Tuesday, December 06, 2005
A page hijack is a technique exploiting the way search engines interpret certain commands that a web server can send to a visitor. In essence, it allows a hijacking website to replace pages belonging to target websites in the Search Engine Results Pages ("SERPs").
When a visitor searches for a term a hijacking webmaster can replace the pages that appear for this search with pages that (s)he controls. The new pages that the hijacking webmaster inserts into the search engine are "virtual pages", meaning that they don't exist as real pages. Technically speaking they are "server side scripts" and not pages, so the searcher is taken directly from the search engine listings to a script that the hijacker controls. The hijacked pages appear to the searcher as copies of the target pages, but with another web address ("URL") than the target pages.
Once a hijack has taken place, a malicious hijacker can redirect any visitor that clicks on the target page listing to any other page the hijacker chooses to redirect to. If this redirect is hidden from the search engine spiders, the hijack can be sustained for an indefinite period of time.
Possible abuses include: Make "adult" pages appear as e.g. CNN pages in the search engines, set up false bank frontends, false storefronts, etc. All the "usual suspects" that is.
# Googlebot (the "web spider" that Google uses to harvest pages) visits a page with a redirect script. In this example it is a link that redirects to another page using a click tracker script, but it need not be so. That page is the "hijacking" page, or "offending" page.
# This click tracker script issues a server response code "302 Found" when the link is clicked. This response code is the important part; it does not need to be caused by a click tracker script. Most webmaster tools use this response code per default, as it is standard in both ASP and PHP.
# Googlebot indexes the content and makes a list of the links on the hijacker page (including one or more links that are really a redirect script)
# All the links on the hijacker page are sent to a database for storage until another Googlebot is ready to spider them. At this point the connection breaks between your site and the hijacker page, so you (as webmaster) can do nothing about the following:
# Some other Googlebot tries one of these links - this one happens to be the redirect script (Google has thousands of spiders, all are called "Googlebot")
# It receives a "302 Found" status code and goes "yummy, here's a nice new page for me"
# It then receives a "Location: www.your-domain.tld" header and hurries to your page to get the content.
# It heads straight to your page without telling your server on what page it found the link it used to get there (as, obviously, it doesn't know - another Googlebot fetched it)
# It has the URL of the redirect script (which is the link it was given, not the page that link was on), so now it indexes your content as belonging to that URL.
# It deliberately chooses to keep the redirect URL, as the redirect script has just told it that the new location (That is: The target URL, or your web page) is just a temporary location for the content. That's what 302 means: Temporary location for content.
# Bingo, a brand new page is created (never mind that it does not exist IRL, to Googlebot it does)
# Some other Googlebot finds your page at your right URL and indexes it.
# When both pages arrive at the reception of the "index" they are spotted by the "duplicate filter" as it is discovered that they are identical.
# The "duplicate filter" doesn't know that one of these pages is not a page but just a link (to a script). It has two URLs and identical content, so this is a piece of cake: Let the best page win. The other disappears.
# Optional: For mischievous webmasters only: For any other visitor than "Googlebot", make the redirect script point to any other page free of choice.