Ever run a search on Google for some information you were interested in, only to find that after you click on a result you are taken to a site which required registration in order to access the full article? I sure did, on many occasions. Ever wonder how Google was able to index something that can only be accessed by registered users? That thought occurred to me on my ride home today as I was listening to some podcasts. I do some of my best thinking while driving and listening to something totally unrelated to the problem I’m trying to solve. And that’s when it struck me…
There’s no way that the Google bot happened to have ‘registered’ an account on some of these sites and thus had access. Even if it did, the bot just follows links, albeit in an intelligent manner. So, it had to be something else. And that’s when it hit me; the Google bot has a unique browser ‘User Agent’. When one of these sites sees that Google is spidering their pages, they just give it free reign to all their content. After all, it’s important to get as much of their content indexed, and as such, get more people directed to their website from search results. When you and I go to the same site, our browser transmits a User Agent header indicating if we’re using Firefox or IE or another browser.
I decided to test my theory tonight. I fired up my Firefox browser and grabbed an extension which would enable me to customize my User Agent value. I downloaded the “User Agent Switcher” Firefox extension, set my User Agent to “GoogleBot/2.1” (no quotes) and I was ready to go. I needed to find a site that was indexed by Google, but that needed ‘registration’ to be able to view its content. Take the following URL for example: http://www.windowsitpro.com/Windows/Article/ArticleID/46980/
Try going to it without modifying your User Agent and you’ll notice you need to be a subscriber to have access to more than 2 paragraphs of the full article. Now, modify your User Agent and Presto! You have access to the full article!
While this won’t work with some of the larger sites like the New York Times or Washington Post, it does work with some of the smaller sites which rely more heavily on Google to route some traffic to their site. For now, anytime I hit a site which requires me to register before I can view the full article, I’ll switch my User Agent just in case. I have a strange suspicion this might work on many, many sites…
#1 by Scott A'Hearn on July 27, 2005 - 8:56 am
As you probably know, too, the BugMeNot extension [bugmenot.com] will help you out for sites like the Post and NYT, sending “dummy” free registration credentials across.