in

Utah .NET User Group

Home of Utah's professional .NET developers.

Josh's Blog

Google Whammy

Do you ever search for help on a problem or some piece of information on Google? Do you ever find that for some reason, some sites show high in the list but when you click the link you get welcomed to a "login" page with maybe a short excerpt?

I run into this quite a bit, and I'll list a few sites that I've run into this with. But first, an analysis of how and why sites do this, and how I've gotten around it.

I'm not sure if this behavior is flagged by Google as they are quite secretive about what goes into their Special Sauce, but to me it seems wrong. Google hosts a free-for-the-user service and generates money from advertisers hoping to make money from those users. When someone subterfuges that to try an make a buck it smells bad.

User Agent

Every web browser has an identifier that it sends to a web server every time you request a page. This ID is referred to as a user-agent (UA). ( I know this is quite elementary, and most of you have figured everything out already after this one sentence, but I like to be complete.) The user-agent is used for many different purposes some of which include identifying the browser type, operating system type, and the current version of each of those. The user agent is also often used to communicate to the web server information about certain extensions installed in your web browser. It is quite common for sites to deliver slightly different content to different browsers in order to overcome some certain incompatibility or to chide you about your poor choice of web browsers.

Robot Spiders

The robots that Google and all other search engines also have UAs in order to be good netizens and to allow sites to block certain data from being indexed. Imagine having a section of your site indexed where you used hyperlinks to delete items from a database (this actually happened and would be a great write-up for another day)

The culmination of all of this is the fact that anything accessing somone's site has a UA, and all web indexing bots have uniquely identifiable UAs. The people that noticed this relationship have setup a corollary to the actual purpose of the search engines' internet goodwill. They have in fact decided to give search engines a different version of their site than they deliver to end-users. The site owners do this purely to build their user base and in effect their advertising and subscription revenue.

In Action

Lets see an example: I searched Google for "DateTime sql 1900" to see how 1900-01-01 in sql related to a .NET DateTime object. The very first hit is http://www.sqlservercentral.com/articles/Advanced+Querying/workingwithdatetime/1634/ and if you go to that link you will see that you get only the first paragraph of that article, even though Google has indexed the whole thing and is delivering search results based on the contents of the whole article.

Work Around

There are a couple things that you can do to work around this, the easiest is to just to click on the Google cache link to see the actual content that Google indexed off of. The problem with this approach is that a lot of times this cache is not available; this is the case with our example above. The more effective solution is to change your UA so that the site thinks that you are Google.

Changing your UA can be a tricky proposition involving a registry hackathon with some browsers. Luckily, if you are using Firefox there is an easier way. There is an add-on extension called - of all things - User Agent Switcher that lets you quickly choose a user agent from a pull down menu. It does not come with a list of UAs for search engines, but by searching you should be able to find as many as you need.

Here we have setup a UA called Googlebot with a value of "Googlebot/2.1 (+http://www.google.com/bot.html)" by simply selecting this from the pull down and refreshing the page we get the whole article

I've been meaning to write this down for a while I just procrastinated too long, and I guess I was more bored this time than the times before. Anyways, I may contact Google to see if this is anything that they will do anything about, I would like my browsing experience to be such that when I click on a top link that I am going to get the most relevant data. In the past this has been posting by people with the good of the community at heart but more often now days I'm finding the Google Whammys... And Stop.

List of sites (let me know if you run into any)

  • sqlservercentral.com
  • techtarget.com
  • windowsitpro.com (penton property)
  • sqlmag.com (penton property)
  • exchangeprovip.com (penton property)
  • scriptingprovip.com (penton property)
  • securityprovip.com (penton property)

I am obfuscating the name of the site domains so that they don't get any link-back points from search engines

Comments

 

HintonBR said:

Josh - are you saying that SQLServerCentral.com is trying to play the system by forcing you to register when you go to view the content on the site, but allowing Google to see the content in order to index it?  

I don't see anything wrong with this at all - is it annoying at times yes - but SQLServerCentral's business model includes requiring people to register in order to see the information.

I think we are best served by having mechanisms for search engines to reach into all content whether secured by registration or not.  Google will rank the information based on relevance.  If the most relevant site requires a registration so be it.  I am sure Google is well aware of this practice and wants sites to make an exception for their bots that allows them to find and deliver the most relevant search results.

March 7, 2008 11:57 AM
 

josh said:

HintonBR, thank you for the comment.  Sorry for the late reply, this site doesn't seem to send me comment notifications.

In a way I agree with you, it would be nice to have the data indexed so that we can easily find it.  I believe that is exactly what a synopsis is for, and this is even what SqlServerCentral tries to make it look like by leaving the first paragraph when an actual user visits.

Like you I find this merely an annoyance currently, but imagine if every site with content started doing this same thing.  It would make web searching more like walking through a mine-field.  If a companies revenue is based on ads surrounding a 6 paragraph article and deceptive search-engine-optimisation (SEO) practices, then I probably don't want to visit their site anyway.  But when their links comes first because of those practices, then the users are the ones that loose.

It seems that if you value that site's content enough to register, or pay to join, then you would probably use that site's own search functionality once you have logged in.

I have since done some more research on the subject and it looks like this practice is called "cloaking" and is against Google's guidlines. www.google.com/.../answer.py

Also, Google does provide a way to notify them of sites that cloak their content: www.google.com/.../spamreport.html

Personally, I will be notifying Google of any sites that I find that participate in such activities.

I value being able to get things done; like you said, it is only an annoyance currently, but will it still be only annoying when the first whole page of search links leads to a registration/pay page instead of a page containing the content summarized on the search page?

March 24, 2008 2:58 PM
Copyright © 2000-2007, Utah .NET User Group
Powered by Community Server (Commercial Edition), by Telligent Systems