Search marketing beyond the search engines | Search Optimization Marketing (SOM)

T: 1-714-556-8633

Google Crawls Through HTML. Webmasters Not Satisfied

Google had been testing a new search related technology, thereby enabling its crawlers to explore some HTML forms in an attempt to discover new web pages and URLs which have not yet been found and indexed. However, Michael VanDeMar at Smackdown has found a flaw in Google’s new crawling format due to which certain pages are being indexed that are not supposed to be. Here is a look at links that have been indexed. Please note, that these pages have an almost similar format:

http://smackdown.blogsblogsblogs.com/index.php?s=lube

http://smackdown.blogsblogsblogs.com/index.php?s=bestest

http://smackdown.blogsblogsblogs.com/index.php?s=scanned

These pages were part of a search result that was intended to show Michael’s blog pages as the final result. However, these phrases proved insignificant as far as search was concerned, as users were not likely to use them as search terms. Then Michael devised an experiment, in which he excluded these phrases from his Wordpress account. This is the code that he used for the exclusion

<?php
if(!is_archive() && !is_search()){
?>
<meta name=”ROBOTS” content=”ALL”/>
<?php
}else{
?>
<META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>
<?php
}
?>

The sole purpose of removing these pages was to stop them from being indexed. It did work but only for a while. In February 2008, Michael noticed visits from Googlers to his website and he knew only Googlers from Googleplex had the required tools to view pages that have the noindex tag in them. However, this wasn’t the only such occurrence. Reports of pages being indexed in spite of the ‘noindex’ tags started surfacing from other websites as well. After some time, there were new pages that had been indexed, that according to Michael were not supposed to be indexed. With the mounting confusion, finally, Google’s Matt Cutts decided to clear the confusion and stated, ““it’s less about crawling search

This is what Matt Cutts had to say about the crawling of forms by Googlebot. “The main thing that I want to communicate is that crawling these forms doesn’t cause any PageRank hit for the other pages on a site. So it’s pretty much free crawling for the form pages that get crawled.

In the end, it all comes down to this. If Google is intending to implement this new from of crawling as a part of their mainstream discovery, then it definitely needs to implement some changes in the Robots Exclusion Protocol, such as a NOFORM Meta tag. This could help in reducing the inconsistenciesThe absence of an action, allows Googlebot to assume that without an action the page that the form resides on must be the intended target.

One Response to “Google Crawls Through HTML. Webmasters Not Satisfied”

  1. GoogleCrawlsThroughHTML.WebmastersNotSatisfied » SOM Report Says:

    [...] According to Michael VanDeMar of Smackdown, Google's new crawling method has developed a flaw and is indexing pages form websites that either have a 'nofollow' tag or aren't meant to be indexed. For the full report about the Google Crawls Through HTML,  Click Here. [...]

Leave a Reply

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivs 2.5 License.