Translate

Google News: Crawling

There are three phases of how your news article gets in Google News' index. The three phases are:

1. discovery and crawling

2. grouping

3. ranking


Google News discovers your recent articles similarly Google Web Search does. There can be two ways Google News finds your new articles, both are done by Googlebot-News:

1. standard discovery when the robot runs through your site searching for new URLs in the HTML

2. using a news sitemap when the bot is collecting, crawling URLs from your news sitemap


The discovery using sitemaps is rather complementarity of standard discovery. In some situations there a pages which are not linked from anywhere on your site, thus Googlebot-News can not find them in your HTML if you don't give it a bit of help. A news sitemap ensures that the robot finds the article even if it's 'hidden'. In both cases the URLs found by Googlebot-News should not be blocked by a robots meta tag or by robots.txt only if you really want them blocked, and should return a good HTTP status code (i.e. 200 etc.). And the final step, the crawled URLs are passed to another engine on Google's end for further processing.


A very big difference between the crawling process of Google News and Google Web Search is that in Google News' case Googlebot-News visits the website dozens of times a day. This ensures that if you write an article, that will appear in the news index within a few minutes if that passes all the article-related requirements.

To ensure that your site can be easily crawled by Google and all your articles are found you will have to meet the following requirements:

  • Have at least 3 digits in your articles' URL or submit a news sitemap in Webmaster Tools. If you have both, then you're pretty much bulletproof
  • The links to individual articles should be in HTML format and should be simple links. These links should not be images linked to the article, nor generated by Javascript and definitely not Flash objects
  • Your website and the articles in general can be reached by Googlebot-News, you don't block it with a firewall rule or robots protocol