Translate

Google News:Googlebot-news

Related blog post: Same Protocol, More Options for News Publishers by Josh Cohen

[edit] Identifying Googlebot-News

A typical Apache access.log entry of a crawling process made by Googlebot-News looks like this:

66.249.71.168 - - [31/Mar/2010:07:07:44 -0500] "GET /index.php HTTP/1.1" 200 5404 "-" "Googlebot-News"

To identify the crawler, you can do a host look-up for the IP address on a UNIX-like shell, for example:

host 66.249.71.168

Running the above command should provide something like:

[root@google]# host 66.249.71.168

168.71.249.66.in-addr.arpa domain name pointer crawl-66-249-71-168.googlebot.com.

Since the above information can still be spoofed, you may want to do the check again, but this time for the hostname provided by the previous command (bold above):

[root@google]# host crawl-66-249-71-168.googlebot.com

crawl-66-249-71-168.googlebot.com has address 66.249.71.168

If the IP address from the first command matches the IP address thrown back by the second command, you know that crawler is legitimate.


[edit] Blocking the Google News robot

Just like all the Google robots, Googlebot-news also obeys the Robots Protocol. If for some reason you decide to block the News Googlebot from indexing your content, you have two ways you can achieve this:

  1. Robots Meta tag
  2. Robots.txt


Both methods are relying on the Robots Protocol.

Blocking Google News with Meta Tag: If you want to restrict Google to index your content, you can place the following meta tag in the pages you don't want to be indexed:

<meta name="Googlebot-news" content="noindex">

Note however, that Google will have to crawl the content to notice this meta tag so you will still see Googlebot-news accessing the articles.

Blocking Google News with Robots.txt: With a robots.txt file you can restrict indexing of specific articles or the whole site. To restrict indexation of the whole domain you would include the following in your robots.txt file:

User-agent: Googlebot-news

Disallow: /

If you'd like to block access only to specific articles, you could achieve it with the following snippet in your robots.txt file:

User-agent: Googlebot-news

Disallow: /path-to/article.html

If you'd like to keep out all your articles from Google News without messing with the Robots protocol, you can contact Google News and express your wish.