Google News:Googlebot-news
Related blog post: Same Protocol, More Options for News Publishers by Josh Cohen
[edit] Identifying Googlebot-News
A typical Apache access.log entry of a crawling process made by Googlebot-News looks like this:
66.249.71.168 - - [31/Mar/2010:07:07:44 -0500] "GET /index.php HTTP/1.1" 200 5404 "-" "Googlebot-News"
To identify the crawler, you can do a host look-up for the IP address on a UNIX-like shell, for example:
host 66.249.71.168
Running the above command should provide something like:
[root@google]# host 66.249.71.168
168.71.249.66.in-addr.arpa domain name pointer crawl-66-249-71-168.googlebot.com.
Since the above information can still be spoofed, you may want to do the check again, but this time for the hostname provided by the previous command (bold above):
[root@google]# host crawl-66-249-71-168.googlebot.com
crawl-66-249-71-168.googlebot.com has address 66.249.71.168
If the IP address from the first command matches the IP address thrown back by the second command, you know that crawler is legitimate.
[edit] Blocking the Google News robot
Just like all the Google robots, Googlebot-news also obeys the Robots Protocol. If for some reason you decide to block the News Googlebot from indexing your content, you have two ways you can achieve this:
1. Robots Meta tag 2. Robots.txt
Both methods are relying on the Robots Protocol.
Blocking Google News with Meta Tag: If you want to restrict Google to index your content, you can place the following meta tag in the pages you don't want to be indexed:
<meta name="Googlebot-news" content="noindex">
Note however, that Google will have to crawl the content to notice this meta tag so you will still see Googlebot-news accessing the articles.
Blocking Google News with Robots.txt: With a robots.txt file you can restrict indexing of specific articles or the whole site. To restrict indexation of the whole domain you would include the following in your robots.txt file:
User-agent: Googlebot-news
Disallow: /
If you'd like to block access only to specific articles, you could achieve it with the following snippet in your robots.txt file:
User-agent: Googlebot-news
Disallow: /path-to/article.html
If you'd like to keep out all your articles from Google News without messing with the Robots protocol, you can contact Google News and express your wish.