Removing potential bots from query results

Added 8/9/2011 12:55:11 AM and updated 8/9/2011 1:01:04 AM

Unless your site is only available internally, and isn't crawled by an internal search engine, chances are your site is crawled by a large variety of robots each day. However, there's a way to exclude the good bots fairly easily.

Each Web site should generally have a robots.txt file that is used by the good search engine bots to determine what content should not be crawled. While not all bots follow this, the good ones do.

Using log parser we could find legit bots via a query like the following:

logparser -rtp:-1 -i:w3c -o:w3c file:Robots.log.sql

Where Robots.log.sql is as follows:

SELECT c-ip, cs(User-Agent) AS [UserAgent], COUNT(*) AS [Requests]
    , MIN(TO_DATE(TO_LOCALTIME(TO_TIMESTAMP(date, time)))) AS FirstDate
    , MAX(TO_DATE(TO_LOCALTIME(TO_TIMESTAMP(date, time)))) AS LastDate
INTO _Robots.log
FROM W3SVC7/*ex*.log
WHERE cs-uri-stem = '/robots.txt'
GROUP BY c-ip, UserAgent
ORDER BY c-ip, UserAgent

One thing to watch out for is to verify that you haven’t viewed the robots.txt file yourself, since you probably don’t want to exclude yourself.

After verifying the above you can use that log file in your WHERE clause. For example, the following query will pull the first 10 requests for a .txt file from an ip not in the above _Robots.log:

logparser -i:w3c -o:w3c "SELECT TOP 10 c-ip, cs-uri-stem FROM W3SVC7\*ex*.log WHERE EXTRACT_EXTENSION(cs-uri-stem) = 'txt' AND c-ip NOT IN (SELECT TO_STRING(c-ip) FROM _Robots.log)"

Tying it together, we have the following which grabs the 10 most requested pieces of content, excluding those from legit robots:

logparser -i:w3c -o:w3c "select top 10 cs-uri-stem, COUNT(*) AS [Requests] FROM W3SVC7\*ex*.log WHERE c-ip not in (select TO_STRING(c-ip) from _Robots.log) GROUP BY cs-uri-stem ORDER BY Requests DESC"

Removing the WHERE clause should return some rather different results.

Of course this may not catch all robots, especially depending upon when they last grabbed your robots.txt file, but it should give you a pretty good number. If you’re willing to do a bit more checking, you can also use the user agent passed by the bot instead of the IP.

Article provided by James Skemp.

Back to List


blog comments powered by Disqus