Each Web site should generally have a robots.txt file that is used by the good search engine bots to determine what content should not be crawled. While not all bots follow this, the good ones do.
Using log parser we could find legit bots via a query like the following:
logparser -rtp:-1 -i:w3c -o:w3c file:Robots.log.sql
Where Robots.log.sql is as follows:
SELECT c-ip, cs(User-Agent) AS [UserAgent], COUNT(*) AS [Requests] , MIN(TO_DATE(TO_LOCALTIME(TO_TIMESTAMP(date, time)))) AS FirstDate , MAX(TO_DATE(TO_LOCALTIME(TO_TIMESTAMP(date, time)))) AS LastDate INTO _Robots.log FROM W3SVC7/*ex*.log WHERE cs-uri-stem = '/robots.txt' GROUP BY c-ip, UserAgent ORDER BY c-ip, UserAgent
One thing to watch out for is to verify that you haven't viewed the robots.txt file yourself, since you probably don't want to exclude yourself.
After verifying the above you can use that log file in your WHERE clause. For example, the following query will pull the first 10 requests for a .txt file from an ip not in the above _Robots.log:
logparser -i:w3c -o:w3c "SELECT TOP 10 c-ip, cs-uri-stem FROM W3SVC7\*ex*.log WHERE EXTRACT_EXTENSION(cs-uri-stem) = 'txt' AND c-ip NOT IN (SELECT TO_STRING(c-ip) FROM _Robots.log)"
Tying it together, we have the following which grabs the 10 most requested pieces of content, excluding those from legit robots:
logparser -i:w3c -o:w3c "select top 10 cs-uri-stem, COUNT(*) AS [Requests] FROM W3SVC7\*ex*.log WHERE c-ip not in (select TO_STRING(c-ip) from _Robots.log) GROUP BY cs-uri-stem ORDER BY Requests DESC"
Removing the WHERE clause should return some rather different results.
Of course this may not catch all robots, especially depending upon when they last grabbed your robots.txt file, but it should give you a pretty good number. If you're willing to do a bit more checking, you can also use the user agent passed by the bot instead of the IP.
Article provided by James Skemp.