Web Crawler Menace & Robots.txt

Web Crawlers are the software program used by many applications like search engines to crawl and index the website pages. Later on such indexed pages are recommended by the search engines. Your website recommendation could be based on complex algorithm used by each search engine. We want our website to be crawled by well known search engines so that it appears in the search results. We like to have Google Bot, Bing Bot, Moz Bot crawlers on our websites. But not all of them. Unfortunately there is tremendous rise in the web crawlers & penetration level. Speed & scope of crawling is always increasing. That ratio of crawler to visitor have become 100:1 for some websites.

This has lead to wastage of online resources like server bandwidth, cpu & memory. Imagine at a medical clinic where MR (Medical Representative) to Patient ratio becomes 100:1. Patient has to wait and compete with 100 MRs to reach the doctor. That could be probably happening to your website & you are not aware of it. Your real website audience might be competing with crawlers for server resources. Only way to know this is by looking at the website log. No, you won’t find this in Google analytics or Google Webmaster/Search Console report.

Today crawlers are not crawling the websites for the purpose of search engine only. There are new purposes like market analysis, content ranking, income estimation, authenticity, archival or some crawlers could be just stealing the data in the name of crawling. ‘AI’ has given birth to this new scenario where crawler takes the data rephrases it and posts it on other website. Scary!!! right.

It’s a long topic of discussion so I’ll jump to one possible solution that is robots.txt. robots.txt file is kept at the root level of the website. If it is genuine web crawler, this file is read first by the crawler and accordingly crawler takes the decision to skip the website or skip some urls or slow down in the crawling process.

Following is one such suggested robots.txt file to avoid unwanted crawling and save the resource of your website. You can copy this content and save it in robots.txt file on doc root of your website. This could speed up the performance of your website. But be careful, if some of the listed crawlers are required for your business you need to remove them before posting this on your website. If you made the mistake, you may not see the effect immediately, but in long run your website might get less referrals from crawler related websites.

User-agent: PetalBot
Disallow: /
User-agent: Baiduspider
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: Exabot
Disallow: /
User-agent: BLEXBot
Disallow: /  
User-agent: CCBot
Disallow: /
User-agent: SemrushBot-SA
Disallow: /
User-agent: SemrushBot-BA
Disallow: /
User-agent: SemrushBot-SI
Disallow: /
User-agent: SemrushBot-SWA
Disallow: /
User-agent: SemrushBot-CT
Disallow: /
User-agent: SemrushBot-BM
Disallow: /
User-agent: SemrushBot-SEOAB
Disallow: /
User-agent: AdIdxBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: TwengaBot
Disallow: /
User-agent: 008
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: WotBox
Disallow: /
User-agent: Sosospider
Disallow: /
User-agent: SeznamBot 
Disallow: /
User-agent: ZumBot
Disallow: /
User-agent: coccocbot-web
Disallow: /
User-agent: dotbot
Disallow: /
User-agent: CriteoBot/0.1
Disallow: /
User-agent: Amazonbot            
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: serpstatbot
Disallow: /
User-agent: PiplBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: DataForSeoBot
Disallow: / 
User-agent: proximic
Disallow: /
User-agent: 360Spider
Disallow: /
User-agent: Yisouspider
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Sogou web spider
Disallow: /
User-agent: Sogou inst spider
Disallow: /
User-agent: SeekportBot
Disallow: /
User-agent: barkrowler
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: grapeshot
Disallow: /
User-Agent: ImagesiftBot 
Disallow: /
Allow: /
Crawl-delay: 60

Also to note when you see to your website log. It may seem it is genuine web crawler but those could be fake ones and may not abide to rules set in your robots.txt. You might need alternate solutions for them like ip blocking, firewall, rate limiting or may be mix of many solutions.

Web Crawler Menace & Robots.txt

Be the first to comment

Leave a Reply Cancel reply