AI web crawlers such as GPTBot, CCBot, and AI bots from Google now crawl our websites and collect data for their need. The question arises, should we block these AI bots in our robots.txt file to protect our content? The short answer is yes. If you examine the list from here – originality.ai/ai-bot-blocking then you’ll realize that a lot of websites already blocking them.
As we know, robots.txt is a plain text file which gives certain instructions to web crawlers (they are also called robots) about which pages or files they should or should not crawl. The file is placed in the root directory of any website: thecustomizewindows.com/robots.txt. So far, we need not to do anything much with this file since our intention was to get indexed by the search engines.
Why Block the AI Crawlers Scrapping Our Content?
Because they are scrapping our content and we are not getting permanent dofollow backlinks. Essentially your article will get spunned (they say it is AI) and maybe SEO-optimized content will be created. If you spend some time with Google Bard or any similar tool, you’ll realize that:
---
- Part of your content cited somewhere else (such as StackOverflow or Reddit) became Bard’s content without mentioning your site. You can do nothing since the language after the modern text spinning (you can say that is Generative AI).
- Your code/snippet reproduced
When you scrap 10 great webpages and spin them, obviously the content becomes great and may outperform the original websites on SERP or maybe, you just increase your engagement, or you make money. Most importantly, some of them tend to remove almost any chance of the source website receiving traffic. Bing usually links to information sources. None of the usages will do any good for you. Indeed, your SERP can fall (maybe it is already falling).

Image credit: seosandwitch.com
Why Not to Block All the AI Crawlers?
If you block Google or Bing or any such search engine, they can indirectly penalize you in future in some way or the other. Yoast pointed out this point in his website which is logically acceptable. They can tweak their algorithm and use AI bots to show meta descriptions or summaries of your article.
What robots.txt We Sugggest?
We will tend to suggest you block certain bots including ChatGPT (CCBot, ChatGPT-User, GPTBot), Anthropic (anthropic-ai), Omgili (Omgili bot, Omgili), Facebook (they use for their speech recognition). This is the robots.txt (it is part of the file):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | User-agent: CCBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: GPTBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Omgilibot Disallow: / User-agent: Omgili Disallow: / User-agent: FacebookBot Disallow: / |
Blocking them can not harm our site’s SEO. Please note that we have not blocked Google’s AI tools or Bing’s AI tools. You can add this directive to block Google Bard and similar tools:
1 2 | User-agent: Google-Extended Disallow: / |
Will This Work?
God knows! Robots.txt just requests the bots to not crawl. There are other ways to block, such as blocking the IP range from the server (or serving them confusing stuff):
1 | https://openai.com/gptbot-ranges.txt |
You can use .htaccess rules or some WAF which supports blocking these bots (such as CloudFlare, Securi and so on). Also, there is an upcoming method named glazing (for images). Of course, we need some fail2ban filter for these odd scrappers.
Governments need to force rules.
Tagged With https://thecustomizewindows com/2023/12/lets-block-the-ai-crawlers-using-robots-txt-file/ , nativehee , testingK4Ibt9qe\)) OR 146=(SELECT 146 FROM PG_SLEEP(15))-- , testingnHFnxanT\; waitfor delay \0:0:15\ -- , testingnHqFIVMW\ OR 808=(SELECT 808 FROM PG_SLEEP(15))-- , testingo0K1qT73\) OR 355=(SELECT 355 FROM PG_SLEEP(15))--