Your website's robots.txt file is the first thing AI crawlers read when they visit your domain. If it tells them to leave, they will — and your brand won't appear in AI-generated answers from ChatGPT, Perplexity, Gemini, or Claude. Most brands don't realise their robots.txt is blocking AI crawlers. And the ones using a WAF like Cloudflare may be blocking them at the network level without robots.txt even being involved.
This guide covers every AI crawler that matters in 2026, what each one does, and how to configure your robots.txt to control AI access to your site.
Why AI Crawlers Exist
AI crawlers serve three distinct purposes, and understanding the difference matters for how you configure access:
Training crawlers scrape content for future model training. GPTBot (OpenAI), ClaudeBot (Anthropic), and Google-Extended (Google) fall here. Blocking these prevents your content from being used for training — but also means the model won't "know" your brand from its training data.
Search crawlers index content for real-time AI search results. OAI-SearchBot indexes for ChatGPT Search, Claude-SearchBot for Claude. They work like Googlebot — building an index queried when users ask questions.
Browsing bots fetch pages in real time during live conversations. ChatGPT-User, Claude-User, and Perplexity-User operate this way — blocking them means AI can't pull live data from your pages.
The Complete AI Crawler Reference
Tier 1 — Core AI Training Crawlers
These are the most important crawlers to be aware of. They collect data used to train the foundation models behind major AI platforms.
| Bot | Company | Purpose | robots.txt Token |
|---|---|---|---|
| GPTBot | OpenAI | Training data for ChatGPT models | GPTBot |
| ClaudeBot | Anthropic | Training data for Claude | ClaudeBot |
| PerplexityBot | Perplexity | Real-time search answers | PerplexityBot |
| Google-Extended | Gemini / AI Overviews training | Google-Extended |
Tier 2 — Search and Browsing Bots
These crawlers fetch content for real-time AI search and live browsing during conversations. Blocking them has a more immediate impact on your AI visibility than blocking training crawlers.
| Bot | Company | Purpose | robots.txt Token |
|---|---|---|---|
| OAI-SearchBot | OpenAI | ChatGPT Search index | OAI-SearchBot |
| ChatGPT-User | OpenAI | Real-time browsing in ChatGPT | ChatGPT-User |
| OAI-AdsBot | OpenAI | Ad landing page validation | OAI-AdsBot |
| Claude-User | Anthropic | Real-time browsing in Claude | Claude-User |
| Claude-SearchBot | Anthropic | Search result indexing for Claude | Claude-SearchBot |
| anthropic-ai | Anthropic | General-purpose crawler | anthropic-ai |
| Perplexity-User | Perplexity | Real-time fetching during queries | Perplexity-User |
| GoogleOther | Supplementary AI training crawling | GoogleOther | |
| Bytespider | ByteDance | TikTok AI / Doubao training | Bytespider |
Note on Bytespider: ByteDance does not publish official documentation for this crawler. Many site operators block it due to aggressive crawl volumes and lack of transparency.
Tier 3 — AI-Adjacent Crawlers
| Bot | Company | Purpose | robots.txt Token |
|---|---|---|---|
| CCBot | Common Crawl | Open dataset used by many LLMs | CCBot |
| Amazonbot | Amazon | Alexa / Amazon AI features | Amazonbot |
| Meta-ExternalAgent | Meta | Meta AI / Llama training | Meta-ExternalAgent |
| Applebot-Extended | Apple | Apple Intelligence / Siri | Applebot-Extended |
Other crawlers worth knowing about: Bingbot powers Microsoft Copilot — blocking Bingbot means blocking Copilot entirely. cohere-ai is Cohere's training crawler. Diffbot builds knowledge graphs used by multiple AI applications.
The Default Blocking Problem
Many brands are blocking AI crawlers without knowing it. This happens in three ways:
Platform defaults. Shopify, WordPress, and other CMS platforms ship with robots.txt configurations that don't explicitly allow AI crawlers. If you haven't edited your robots.txt since 2024, you likely haven't accounted for the wave of new AI crawlers.
Aggressive bot protection. Cloudflare's "Bot Fight Mode," AWS WAF rules, and similar services block automated traffic — and AI crawlers are automated traffic. Without explicit allow-lists, these services block legitimate AI bots alongside malicious scrapers.
Inherited configurations. Development teams copy robots.txt files from templates that predate the AI crawler era and frequently include blanket bot-blocking rules.
How to Check Your AI Crawler Access
Manual check: Visit yourdomain.com/robots.txt in a browser and search for the bot names listed above. If you see Disallow: / for GPTBot, ClaudeBot, or PerplexityBot, those crawlers are blocked.
Automated check: Crawl Radar tests your site against every major AI crawler and reports which ones can access your pages — including WAF-level blocks that robots.txt alone won't reveal.
Example robots.txt Configuration
Here's a configuration that allows all major AI crawlers to access your site:
# AI Crawlers — Allow
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
If you want to allow AI search and browsing but block training, you can be selective:
# Block training crawlers
User-agent: GPTBot
Disallow: /
# Allow search and browsing bots
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
This tells OpenAI not to use your content for model training while still allowing ChatGPT Search and Perplexity to cite your pages in real-time answers.
WAF vs robots.txt — A Critical Distinction
robots.txt is advisory — it tells well-behaved crawlers what to do, but the crawler must first reach your server to read the file.
A Web Application Firewall (WAF) operates at the network level. It blocks requests before they reach your server. If your Cloudflare settings block GPTBot's IP range or user-agent string, your robots.txt is irrelevant — the crawler never gets to read it.
This is why testing actual access matters more than reading your robots.txt. Many brands have a robots.txt that says "allow" while their WAF silently blocks every AI crawler at the door.
The Agentic Browser Challenge
A new category of AI agents emerged in 2025-2026 that robots.txt cannot control at all.
OpenAI's Operator and Google's Project Mariner are agentic browsers — they control real browser instances with standard Chrome user-agent strings. They navigate websites the way a human would. From your server's perspective, they look identical to a human visitor. There is no robots.txt token to block them and no reliable way to detect them in server logs.
The practical implication: assume that AI agents will eventually access anything a human can access on the open web. Focus on making that content work for you — structured, accurate, and citation-ready — rather than trying to restrict it.
Key Takeaways
- AI crawlers fall into 3 categories: training (GPTBot, ClaudeBot), search (OAI-SearchBot), and browsing (ChatGPT-User) — each has different visibility implications
- Blocking training crawlers is a choice, but blocking search and browsing bots directly reduces your AI visibility
- robots.txt is advisory — WAFs like Cloudflare can block AI crawlers at the network level before robots.txt is even read
- Many CMS platforms and security tools block AI crawlers by default without explicit configuration
- Agentic browsers like OpenAI Operator use standard Chrome user-agents and cannot be controlled via robots.txt
- Test actual crawler access with Crawl Radar rather than relying on your robots.txt file alone