AI training crawlers can pull your pages, images, and feeds at high volume. Search crawlers can also crawl your site, but they help you rank and earn traffic. You can use robots.txt to send clear crawl rules to both groups. You can block many known training bots while you keep Google, Bing, and other search bots allowed.
This guide shows how to configure robots.txt to block training bots while allowing search bots. It also explains what robots.txt can and cannot do, so you set the right expectations and add stronger controls where needed.
Key Takeaways
- robots.txt is a crawl directive file that many bots follow, but it is not a hard security barrier.
- Allow major search bots (Googlebot, Bingbot) and block known AI training bots by user-agent.
- Keep your rules simple to avoid accidental deindexing, broken rendering, or blocked assets.
- Test your robots.txt with Google Search Console and live fetch checks before and after you deploy.
- Use layered controls (WAF, rate limits, bot management) for bots that ignore robots.txt.
- Document and monitor changes so you can reverse mistakes fast and track bot traffic shifts.
Comprehensive List of All Bots (Updated)
| Category | Bot name (robots.txt user-agent token) | Operated by | What it’s typically used for |
|---|---|---|---|
| AI Search bots | OAI-SearchBot | OpenAI | Indexing for ChatGPT search results. (OpenAI Platform) |
| AI Search bots | PerplexityBot | Perplexity | Surfacing and linking sites in Perplexity search results. (docs.perplexity.ai) |
| AI Search bots | Claude-SearchBot | Anthropic | Improving search result quality for Claude users. (privacy.claude.com) |
| AI Search bots | Applebot | Apple | Search-related features across Apple experiences (Spotlight, Siri, Safari). (Apple Support) |
| AI Search bots | Amazonbot | Amazon | Crawling to improve products and services, may be used to train Amazon AI models. (Developer Portal Master) |
| AI “user fetch” bots | ChatGPT-User | OpenAI | User-initiated page fetches from ChatGPT (not automatic crawling). (OpenAI Platform) |
| AI “user fetch” bots | Perplexity-User | Perplexity | User-initiated fetches to answer queries (not training crawl). (docs.perplexity.ai) |
| AI “user fetch” bots | Claude-User | Anthropic | User-initiated website access for Claude responses. (privacy.claude.com) |
| AI “user fetch” bots | Meta-ExternalFetcher | Meta | Fetches individual links for product features (user-initiative fetch style). (Cloudflare Radar) |
| AI Training bots | GPTBot | OpenAI | Collecting content that may be used for training OpenAI foundation models. (OpenAI Platform) |
| AI Training bots | ClaudeBot | Anthropic | Collecting web content that could contribute to model training. (privacy.claude.com) |
| AI Training control token | Google-Extended | Controls whether Google-crawled content can be used for Gemini training and grounding. (Google for Developers) | |
| AI Training control token | Applebot-Extended | Apple | Controls whether content can be used to train Apple foundation models (Apple Intelligence, etc.). (Apple Support) |
| AI Training bots | AI2Bot | Allen Institute for AI (AI2) | Crawls web content used to train open language models (per AI2 notice). (Allen AI) |
| AI Training/data bots | CCBot | Common Crawl | Crawls the web for Common Crawl datasets used for research and ML. (commoncrawl.org) |
| AI Training/data bots | meta-externalagent | Meta | Meta’s crawler listed as a common Meta web crawler (used for content retrieval, often referenced for AI collection). (Facebook Developers) |
| AI Training/data bots | Bytespider | ByteDance | Commonly identified as a crawler token seen in logs/robots policies. (ColorTokens) |
| AI Training/data bots | cohere-ai | Cohere | Commonly identified as an AI crawler token seen in logs/robots policies. (ColorTokens) |
| AI Training/data bots | DeepSeekBot | DeepSeek | Commonly identified as an AI crawler token for data collection. (DataDome) |
| Other Search bots | Googlebot | Main Google Search crawler. (Google for Developers) | |
| Other Search bots | bingbot | Microsoft Bing | Main Bing Search crawler. (Search - Microsoft Bing) |
| Other Search bots | DuckDuckBot | DuckDuckGo | DuckDuckGo search crawler. (DuckDuckGo) |
| Other Search/preview bots | facebookexternalhit | Meta | Fetches URLs for link previews and related features. (Facebook Developers) |
| Other Search/preview bots | Facebot | Meta | Another Meta crawler commonly used for crawling content for Meta surfaces. (humansecurity.com) |
| Other Search bots | Baiduspider | Baidu | Baidu search crawler (commonly referenced bot token). (humansecurity.com) |
| Other Search bots | YandexBot | Yandex | Yandex search crawler (commonly referenced bot token). (humansecurity.com) |
| Other crawler bots | AhrefsBot | Ahrefs | SEO/backlink crawler. (humansecurity.com) |
| Other crawler bots | SemrushBot | Semrush | SEO research crawler. (humansecurity.com) |
| Other crawler bots | MJ12Bot | Majestic | Backlink index crawler. (humansecurity.com) |
Understand what robots.txt does (and what it does not)

Before you edit rules, you need a clear model of how bots read robots.txt. This helps you block training bots without harming search visibility.
What robots.txt controls
- Crawling: It tells compliant bots which paths they should not fetch.
- Crawl focus: It can reduce load by keeping bots out of low-value sections.
- Bot-specific rules: You can set different rules per user-agent.
What robots.txt does not control
- Access: A blocked URL can still load in a browser if it is public.
- Indexing in all cases: Some search engines can index a URL based on links even if crawling is blocked. (Google usually needs crawl access to fully index content, but a URL can still appear as a “URL-only” result.)
- Non-compliant bots: Some scrapers and some training crawlers can ignore robots.txt.
Action steps
- Use robots.txt as your first filter for known, compliant training bots.
- Add server-side controls for bots that ignore rules.
- Do not place secrets behind robots.txt. Use auth or IP allowlists for that.
Know the difference: training bots vs search bots
You can block training bots and allow search bots, but you need to identify them correctly. You also need to avoid broad rules that catch the wrong crawlers.
Common search bots you usually want to allow
- Googlebot (Google Search)
- Bingbot (Bing Search)
- DuckDuckBot (DuckDuckGo)
- Applebot (Apple Search and Spotlight)
- YandexBot (if you serve that market)
- Baiduspider (if you serve that market)
Common training bots you may want to block
Bot names change often. Use your server logs to confirm what hits your site. Many sites block these user-agents in robots.txt:
- GPTBot (OpenAI)
- CCBot (Common Crawl)
- Google-Extended (Google’s AI training control token for some products)
- anthropic-ai and ClaudeBot (Anthropic)
- Bytespider (ByteDance)
- Amazonbot (Amazon)
Action steps
- List the bots that matter to your traffic and your risk.
- Confirm bot names in logs, not in guesswork.
- Decide what you will allow: search crawling, ads crawling, preview bots, uptime monitors.
Set your robots.txt goals before you write rules
Clear goals prevent rule sprawl. They also prevent a common mistake: blocking search bots by accident while you chase training bots.
Goal checklist
- Keep indexing stable: Allow Googlebot and Bingbot on key pages.
- Block training crawlers: Disallow known AI user-agents.
- Protect server resources: Reduce crawl load on heavy endpoints.
- Keep rendering intact: Avoid blocking CSS, JS, and images needed for page rendering.
- Keep private areas private: Use auth for private paths, then also disallow them in robots.txt as a hint.
Decide what content you want to block from training bots
- Full site (common for publishers who do not want AI reuse)
- Only premium sections (paywalled, member-only)
- Only high-value assets (images, PDFs, datasets)
- Only APIs and feeds (RSS, JSON endpoints)
Action steps
- Write down the exact paths you want to protect (examples:
/premium/,/members/,/api/,/feeds/). - List the assets that must stay crawlable for SEO (examples:
/wp-content/,/assets/,/images/).
Build a safe baseline robots.txt (allow search bots first)
Start with a baseline that keeps search crawling healthy. Then add training bot blocks. This order reduces the risk of an SEO outage.
Baseline example for most sites
This baseline allows all bots by default, blocks only common private or low-value areas, and declares your sitemap:
User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xmlWhy this baseline works
- It avoids broad blocks like
Disallow: /. - It blocks areas that create thin or duplicate pages.
- It keeps important AJAX endpoints available for WordPress sites.
Action steps
- Replace
example.comwith your domain. - Remove paths you do not use.
- Add your actual sitemap URL(s). Some sites use multiple sitemap files.
Add rules to block training bots while allowing search bots
Now you can add user-agent groups for training bots. In robots.txt, the most specific group that matches a bot name applies. Keep each group simple.
Copy-paste robots.txt example (block training bots, allow search bots)
This example blocks several known training bots and keeps broad access for search bots:
# --- Allow major search bots (optional but clear) ---
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: DuckDuckBot
Disallow:
User-agent: Applebot
Disallow:
# --- Block common AI training crawlers ---
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
# --- Default rules for all other bots ---
User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xmlHow this pattern works
- Search bots get an explicit allow rule (
Disallow:with blank value means allow all). - Training bots get a full-site block (
Disallow: /). - All other bots follow the default group.
Action steps
- Keep the bot names exact. Spelling matters.
- Keep one
User-agentper group for clarity. - Place your sitemap line at the end (or anywhere). Many bots read it anywhere.
Block training bots from specific sections (instead of the full site)
Some site owners want search engines to index everything, but they want to block training bots from premium pages, datasets, or images. You can disallow only those paths for training bots.
Example: block premium and API paths for training bots
User-agent: GPTBot
Disallow: /premium/
Disallow: /members/
Disallow: /api/
Disallow: /feeds/
User-agent: CCBot
Disallow: /premium/
Disallow: /members/
Disallow: /api/
Disallow: /feeds/
User-agent: *
Disallow: /admin/
Disallow: /login/
Sitemap: https://example.com/sitemap.xmlExample: block image folders for training bots (use with care)
Blocking images can reduce reuse, but it can also reduce image search traffic. If image SEO matters, do not block images for Googlebot-Image or Bingbot.
User-agent: GPTBot
Disallow: /images/
Disallow: /media/
Disallow: /wp-content/uploads/
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xmlAction steps
- Pick path blocks that match your URL structure.
- Do not block shared asset folders if your pages need them to render.
- Check if your CDN uses a separate hostname for images. robots.txt applies per host.
Avoid common robots.txt mistakes that break SEO
Many sites lose rankings because of a small robots.txt error. Use this section as a pre-launch checklist.
Mistake: blocking the entire site for all bots
This rule blocks everything:
User-agent: *
Disallow: /- Use this only for staging sites or emergency cases.
- If you need to block training bots only, target their user-agents instead.
Mistake: blocking CSS and JS needed for rendering
- Google renders pages. If you block key assets, Google can misread layout, content, and structured data.
- Do not disallow broad folders like
/assets/unless you know the impact.
Mistake: relying on robots.txt to hide sensitive content
- robots.txt is public. Anyone can read it.
- Use authentication, signed URLs, or network controls for sensitive areas.
Mistake: assuming user-agent strings prove identity
- Any client can claim it is “Googlebot” in the user-agent header.
- Use reverse DNS validation for Googlebot and Bingbot if you need proof.
Action steps
- Run a diff check before you deploy changes.
- Keep a rollback copy of the last known good robots.txt.
- Monitor index coverage and crawl stats after each change.
Use the right syntax: rules, wildcards, and precedence
robots.txt syntax looks simple, but small details change outcomes. Use these rules to keep behavior predictable.
Core directives you will use
- User-agent: The bot name the group applies to.
- Disallow: A path prefix the bot should not crawl.
- Allow: A path prefix the bot may crawl even if a broader disallow matches.
- Sitemap: A sitemap URL to help discovery.
Wildcards and end-of-line markers
Google and Bing support common pattern matching:
*matches any string.$matches the end of the URL.
Example: block all PDF files for a bot:
User-agent: GPTBot
Disallow: /*.pdf$Precedence rules you should know
- More specific user-agent groups can override the
User-agent: *group. - Within a group, the most specific matching rule usually wins (longer path match).
- Some bots interpret edge cases differently. Keep patterns simple to reduce surprises.
Action steps
- Prefer direct path blocks over heavy wildcard use.
- Use
$only when you need exact file-type blocks. - Test patterns with real URLs from your site.
Deploy robots.txt correctly (location, status codes, and caching)
Even perfect rules fail if the file is not reachable. bots fetch robots.txt from a fixed location.
Correct location
- robots.txt must be at:
https://yourdomain.com/robots.txt - Each subdomain needs its own file (example:
https://cdn.yourdomain.com/robots.txt).
Correct server response
- Serve a 200 OK status for a valid file.
- A 404 often means “no restrictions” for many bots.
- A 5xx can cause bots to pause crawling or assume temporary limits.
Caching and propagation
- CDNs can cache robots.txt. Purge cache after updates.
- Some bots cache robots.txt for hours or days. Changes can take time to apply.
Action steps
- Open
/robots.txtin a browser and confirm the content matches your latest version. - Check headers to confirm you serve the right status code.
- Purge CDN cache for
/robots.txtafter each update.
Test robots.txt before and after you publish
Testing prevents silent SEO damage. You should test both syntax and real crawl behavior.
Use Google Search Console robots.txt tools
- Use the robots.txt tester (if available in your account) or URL Inspection to check crawl access.
- Test key URLs: homepage, category pages, product pages, blog posts, CSS/JS assets, sitemap URLs.
Run live checks with curl
Fetch the file and confirm it returns 200 and the expected content:
curl -I https://example.com/robots.txt
curl https://example.com/robots.txtValidate bot behavior with logs
- Check server logs for requests to
/robots.txt. - Track requests from blocked user-agents. Confirm they reduce over time.
- Watch for spikes from unknown bots. Add them to your block list if needed.
Action steps
- Create a list of 20 to 50 important URLs and test them after every robots.txt change.
- Monitor crawl stats and index coverage for 7 to 14 days after deployment.
Handle bots that ignore robots.txt (use stronger controls)
Some training bots and many scrapers do not follow robots.txt. If you need real enforcement, use server-side controls.
Use a WAF or bot management
- Block by verified bot signatures when your provider supports it.
- Challenge suspicious traffic with JS challenges or managed challenges.
- Set rate limits for high-request patterns (example: many pages per minute).
Block by IP and ASN (with caution)
- IP blocks can stop abuse fast, but IPs can change.
- ASN blocks can be too broad and can hit real users if you block large providers.
Protect high-value endpoints
- Add auth to APIs and feeds that do not need to be public.
- Use signed URLs for file downloads.
- Throttle endpoints that cause heavy database load.
Action steps
- Start with robots.txt blocks for known compliant training bots.
- Add rate limiting for paths that get scraped (example:
/search,/tag/,/wp-json/). - Escalate to WAF rules if logs show bots ignore robots.txt.
Use robots meta tags and headers for page-level control
robots.txt works at the path level. If you need page-level control, use meta robots tags or X-Robots-Tag headers. This helps when you want search bots to crawl but not index certain pages.
Meta robots tag (HTML)
<meta name="robots" content="noindex, nofollow">- Use this for pages you do not want indexed.
- Do not use
noindexon pages you want to rank.
X-Robots-Tag header (files and non-HTML)
You can set headers for PDFs and other assets:
X-Robots-Tag: noindexAction steps
- Use robots.txt to block training bots from crawling.
- Use
noindexto control indexing for search engines. - Do not block a URL in robots.txt if you also need Google to see its
noindex. Google must crawl to see the tag.
Recommended robots.txt templates (choose one)
Pick the template that matches your goal. Then edit paths and sitemaps to match your site.
Template A: Block training bots site-wide, allow search bots
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: DuckDuckBot
Disallow:
User-agent: Applebot
Disallow:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: *
Disallow: /admin/
Disallow: /login/
Sitemap: https://example.com/sitemap.xmlTemplate B: Block training bots from premium areas only
User-agent: GPTBot
Disallow: /premium/
Disallow: /members/
User-agent: CCBot
Disallow: /premium/
Disallow: /members/
User-agent: Google-Extended
Disallow: /premium/
Disallow: /members/
User-agent: *
Disallow: /admin/
Disallow: /login/
Sitemap: https://example.com/sitemap.xmlTemplate C: Block training bots from PDFs and downloads
User-agent: GPTBot
Disallow: /*.pdf$
Disallow: /downloads/
User-agent: CCBot
Disallow: /*.pdf$
Disallow: /downloads/
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xmlAction steps
- Pick one template and keep it close to the original structure.
- Update your sitemap line and confirm it loads.
- Test 10 to 20 URLs that represent your main site sections.
Monitor results and keep your bot list current
Training bot names and behavior change. You need a simple process to keep rules current without constant rework.
What to monitor each week
- Requests by user-agent in server logs
- Bandwidth and response time by path
- Crawl stats in Google Search Console
- Index coverage warnings after robots.txt edits
How to update your block list safely
- Add one or two new user-agent blocks at a time.
- Deploy, then watch logs for 3 to 7 days.
- Keep a changelog with date, reason, and expected effect.
Action steps
- Create an internal “bots policy” doc that states which bots you allow and why.
- Set a monthly reminder to review logs and update rules.
Frequently Asked Questions (FAQs)
Can robots.txt fully block AI training bots?
No. robots.txt blocks compliant bots from crawling, but it does not enforce access. Some bots can ignore it. Use a WAF or bot management for enforcement.
Will blocking training bots hurt my Google rankings?
No, if you block only training bot user-agents and you keep Googlebot allowed. Rankings drop when you block Googlebot, block key assets, or block important pages by mistake.
Should I block Google-Extended?
It depends on your policy. Many sites block Google-Extended to limit some AI training uses while still allowing Googlebot for search. You should confirm your goals before you add the rule.
Do I need separate robots.txt files for subdomains?
Yes. Each host needs its own robots.txt. A file on www does not control crawling on cdn or blog.
How do I confirm a bot is really Googlebot?
Check reverse DNS and forward DNS validation for the requesting IP, then confirm it maps to Google. User-agent text alone is not proof.
What is the safest first change if I am worried about mistakes?
Add only training bot blocks first and do not change your existing User-agent: * rules. Then test key URLs in Search Console and watch logs for a week.
Final Thoughts
How to configure robots.txt to block training bots while allowing search bots comes down to clear goals, clean user-agent groups, and careful testing. Start with a safe baseline, block known training crawlers by name, and keep Googlebot and Bingbot open on the pages you want to rank. Then add stronger controls for bots that ignore robots.txt. If you want help, review your server logs today, list the top training bot user-agents you see, and update your robots.txt with one of the templates above.