Block AI Training Bots in robots.txt (Keep Google)

Q: Do I need separate robots.txt files for subdomains?

Yes. Each host needs its own robots.txt. A file on www does not control crawling on cdn or blog.

AI training crawlers can pull your pages, images, and feeds at high volume. Search crawlers can also crawl your site, but they help you rank and earn traffic. You can use robots.txt to send clear crawl rules to both groups. You can block many known training bots while you keep Google, Bing, and other search bots allowed.

This guide shows how to configure robots.txt to block training bots while allowing search bots. It also explains what robots.txt can and cannot do, so you set the right expectations and add stronger controls where needed.

Key Takeaways

robots.txt is a crawl directive file that many bots follow, but it is not a hard security barrier.
Allow major search bots (Googlebot, Bingbot) and block known AI training bots by user-agent.
Keep your rules simple to avoid accidental deindexing, broken rendering, or blocked assets.
Test your robots.txt with Google Search Console and live fetch checks before and after you deploy.
Use layered controls (WAF, rate limits, bot management) for bots that ignore robots.txt.
Document and monitor changes so you can reverse mistakes fast and track bot traffic shifts.

Comprehensive List of All Bots (Updated)

Category	Bot name (robots.txt user-agent token)	Operated by	What it’s typically used for
AI Search bots	OAI-SearchBot	OpenAI	Indexing for ChatGPT search results. (OpenAI Platform)
AI Search bots	PerplexityBot	Perplexity	Surfacing and linking sites in Perplexity search results. (docs.perplexity.ai)
AI Search bots	Claude-SearchBot	Anthropic	Improving search result quality for Claude users. (privacy.claude.com)
AI Search bots	Applebot	Apple	Search-related features across Apple experiences (Spotlight, Siri, Safari). (Apple Support)
AI Search bots	Amazonbot	Amazon	Crawling to improve products and services, may be used to train Amazon AI models. (Developer Portal Master)
AI “user fetch” bots	ChatGPT-User	OpenAI	User-initiated page fetches from ChatGPT (not automatic crawling). (OpenAI Platform)
AI “user fetch” bots	Perplexity-User	Perplexity	User-initiated fetches to answer queries (not training crawl). (docs.perplexity.ai)
AI “user fetch” bots	Claude-User	Anthropic	User-initiated website access for Claude responses. (privacy.claude.com)
AI “user fetch” bots	Meta-ExternalFetcher	Meta	Fetches individual links for product features (user-initiative fetch style). (Cloudflare Radar)
AI Training bots	GPTBot	OpenAI	Collecting content that may be used for training OpenAI foundation models. (OpenAI Platform)
AI Training bots	ClaudeBot	Anthropic	Collecting web content that could contribute to model training. (privacy.claude.com)
AI Training control token	Google-Extended	Google	Controls whether Google-crawled content can be used for Gemini training and grounding. (Google for Developers)
AI Training control token	Applebot-Extended	Apple	Controls whether content can be used to train Apple foundation models (Apple Intelligence, etc.). (Apple Support)
AI Training bots	AI2Bot	Allen Institute for AI (AI2)	Crawls web content used to train open language models (per AI2 notice). (Allen AI)
AI Training/data bots	CCBot	Common Crawl	Crawls the web for Common Crawl datasets used for research and ML. (commoncrawl.org)
AI Training/data bots	meta-externalagent	Meta	Meta’s crawler listed as a common Meta web crawler (used for content retrieval, often referenced for AI collection). (Facebook Developers)
AI Training/data bots	Bytespider	ByteDance	Commonly identified as a crawler token seen in logs/robots policies. (ColorTokens)
AI Training/data bots	cohere-ai	Cohere	Commonly identified as an AI crawler token seen in logs/robots policies. (ColorTokens)
AI Training/data bots	DeepSeekBot	DeepSeek	Commonly identified as an AI crawler token for data collection. (DataDome)
Other Search bots	Googlebot	Google	Main Google Search crawler. (Google for Developers)
Other Search bots	bingbot	Microsoft Bing	Main Bing Search crawler. (Search - Microsoft Bing)
Other Search bots	DuckDuckBot	DuckDuckGo	DuckDuckGo search crawler. (DuckDuckGo)
Other Search/preview bots	facebookexternalhit	Meta	Fetches URLs for link previews and related features. (Facebook Developers)
Other Search/preview bots	Facebot	Meta	Another Meta crawler commonly used for crawling content for Meta surfaces. (humansecurity.com)
Other Search bots	Baiduspider	Baidu	Baidu search crawler (commonly referenced bot token). (humansecurity.com)
Other Search bots	YandexBot	Yandex	Yandex search crawler (commonly referenced bot token). (humansecurity.com)
Other crawler bots	AhrefsBot	Ahrefs	SEO/backlink crawler. (humansecurity.com)
Other crawler bots	SemrushBot	Semrush	SEO research crawler. (humansecurity.com)
Other crawler bots	MJ12Bot	Majestic	Backlink index crawler. (humansecurity.com)

Understand what robots.txt does (and what it does not)

Before you edit rules, you need a clear model of how bots read robots.txt. This helps you block training bots without harming search visibility.

What robots.txt controls

Crawling: It tells compliant bots which paths they should not fetch.
Crawl focus: It can reduce load by keeping bots out of low-value sections.
Bot-specific rules: You can set different rules per user-agent.

What robots.txt does not control

Access: A blocked URL can still load in a browser if it is public.
Indexing in all cases: Some search engines can index a URL based on links even if crawling is blocked. (Google usually needs crawl access to fully index content, but a URL can still appear as a “URL-only” result.)
Non-compliant bots: Some scrapers and some training crawlers can ignore robots.txt.

Action steps

Use robots.txt as your first filter for known, compliant training bots.
Add server-side controls for bots that ignore rules.
Do not place secrets behind robots.txt. Use auth or IP allowlists for that.

Know the difference: training bots vs search bots

You can block training bots and allow search bots, but you need to identify them correctly. You also need to avoid broad rules that catch the wrong crawlers.

Common search bots you usually want to allow

Googlebot (Google Search)
Bingbot (Bing Search)
DuckDuckBot (DuckDuckGo)
Applebot (Apple Search and Spotlight)
YandexBot (if you serve that market)
Baiduspider (if you serve that market)

Common training bots you may want to block

Bot names change often. Use your server logs to confirm what hits your site. Many sites block these user-agents in robots.txt:

GPTBot (OpenAI)
CCBot (Common Crawl)
Google-Extended (Google’s AI training control token for some products)
anthropic-ai and ClaudeBot (Anthropic)
Bytespider (ByteDance)
Amazonbot (Amazon)

Action steps

List the bots that matter to your traffic and your risk.
Confirm bot names in logs, not in guesswork.
Decide what you will allow: search crawling, ads crawling, preview bots, uptime monitors.

Set your robots.txt goals before you write rules

Clear goals prevent rule sprawl. They also prevent a common mistake: blocking search bots by accident while you chase training bots.

Goal checklist

Keep indexing stable: Allow Googlebot and Bingbot on key pages.
Block training crawlers: Disallow known AI user-agents.
Protect server resources: Reduce crawl load on heavy endpoints.
Keep rendering intact: Avoid blocking CSS, JS, and images needed for page rendering.
Keep private areas private: Use auth for private paths, then also disallow them in robots.txt as a hint.

Decide what content you want to block from training bots

Full site (common for publishers who do not want AI reuse)
Only premium sections (paywalled, member-only)
Only high-value assets (images, PDFs, datasets)
Only APIs and feeds (RSS, JSON endpoints)

Action steps

Write down the exact paths you want to protect (examples: /premium/, /members/, /api/, /feeds/).
List the assets that must stay crawlable for SEO (examples: /wp-content/, /assets/, /images/).

Build a safe baseline robots.txt (allow search bots first)

Start with a baseline that keeps search crawling healthy. Then add training bot blocks. This order reduces the risk of an SEO outage.

Baseline example for most sites

This baseline allows all bots by default, blocks only common private or low-value areas, and declares your sitemap:

User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

Why this baseline works

It avoids broad blocks like Disallow: /.
It blocks areas that create thin or duplicate pages.
It keeps important AJAX endpoints available for WordPress sites.

Action steps

Replace example.com with your domain.
Remove paths you do not use.
Add your actual sitemap URL(s). Some sites use multiple sitemap files.

Add rules to block training bots while allowing search bots

Now you can add user-agent groups for training bots. In robots.txt, the most specific group that matches a bot name applies. Keep each group simple.

Copy-paste robots.txt example (block training bots, allow search bots)

This example blocks several known training bots and keeps broad access for search bots:

# --- Allow major search bots (optional but clear) ---
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

User-agent: DuckDuckBot
Disallow:

User-agent: Applebot
Disallow:

# --- Block common AI training crawlers ---
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

# --- Default rules for all other bots ---
User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

How this pattern works

Search bots get an explicit allow rule (Disallow: with blank value means allow all).
Training bots get a full-site block (Disallow: /).
All other bots follow the default group.

Action steps

Keep the bot names exact. Spelling matters.
Keep one User-agent per group for clarity.
Place your sitemap line at the end (or anywhere). Many bots read it anywhere.

Block training bots from specific sections (instead of the full site)

Some site owners want search engines to index everything, but they want to block training bots from premium pages, datasets, or images. You can disallow only those paths for training bots.

Example: block premium and API paths for training bots

User-agent: GPTBot
Disallow: /premium/
Disallow: /members/
Disallow: /api/
Disallow: /feeds/

User-agent: CCBot
Disallow: /premium/
Disallow: /members/
Disallow: /api/
Disallow: /feeds/

User-agent: *
Disallow: /admin/
Disallow: /login/

Sitemap: https://example.com/sitemap.xml

Example: block image folders for training bots (use with care)

Blocking images can reduce reuse, but it can also reduce image search traffic. If image SEO matters, do not block images for Googlebot-Image or Bingbot.

User-agent: GPTBot
Disallow: /images/
Disallow: /media/
Disallow: /wp-content/uploads/

User-agent: *
Disallow: /admin/

Sitemap: https://example.com/sitemap.xml

Action steps

Pick path blocks that match your URL structure.
Do not block shared asset folders if your pages need them to render.
Check if your CDN uses a separate hostname for images. robots.txt applies per host.

Avoid common robots.txt mistakes that break SEO

Many sites lose rankings because of a small robots.txt error. Use this section as a pre-launch checklist.

Mistake: blocking the entire site for all bots

This rule blocks everything:

User-agent: *
Disallow: /

Use this only for staging sites or emergency cases.
If you need to block training bots only, target their user-agents instead.

Mistake: blocking CSS and JS needed for rendering

Google renders pages. If you block key assets, Google can misread layout, content, and structured data.
Do not disallow broad folders like /assets/ unless you know the impact.

Mistake: relying on robots.txt to hide sensitive content

robots.txt is public. Anyone can read it.
Use authentication, signed URLs, or network controls for sensitive areas.

Mistake: assuming user-agent strings prove identity

Any client can claim it is “Googlebot” in the user-agent header.
Use reverse DNS validation for Googlebot and Bingbot if you need proof.

Action steps

Run a diff check before you deploy changes.
Keep a rollback copy of the last known good robots.txt.
Monitor index coverage and crawl stats after each change.

Use the right syntax: rules, wildcards, and precedence

robots.txt syntax looks simple, but small details change outcomes. Use these rules to keep behavior predictable.

Core directives you will use

User-agent: The bot name the group applies to.
Disallow: A path prefix the bot should not crawl.
Allow: A path prefix the bot may crawl even if a broader disallow matches.
Sitemap: A sitemap URL to help discovery.

Wildcards and end-of-line markers

Google and Bing support common pattern matching:

* matches any string.
$ matches the end of the URL.

Example: block all PDF files for a bot:

User-agent: GPTBot
Disallow: /*.pdf$

Precedence rules you should know

More specific user-agent groups can override the User-agent: * group.
Within a group, the most specific matching rule usually wins (longer path match).
Some bots interpret edge cases differently. Keep patterns simple to reduce surprises.

Action steps

Prefer direct path blocks over heavy wildcard use.
Use $ only when you need exact file-type blocks.
Test patterns with real URLs from your site.

Deploy robots.txt correctly (location, status codes, and caching)

Even perfect rules fail if the file is not reachable. bots fetch robots.txt from a fixed location.

Correct location

robots.txt must be at: https://yourdomain.com/robots.txt
Each subdomain needs its own file (example: https://cdn.yourdomain.com/robots.txt).

Correct server response

Serve a 200 OK status for a valid file.
A 404 often means “no restrictions” for many bots.
A 5xx can cause bots to pause crawling or assume temporary limits.

Caching and propagation

CDNs can cache robots.txt. Purge cache after updates.
Some bots cache robots.txt for hours or days. Changes can take time to apply.

Action steps

Open /robots.txt in a browser and confirm the content matches your latest version.
Check headers to confirm you serve the right status code.
Purge CDN cache for /robots.txt after each update.

Test robots.txt before and after you publish

Testing prevents silent SEO damage. You should test both syntax and real crawl behavior.

Use Google Search Console robots.txt tools

Use the robots.txt tester (if available in your account) or URL Inspection to check crawl access.
Test key URLs: homepage, category pages, product pages, blog posts, CSS/JS assets, sitemap URLs.

Run live checks with curl

Fetch the file and confirm it returns 200 and the expected content:

curl -I https://example.com/robots.txt
curl https://example.com/robots.txt

Validate bot behavior with logs

Check server logs for requests to /robots.txt.
Track requests from blocked user-agents. Confirm they reduce over time.
Watch for spikes from unknown bots. Add them to your block list if needed.

Action steps

Create a list of 20 to 50 important URLs and test them after every robots.txt change.
Monitor crawl stats and index coverage for 7 to 14 days after deployment.

Handle bots that ignore robots.txt (use stronger controls)

Some training bots and many scrapers do not follow robots.txt. If you need real enforcement, use server-side controls.

Use a WAF or bot management

Block by verified bot signatures when your provider supports it.
Challenge suspicious traffic with JS challenges or managed challenges.
Set rate limits for high-request patterns (example: many pages per minute).

Block by IP and ASN (with caution)

IP blocks can stop abuse fast, but IPs can change.
ASN blocks can be too broad and can hit real users if you block large providers.

Protect high-value endpoints

Add auth to APIs and feeds that do not need to be public.
Use signed URLs for file downloads.
Throttle endpoints that cause heavy database load.

Action steps

Start with robots.txt blocks for known compliant training bots.
Add rate limiting for paths that get scraped (example: /search, /tag/, /wp-json/).
Escalate to WAF rules if logs show bots ignore robots.txt.

Use robots meta tags and headers for page-level control

robots.txt works at the path level. If you need page-level control, use meta robots tags or X-Robots-Tag headers. This helps when you want search bots to crawl but not index certain pages.

Meta robots tag (HTML)

<meta name="robots" content="noindex, nofollow">

Use this for pages you do not want indexed.
Do not use noindex on pages you want to rank.

X-Robots-Tag header (files and non-HTML)

You can set headers for PDFs and other assets:

X-Robots-Tag: noindex

Action steps

Use robots.txt to block training bots from crawling.
Use noindex to control indexing for search engines.
Do not block a URL in robots.txt if you also need Google to see its noindex. Google must crawl to see the tag.

Recommended robots.txt templates (choose one)

Pick the template that matches your goal. Then edit paths and sitemaps to match your site.

Template A: Block training bots site-wide, allow search bots

User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

User-agent: DuckDuckBot
Disallow:

User-agent: Applebot
Disallow:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: *
Disallow: /admin/
Disallow: /login/

Sitemap: https://example.com/sitemap.xml

Template B: Block training bots from premium areas only

User-agent: GPTBot
Disallow: /premium/
Disallow: /members/

User-agent: CCBot
Disallow: /premium/
Disallow: /members/

User-agent: Google-Extended
Disallow: /premium/
Disallow: /members/

User-agent: *
Disallow: /admin/
Disallow: /login/

Sitemap: https://example.com/sitemap.xml

Template C: Block training bots from PDFs and downloads

User-agent: GPTBot
Disallow: /*.pdf$
Disallow: /downloads/

User-agent: CCBot
Disallow: /*.pdf$
Disallow: /downloads/

User-agent: *
Disallow: /admin/

Sitemap: https://example.com/sitemap.xml

Action steps

Pick one template and keep it close to the original structure.
Update your sitemap line and confirm it loads.
Test 10 to 20 URLs that represent your main site sections.

Monitor results and keep your bot list current

Training bot names and behavior change. You need a simple process to keep rules current without constant rework.

What to monitor each week

Requests by user-agent in server logs
Bandwidth and response time by path
Crawl stats in Google Search Console
Index coverage warnings after robots.txt edits

How to update your block list safely

Add one or two new user-agent blocks at a time.
Deploy, then watch logs for 3 to 7 days.
Keep a changelog with date, reason, and expected effect.

Action steps

Create an internal “bots policy” doc that states which bots you allow and why.
Set a monthly reminder to review logs and update rules.

Frequently Asked Questions (FAQs)

Can robots.txt fully block AI training bots?

No. robots.txt blocks compliant bots from crawling, but it does not enforce access. Some bots can ignore it. Use a WAF or bot management for enforcement.

Will blocking training bots hurt my Google rankings?

No, if you block only training bot user-agents and you keep Googlebot allowed. Rankings drop when you block Googlebot, block key assets, or block important pages by mistake.

Should I block Google-Extended?

It depends on your policy. Many sites block Google-Extended to limit some AI training uses while still allowing Googlebot for search. You should confirm your goals before you add the rule.

Do I need separate robots.txt files for subdomains?

Yes. Each host needs its own robots.txt. A file on www does not control crawling on cdn or blog.

How do I confirm a bot is really Googlebot?

Check reverse DNS and forward DNS validation for the requesting IP, then confirm it maps to Google. User-agent text alone is not proof.

What is the safest first change if I am worried about mistakes?

Add only training bot blocks first and do not change your existing User-agent: * rules. Then test key URLs in Search Console and watch logs for a week.

Final Thoughts

How to configure robots.txt to block training bots while allowing search bots comes down to clear goals, clean user-agent groups, and careful testing. Start with a safe baseline, block known training crawlers by name, and keep Googlebot and Bingbot open on the pages you want to rank. Then add stronger controls for bots that ignore robots.txt. If you want help, review your server logs today, list the top training bot user-agents you see, and update your robots.txt with one of the templates above.

How to Block AI Training Bots in robots.txt (Keep Search Bots)

Key Takeaways

Comprehensive List of All Bots (Updated)

Understand what robots.txt does (and what it does not)

What robots.txt controls

What robots.txt does not control

Action steps

Know the difference: training bots vs search bots

Common search bots you usually want to allow

Common training bots you may want to block

Action steps

Set your robots.txt goals before you write rules

Goal checklist

Decide what content you want to block from training bots

Action steps

Build a safe baseline robots.txt (allow search bots first)

Baseline example for most sites

Why this baseline works

Action steps

Add rules to block training bots while allowing search bots

Copy-paste robots.txt example (block training bots, allow search bots)

How this pattern works

Action steps

Block training bots from specific sections (instead of the full site)

Example: block premium and API paths for training bots

Example: block image folders for training bots (use with care)

Action steps

Avoid common robots.txt mistakes that break SEO

Mistake: blocking the entire site for all bots

Mistake: blocking CSS and JS needed for rendering

Mistake: relying on robots.txt to hide sensitive content

Mistake: assuming user-agent strings prove identity

Action steps

Use the right syntax: rules, wildcards, and precedence

Core directives you will use

Wildcards and end-of-line markers

Precedence rules you should know

Action steps

Deploy robots.txt correctly (location, status codes, and caching)

Correct location

Correct server response

Caching and propagation

Action steps

Test robots.txt before and after you publish

Use Google Search Console robots.txt tools

Run live checks with curl

Validate bot behavior with logs

Action steps

Handle bots that ignore robots.txt (use stronger controls)

Use a WAF or bot management

Block by IP and ASN (with caution)

Protect high-value endpoints

Action steps

Use robots meta tags and headers for page-level control

Meta robots tag (HTML)

X-Robots-Tag header (files and non-HTML)

Action steps

Recommended robots.txt templates (choose one)

Template A: Block training bots site-wide, allow search bots

Template B: Block training bots from premium areas only

Template C: Block training bots from PDFs and downloads

Action steps

Monitor results and keep your bot list current

What to monitor each week

How to update your block list safely

Action steps

Frequently Asked Questions (FAQs)

Can robots.txt fully block AI training bots?

Will blocking training bots hurt my Google rankings?

Should I block Google-Extended?

Do I need separate robots.txt files for subdomains?

How do I confirm a bot is really Googlebot?

What is the safest first change if I am worried about mistakes?

Final Thoughts

Frequently Asked Questions

Raman Singh

Your AI Marketing AgentsAre Ready to Work

Related Articles

How to Optimize for AI Search in 2026: The Complete Guide

NotebookLM For Coders: Turn Docs Into Faster Code

Your AI Marketing Agents
Are Ready to Work