Skip to main content
CheckTown
Generators

robots.txt Generator: Control Search Engine Crawling for Your Site

Published 6 min read
In this article

What Is robots.txt?

robots.txt is a plain text file placed at the root of a website that tells web crawlers which pages or sections they should or should not access. It follows the Robots Exclusion Protocol, a standard that has been used since 1994 to communicate crawling preferences to search engine bots, AI crawlers, and other automated agents.

When a crawler visits your site, it first checks for a robots.txt file at yourdomain.com/robots.txt before crawling any pages. The file contains directives that specify which user agents (crawlers) can access which paths. Note that robots.txt is advisory — well-behaved crawlers respect it, but malicious bots may ignore it entirely.

robots.txt Syntax

The robots.txt file uses a simple directive-based syntax. Each block starts with a User-agent line followed by one or more rules:

  • User-agent — specifies which crawler the rules apply to. Use * for all crawlers, or a specific name like Googlebot, Bingbot, or GPTBot
  • Disallow — blocks access to a specific path or pattern. Disallow: /admin/ prevents crawling of the admin directory
  • Allow — explicitly permits access to a path, useful for overriding a broader Disallow rule. Allow: /admin/public/ would allow that specific subfolder
  • Sitemap — specifies the URL of your XML sitemap so crawlers can discover all your pages. Sitemap: https://example.com/sitemap.xml
  • Crawl-delay — suggests a delay in seconds between successive requests. Not all crawlers support this directive (Google ignores it; Bing respects it)

Wildcards are supported in Disallow and Allow: * matches any sequence of characters, and $ matches the end of a URL. For example, Disallow: /*.pdf$ blocks all URLs ending in .pdf.

Common robots.txt Patterns

Here are the most useful robots.txt configurations for different scenarios:

  • Allow everything — an empty Disallow directive (User-agent: * with Disallow: ) or no robots.txt at all permits full crawling
  • Block everything — Disallow: / blocks all crawlers from accessing any page. Use this for staging environments or pre-launch sites
  • Block AI crawlers — target specific AI bots with User-agent: GPTBot, CCBot, anthropic-ai followed by Disallow: / to prevent content scraping while allowing search engines
  • Allow only Google — combine User-agent: Googlebot with Allow: / and a separate block for User-agent: * with Disallow: /
  • Protect admin paths — Disallow: /admin/, /wp-admin/, /api/, or any sensitive directory to keep them out of search results and crawler logs

Try it free — no signup required

Generate robots.txt →

Common Use Cases

A well-configured robots.txt file serves several important purposes for website management:

  • SEO optimization — prevent crawling of duplicate content, pagination pages, search result pages, and filtered URLs that could dilute your search rankings
  • Blocking scrapers — deter content scrapers and AI training bots from copying your content by disallowing their specific user agents
  • Protecting staging environments — block all crawlers on staging and development servers to prevent unfinished content from appearing in search results
  • Managing crawl budget — for large sites, blocking low-value pages (tag archives, internal search results, session URLs) ensures search engines spend their crawl budget on your important pages

robots.txt and SEO

robots.txt has a direct impact on how search engines discover and index your content. Understanding its relationship with SEO is essential:

  • Blocking does not mean de-indexing — if other sites link to a page you have blocked in robots.txt, search engines may still index the URL (without content). Use the noindex meta tag instead to prevent indexing
  • Always include your sitemap — adding a Sitemap directive helps search engines discover all your pages, especially new ones that may not yet have inbound links
  • Do not block CSS or JavaScript files — search engines need these to render your pages. Blocking them can hurt your rankings because the crawler cannot understand your page layout
  • Common mistakes — blocking entire directories accidentally, using robots.txt instead of noindex for sensitive pages, or forgetting to update robots.txt after a site restructure can all harm your SEO

Frequently Asked Questions

Is robots.txt mandatory or just advisory?

robots.txt is entirely advisory. Well-behaved crawlers like Googlebot, Bingbot, and most legitimate bots respect it, but there is no technical enforcement. Malicious bots, scrapers, and some AI crawlers may ignore it completely. For sensitive content, use server-side access controls (authentication, IP blocking) rather than relying solely on robots.txt.

How do I test my robots.txt file?

Google Search Console provides a robots.txt tester that shows whether specific URLs are blocked or allowed. You can also use online validators that parse your robots.txt and simulate crawler behavior. Test by checking if critical pages are accessible and non-essential pages are blocked. Always verify after deploying changes.

Does Google respect Crawl-delay in robots.txt?

No, Google does not support the Crawl-delay directive. Instead, Google uses its own algorithms to determine optimal crawl rate based on server response times. To control Google's crawl rate, use the Crawl Rate settings in Google Search Console. Bing does respect Crawl-delay, so include it if Bing traffic matters to you.

Related Tools