Table of Contents >> Show >> Hide
- What Is a Robots.txt File, Really?
- Where Your Robots.txt File Must Live
- Basic Robots.txt Syntax You Need to Know
- Robots.txt vs Meta Robots vs X-Robots-Tag
- Robots.txt Best Practices for Modern SEO
- Common Robots.txt Mistakes (and How to Avoid Them)
- Practical Robots.txt Templates
- Real-World Robots.txt Experiences From the SEO Front Lines
- Key Takeaways
If search engines are the party guests, your robots.txt file is the bouncer at the door.
It doesn’t decide who’s allowed into the club at all times, but it does tell the well-behaved bots
where they can and can’t wander on your site. Used correctly, this tiny text file can protect your
server, tidy up your crawl budget, and keep unnecessary clutter out of Google’s way.
In this guide, inspired by the practical approach popularized by Moz and other major SEO platforms,
we’ll walk through what a robots.txt file is, how robots.txt syntax works, and the best practices
that help you avoid catastrophic “why did my whole site disappear?!” mistakes.
What Is a Robots.txt File, Really?
A robots.txt file is a simple, UTF-8 encoded text file that lives in the root directory
of your site (for example, https://www.example.com/robots.txt). Its job is to give
crawl directives to bots: which URLs they’re allowed to crawl and which they should avoid.
When a polite crawler (like Googlebot, Bingbot, or other major search engine spiders) arrives at your
domain, it usually checks robots.txt first. If it finds rules that apply to it, it follows them. If it
doesn’t, it crawls more freely.
Important: Robots.txt Is About Crawling, Not Indexing
One of the biggest misunderstandings: a robots.txt file controls crawling, not
necessarily indexing. A URL can still end up in search results even if it’s blocked
from crawling, as long as search engines discover it through links elsewhere. They may show a bare-bones
listing without a snippet because they weren’t allowed to fetch the page content.
If you want to prevent indexing, you should use:
<meta name="robots" content="noindex">on the page, or- An
X-Robots-Tag: noindexHTTP header.
Robots.txt is a polite “please don’t crawl this,” not a secure lock or a guaranteed “stay out of the index.”
Where Your Robots.txt File Must Live
For your robots.txt file to be valid, it must:
- Be located at the root of the host:
https://example.com/robots.txt - Be accessible over HTTP/HTTPS without redirects that break the URL
- Be a plain text file encoded in UTF-8 (no smart quotes, no rich text formatting)
You can’t put it at /blog/robots.txt or /folder/robots.txt and expect it to
control crawling for the whole domain. Subdomains need their own file (for example,
https://blog.example.com/robots.txt).
Basic Robots.txt Syntax You Need to Know
Robots.txt syntax is surprisingly simple. It’s made up of one or more groups.
Each group contains:
- A User-agent line (which bot the group applies to)
- One or more Allow or Disallow lines
Core Directives
User-agent
User-agent specifies which crawler the rule applies to.
Here, * means “all crawlers.” You can also target specific bots:
Disallow
Disallow tells a crawler which path it cannot crawl.
A blank Disallow means “crawl everything you want”:
Allow
Allow lets you make exceptions inside blocked folders, especially useful for Googlebot.
Wildcards and End-of-Line ($)
Many modern crawlers support pattern matching:
*matches any sequence of characters.$matches the end of a URL.
This setup blocks any URL with a ?sessionid= parameter and any URL ending in .pdf.
Sitemap
While not a “crawl block,” you can declare your XML sitemap in robots.txt to help crawlers discover URLs:
You can list multiple sitemap URLs if needed (for large or segmented sites).
Crawl-delay (Handle With Care)
Some search engines support Crawl-delay to slow down how quickly they request pages.
Googlebot ignores this directive and instead lets you control crawl rate in Search Console.
This example asks Bingbot to wait 5 seconds between requests. Use it only if you truly have server strain issues.
Robots.txt vs Meta Robots vs X-Robots-Tag
Think of these as different tools in your SEO toolbox:
- Robots.txt – Controls crawling at the site or directory level.
- Meta robots tag – Page-level directive in HTML (like
noindex,nofollow). - X-Robots-Tag header – HTTP header version, great for non-HTML files like PDFs.
A good rule of thumb:
- Use robots.txt to shape crawling paths and protect your crawl budget.
- Use meta robots or X-Robots-Tag when you care about indexing status or search appearance.
Robots.txt Best Practices for Modern SEO
1. Never Use Robots.txt for Security
Robots.txt is public. Anyone can go to /robots.txt and read exactly which folders you’re
trying to keep bots away from. That includes attackers, scrapers, and very curious competitors.
If a section of your site is truly sensitivepayment data, internal dashboards, customer recordsuse:
- Authentication (password protection, login)
- Proper permission systems and server-side controls
- Network-level security, not robots.txt
2. Don’t Block CSS, JavaScript, or Important Resources
Google and other engines render your pages to understand layout, mobile-friendliness, and Core Web Vitals.
If your robots.txt blocks critical CSS or JS files, it can hurt how search engines evaluate your site and
may affect rankings.
Make sure folders like /wp-includes/js/ or theme assets aren’t blocked if they’re needed to
render actual content pages.
3. Use Robots.txt to Protect Crawl Budget
Big sitese-commerce, classifieds, UGC platformshave thousands or millions of URLs. Crawlers have a
practical limit to how much they’ll fetch in a reasonable time (“crawl budget”). If bots spend that budget
chewing on endless filter combinations or thin faceted pages, they may never reach your more valuable content.
Use robots.txt to:
- Block URLs with duplicate or low-value parameters (sort, filter, session IDs).
- Block infinite calendar URLs or auto-generated archives.
- Block internal search results pages that don’t need to rank.
4. Test Your Rules Before Shipping
A single typolike Disallow: / in the wrong placecan block your entire site. That’s the
SEO equivalent of unplugging the internet and wondering why traffic dropped.
Always test robots.txt in:
- Google Search Console’s robots.txt tester (for Googlebot), and
- Any available tools from your CMS or SEO plugins (Yoast, etc.).
5. Keep Your Robots.txt File Clean and Commented
Add comments (lines starting with #) to explain your logic. Future youand future devswill thank you.
6. Be Mindful When Blocking SEO Tools
Many SEOs use crawling tools (like Ahrefs, Semrush, etc.) to audit sites. You can block their bots via
robots.txt if you really want to, but it will limit the data those tools can show about your site.
That line stops AhrefsBot from crawling entirelywhich also means no new backlink checks or content audits
from that tool. Use sparingly.
Common Robots.txt Mistakes (and How to Avoid Them)
1. Accidentally Blocking the Entire Site
This tells all bots: “Do not crawl anything. Seriously. Nothing.” This is sometimes used during staging or
development, then forgotten when the site goes live. Traffic then crashes and everyone panics.
Fix: Make sure your live robots.txt has Disallow: (blank) for production unless you’re
intentionally blocking specific sections.
2. Trying to Noindex via Robots.txt
For a while, some SEOs used Noindex: directives in robots.txt. Google officially stopped
supporting that approach; it’s now treated like a comment rather than a directive.
Fix: Use noindex in meta robots or X-Robots-Tag, not robots.txt.
3. Blocking Resources Needed for Rendering
Over-aggressive “clean up” rules like:
can block images, CSS, and JS that Google needs to understand your layout and mobile usability.
Fix: Narrow your rules. Block only truly internal paths (like /wp-admin/) and allow assets used
on public pages.
4. Misusing Wildcards and Trailing Slashes
Tiny changes can have big consequences:
Depending on the crawler’s interpretation, one might block the folder plus similarly named URLs, and the
other strictly the directory. Always test patterns with real URLs in tools before pushing to production.
Practical Robots.txt Templates
Example 1: Simple Blog or Small Business Site
This pattern is common for WordPress sites: you block the admin area, keep AJAX available, and prevent
internal search results from cluttering the index.
Example 2: Large E-commerce Site with Faceted Navigation
Here, robots.txt focuses crawl budget on core product and category URLs while steering bots away from
endless filter combinations.
Real-World Robots.txt Experiences From the SEO Front Lines
Theory is great, but robots.txt becomes truly interesting (and sometimes terrifying) once you’ve watched
it breakor savereal websites. Here are a few experiences and patterns that often show up in practice.
Story 1: The “Launch Day Traffic Cliff”
A brand-new site relaunched on a shiny platform with a fresh design and improved content. Everyone was
exciteduntil organic traffic plummeted to almost zero. The culprit? A single line copied over from the
staging environment:
Search engines were politely doing as instructed and ignoring the entire domain. Once the directive was
changed to:
crawlers slowly returned, but it took weeks for rankings to recover. The lesson: treat robots.txt like
production code. Review it as part of every major deployment checklist, right alongside redirects,
canonical tags, and XML sitemaps.
Story 2: When Crawlers Love Your Filters a Little Too Much
Another site had thousands of products and a slick faceted navigation systemfilter by color, size, price,
brand, rating, you name it. It was awesome for users, but it became a nightmare for Googlebot. Each filter
combination created unique URLs with query parameters, and the crawl stats in Search Console showed bots
spending more time crawling weird filtered pages than core product detail pages.
The fix involved three steps:
- Mapping the main parameter patterns driving duplicate content.
- Blocking those parameters in robots.txt to reduce crawl chaos.
- Pairing that with canonical tags pointing to the “clean” versions of category and product URLs.
Within a few weeks, logs showed crawlers spending more time on valuable URLs, and new products began
appearing in search results much faster. The moral: robots.txt is a powerful tool for taming faceted
navigation before it eats your crawl budget alive.
Story 3: Blocking SEO Tools (and Then Missing Out on Data)
Some brands don’t love being crawled by third-party SEO tools, so their robots.txt files ban popular
bots like AhrefsBot or Semrush’s crawler. That’s technically fine, and the bots usually respect the rules.
But in one case, an in-house SEO team was puzzled: they could see backlinks in Google Search Console, but
their backlink tools showed almost nothing.
A quick look at /robots.txt revealed:
The security team had added those lines years earlier to reduce bot traffic. Once the team relaxed those
blocks (or whitelisted specific paths), third-party data started flowing again. For many SEOs, that kind of
visibility is worth the small amount of extra crawl activity.
Story 4: Too Many Rules, Not Enough Strategy
A large publisher had accumulated robots.txt directives over the years like digital dust bunnies. Every
time someone ran into a crawling or duplication issue, they added another rule. Eventually, the file
became a confusing wall of disallows, wildcards, and comments that contradicted each other.
An audit found multiple overlapping rules for similar paths, some of which actually blocked new content
types from being crawled. The fix wasn’t glamorous: the team stripped robots.txt back to a smaller,
strategy-driven set of rules that matched their current site architecture. Crawl stats became more
predictable, and future changes were easier to manage.
The takeaway from all these stories: robots.txt works best when it’s:
- Deliberate, not reactive
- Documented with comments
- Tested in real tools before deployment
- Reviewed regularly as your site architecture evolves
Key Takeaways
- Robots.txt is a simple text file that controls crawling, not indexing.
- It must live at the root of your domain and be properly formatted as UTF-8 plain text.
- Use
User-agent,Disallow,Allow, wildcards, andSitemapwisely. - Never treat robots.txt as a security layer or the sole way to remove URLs from search results.
- Use it strategically to protect crawl budget and keep bots away from low-value or infinite spaces.
- Test your rules, comment them clearly, and revisit them whenever your site structure changes.
Handle your robots.txt file thoughtfully and you’ll give search engines exactly the guidance they needno
more, no lesswhile keeping your most important content front and center in the SERPs.