Robots.txt Configuration Issues | Blue Frog Docs

Robots.txt Configuration Issues

Fix robots.txt errors to ensure search engines can properly crawl and index your website

Robots.txt Configuration Issues

What This Means

The robots.txt file is a text file placed in your website's root directory that tells search engine crawlers which pages and files they can or cannot access. When robots.txt is misconfigured, you can accidentally block important pages from Google, prevent indexing of your entire site, expose sensitive directories, or fail to guide crawlers efficiently, resulting in poor search visibility and wasted crawl budget.

How Robots.txt Works

Basic Structure:

# robots.txt - Lives at https://example.com/robots.txt

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

Sitemap: https://example.com/sitemap.xml

What This Means:

  • User-agent: * - Rules apply to all search engines
  • Disallow: /admin/ - Don't crawl /admin/ directory
  • Allow: /public/ - Override disallow for specific path
  • Sitemap: - Location of XML sitemap

Impact on Your Business

SEO Consequences:

  • Pages not indexed - Content invisible in search results
  • Lost organic traffic - Users can't find your site
  • Revenue loss - Product pages blocked = no sales
  • Brand invisibility - Company doesn't appear in searches
  • Competitor advantage - They rank, you don't

Common Disasters:

# DISASTER 1: Blocks entire site
User-agent: *
Disallow: /
# Result: NOTHING gets indexed!

# DISASTER 2: Blocks all CSS/JS
User-agent: *
Disallow: /*.css$
Disallow: /*.js$
# Result: Google can't render pages properly

# DISASTER 3: Exposes sensitive info
User-agent: *
Disallow: /admin-login-page/
Disallow: /customer-database/
Disallow: /financial-reports/
# Result: Hackers know exactly where to look!

Real-World Examples:

  • BBC accidentally blocked itself from Google - massive traffic drop
  • Major retailer blocked /products/ - lost millions in revenue
  • Site blocked JavaScript - Google couldn't render pages

How to Diagnose

Method 1: Check Robots.txt Exists

  1. Visit https://yoursite.com/robots.txt
  2. Check if file loads
  3. Review contents

What to Look For:

Good robots.txt:

User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /cart/
Allow: /

Sitemap: https://example.com/sitemap.xml

Problematic robots.txt:

User-agent: *
Disallow: /
# BLOCKS EVERYTHING!

or

# Empty file - wastes opportunity to guide crawlers

Method 2: Google Search Console Robots.txt Tester

  1. Open Google Search Console
  2. Select your property
  3. Go to Settingsrobots.txt Tester (or use legacy version)
  4. View your robots.txt file
  5. Test specific URLs

Test Process:

Enter URL to test: https://example.com/products/widget
User-agent: Googlebot
Result: ALLOWED ✅

Enter URL to test: https://example.com/admin/
User-agent: Googlebot
Result: BLOCKED 🚫 (expected)

Method 3: Screaming Frog SEO Spider

  1. Download Screaming Frog
  2. Enter your domain
  3. Click Start
  4. Check ConfigurationRobots.txt

What to Check:

  • Does robots.txt exist?
  • Are important pages blocked?
  • Are unnecessary pages allowed?
  • Syntax errors?

Method 4: Manual Syntax Check

Common errors to look for:

# ERROR 1: Wrong location
# Must be at: example.com/robots.txt
# NOT: example.com/blog/robots.txt

# ERROR 2: Case sensitivity
Disallow: /Admin/  # Won't block /admin/ (lowercase)

# ERROR 3: Typos
User-agnet: *  # Typo - should be "User-agent"
Dissallow: /private/  # Typo - should be "Disallow"

# ERROR 4: Invalid wildcards
Disallow: /category/*/page/  # Wildcards not universally supported

# ERROR 5: Wrong syntax
Allow all: /  # Invalid - should be "Allow: /"

Method 5: Check for Blocking Important Resources

  1. View page source
  2. Look at CSS/JS file paths
  3. Check if robots.txt blocks them

Test:

# Your robots.txt:
User-agent: *
Disallow: /assets/

# Your page uses:
<link rel="stylesheet" href="/assets/css/style.css">
<script src="/assets/js/app.js"></script>

# Problem: Google can't load CSS/JS to render page!

General Fixes

Fix 1: Create Proper Robots.txt

Recommended structure for most sites:

# /robots.txt
User-agent: *

# Block admin areas
Disallow: /admin/
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

# Block user accounts & checkout
Disallow: /account/
Disallow: /checkout/
Disallow: /cart/

# Block search & filter pages (duplicate content)
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=

# Block private directories
Disallow: /private/
Disallow: /temp/

# Allow everything else
Allow: /

# Sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/news-sitemap.xml

Fix 2: Never Block CSS/JavaScript

DON'T DO THIS:

# BAD - Blocks resources Google needs
User-agent: *
Disallow: /*.css$
Disallow: /*.js$
Disallow: /assets/
Disallow: /static/

DO THIS:

# GOOD - Allow CSS/JS for rendering
User-agent: *
Disallow: /admin/

# Explicitly allow assets
Allow: /assets/
Allow: /css/
Allow: /js/
Allow: /static/

Fix 3: Fix "Disallow: /" Blocking

PROBLEM:

User-agent: *
Disallow: /
# Blocks entire website!

SOLUTION:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /
# Only blocks specific directories

Fix 4: Block Search & Filter Pages

Prevent duplicate content issues:

User-agent: *

# Block URL parameters
Disallow: /*?*  # Blocks all URLs with parameters
# OR be more specific:
Disallow: /*?q=  # Block search queries
Disallow: /*?sort=  # Block sorted pages
Disallow: /*?page=  # Block pagination
Disallow: /*?filter=  # Block filters

# Allow specific parameters you want indexed
Allow: /*?utm_source=  # Allow tracking parameters in search

Fix 5: Sitemap Reference

Include sitemap location:

User-agent: *
Disallow: /admin/

# One or more sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-products.xml
Sitemap: https://example.com/sitemap-blog.xml

# For international sites
Sitemap: https://example.com/sitemap-en.xml
Sitemap: https://example.com/sitemap-fr.xml

Fix 6: Different Rules for Different Bots

Customize per search engine:

# Google
User-agent: Googlebot
Disallow: /private/
Allow: /

# Bing
User-agent: Bingbot
Disallow: /private/
Crawl-delay: 10

# Block bad bots
User-agent: MJ12bot
User-agent: AhrefsBot
User-agent: SemrushBot
Disallow: /

# Default for all others
User-agent: *
Disallow: /private/

Fix 7: E-commerce Specific

Optimize for online stores:

User-agent: *

# Block checkout & account pages
Disallow: /checkout/
Disallow: /cart/
Disallow: /my-account/
Disallow: /login/
Disallow: /register/

# Block filtered/sorted product pages
Disallow: /*?orderby=
Disallow: /*?filter
Disallow: /*?min_price=
Disallow: /*?max_price=

# Allow product images
Allow: /wp-content/uploads/

# Block duplicate category pages
Disallow: /*/page/

# Sitemaps
Sitemap: https://example.com/product-sitemap.xml
Sitemap: https://example.com/category-sitemap.xml

Fix 8: Development/Staging Site

Block entire staging site:

# robots.txt for staging.example.com
User-agent: *
Disallow: /

# Prevent indexing of dev site
# Also add <meta name="robots" content="noindex, nofollow"> to all pages

Platform-Specific Guides

Detailed implementation instructions for your specific platform:

Platform Troubleshooting Guide
Shopify Shopify Robots.txt Guide
WordPress WordPress Robots.txt Guide
Wix Wix Robots.txt Guide
Squarespace Squarespace Robots.txt Guide
Webflow Webflow Robots.txt Guide

Verification

After updating robots.txt:

Test 1: Direct Access

  1. Visit https://yoursite.com/robots.txt
  2. Verify changes are live
  3. Check syntax is correct

Test 2: Google Search Console

  1. Go to robots.txt Tester
  2. Test important URLs
  3. Verify they're allowed
  4. Test admin URLs blocked

Test 3: Fetch as Google

  1. Google Search Console
  2. URL Inspection tool
  3. Enter important product/page URL
  4. Click "Test Live URL"
  5. Should say "URL is allowed"

Test 4: Wait and Monitor

  1. Wait 24-48 hours for Google to recrawl
  2. Check Search Console Coverage report
  3. Previously blocked pages should appear
  4. Check organic traffic increases

Common Mistakes

  1. Blocking entire site - Disallow: / blocks everything
  2. Blocking CSS/JS - Google can't render pages
  3. Case sensitivity - /Admin//admin/
  4. Wrong location - Must be at root domain
  5. Listing sensitive directories - Don't advertise what to attack
  6. Not including sitemap - Missed opportunity to guide crawlers
  7. Wildcards - Not all bots support * in paths
  8. Not testing - Always test before deploying

Advanced Topics

Virtual Robots.txt

Serve robots.txt dynamically:

// Node.js/Express example
app.get('/robots.txt', (req, res) => {
    const robotsTxt = `
User-agent: *
Disallow: /admin/
Allow: /

Sitemap: ${req.protocol}://${req.get('host')}/sitemap.xml
    `.trim();

    res.type('text/plain');
    res.send(robotsTxt);
});

Crawl Delay

Slow down aggressive crawlers:

User-agent: *
Crawl-delay: 10
# Wait 10 seconds between requests
# Note: Google ignores this, use Search Console instead

Noindex vs Robots.txt

When to use each:

Robots.txt (prevent crawling):
- Private areas
- Duplicate content
- Waste of crawl budget

Meta robots / X-Robots-Tag (prevent indexing):
- Pages that should be crawled but not indexed
- Use: <meta name="robots" content="noindex, follow">

Further Reading

// SYS.FOOTER