A robots.txt file provides guidelines for web crawlers, often used by search engines, indicating which areas of your website they should or shouldn’t access. It’s important to note that robots.txt is not a security tool; it won’t protect sensitive areas but will ask compliant crawlers to avoid them. Having a well-configured robots.txt is beneficial for SEO, directing crawlers towards important sections while reducing unnecessary server load.
Here’s a sample robots.txt file to give you an idea:
User-agent: *
Disallow: /archive/
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Allow: /archive/uploads/
Sitemap: https://www.pbrumby.com/image-sitemap-1.xml
Sitemap: https://www.pbrumby.com/sitemap-1.xml
Elements of a robots.txt file
User-agent
User-agent specifies which web crawlers the rule is meant for. A web crawler is a tool that search engines use to go through your website.
User-agent: *means the rules that follow apply to all web crawlers.User-agent: Googlebotwould mean the rules only apply to Google’s crawler.
You can set multiple User-agent rules in a single robots.txt file to target different web crawlers with specific rules. Each set of rules for a User-agent starts with the User-agent line and ends when another User-agent line starts or the file ends.
For example, Google’s own robots.txt is quite complex and targets various user-agents differently. Check out Google’s robots.txt for inspiration.
Here’s a simplified example:
User-agent: Googlebot
Disallow: /private/
User-agent: Bingbot
Disallow: /sensitive/
User-agent: *
Disallow: /hidden/
Disallow and Allow
In a robots.txt file, Disallow and Allow commands control where web crawlers can go on your site. They are followed by a path to specify which part of the website you’re talking about.
Disallow: /something/stops crawlers from scanning the ‘something‘ folder and all its subfolders and files.Allow: /lets crawlers scan the ‘images’ folder. The ‘Allow’ rule overrides the ‘Disallow’ for that specific path.something/images/
You can have multiple Disallow and Allow lines for each User-agent.
It’s important to note that you don’t have to ‘Allow’ everything. By default, if a path isn’t mentioned in a Disallow line, it’s fair game for crawlers. So, unless you explicitly block it with Disallow, it’s accessible to the crawlers.
Sitemap
The Sitemap directive in robots.txt provides search engines with a list of pages on your site.
You can specify multiple sitemaps:
Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/another-sitemap.xml
It’s important to use the full web address, starting with ‘https’. Including sitemaps helps with SEO by directing search engines to the key pages you want indexed, ensuring they recognise and understand all significant parts of your site, which in turn boosts your site’s visibility in search results.
Creating the file
To create a robots.txt file:
- Open a plain text editor on your computer.
- Type the required content and directives for the web crawlers.
- Save the file with the name ‘robots.txt’.
- Upload the saved file to the root directory of your website.
- Ensure the file is accessible by visiting
https://yourdomain.com/robots.txtin a web browser.
This placement ensures search engines can locate and abide by the rules you’ve set for your site.
How to test
You can test your file with Google’s robots.txt checker.
Summary
A robots.txt file guides search engines on how to interact with a website. Its primary purpose isn’t to hide content from search engines but to avoid overwhelming the site with unnecessary requests, saving on server costs. While you can use it to block specific user-agents, including AI, it’s not always advisable, especially as tools like ChatGPT may become prominent in future search methods.
Beyond managing traffic, robots.txt also serves as an SEO tool. It directs search engines like Google towards the most vital parts of your website, ensuring these areas get priority and focused attention during crawls.
