Semalt Expert Speaks On Common Robot.txt Issues and Explains How To Fix Them

Robot.txt is a useful and powerful SEO asset. As SEO experts, we create articles like these to teach you what to expect when dealing with the different parts of SEO. in a nutshell, robot.txt is a command that instructs search engine crawlers on how you want them to crawl your site. 

There is no guarantee that Google or any other search engine must follow this command. What this means is that you cannot use Robot.txt to keep Google crawlers away from a web page. Instead, the Robot.txt file is used to prevent a webpage from showing in SERP and to prevent your server from getting overloaded by crawler requests. 

In this guide, we will be looking at some of the most common issues we've encountered with robots.txt files and the impact they have on your website and search visibility. 

But first, let's take a quick look at the robots.txt file.

What is Robot.txt

Robot.txt is a plain text file format placed in the root directory of your website to guide web crawlers as they discover and index your pages. To serve its intended purpose, the robot.txt file must be placed in the topmost directory of your site. Placing it in your subdirectory won't deliver the desired effect as search engines will ignore it. 

Creating a robot.txt file is simple and can be done in a matter of seconds using a notepad. There are other directory alternatives to robots.txt. You can include a robot meta tag within the page code itself. 

In some cases, we can use the X-Robots-Tag HTTP header to influence how content appears in search results. 

Why Do You Need A Robot.txt File?

Robots.txt enables us to achieve different results depending on the content type. Here are some typical uses of robots.txt:

To Stop Crawlers From indexing a web page

In this case, the page may still rank in SERP, but it will not have a text description. The HTML content on the page will not be crawled either. 

To stop media files from appearing in SERP

Robot.txt can be used to prevent your images, videos, and audio files from appearing in search results. If these files are public, they will still exist online and can be viewed. Private content, however, will be kept away from Google search results. 

To Block Resource files like unimportant external scripts

This means that if Google attempts to crawl a page that needs a resource file to load, the Google bot will 'see' a version of the page without that resource file. Doing this may affect the indexing of the page.

While we cannot use robot.txt to prevent a web page from ranking on SERPs completely, we can use noindex meta tags. Adding a noindex meta tag to the head of a page prevents it from appearing on SERPs.

What can go wrong if there's a mistake In your Robot.txt File

Mistakes in robots.txt can have heartbreaking consequences, but it's not the end of the world. Fixing these mistakes will help you recover from any drawbacks quickly. In most cases, these mistakes can be completely erased with no trace. 

Web crawlers are very flexible, and in most cases, small or minor errors in the robot.txt file hardly have an effect. In extreme cases, unsupported or incorrect directives will be ignored. Once you've identified a fault in your robot.txt file, it is relatively easy to fix. 

Common Robots.txt Mistakes and How To Fix Them

If you notice your website behaving strangely in search results, consider checking your robots.txt file to look for any syntax errors, mistakes, and overreaching rules. Here are some common mistakes you should look out for. 

Robot.txt Not In The Root Directory

Search crawlers can only discover the robots.txt file if it's stored in the root folder. That is why there must be only one slash (/) between the domain of your website and the robots.txt filename in the URL of your robots.txt file. 

If you save your robot.txt file in a subfolder, it will most likely remain invisible to search robots. This is probably why your website is behaving as if it has no robot.txt file.

To fix this issue, move your robots.txt file to your root directory. This isn't a difficult process, and all you'll need is root access to your servers. Some content management systems automatically upload files to a media subdirectory, so you should remember that and ensure your robot.txt file is saved in the root directory. 

Poor Use Of Wildcards

Robots.txt supports two wildcard characters: 
It is safer to use wildcards only when you have to, as they can restrict much broader portions of your website. Using wildcards carelessly can cause you to restrict robots from accessing your entire site. After many audits, you discover that your website has not been showing up on SERPs because of asterisks. 

If you have a wildcard issue, you need to find where the incorrect wildcard was placed. You can move or delete it so your robot.txt can function as it should. 

Noindex in Robot.txt

This is more common in older websites. Google stopped obeying the noindex rules in robot.txt files in 2019. But it is possible to suffer the effects if your robot.txt file was created before that date. So you're more likely to see those pages indexed in Google search results when you have a noindex command in your robot.txt file. 

The solution is to use an alternative noindex method. A noindex meta tag at the head of your web page will prevent Google from indexing its content. 

Blocked Scripts and Stylesheets

While you may consider it logical to block crawlers' access to external JavaScript and cascading stylesheets (CSS), you may be doing your website harm. Remember that Googlebot needs access to both CSS and JS files in order to see your HTML and PHP pages correctly. 

If you notice your pages behaving strangely in Google's results, or it looks like Google is misinterpreting those pages, check whether you're blocking crawlers from accessing the necessary external files. 

To fix this, remove the line of code that's blocking access from your robots.txt file. Or insert an exception that restores crawler access to the necessary CSS and Javascript. 

No Sitemap URL

You can include the URL of your sitemap in your robot.txt file. In fact, we discourage this because it affects not only your robot.txt file but your SEO itself. 

Your sitemap is the first place Google Bots looks when crawling your website. It helps crawlers understand how to navigate the site and which pages are the most important on your website. 

On its own, this isn't strictly considered an error because omitting a sitemap shouldn't negatively affect the core function or what your site looks like in the search results. But if you're looking for new ways to optimize your site for SEO, consider adding your sitemap URL to robots.txt. 

Access to Development Sites 

Blocking web crawlers from accessing a live site is bad for rankings, but it's equally bad to let search engine bots crawl a website or page under development. It is common practice to add a disallow instruction to the robot.txt file when a website is still under construction, so the general public can't access its contents until you're ready to show the world what you've been up to. 

Equally, it is important to remove the disallow instruction once you've completed the site and you're ready to launch. It is common to forget about removing this instruction, but it comes at a heavy price. With the disallow instruction still part of your robots.txt, you can stop your entire website from being crawled or indexed correctly. 

If you recently launched your site and it's performing worse than anticipated, look for a universal agent disallow rule in your robots.txt file. 

Final notes

If the mistakes on your robot.txt file are having unwanted effects on your website's search appearance, the first step is correcting the mistake. You can identify and fix this error using SEO crawling tools. Once you fix the mistake, it's only a matter of time before things return to normal. 

Generally speaking, prevention is better than cure, which is why it's always important that you get professionals to code your robot.txt file. Our experts here at Semalt would love to help you out, so reach out, and we will get back to you shortly.