With all the efforts you’re putting in to increase website traffic, why would you want to ask search engines to NOT crawl your webpages.

Or at least this is what so many bloggers and webmasters think.

And this is the reason why so many of them never care to know *really* what are Robots.txt files, let alone learning how to create one.

I would know… because I was one of them a couple of years back.

what are robots.txt files, robots.txt for seo, create a robots.txt file, robots.txt sitemap

Talking about what does Robots.txt mean, it’s basically a “Do Not Enter” sign for the search engines.

It’s a file – usually stored in the main directory of the server – that tells the search spiders which webpages/links on your domain TO NOT CRAWL.

How Robots.txt works?

Again, the concept is extremely simple. Search engines do not crawl all the links that they come across like an outlaw.

Before crawling, the spiders ensure they have the permission to do so.

what are robots.txt files, robots.txt for seo, create a robots.txt file, robots.txt sitemap

So, when they are about to access a page, the first thing they look is at the Robot.txt file to make sure they have the permission to access and crawl the page.

In this file, you can command Google, Bing and other bots to KEEP AWAY from a particular webpage or an entire section when crawling.

How to create a Robots.txt file?

A generic Robots.txt file looks like:

User-agent:*
disallow:/terms-and-conditions.html

Here ‘User-agent’ with the asterisk value (*) says the following command is for ALL the search bots.

In case, if you DON’T want Google to crawl the page but are fine if Bing crawls it, use User-agent would ‘Googlebot’. If it’s the otherwise (no to Bing and Yes to Google), the User-agent value would ‘Bingbot’.

Different search engines have different crawlers/bots. If you don’t want any specific search engine to crawl your page, you have to point that out to their bots by mentioning their name in the ‘User-agent’ space.

For now, let’s stick to (*), which commands ALL search engines.

The ‘disallow’ part is the actual command that asks the bots to NOT crawl the ‘Terms and Condition’ page on your domain.

You can include multiple ‘disallow’ syntaxes.

You can prevent crawling by either blocking one particular webpage (like above) OR any specific category/label, like…

User-agent:*
disallow:/category1
disallow:/category 2

There’s also an ‘Allow’ syntax btw.

Say, if you want to make a category un-crawlable but there’s one specific page in that category that you would like search engines to crawl, you can use ‘Allow’ directive here to make that possible.

The Robots.txt file will look something like…

User-agent:*
disallow:/category1
allow:/category1/product1

As and how you require, you can use Robots.txt file flexibly to make your pages un-crawlable.

Including sitemap in Robots.txt file

It’s also a nice practice to include your sitemap in the robots.txt file.

Keeping the technical aside, think of it…

Before crawling a website, the search bots enter the Robots.txt file first (if the file is located in the main directory) to see whether they have the permission to crawl or not.

If they find the XML sitemap on the first file that they crawl on your domain – sitemap, which includes a central list of all the webpages on your website – would it not facilitate quick and easy crawling?

Of course, it would help tremendously.

So, it makes more sense to have your Robots.txt and sitemap.xml hand-in-hand. Tell the search bots what not to crawl and then what to index.

Now, including the sitemap in robots.txt is fairly straightforward. You simply add another line in your existing file:

Sitemap: htt p://www(dot)yourwebsitename(dot)com/sitemap.xml

Change the above link if the URL of your sitemap varies.

Now, after XML sitemap inclusion, here’s what should robots.txt look like now in the most basic form:

User-agent:*
disallow:/category1
Sitemap: htt p://www(dot)yourwebsitename(dot)com/sitemap.xml

If you have multiple sitemaps, add multiple links. Like:

User-agent:*
disallow:/category1
Sitemap: htt p://www(dot)yourwebsitename(dot)com/sitemap.xml
Sitemap: htt p://www(dot)yourwebsitename(dot)com/sitemap.xml

Here's how the Robots.txt file of New York Times looks like:

what are robots.txt files, robots.txt for seo, create a robots.txt file, robots.txt sitemap

*****

Everything sounds good till now? Now know why sitemap in Robots.txt file is a good choice?
GREAT!

Now before moving ahead, it’s also important that we make proper distinctions between Robots.txt and Noindex meta tag.

Because many people confuse robots.txt with noindex directive.

Robots.txt vs. Noindex

If you notice in the above part, all I am talking about is ‘crawl’ and NOT ‘index’. And this is where the biggest difference between Robot.txt and Noindex meta tag lies.

When you use Robot.txt file, the search engine DOES NOT crawl that particular webpage—directly.

Now Imagine: You publish a new article and link a particular webpage that you originally wanted to be un-crawlable.

Now if you ‘Fetch’ this published article on Search Console, the spider would definitely crawl this particular ‘un-crawlable webpage’ before visiting the robots.txt file because you asked for it from the front.

And when it is crawling, it will also index it.

And if indexed, there are decent chances it will show on SERP (although it will be extremely difficult).

Another hypothetical situation is if someone links to this particular ‘un-crawlable webpage’ from their own page. The search bots might jump from there to here, avoiding the robots.txt.

Meaning, just because you’re asking search bots to not crawl a page from one doorway doesn’t necessarily mean they won’t crawl in the page through another entryway.

This is why even Google says, “You should not use robots.txt as a means to hide your web pages from Google Search results.”

Because Robots.txt doesn’t hide your page from search engines, it might get indexed by search spiders and show up on SERP.

This is the flaw that Noindex meta tag fix. It does exactly what its name says.

Noindex tag ensures the search bots DO NOT index your page. It ensures the search engines DO NOT list that page on their result pages.

A noindex meta tag looks something like:

<meta name= “robots” content=“noindex”>

Here, ‘robots’ means ALL search engine crawlers.

Again, if you want one search engine to index the page and others do not, you have to use the crawler’s specific name here who you do not want to index this page.

Content ‘noindex’ is a basic directive that the crawlers are not allowed to index the particular webpage.

Still with me?

It’s okay if none of this is making any sense. Bookmark this post and give the section a re-read sometime later. Continue with the rest.

Why Robots.txt is important?

It’s another very common question that people toss.

After all, isn’t it a good thing if the search bots crawl the page? Would it not help the search ranking?

The simple answer is NO.

The reason why you want to use Robots.txt file is because you don’t want it to crawl webpages that aren’t important.

There are many pages on your website that are irrelevant and are meant for the backdoor purpose.

They aren’t exactly meant for visitors’ access. The thank you page, internal lead generation page and so much more…

Using this exclusion protocol for each is a fitting idea here.

Other benefits of Robots.txt include:
  • Cover all the duplicate contents on the website
  • To inform search bots about the location of the XML Sitemap 
But aside from these minor benefits, there’s a very direct SEO advantage of using the Robots.txt file.

Are You Wasting Crawl Budget?

Here’s how Google explains what Robots.txt file is used for…

Robots.txt for SEO, robots.txt of google

The key line is—“(Not) waste crawl budget crawling unimportant or similar pages on your site.”

Here’s a thing: Big websites like Huffington Post publishes 100s of articles every day. And they have been here for years now. Meaning, they are sitting on a seamless pile of content.

Sadly, at large, the old articles aren’t really much of a use to them because they focus on time-sensitive topics.

So, when they index their domain, do you think the search bots crawl all the thousands of old posts? Or for that matter, do you think it prioritizes latest, as well as the old posts, equally?

The answer is: NO.

As is evident in Google’s explanation above, each website gets a “crawl budget”.

Now, assumingly, the amount of this budget depends on a range of factors, like backlinks, domain authority, PageRank, brand mentions and more.

Meaning, large websites like Huff Post likely get a large crawl budget than a newer, less-renowned website.

However, everyone gets a (limited) budget nonetheless.

Now imagine if you’re spending this crawl budget on old, irrelevant, unnecessary URLs. Wouldn’t that be a waste of it?

Of course, it is!

This is where Robots.txt for SEO comes in. It makes URLs un-crawlable. Hence, saving the crawl budget.

Talking about our example of Huffington Post, they certainly wouldn’t make their old posts un-crawlable.

However, to counter the challenge of duplicate contents, which is a big source of crawl budget wastage, they have included archive, search, feed, and referral extensions in Robots.txt file. Here is it:

robots.txt for SEO, how robots.txt should look like

Similarly, if you have multiple categories/labels, internal search option, long archive, and other irrelevant URLs on your domain—it’s a great idea to lock them in Robots.txt file.

This basic step can ensure your crawl budget is preserved and used well.

And this is one of the core and emerging reasons why you MUST use Robots.txt file very carefully and smartly to positively affect the search ranking of your website.

Quick (And Basic) Questions

  • Where should Robots.txt be located?
It must be located at the root of the web host.

It must be located in the main directory of the host, i.e.
www(dot)yourwebsitename(dot)com/robots.txt

...And NOT in subdirectory like
www(dot)yourwebsitename(dot)com/category/robots.txt.

In case of Robots.txt of Blogger (BlogSpot), the file already exists for you. Check it by typing ‘/robots.txt’ in the URL.

And to modify this file, go to your dashboard’s setting, click on search preferences and check under the head ‘Crawlers and Indexing’.

what are robots.txt files, robots.txt for seo, create a robots.txt file, robots.txt sitemap

  • How do I find my Robots txt file?
Simple. Type: www(dot)yourdomainname(dot)com/robots.txt.

In case, if you don’t find the file, simply use this tool at sitechecker.pro. If the file exists, the tool will come up with the right URL.

  • How to create robots.txt file for SEO?
You can use your regular notepad to create the file.

Know what to allow and disallow, create the syntax like mentioned above and then save the file as ‘robots.txt’. Then upload it to the main directory of the host.

Avoid using Word Processor or MS Word.

Instead of writing the directives yourself, you can even use a Robots.txt generator to create the file. And then you can make changes in that.

Here’s a good Robots.txt generator.

Save the result in the notepad and save it on the root of the host.

*****

Now you know all the what’s, how’s and why’s—time for some action!

Check Robots.txt file on your domain. See what URLs are made un-crawlable. And then decide which new ones need to be locked in there.

If you don’t already have one, create it. It’s extremely simple. And then smartly use it to boost your SEO.

If you have any question, ask in the comment section below! :)

Asif Ali

Asif is a certified content marketer and professional full-stack writer with 3 years experience in his pockets. Stratightedge, blogger and a full-time cloud-lover, find him on Facebook and Twitter.


Follow Spell Out Marketing



I want to learn about...
SEO

On-page and off-page SEO techniques and stratgeies to champion search engines.

Know More
Content Marketing

Learn how to use contents to generate traffic and build a lasting business.

Know More
Blogging

Bring out the (super) blogger inside you that engages audience + generates revenue.

Know More
Email Marketing

Build a list of subscribers-cum-fan that adds sustainability to your business

Know More
Spell-Out_Marketing
Home | Start Here | Tools | Quick Read | About | Contact
Terms of Use | Privacy Policy | © 2018 Spell Out Marketing