Whenever we talk about SEO of Wp blog, WordPress robots.txt file plays an important role in search engine ranking.
It blocks search engine bots and helps index and crawl the important content of our blog. However, sometimes, a misconfigured Robots.txt file can get your blog completely blocked from search engines.
Therefore, when you make changes to your robots.txt file, it is important that it should be well optimized and should not block access to important parts of your blog.
There are many misunderstandings about what is and is not indexed in Robots.txt, and we will look at this aspect as well.
SEO consists of hundreds of elements, and one of the most important parts is Robots.txt. This small text file located in the root directory of your website can help you optimize your website in depth.
Most webmasters tend to avoid editing the Robots.txt file, but it's actually not that difficult. Anyone with basic knowledge can create and edit a Robots file. If you're new to this, this article is for you.
If your website does not have a Robots.txt file, you can learn how to do it here. If your blog/website has a Robots.txt file but it is not optimized yet, you can refer to this article to optimize your Robots.txt file.
What is WordPress Robots.txt and why should we use it
Let me start with the basics. All search engines have robots that crawl websites. Crawl and index are two different terms, if you want to learn more, you can read: Google Crawl and Index .
When search engine bots (Google bot, Bing bot, third-party search engine crawlers) visit your website through a link or a sitemap link submitted in your webmaster dashboard, they follow all the links on your blog to crawl and index your website.
Now, these two files – Sitemap.xml and Robots.txt – are located in the root directory of your domain. As I mentioned, robots follow the Robots.txt rules to decide whether to crawl your website or not. Here is how the robots.txt file is used:
When search engine bots visit your blog, they have limited resources to crawl your site. If they cannot crawl all the pages on your site with the allocated resources, they will stop crawling, which will affect your index.
Now, there are many parts of your website that you don't want search engine bots to crawl. For example, your WP-admin folder, admin panel, or other pages that are useless to search engines. Using Robots.txt, you can instruct search engine crawlers (bots) not to crawl these areas of your website. This will not only speed up the crawling of your blog, but also help in deep crawling of your inner pages.
The biggest misconception about the Robots.txt file is that people use it for Noindexing .
Remember, a Robots.txt file is not for "indexing" or "disindexing." It is for telling search engine robots to stop crawling certain parts of your blog . For example, if you look at the ShoutMeLoud Robots.txt file (WordPress platform), you'll see exactly which parts of my blog I don't want search engine robots to crawl.
The Robots.txt file helps search engine robots by telling them which parts to crawl and which parts to avoid. When a search engine's search robot or spider visits your website and wants to index it, they first follow the Robots.txt file. The search robot or spider follows the instructions in the file to decide whether to index your website's pages.
If you use WordPress, you will find a Robots.txt file in the root directory of your WordPress installation.
For static websites, if you or your developer has already created one, you can find it in the root folder. If not, just create a new Notepad file and name it Robots.txt, then upload it to the root directory of your domain using FTP.
How do I create a robots.txt file?
As I mentioned before, Robots.txt is a generic text file. So, if your website doesn't have one, open any text editor you like (such as Notepad) and create a Robots.txt file with one or more records. Each record contains information that is important for search engines. For example:
User-agent: googlebot
Disallow: /cgi-bin
If you write these lines in the Robots.txt file, it means that Google robot is allowed to index every page of your website. However, cgi-bin
the folders under the root directory are not allowed to be indexed. This means that Google robot will not index cgi-bin
that folder.
By using the "Disallow" option you can restrict any search robot or spider from indexing a page or folder. Many websites do not index archive folders or pages to avoid duplicate content .
Where do you get the name of the search robot?
You can check it in your website logs, but if you want to attract a lot of visitors from search engines, you should allow all search robots. This means that all search robots will index your website. You can User-agent: *
write: Allow all search robots. For example:
User-agent: *
Disallow: /cgi-bin
That’s why every search bot will index your website.
Considerations for Robots.txt files
- Do not use comments in your Robots.txt file.
- Do not preserve any leading spaces at the beginning of lines, and do not add normal spaces to the file. For example:
Bad Practices:
User-agent: *
Not allowed: /support
Good practices:
User-agent: *
Not allowed: /support
- Don’t change the rules of command.
Bad Practices:
Disallow: /support
User-Agent: *
Good practices:
User-agent: *
Not allowed: /support
- If you don't want to index multiple directories or pages, please do not write the following names:
Bad Practices:
User-agent: *
Forbidden: /support /cgi-bin /images/
Good practices:
User-agent: *
Not allowed: /support
Disallow: /cgi-bin
Disallow: /images
- Use upper and lower case letters correctly. For example, if you want to index the "Downloads" directory but write "download" in the Robots.txt file, it will mistakenly think it is a search robot.
- If you want to index all pages and directories of your site, write:
User-agent: *
prohibit:
- But if you don't want to index all pages and directories of your website, write this:
User-agent: *
Not allowed: /
After editing the Robots.txt file, upload it to the root or main directory of your website through any FTP software.
WordPress Robots.txt Guide:
You can edit your WordPress Robots.txt file by logging into your server’s FTP account or you can edit your Robots.txt file from your WordPress dashboard using a plugin like Robots meta. In addition to your sitemap URL, you will also need to add a few things to your Robots.txt file. Adding your sitemap URL helps search engine bots find your sitemap file, which can speed up indexing of your pages.
Here's an example Robots.txt file that would work for any domain. In the sitemap, replace the sitemap URL with your blog URL:
sitemap: https://www.sidelineplay.com/sitemap.xml
User-agent: *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /archives/
disallow: /*?*
Disallow: *?replytocom
Disallow: /comments/feed/
User-agent: Mediapartners-Google*
Allow: /
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
User-agent: Adsbot-Google
Allow: /
User-agent: Googlebot-Mobile
Allow: /
Block Bad SEO Bots (Full List)
There are many SEO tools on the market, such as Ahrefs, SEMRush, Majestic, etc., which will constantly crawl your website and try to get SEO secrets. These strategies will be used by your competitors to gain benefits, but will not bring any value to you. In addition, these SEO crawlers will increase the load on your server and increase your server costs.
Unless you don’t use these SEO tools, it’s best to block them from crawling your site. Here’s what I use to block some of the most common SEO proxies in my robots.txt file:
User-agent: MJ12bot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: SemrushBot-SA
Disallow: /
User-agent: dotbot
Disallow:/
User-agent: AhrefsBot
Disallow: /
User-agent: Alexibot
Disallow: /
User-agent: SurveyBot
Disallow: /
User-agent: Xenu’s
Disallow: /
User-agent: Xenu’s Link Sleuth 1.1c
Disallow: /
User-agent: rogerbot
Disallow: /
# Block NextGenSearchBot
User-agent: NextGenSearchBot
Disallow: /
# Block ia-archiver from crawling site
User-agent: ia_archiver
Disallow: /
# Block archive.org_bot from crawling site
User-agent: archive.org_bot
Disallow: /
# Block Archive.org Bot from crawling site
User-agent: Archive.org Bot
Disallow: /
# Block LinkWalker from crawling site
User-agent: LinkWalker
Disallow: /
# Block GigaBlast Spider from crawling site
User-agent: GigaBlast Spider
Disallow: /
# Block ia_archiver-web.archive.org_bot from crawling site
User-agent: ia_archiver-web.archive.org
Disallow: /
# Block PicScout Crawler from crawling site
User-agent: PicScout
Disallow: /
# Block BLEXBot Crawler from crawling site
User-agent: BLEXBot Crawler
Disallow: /
# Block TinEye from crawling site
User-agent: TinEye
Disallow: /
# Block SEOkicks
User-agent: SEOkicks-Robot
Disallow: /
# Block BlexBot
User-agent: BLEXBot
Disallow: /
# Block SISTRIX
User-agent: SISTRIX Crawler
Disallow: /
# Block Uptime robot
User-agent: UptimeRobot/2.0
Disallow: /
# Block Ezooms Robot
User-agent: Ezooms Robot
Disallow: /
# Block netEstate NE Crawler (+http://www.website-datenbank.de/)
User-agent: netEstate NE Crawler (+http://www.website-datenbank.de/)
Disallow: /
# Block WiseGuys Robot
User-agent: WiseGuys Robot
Disallow: /
# Block Turnitin Robot
User-agent: Turnitin Robot
Disallow: /
# Block Heritrix
User-agent: Heritrix
Disallow: /
# Block pricepi
User-agent: pimonster
Disallow: /
User-agent: Pimonster
Disallow: /
User-agent: Pi-Monster
Disallow: /
# Block Eniro
User-agent: ECCP/1.0 (search@eniro.com)
Disallow: /
# Block Psbot
User-agent: Psbot
Disallow: /
# Block Youdao
User-agent: YoudaoBot
Disallow: /
# BLEXBot
User-agent: BLEXBot
Disallow: /
# Block NaverBot
User-agent: NaverBot
User-agent: Yeti
Disallow: /
# Block ZBot
User-agent: ZBot
Disallow: /
# Block Vagabondo
User-agent: Vagabondo
Disallow: /
# Block LinkWalker
User-agent: LinkWalker
Disallow: /
# Block SimplePie
User-agent: SimplePie
Disallow: /
# Block Wget
User-agent: Wget
Disallow: /
# Block Pixray-Seeker
User-agent: Pixray-Seeker
Disallow: /
# Block BoardReader
User-agent: BoardReader
Disallow: /
# Block Quantify
User-agent: Quantify
Disallow: /
# Block Plukkie
User-agent: Plukkie
Disallow: /
# Block Cuam
User-agent: Cuam
Disallow: /
# https://megaindex.com/crawler
User-agent: MegaIndex.ru
Disallow: /
User-agent: megaindex.com
Disallow: /
User-agent: +http://megaindex.com/crawler
Disallow: /
User-agent: MegaIndex.ru/2.0
Disallow: /
User-agent: megaIndex.ru
Disallow: /
Make sure no content is affected by the new Robots.txt file
Now that you have made some changes to your Robots.txt file, it is time to check if any of your content has been affected as a result of the update to your robots.txt file.
You can use the Google Search Console's Fetch as Google tool to see if your content is accessible by your Robots.txt file.
The steps are simple.
Log into Google Search Console, select your site, diagnose and crawl as Google.
Add your website post and check if there is any issue accessing your post.
You can also check for crawl errors caused by your Robots.txt file under the Crawl Errors section of the Search Console.
Under Crawl > Crawl Errors, select Restricted by Robots.txt and you will see all the links that are disallowed by the Robots.txt file.
Do you use WordPress Robots.txt to optimize your website? Would you like to add more insights to your Robots.txt file? Let us know in the comments below. Don’t forget to subscribe to our email newsletter to keep receiving more SEO tips.
Here are some of our handpicked guides for you to read:
- How to Fix 500 Internal Server Error in WordPress
- WordPress SEO Tutorial (From Beginner to Advanced Guide)
Disclosure: Some of the links in this article contain affiliate links, which means we may earn a commission if you click through to visit us, at no extra cost to you. See how SidelinePlay is funded, why it’s important, and how you can support us.
Was this helpful?