What is Google Crawl and Index in SEO?

What is crawling?

Crawling is the process by which search engines discover new content on the Internet. To do this, search engines use crawlers that follow links from known web pages to new web pages.

Since thousands of web pages are created or updated every day, the crawling process is a never-ending, repetitive mechanism.

Martin Splitt, Webmaster Trends Analyst at Google, describes the crawling process very simply:

"We start with some URLs and then basically follow the links from them. So we're basically crawling the internet page by page, pretty much."

Crawling is the first step in the process. Next comes indexing, ranking (pages go through various ranking algorithms), and finally providing search results.

What is Google Crawl and Index in SEO?

Let’s take a closer look at how crawling works.

What are search engine crawlers?

A search engine crawler (also called a web spider or crawler robot) is a program that crawls web pages, scans their contents, and collects data for indexing.

Whenever a crawler visits a new web page via a hyperlink, it looks at what it contains - scanning any text, visual elements, links, HTML, CSS or JavaScript files, etc. - and then passes (or fetches) that information for processing and eventual indexing.

Google, as a search engine, uses its own web crawler, Googlebot. There are two main types of crawlers:

Googlebot Smartphone – Primary crawler
Googlebot Desktop – Secondary crawler

Googlebot crawls websites primarily as a smartphone browser , but it can also recrawl each page with its desktop crawler to examine the performance and behavior of the website from both perspectives.

How often new pages are crawled is determined by your crawl budget.

What is crawl budget?

The crawl budget determines how much and how often web spiders crawl the web. In other words, it determines how many pages Googlebot crawls and how often those pages are re-crawled by Googlebot.

Crawl budget is determined by two main factors:

Crawl rate limit - the number of pages on a website that can be crawled simultaneously without overloading the website server.
Crawl Requests – The number of pages that Googlebot needs to crawl and/or recrawl.

Crawl budgets should focus primarily on large sites with millions of pages, rather than small sites with only a few hundred pages.

Furthermore, having a larger crawl budget does not necessarily provide any additional benefit to a site, as it is not a quality signal to search engines.

What is an index?

Indexing is the process of analyzing and storing the crawled web page content in a database (also called an index). Only indexed web pages can be ranked and used for relevant search queries.

Whenever the web crawler discovers a new web page, Googlebot passes its content (e.g., text, images, videos, meta tags, attributes, etc.) to the indexing phase where it is parsed to better understand the context and stored in the index.

Martin Splitt explains what the indexing phase actually does:

Once we have these pages... we need to understand them. We need to figure out what this content is about and what it does. So, that's the second phase, which is indexing.

To do this, Google uses the so-called Caffeine indexing system, which it launched in 2010.

The Caffeine index database can store millions of GB of web pages, which are systematically processed and indexed (and re-crawled) by Googlebot based on their content.

Not only does Googlebot first visit websites through its mobile crawler, but since the so-called mobile-first indexing update, it also tends to index content that exists on its mobile version.

What is mobile-first indexing?

Mobile-first indexing was first introduced in 2016 when Google announced that they would primarily index and use content on the mobile version of a website.

Google’s official statement clearly stated:

“In mobile-first indexing, we only get information from the mobile version of your site, so make sure Googlebot can see the full content and all the resources there.”

Since most people browse the internet using their phones these days, it’s no surprise that Google wants websites to be browsed “the same way” humans do. This is also a clear call to website owners to ensure their sites are responsive and mobile-friendly.

Note: It’s important to realize that mobile-first indexing doesn’t necessarily mean that Google won’t crawl your site using its desktop agent (Googlebot Desktop) to compare the two versions of your content.

So far, we have introduced the concepts of crawling and indexing from a theoretical perspective.

Now, let’s look at actionable steps you can take to crawl and/or index your site.

How do you get Google to crawl and index your website?

When it comes to the actual crawling and indexing, there is no “direct command” to have search engines index your site.

However, there are several ways to influence if, when, or how your site is crawled and indexed.

So let's examine what options you have when it comes to telling Google you exist.

1. Do nothing - passive attitude

From a technical perspective, you don't have to do anything to allow Google to crawl and index your site.

All you need is one link from an external website and Googlebot will eventually start crawling and indexing all available pages.

However, taking a “do nothing” approach may cause delays in web crawling and indexing, as it may take some time for web crawlers to discover your site.

2. Submit the web page through the URL Checker tool

One way to ensure that individual web pages are safe for crawling and indexing is to directly ask Google to index (or re-index) your pages using the URL Inspection tool in Google Search Console.

Google Search Console: A Simple Guide for SEO Beginners

This tool can be very useful when you have a brand new page or have made some substantial changes to an existing page and want to get it indexed as quickly as possible.

The process is pretty straightforward:

1. Go to Google Search Console and type your URL into the search bar at the top. Hit enter.

2. Search Console will show the status of the page. If the page is not indexed yet, you can request indexing. If the page is already indexed, you don't need to do anything or request it again (if you made major changes to the page).

What is Google Crawl and Index in SEO?

3. The URL Checker tool will begin testing whether the live version of the URL can be indexed (this may take a few seconds or minutes).

4. After the test is successful, a notification pops up to confirm that your URL has been added to the priority crawl queue, waiting to be indexed. The indexing process may take from a few minutes to a few days.

Note: This indexing method is only recommended for a small number of web pages; if you have a large number of URLs to index, do not abuse this tool.

Requesting indexing does not necessarily guarantee that your URL will be indexed. If a URL is blocked from crawling and/or indexing, or has some quality issues that violate Google's quality guidelines, then the URL may not be indexed at all.

3. Submit a sitemap

A sitemap is an XML-formatted list or file that contains all the web pages that you want search engines to crawl and index.

The main benefit of a sitemap is that it makes it easier for search engines to crawl your website. You can submit a large number of URLs at once, which speeds up the overall indexing of your site.

What is Google Crawl and Index in SEO?

To let Google know about your sitemap, you’ll need to use Google Search Console again.

Note: The easiest way to create a sitemap for your WordPress site is to use the Yoast SEO plugin, which will automatically create it for you. Check out this guide on how to find the URL for your sitemap.

Then go to Google Search Console > Sitemaps and paste the URL of your sitemap under Add a new sitemap :

What is Google Crawl and Index in SEO?

Once submitted, Googlebot will eventually check your sitemap and crawl every listed page you provided (assuming they are not blocked from crawling and indexing in any way).

4. Use appropriate internal linking

A strong internal linking structure is a great long-term way to make your pages easily crawlable.

How do you do this? The answer is a flat website architecture . In other words, no page has more than 3 links between them:

What is Google Crawl and Index in SEO?

A good link architecture ensures that all the pages you want to index are crawled because they are easily accessible to web crawlers. This practice is especially important for large websites with thousands of product pages, such as e-commerce websites.

Tip: While internal links are important, you should also strive to get strong and relevant external links from high-authority websites. This will not only improve the crawling and indexing efficiency of your website, but also improve your ranking in relevant SERPs.

How to prevent Google from crawling and indexing your pages?

There are many reasons why you might want to block Googlebot from crawling and/or indexing parts of your site. For example:

Private content (for example, user information that should not appear in search results)
Duplicate pages (e.g. pages with identical content that should not be crawled to save crawl budget and/or pages that appear multiple times in search results)
Blank or error pages (e.g., work-in-progress pages that are not yet ready to be indexed and appear in search results)
Pages that have little to no value (for example, user-generated pages that don’t contribute any quality content to a search query).

By now, it should be clear that Googlebot is very efficient at discovering new web pages, even if that’s not your intention.

As Google puts it: "It is virtually impossible to keep a web server secret by not publishing links."

Let's look at your options when it comes to preventing crawling and/or indexing.

1. Use robots.txt (prevent crawling)

Robots.txt is a small text file that contains direct instructions for web spiders on how to crawl your website.

Whenever web crawlers visit your website, they first check if your website contains a robots.txt file and the instructions in it. After reading the commands in the file, they follow the instructions and start crawling your website.

By using the "allow" and "disallow" directives in the robots.txt file , you can tell web crawlers which parts of your website should be accessed and crawled and which pages should be left alone.

Here is an example of a robots.txt file from the New York Times website, which contains many disallow directives:

What is Google Crawl and Index in SEO?

For example, you can block Googlebot from crawling:

Pages with duplicate content
Private Pages
URL with query parameters
Pages with thin content
Test Page

Without instructions from this file, the web crawler will visit every web page it can find, including URLs that you don't want crawled.

While robots.txt works well to prevent Googlebot from crawling your pages, you should not rely on this method to hide your content.

Disallowed pages can still be indexed by Google if other sites point to those URLs.

To prevent web pages from being indexed, there is another more effective method - Robots Meta Directives.

2. Use the “noindex” directive (prevent indexing)

Robots meta directives (sometimes called meta tags) are small pieces of HTML code placed in sections of a web page that instruct search engines how to index or crawl that page.

One of the most common directives is the so-called “noindex” directive (a robot meta directive with the noindex value in its content attribute). It prevents search engines from indexing your web pages and displaying them in the SERPs.

It looks like this:

<meta name="robots" content="noindex"/>

The "robots" attribute indicates that the command applies to all types of web crawlers.

The noindex directive is particularly useful for pages that you want to be seen by visitors but you don’t want them to be indexed or appear in search results.

noindex is often used in conjunction with the follow or nofollow attributes to tell search engines whether or not they should crawl the links on a page.

Important: You should not use both the noindex directive and a robots.txt file at the same time to block web crawlers from accessing your pages.

As Google clearly states:

In order for a noindex directive to be effective, the page must not be blocked by a robots.txt file. If a page is blocked by a robots.txt file, crawlers will never be able to see the noindex directive, and the page may still appear in search results, for example, if other pages link to it.

How to check if a page is indexed?

When checking to see if a page is crawled and indexed or if there is a problem with a particular page, you have several options.

1. Manual inspection

The easiest way to check if your site has been indexed is to do it manually using the site: operator :

What is Google Crawl and Index in SEO?

If your site has been crawled and indexed, you should see all of the indexed pages in the About XY Results section along with the approximate number of indexed pages.

If you want to check if a specific URL has been indexed, use the URL instead of the domain name:

What is Google Crawl and Index in SEO?

If your page has been indexed, you should see it in the search results.

2. Check index coverage status

To get a more detailed overview of your indexed (or not-indexed) pages, you can use the Index Coverage report in Google Search Console.

What is Google Crawl and Index in SEO?

The detailed graphs in the Index Coverage report can provide valuable information about the status of URLs and the types of problems with crawled and/or indexed pages.

3. Use the URL Inspection Tool

The URL Inspection tool provides information about each page on your site since the last time we crawled it.

You can check if your page:

There are some issues (details on how to find out)
The time when it was crawled and last crawled
Whether the page has been indexed and can appear in search results

What is Google Crawl and Index in SEO?

I hope this article was helpful. If you have anything you want to say to me, please leave me a comment.

Disclosure: Some of the links in this article contain affiliate links, which means we may earn a commission if you click through to visit us, at no extra cost to you. See how SidelinePlay is funded, why it’s important, and how you can support us.

Was this helpful?

Blog

Income

Resources

What is Google Crawl and Index in SEO?

What is crawling?

What are search engine crawlers?

What is crawl budget?

What is an index?

What is mobile-first indexing?

How do you get Google to crawl and index your website?

1. Do nothing - passive attitude

2. Submit the web page through the URL Checker tool

3. Submit a sitemap

4. Use appropriate internal linking

How to prevent Google from crawling and indexing your pages?

1. Use robots.txt (prevent crawling)

2. Use the “noindex” directive (prevent indexing)

How to check if a page is indexed?

1. Manual inspection

2. Check index coverage status

3. Use the URL Inspection Tool

Get free tips and resources right in your inbox, along with 60,000+ others