A brief guide to Crawling & Indexing

A brief introduction to crawling and indexing

In this post, I am going to provide an introduction to crawling and indexing. I will also share how you can check if your site is being crawled and indexed successfully. And what to do if your site isn’t indexed.

Let’s get into it.

Indexing your website on search engines begins with crawling.

In order to show up in search results, your content needs to first be visible to search engines. It’s arguably the most important piece of the SEO puzzle: If your site can’t be found, there’s no way you’ll ever show up in the SERPs (Search Engine Results Page).

What is crawling?

Crawling is the discovery process in which search engines send out a team of robots (known as crawlers or spiders) to find new and updated content.

The second kind of crawling is Refresh, where Google finds changes in webpages that are already indexed.

Googlebot starts out by fetching a few web pages and then follows the links on those web pages to find new URLs. By hopping along this path of links, the crawler is able to find new content and add it to its index.

For more in-depth information check out this guide by Google.

What is indexing?

Search engines process and store information they find in an index, a huge database of all the content they’ve discovered and deem good enough to serve up to searchers.

Indexing essentially refers to the adding of a webpage’s content into Google to be considered for rankings.

For more in-depth information check out this guide by Google.

How to check if your website is being indexed?

To check if your website is on Google you can do a “site:yoursite.com” search on Google. This will return results Google has in its index for the site specified:

The number of results Google displays (see “About XX results” above) isn’t exact, but it does give you a solid idea of which pages are indexed on your site and how they are currently showing up in search results.

If your site shows up in the results, great! You have nothing to worry about.

It is worth checking some specific pages, perhaps a key service page to make sure it’s in the index.

To do this, simply add the URL string after “site:”. For example, “site:yoursite.com/best-seller”

If your site doesn’t show, it could just mean that your site is new. And Google hasn’t found it yet.

If you know your site isn’t new, it probably means that your site has inadvertently blocked search engines from crawling it (which is surprisingly common!) Either way, you want to get this fixed ASAP.

What stops a site from being crawled and how to fix it?

Here are the top 3 causes stopping web pages from being crawled.

The whole website or certain pages can remain unseen by Google for a simple reason: its site crawlers are not allowed to enter them.

Blocking the page from indexing through robots meta tag

Without realising you may be blocking the page from indexing through robots meta tag.

If you do this, the search bot will not even start looking at your page’s content, moving directly to the next page.

You can detect this issue checking if your page’s code contains this directive:

To check your page’s code on Chrome, right-click and select “View Page Source”

Blocking the pages from indexing through robots.txt

Second, you may be blocking the pages from indexing through robots.txt.

Robots.txt is the first file of your website the crawlers look at. The most painful thing you can find there is:

User-agent: *

Disallow: /

It means that all the website’s pages are blocked from indexing.

It might happen that only certain pages or sections are blocked, for instance:

User-agent: *

Disallow: /products/

In this case, any page in the Products subfolder will be blocked from indexing and, therefore, none of your product descriptions will be visible in Google.

To check your robots.txt visit “yoursite.com/robots.txt”.

NoFollow links

In this case, the site crawler will index your page’s content but will not follow the links. There are two types of no follow directives:

for the whole page. Check if you have

in the page’s code – that would mean the crawler can’t follow any link on the page.

for a single link. This is what the piece of code looks like in this case:

href=”pagename.html” rel=”nofollow”/>

If you are seeing one or all of these directives you need to instruct your developer to remove them as they are stopping your website from appearing on Google.

How to make it easy for Google to crawl and index your site

Issues with meta tags and robots.txt aren’t the only things stopping your website from showing in Google.

To give your website the best chance of being crawled and indexed, make sure you do the following:

Upload a sitemap to Google Search Console

A sitemap is just like it sounds: a “map” of your site. Google and other search engines use sitemaps to find all of the pages on your site.

You can usually find yours by typing one of these URLs into your browser:

website.com/sitemap.xml

website.com/sitemap_index.xml

If it’s not there, go to website.com/robots.txt where it’ll usually be listed

A sitemap helps ensure that all of your important pages are being crawled and indexed.

If you don’t have an XML sitemap, it’s essential to create one! If you are not sure how to do this, speak to your developer or get in touch with me.

Once you have found or created your sitemap you need to submit it to Google via the Search Console:

Create a logical site structure

Both visitors and search engines need to be able to navigate your site easily and intuitively, which is why it’s important to create a logical hierarchy for your content.

The easiest way to do this is to sketch out a mind map. Each of the branches in your mind map will become internal links, which are links from one page on a website to another.

Source: Kinsta

This site structure should be used as your menu/navigation. This way Google can easily crawl your website.

Internal linking

An internal link is any link from one page on your website to another page on your website. Both your users and search engines use links to find content on your website. Your users use links to navigate through your site and to find the content they want to find. Search engines also use links to navigate your site. They won’t see a page if there are no links to it.

There are several types of internal links. In addition to links on your homepage, menu, post, etc, you can also add links within your content. We call those contextual links. Contextual links point your users to interesting and related content.

Internal links are crucial for UX and SEO for a few reasons:

They help search engines find new pages. Pages without internal links are rarely found and indexed.
They help pass PageRank around your site. PageRank is the foundation of Google’s ranking algorithm that tries to determine the “value” of a page.
They help search engines understand what your page is about. Google looks at link anchors and surrounding text for this.

Takeaway

You should now have a good understanding of crawling and indexing; what it is and why it is important.

To ensure your website is in the best position to be crawled and indexed make sure there are no directives blocking Google from crawling. Add an XML sitemap > upload it to GSC. Create a logical website structure and use internal links to help Google find the content you want to be indexed.

If you need help with any of this, book a free 30-minute call today or email me, and we can discuss your website needs.

About the author

Rory

Rory has worked as an SEO consultant for 10 years. He founded Parlez Creative after returning to Europe from Malawi where he lived for 2 years working for a safari tour operator. When he isn't working on client campaigns, you will likely find him on the golf course working on his swing! Or exploring the great outdoors with his family.

PrevNext

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

A brief introduction to crawling and indexing

A brief introduction to crawling and indexing

What is crawling?

What is indexing?

How to check if your website is being indexed?

What stops a site from being crawled and how to fix it?

Blocking the page from indexing through robots meta tag

Blocking the pages from indexing through robots.txt

NoFollow links

How to make it easy for Google to crawl and index your site

Upload a sitemap to Google Search Console

Create a logical site structure

Internal linking

Takeaway

Rory

More from our blog

Our Works

A brief introduction to crawling and indexing

What is crawling?

What is indexing?

How to check if your website is being indexed?

What stops a site from being crawled and how to fix it?

Blocking the page from indexing through robots meta tag

Blocking the pages from indexing through robots.txt

NoFollow links

How to make it easy for Google to crawl and index your site

Upload a sitemap to Google Search Console

Create a logical site structure

Internal linking

Takeaway

Rory

More from our blog

Our Works

Tags