Google: Crawl Rate and Crawl Demand Affects the Crawl Budget
Google has recently published details to explain the term “crawl budget” and how it will affect Googlebot crawling your website.
Googlebot is the name given to Google’s web crawling bot. Google uses its Googlebot to crawl and fetches the billions of pages on the web to index in its search engine.
Googlebot uses an algorithmic process that determines which websites to crawl, how often, as well as the number of web pages to fetch from each site.
It is this algorithm that has created the concept of “crawl budget” as webmasters seek to optimize the number of pages crawled by Google. You can read more about how Googlebot works here.
In the announcement, Google stated that:
Prioritizing what to crawl, when, and how much resource the server hosting the website can allocate to crawling is more important for bigger websites, or those that auto-generate pages based on URL parameters, for example.
This makes perfect sense.
A website that has duplicate content caused by millions of auto-generated pages may struggle to have all their pages indexed. Even if they are indexed, the content may not be crawled often, causing any changes to those pages to be slow to update.
Crawl rate limit
The crawl rate limit relates to the maximum number of simultaneous parallel connections Googlebot will use to crawl your website, and the time it will wait before fetching another page.
Because Googlebot will use your server resources, Google limits the number of connections to your website to prevent your site from slowing down and affecting the user experience for your visitors.
The crawl rate is affected by the following factors:
Crawl health: When Google crawls the website, it monitors its responsiveness. If the website responds fast, then the limits increase, and more connections are used to crawl the site. If the site is slow, slows down once crawling starts, or responds with server errors, the limit goes down, and Googlebot crawls the site less.
Google confirmed in the follow-up FAQs that website speed would increase the crawl rate, and conversely, a high number of errors will decrease the crawl rate.
Limit set in Search Console: If you wish to reduce the rate that Googlebot crawls your website, then you can set this within your Search Console. You cannot increase the rate of crawling by setting higher limits.
Googlebot does not obey the “crawl-delay” directive in the robots.txt file. It must be implemented from within the search console.
You can improve the crawl rate limit by ensuring your server is as responsive as possible.
While a higher crawl rate will help get all your pages indexed, or updated content re-indexed, Google has confirmed explicitly that a higher crawl rate is not a ranking factor.
That being said, we suspect the same actions you undertake for optimizing the crawl rate will also have search ranking benefits. For instance, by reducing the number of duplicate pages will help concentrate the page rank on the website.
Fewer duplicate pages will also help prevent your website from being marked as low-quality and help avoid a Panda penalty.
Googlebot can also reduce the number of pages it crawls even if the crawl limit is not reached. There appear to be many factors that influence the determination of crawl demand, but Google points out two of the main ones:
- Popularity: Web pages that are more popular on the web tend to be crawled more often. A page that is frequently shared on social media, or has many backlinks will be discovered more often as there are more links to that page from which the Googlebot can enter.
- Staleness: Google endeavors to prevent URLs from becoming stale in the index.
Google has also confirmed that significant changes to a website, such as a website move or URL structure change, may trigger increased crawling.
To improve crawl demand, you can share your web page on social media, or frequently update the content on the page. Creating high-quality content that may attract backlinks will also help.
Factors affecting crawl budget
Google has analyzed various factors affecting the crawling budget and subsequent indexing. Described as “low-value-add” URLs, they fall into the following categories which have been placed in order of significance (according to Google):
- Faceted navigation and session identifiers: This relates to the problem of duplicate content that may be created by URL parameters. An example of this can be seen below: Not only that, the link juice or page rank may be distributed across all these pages, causing lower rankings among the pages on your website.
- On-site duplicate content: The same principle that applies to the above point is relevant to any duplicate content.
- Soft error pages: You can now see your soft errors from within your search console. Wasting your crawl budget on files or pages that result in an error is bad. In some cases, you may find hundreds or even thousands of errors due to errors in your website configuration. An example of the search console page can be seen below, although fortunately in the following case, there is just one error revealed:
- Hacked pages: One of the most common types of hacks is to either inject links into your pages (thus using your crawl budget by sending the bot away from your website) or by creating hidden pages on your website which then provide outbound links. A great article published by Google some time ago discusses this in more detail. However, an easy way to check your website is to enter your domain into the Sucuri website checker. This will check various pages on your website, and give you a result as follows:
- Infinite spaces and proxies: Imagine you have a page that contains a calendar. On that calendar, you can click on any day, or even to the previous or next month. Each of those links leads to a new page. In this situation, Googlebot will continuously find new pages as it jumps from month to month, and it is this type of situation that is referred to as an infinite space. You can read more about infinite spaces here.
- Low quality and spam content: Low quality and spam content have very little value for search engines, and in addition to risking a Panda Penalty, they also use up your crawl budget.
Google confirms that:
Wasting server resources on pages like these will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering great content on a website.
Alternate URLs, and embedded content
It can be useful to use the nofollow directive on parts of your website that does not need to be indexed, such as shopping cart pages.
While this won’t necessarily stop them from being crawled if you have a dofollow link elsewhere on your website linking to them, it can be a great strategy to consider and look into should you have a crawl budget problem.
Logic would suggest that thinking about your crawl budget if you have under two thousand pages is a waste of time. Technically, this is correct, but we believe that even small websites will benefit from reviewing all these issues from a page rank point of view.
Having a well thought out website structure, with no duplicate content, and ensuring your page rank is directed to your most important pages can be very beneficial for SEO.