I presume we are all aware that before a website’s URL can be ranked, it has to be indexed by search engines first. And before a single URL gets indexed and virtually recorded within search engine databases, it has to be discovered by search engine robots — whether fed through XML sitemap submissions or followed through inbound links.
Now, a related concept of crawl budget has been given by Google a formal definition, “the number of URLs Googlebot can and wants to crawl.” It provides webmasters an idea how Googlebot, Google’s search spider, behaves when crawling websites.
It’s a fact that websites come in different sizes, levels of structure, update frequency and so on. If you have a small website — say 100 pages or less, “crawl budget” is not an issue as long as you stick with the basics such as providing access to search engine robots:
. No crawl restrictions in place at robots.txt file
. Created a search engine-friendly site navigation
. Submitted XML sitemap at Google Search Console
. Set a reasonable access to Googlebot on Google Search Console
Google’s Gary Illyes even goes out to say that a website with “fewer than a few thousand URLs, most of the time it will be crawled efficiently.” So unless you have a huge website with tens of thousands of pages, this issue of crawl budget should not be a big issue.
For bigger sites, such as news websites, or online retail shops that constantly add and update content, the priority pages for crawling becomes an important consideration. So to help Google pick the right ones, setting the house in order should be the responsibility of the webmaster. For example, designating the canonical URL or properly using URL parameters can go along way in guiding Googlebot to the right pages.
Googlebot tries to strike a balance between fulfilling its duty of crawling pages, but also ensuring that its activities don’t impede with human visitors as the spider’s activities also use website server resources. This is the basic idea behind crawl rate which sets a limit on number of simultaneous connections Googlebot may use on any given website.
Factors, according to Google, that determine the Googlebot crawl rate include:
a. How quickly a site responds to the crawl. If there are no generated errors (page not found, internal server error, infinite redirection, etc), more parallel connections can be used to expedite the crawl process, if necessary. Otherwise, if there are errors, Googlebot slows down — reducing the time to wait between page fetching — as new errors are recorded in Google Search Console.
b. Limit you set on Google Search Console. Under “Site Settings” accessible via the gear icon on top right, you can let Google optimize its crawl (recommended) or you can set it yourself especially when you suspect Googlebot has slowed down the site.
Should the site have low activity — few daily site updates revealed by matching documents at Google index, there might be no need to increase activity from Googlebot.
But not all sites are treated the same way. For instance, Illyes says that if URLs are more popular on the Internet, something that it can perhaps determine based on inbound traffic or volume of incoming links, crawl demand can be increased.
And even for low-activity sites like corporate websites with content that is often not updated, crawl demand can be increased once they make site-wide activities like moving to a new domain name. This is done in an effort to update Google Index as soon as possible and display relevant search result to reflect site changes.
Google’s crawl budget is influenced by, among others, soft error pages such as a standard “page not found” error, low-quality pages and duplicate content within the website. If such issues are discovered, Googlebot’s way of saying “don’t waste my time” is indicated by not increasing its crawl budget for the site in question.
So to make sure Googlebot crawls efficiently, we must ensure our sites are setup properly:
a. Eliminate duplicate content
b. Set proper canonical URLs
c. Eliminate errors as indicated on Google Search Console
d. Create proper redirection methods, if required.