Google: Don't Use Crawlers to Build Sitemaps. Automate Them.
When updating your website’s sitemap, you should “automate it,” says Google’s John Mueller. He added that you should not use services that crawl your site, as “Google already does that.”
Mueller was responding to a question by a Reddit user who wanted to know the best strategy to keep their XML sitemap up to date on a large site with thousands of articles, with many new ones added weekly.
The Reddit user said that they have already tried paid solutions and even node.js to crawl the website and generate the XML website.
“I tried some paid solution and even open source node.js library to crawl the website and generate the xml sitemap. But the process takes forever. So I’m assuming it’s not the best way to tackle this problem.”
Mueller responded, saying that XML files should be generated by your database, which will ping the sitemap file as soon as changes are made by referencing the exact last-modification date.
“Automate it on your backend (generate the files based on your local database). That way you can ping sitemap files immediately when something changes, and you have an exact last-modification date.”
He continued, “Don’t crawl your own site, Google already does that.”
Why Use Automated XML Sitemaps
An XML sitemap is an unstyled document that displays all the pages on your site, along with the last modified date, and other essential information.
You can then declare your sitemap in your robots.txt file and submit it directly to search engines so that they know to crawl all the pages on your site.
A sitemap looks like the following:
According to Google’s Gary Illyes, XML sitemaps are the second most important source of URLs to be crawled by Googlebot.
The first discovery option being Google’s own crawler, which is aptly named Googlebot.
Googlebot works by following links on websites, so if you have orphaned pages or a very large site with deep pages (pages only found after traveling through many other pages), Googlebot may not crawl them.
There is no sense in crawling a website yourself with an automated tool as this will not pick up orphaned pages, nor will it keep important information such as last modified times up to date.
These tools are only doing what Google is already doing.