SEO With Hugo (12) Sitemaps

by Samuele Lilliu | 29 January 2023

Sitemaps & robots.txt help search engines efficiently crawl your website. Sitemaps list all pages while robots.txt exclude sensitive/unwanted pages.

X
  • Producer: Samuele Lilliu
  • Software: Hugo, HTML

A website sitemap is an XML file that contains a list of all the URLs on a website and is used to help search engines understand the structure and organization of the site. The sitemap is typically submitted to search engines, such as Google and Bing, to help them more efficiently and effectively crawl and index the site.

One of the main benefits of having a sitemap for SEO is that it makes it easier for search engines to find and index all of the pages on a website. Without a sitemap, search engines have to rely on internal links within the site and external links pointing to the site in order to discover new pages. This can be a time-consuming and incomplete process, as not all pages may be linked from other pages or from external sites. A sitemap provides a comprehensive list of all the URLs on a website, making it easier for search engines to discover and index new pages.

Another important aspect of a sitemap is that it can include additional information about each page, such as when it was last updated and how often it changes. This can be useful for search engines when deciding how often to crawl the site, as they can prioritize pages that are frequently updated or are considered important. Additionally, a sitemap can also help ensure that all of the pages on a website are indexed by providing a complete list of URLs for the search engine to crawl. This can be especially useful for websites with a large number of pages or for websites that have a complex structure that may be difficult for search engines to navigate.

Overall, a website sitemap is an essential tool for SEO as it helps search engines more efficiently and effectively crawl and index a site. It provides a comprehensive list of all the URLs on a website, and can include additional information such as when the page was last updated and how often it changes. This helps search engines prioritize which pages to crawl, and thus help improve the visibility of a website in search engine results page.

Generating a Sitemap with Hugo

You can take a look at this Hugo page. Hugo provides built-in sitemap templates. You can set the default values for change frequency and priority, and the name of the generated file, in your site configuration /config/_default/config.yaml:

sitemap:
  changefreq: monthly
  filename: sitemap.xml
  priority: 0.5

In our case we override the built-in sitemap.xml template by creating a new file /layouts/sitemap.xml. This was built by modifying the Hugo default sitemap.xml :

{{ printf "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?>" | safeHTML }}
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xhtml="http://www.w3.org/1999/xhtml">
  {{ $pages := .Data.Pages }}
  {{ $pages = where $pages "Params.private" "!=" true }}
  {{ range $pages }}
    {{- if .Permalink -}}
  <url>
    <loc>{{ .Permalink }}</loc>{{ if not .Lastmod.IsZero }}
    <lastmod>{{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}
    <changefreq>{{ . }}</changefreq>{{ end }}{{ if ge .Sitemap.Priority 0.0 }}
    <priority>{{ .Sitemap.Priority }}</priority>{{ end }}{{ if .IsTranslated }}{{ range .Translations }}
    <xhtml:link
                rel="alternate"
                hreflang="{{ .Language.Lang }}"
                href="{{ .Permalink }}"
                />{{ end }}
    <xhtml:link
                rel="alternate"
                hreflang="{{ .Language.Lang }}"
                href="{{ .Permalink }}"
                />{{ end }}
  </url>
    {{- end -}}
  {{ end }}
</urlset>
With {{ $pages = where $pages "Params.private" "!=" true }} we are excluding from the sitemap all pages that in the markdown have private : true.

We can see the resulting sitemaps if we run the Hugo server in production mode and take a look at the public folder and the sitemap.xml file (trimmed version):

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xhtml="http://www.w3.org/1999/xhtml">
  
  
  <url>
    <loc>http://localhost:1313/about/faq/</loc>
    <lastmod>2022-12-30T11:24:53+00:00</lastmod>
  </url><url>
    <loc>http://localhost:1313/about/people/</loc>
    <lastmod>2022-11-11T21:47:10+00:00</lastmod>
  </url><url>
    <loc>http://localhost:1313/about/reviews/</loc>
    <lastmod>2022-12-30T11:24:40+00:00</lastmod>

Robots.txt

A robots.txt file is a simple text file placed on a website to instruct web crawlers (such as search engine bots) which pages or sections of the site should not be indexed or crawled. It is important because it allows website owners to prevent certain sensitive or confidential pages from being indexed and appearing in search results, and also to prevent overloading the site with too many requests from bots. The file is not a guarantee that a page will not be indexed, but it is a strong suggestion to crawlers to not index the page.

Hugo automatically generates a robots.txt file, but we can generate a custom robots.txt to (try to) prevent search engines indexing private pages. This is an example of robots.txt that discourages googlebot from indexing checkout and about pages.

User-agent: googlebot
Disallow: /checkout/
Disallow: /about/

For Hugo to generate the robots.txt we need to set in config/_default/config.yaml:

enableRobotsTXT: true

This generates a robots.txt (check the public folder) with User-agent: *, i.e. everything can be crawled. To override this we just need to create a file /layouts/robots.txt:

User-agent: *
{{- $pages := .Data.Pages -}}
{{- $pages = where $pages "Params.private" true -}}
{{- range $pages }}
  Disallow: {{ .RelPermalink }}
{{- end -}}

This will exclude the pages that have been set to private in the markdown files.

Submitting a Sitemap to Google

There are a few ways to submit a website sitemap to Google:

Using Google Search Console: Google Search Console is a free tool that allows you to submit your sitemap, check for crawl errors, and monitor your website’s performance in Google search results. To submit your sitemap through Google Search Console, you’ll first need to verify that you own the website. Once you’ve done that, you can go to the Sitemaps section and submit your sitemap’s URL.

Submit via XML Sitemap protocol: Google also accepts sitemaps submitted via the XML Sitemap protocol. You can submit the sitemap to Google by sending an HTTP GET request to the following URL: http://www.google.com/ping?sitemap=</sitemap-url/>. Replace with the URL of your sitemap.

Once you’ve submitted your sitemap, you can use Google Search Console to check the status of your sitemap and see how many URLs have been indexed. Additionally, you can use the tool to see any crawl errors that were encountered while trying to access your pages.

You can also use the “Fetch as Google” feature in Google Search Console to submit individual pages for crawling. This is useful if you have recently updated or added new pages to your website and want them to be indexed quickly. This feature allows you to submit a specific page or a group of pages for crawling, which can speed up the process of getting your pages indexed.

It’s important to note that submitting a sitemap to Google does not guarantee that all of the URLs in the sitemap will be indexed or that the website will rank higher in search results. However, it does provide a way for Google to easily discover and crawl your website’s pages, which can improve the visibility of your website in search results.