Catch SEO Regressions Before Googlebot Does It

February 16, 2022 / Cristi Ingineru

The Why

What do you do when you care about the functionality of your products? You write, of course, unit and E2E tests. What about performance? You write benchmarks. Security? You perform security audits. So, what about SEO? You write a library for SEO tests!

The Available Tools

While there are plenty of libraries and tools available, they are mostly focusing on general analyses, and there’s very little on using them in a CI environment. Lighthouse, for example, is mostly known as a performance audit tool, but it can run basic SEO checks too. However, even if used in a CI environment and much like the rest of the available tools, it lacks the ability to prevent specific regression that might be detrimental for a website’s SEO.

How We Got Here

We are maintaining dozens of sites and we are launching new ones periodically, so ensuring that we reduce SEO regressions as much as possible is no easy task. Using a mono repo proved to be a great decision because it enabled code and feature reuse and limited the amount of testing required during a release cycle - basically, write once, test once, and deploy multiple times. But this doesn’t exclude the risk of fixing a bug for a site servicing one country and breaking something in a site servicing another country, and SEO bugs are typically hard to notice because they are neither functional nor visual, as such, a specific tool was needed: a tool that checks each site against regressions.

Initially, Lighthouse was added to the CI and used to detect major degradations and mostly for performance. Then the first version of seo-slip was introduced to prevent embarrassing 404s or unexpected status codes for high traffic pages. These kinds of errors were easy to find by users as well, not only by Googlebot. Eventually seo-slip evolved and allowed us to catch other site specific regressions like incorrect URLs, canonical inconsistencies, redundant redirects, broken internal links or even broken CDN configurations.

Details

Seo-slip was built with flexibility in mind. It can be used in any CI environment and with any preferred JS unit test framework. The built-in checker list is also flexible: it’s up to the test writer to select what checkers are needed for a site under test, he can even write new checkers by implementing a fairly simple interface.

But the real power of this library comes from the checkers-rules separation. Each checker is a piece of JS code focused on a very specific task, for example there is a checker that verifies the status code, another checker for canonical URL, another one for hreflang URLs, etc. And each of these checkers is blind without the rules telling them what status code to expect, what kind of canonical URL to expect or what kind of hreflang URLs to expect.

statusCodeRules:
  code: 200
  exceptions:
    "/": 301
    "/blog": 301
    "/register": 302
canonicalRules:
  - url: "(.*.site.com)/(..)/search/(.+)(\?.+)?"
    expected: "($1)/($2)/search/($3)"
  - url: "(.*.site.com)/(..)/product-(\d+)(\?.+)?"
    expected: "($1)/($2)/product-($3)"
hreflangRules:
  - url: "(.*.site.com)/(..)/search/(.+)(\?.+)?"
    expected:
      en: "($1)/en/search/($3)($4)"
      ro: "($1)/ro/search/($3)($4)"
  - url: "(.*.site.com)/(..)/product-(\d+)(\?.+)?"
    expected:
      en: "($1)/en/product-($3)($4)"
      ro: "($1)/ro/product-($3)($4)"

As illustrated, the rules can be written in a human-friendly way using JSON or YAML. The latter is better because it has anchors and aliases that can be leveraged to reuse some of the rules across multiple sites and environments or even running the tests against multiple user agents covering both desktop and mobile implementations.

Seo-slip is using a simple crawler to find and download content and each HTML, CSS, image, etc. that was found is validated against the checkers. The best way is to start the crawl from a high traffic page and go to 2 or 3 levels deep, basically downloading enough content only to sample the site. Checking a minimal, but relevant, area of the site will make the tests run fast, which means they can be included in a CI or used as a monitoring tool. An exhaustive analysis is also possible, but the tool was not envisioned to be used in this manner. Eventually, it's up to the test writer and/or SEO specialist to decide how to use it.

There are sites with a complex URL structure making it almost impossible to describe the expectations using rules as illustrated above. In this case, the checkers can be used to only pull the SEO data without asserting anything, store it in a CSV as a "SEO snapshot" that can further be used as a reference for snapshot testing.

Future

Seo-slip is not meant to become popular. Most of the time Lighthouse or other general purpose equivalent or even occasional in-depth crawls are enough. However, for catching specific regressions or catching them early in a development or staging environment seo-slip might be a good candidate.

10 Likes