Catch SEO Regressions Before Googlebot Does It

The Why

What do you do when you care about the functionality of your products? You write, of course, unit and E2E tests. What about performance? You write benchmarks. Security? You perform security audits. So, what about SEO? You write a library for SEO tests! 

The Available Tools

While there are plenty of libraries and tools available, they are mostly focusing on general analyses, and there’s very little on using them in a CI environment. Lighthouse, for example, is mostly known as a performance audit tool, but it can run basic SEO checks too. However, even if used in a CI environment and much like the rest of the available tools, it lacks the ability to prevent specific regression that might be detrimental for a website’s SEO.

How We Got Here

We are maintaining dozens of sites and we are launching new ones periodically, so ensuring that we reduce SEO regressions as much as possible is no easy task. Using a mono repo proved to be a great decision because it enabled code and feature reuse and limited the amount of testing required during a release cycle - basically, write once, test once, and deploy multiple times. But this doesn’t exclude the risk of fixing a bug for a site servicing one country and breaking something in a site servicing another country, and SEO bugs are typically hard to notice because they are neither functional nor visual, as such, a specific tool was needed: a tool that checks each site against regressions.

Initially, Lighthouse was added to the CI and used to detect major degradations and mostly for performance. Then the first version of seo-slip was introduced to prevent embarrassing 404s or unexpected status codes for high traffic pages. These kinds of errors were easy to find by users as well, not only by Googlebot. Eventually seo-slip evolved and allowed us to catch other site specific regressions like incorrect URLs, canonical inconsistencies, redundant redirects, broken internal links or even broken CDN configurations.

Details

Seo-slip was built with flexibility in mind. It can be used in any CI environment and with any preferred JS unit test framework. The built-in checker list is also flexible: it’s up to the test writer to select what checkers are needed for a site under test, he can even write new checkers by implementing a fairly simple interface.

But the real power of this library comes from the checkers-rules separation. Each checker is a piece of JS code focused on a very specific task, for example there is a checker that verifies the status code, another checker for canonical URL, another one for hreflang URLs, etc. And each of these checkers is blind without the rules telling them what status code to expect, what kind of canonical URL to expect or what kind of hreflang URLs to expect.

statusCodeRules:
  code: 200
  exceptions:
    "/": 301
    "/blog": 301
    "/register": 302
canonicalRules:
  - url: "(.*.site.com)/(..)/search/(.+)(\?.+)?"
    expected: "($1)/($2)/search/($3)"
  - url: "(.*.site.com)/(..)/product-(\d+)(\?.+)?"
    expected: "($1)/($2)/product-($3)"
hreflangRules:
  - url: "(.*.site.com)/(..)/search/(.+)(\?.+)?"
    expected:
      en: "($1)/en/search/($3)($4)"
      ro: "($1)/ro/search/($3)($4)"
  - url: "(.*.site.com)/(..)/product-(\d+)(\?.+)?"
    expected:
      en: "($1)/en/product-($3)($4)"
      ro: "($1)/ro/product-($3)($4)"

As illustrated, the rules can be written in a human-friendly way using JSON or YAML. The latter is better because it has anchors and aliases that can be leveraged to reuse some of the rules across multiple sites and environments or even running the tests against multiple user agents covering both desktop and mobile implementations.

Seo-slip is using a simple crawler to find and download content and each HTML, CSS, image, etc. that was found is validated against the checkers. The best way is to start the crawl from a high traffic page and go to 2 or 3 levels deep, basically downloading enough content only to sample the site. Checking a minimal, but relevant, area of the site will make the tests run fast, which means they can be included in a CI or used as a monitoring tool. An exhaustive analysis is also possible, but the tool was not envisioned to be used in this manner. Eventually, it's up to the test writer and/or SEO specialist to decide how to use it.

There are sites with a complex URL structure making it almost impossible to describe the expectations using rules as illustrated above. In this case, the checkers can be used to only pull the SEO data without asserting anything, store it in a CSV as a "SEO snapshot" that can further be used as a reference for snapshot testing.

Future

Seo-slip is not meant to become popular. Most of the time Lighthouse or other general purpose equivalent or even occasional in-depth crawls are enough. However, for catching specific regressions or catching them early in a development or staging environment seo-slip might be a good candidate.

Stupid simple serverless with usage-only pricing

While discovering a small repo that we made public by mistake, we wondered how could we prevent this or at least react sooner. We immediately devised a solution based on a polling agent that would repetitively query the GitHub API for our organization’s public repos and track the changes in time, hooked up with some notification system (Slack channel in our case). Later on we chose to handle the specially designed GitHub hooks for this, but for the sake of the exercise, it’s still interesting to think about the polling agent with history persistence approach.

Because this is the kind of program that runs from time to time and because a tech company gets to build numerous such automations, it makes sense to deal with it systematically. The distilled list of requirements below:

  1. Minimal setup costs - don’t pay per application, don’t pay per developer, don’t waste time configuring much, if anything at all; if the transactional cost of starting up such an app is low, we’ll get the chance to automate more

  2. Minimal running costs - no dedicated computational resources, so that means no recurrent costs per unit of time, which basically means it should be usage-based

  3. General purpose - it should be in a conventional tech stack and it should be capable of supporting a diverse set of needs (computation, persistence, networking)

Looking at our exiting tech stack, there are a few contenders:

  • AWS Lambda - excluded because we don’t consider AWS to be a developer-friendly platform which means setup costs aren’t minimal; we use it extensively, but shield our application developers from it

  • Heroku - excluded because of running costs; for every automation we develop in isolation, we’d need one web dyno running, regardless of throughput

  • A Linux VM, e.g. DigitalOcean droplet or EC2 machine - excluded because the setup is very involved - secret isolation, multi tenancy of apps and the low-level nature of the approach make it unattractive

Although there are ways to work around these issues, as a great Python programmer once said, there must be a better way!

Serverless platforms to the rescue

Some of the more popular serverless computing platforms we took into account

Some of the more popular serverless computing platforms we took into account

One interesting approach that piqued our interest was Serverless.com. Because they rely on a cost / developer pricing model and we have hundreds of developers across the organization, it's an option that’s hard to digest, however, Cloudflare, one of our service providers, already has a solution in that space, with the added benefit of running the code close to the users with a very reasonable usage-based pricing. So we gave it a try. They are by no means the only ones that do it, the landscape being filled with solutions, like Firebase Cloud Functions, however, Firebase’s pricing model is not as simple and, apparently, not as low.

In part 2, we describe test driving Cloudflare Workers with our implementation of the GitHub automation.