This is the last post in “The Making of Yext Pages”:
- Design, Accessibility, & SEO
- Technical Architecture & Content Generation
- Serving, On-Call and High Availability (you are here)
The Pages Serving sub-system receives site files (primarily HTML, CSS, JS) from Pages Generation and is responsible for providing them to users. To improve reliability and performance, site files are replicated to multiple regions around the world, and user requests are directed to the nearest one based on geographic routing. A high level view is provided in the following diagram.
In the diagram above:
The Pages Publishing Server receives site files from Content Generation and places them in blob storage in multiple regions around the world. It also reconfigures the CDN by registering new sites and purging the CDN’s cache of updated file URLs (not pictured above).
The Site Director returns the requested files and applies some light logic such as redirects and setting cache headers.
A big advantage of the “build and publish” architecture is that we can set long-lived cache headers on the vast majority of pages and allow the CDN to serve those requests without checking with the origin (Site Director). This results in very low latencies for users around the world and high reliability since it has the minimum number of moving parts in the “hot path”. Around 80% of page views are served without hitting our servers.
The subsequent sections go into a little more detail around the individual pieces.
Redundant Content Delivery Networks
Zooming in on the CDNs a bit (above), there are two aspects worth noting:
We support multiple CDNs (Cloudflare and Fastly at the moment) for redundancy.
The CDNs are configured to use the geographically nearest instance of the Site Director as the origin for each request.
A majority of page views (around 80%) are served from the CDN’s edge cache, and the ones that do require a request to the origin won’t have too far to travel.
Supporting multiple CDNs does increase the amount of integration work, since we have a number of touch points with the CDNs:
- On site creation, we automatically provision a new DNS name and SSL certificate.
- When site files are updated, we purge individual URLs or wildcards.
- The monitoring system processes and archives access logs.
- Request serving (Site Director) has to be aware of differences like header names used for tracing or device detection.
Although it takes some extra work and testing, keeping our customers’ sites online is the top priority for us, and we’ve found redundancy at every level to be valuable in that pursuit.
For similar reasons, we deploy our software to multiple cloud providers. For example, our European customers are served from a GCP region located in Frankfurt, Germany.
Site Director handles HTTP requests from our CDNs. It’s a Go program that does little more than direct requests to the correct backend. Mostly, it fetches a file in S3 and returns it. As Filippo Valsorda demonstrates in his 2018 Gophercon talk, it’s possible to make an extremely efficient reverse proxy in a straightforward manner.
We based Site Director on the httputil.ReverseProxy provided by Go’s standard library. To evaluate it, we replayed traffic logs using Vegeta to confirm its behavior and performance. We tested it at 400qps, and at that level it consumed only 100mb memory with a mostly-idle CPU and served files reliably, with a p99 of 1s and p99.9 of 1.25s. Satisfied with these results, we introduced it into production in 2017, and since then, it’s only gotten better. Our latest load tests showed a median latency (p50) of 27ms & p99 of 200ms at 1000qps with Go 1.15 as of November 2020.
Pages, Images, and Search
Although most content is served using Site Director, some types are not:
- Images are managed by a separate system described in part 2 > Images
- Sites have a Search page which is powered by our Live API.
We deploy Images and Search backends to each region and replicate all of the content there so that everything is as close to visitors as possible.
Monitoring, Alerting, and On-Call
We perform both white-box and black-box monitoring and alerting:
We collect application and host metrics with Prometheus. This includes everything from resource usage, to the volume and latency of requests served by type, to errors logged by the application. Any application being down or logging a lot of errors will trigger an alert.
We deploy Cloudprober, an open source black-box monitoring service. When a site is created, we reconfigure it to fetch a couple pages from the new site. More than one failure will trigger an alert. Cloudprober also provides us with detailed information about SSL and DNS.
We use AlertManager to classify alerts by severity: if an individual site is experiencing errors, then it is posted to a Slack channel that’s monitored by our Consulting team. In some cases it’s a problem with the site configuration or an unexpected DNS change and they either resolve it or follow up with the client.
Alerts across many sites will escalate to PagerDuty, where the on-call engineer is paged to investigate the anomaly. They have access to Grafana dashboards that can often highlight where the error is occurring, as well as playbooks for some specific alerts on the wiki. Most times of day, relevant engineers are also reachable via Slack and happy to swarm an incident.
Issues can usually be classified as a problem with one of the following:
Individual site - An individual site may have a configuration flaw, an error in their templates, or an error in the surrounding system such a client-operated reverse proxy or DNS records.
Application code - For example, a bug in the request handling logic, or a query of death, or an error handling path that is being executed for the first time.
Infrastructure configuration or availability - For example, a disk filled up, or a security group was updated to close a port that was in use.
Typically, bugs in a serving application can be resolved most quickly by rolling back. Our deployment system allows a one-command rollback to the last live version, and it completes within ~30 seconds. A fast, easy, reliable rollback capability is easily the most valuable tool that you can have at your disposal when responding to an incident.
Sadly, rolling back is not a panacea. A bug affecting Content Generation could have resulted in incorrect site files. To respond to this scenario, we make periodic backup copies of all site files, and we have the ability to put Site Director into backup mode, either globally or for a specific site. In that mode, it serves from the backup directory in Cloud Storage instead; since no files are being moved, this change can take effect instantly, restoring service in case Content Generation has run amok.
What else could go wrong? Important infrastructure or our cloud provider itself could become unavailable. We mitigate that scenario by performing a failover, which reconfigures CDNs to send requests only to the remaining regions. This can be done for individual services (e.g. only the Live API or Pages Serving), and we have automated this process to occur automatically if elevated error rates are detected.
The simplicity of serving a static site enables us to support sophistication in its delivery: a geographically-distributed system that serves queries from the closest users, with redundancy at every level. On top of that, we have layers of failsafes like universal content backups and controls to gracefully take regions out of service, supported by a 24/7 on-call rotation whose engineers have global monitoring from a single dashboard and playbooks for responding to incidents. Although we always see things to improve upon, I think the past 7 years of building and serving pages for clients using our custom system have gone well, and these architectural choices have endured.
What does the future hold? The main thing is opening this system up! We want to enable all developers to build sites using our platform and reap the benefits. Coming soon…