This is the second post in “The Making of Yext Pages”:
- Design, Accessibility, & SEO
- Technical Architecture & Content Generation (you are here)
- Serving, On-Call and High Availability
Last time we discussed the technical aspects of designing and implementing an effective web presence, a process that is handled by our Consulting team in collaboration with the client. There are two types of work product that come out:
HTML templates, CSS, JS, and media assets.
A data model and workflow for populating content into the “Entity Profiles” in the customer’s Knowledge Graph, e.g. custom data types for content and approval flows, automated ETLs, or other mechanisms for populating it.
The Pages system, whose design we cover in this post, is responsible for taking those source materials and serving it as a web site.
The discussion uses a few terms that may be unfamiliar. Please refer to this glossary:
|Knowledge Graph||A Content Management System developed by Yext with custom data types with rich support for media, hierarchy, and relationships between data records.|
|Entity||A node in the Knowledge Graph. Each Entity has a primary Profile, and it may have alternate Profiles containing localized content for different locales.|
|Entity Profile||A data record, usually expressed in JSON|
When designing the Pages system, we had the following requirements:
The system had to generate a page per entity, a directory of entities in the site, a search page, and a sitemap.
Data related to an entity must be available to use in the template. For example the last 5 reviews, associated menus, upcoming events, or other nearby business locations.
It must be possible to view and share edits to a site before pushing them to production.
Here’s what that looks like:
Our non-functional priorities were Reliability, Latency, and “SEO goodness”. That combination led directly to a “build and publish” architecture with staging & production environments:
The site configuration, templates, and assets are managed directly by the developer in a git repository.
On each update to the source data, the system regenerates the site’s pages and uploads them to hosting providers around the world.
The system automatically creates a private staging domain for each site backed by a different set of files. Site configuration is stored in a git repository, and staging and production may be generated independently from different revisions of the repository.
This has some major benefits:
Serving static files is as fast and reliable as you can get. They can be globally distributed using a CDN so it’s fast no matter where in the world your users are.
It also has some challenges compared to a more traditional setup (e.g. Wordpress):
Slower updates - Updates to templates or backing data take longer to be reflected in the live site. This is a fundamental consequence of the architecture, but the delay can be reduced to very reasonable levels (<10s) by a good implementation.
Stale pages - The system must regenerate pages if any of the data used on those pages is updated. If it fails to do so, the pages will show stale content indefinitely. This problem is essentially “cache invalidation”, and it is famously one of the two hardest problems in computer science.
We launched Yext Pages with this architecture in 2014; since then we have had to solve a bunch of challenges to make the “build and publish” architecture work well for pages that change frequently, but we’ve found the benefits so compelling that we haven’t looked back.
We divided the overall system into two subsystems, developed by separate teams:
- Content Generation, which produces site files as shown in the diagram above.
- Serving, which receives site files and serves them to consumers around the world.
This post discusses Content Generation, and the next one discusses Serving.
The Content Generation process involves combining source files (designed and implemented by our Consulting team for this site) with data populated in the customer’s Knowledge Graph to produce the actual site files. Here is how it looks in concept:
Processing Source File Updates
The term “Source Files” refers to the templates, assets (CSS, JS, images), and configuration of the site. Everything is stored in a Git repository, and developers may manually or automatically apply updates to their staging or production sites. When processing an update to the source files, the system publishes the entire site, which involves generating and uploading all of the site files to hosting. Individual sites can involve gigabytes of files, so a naive solution of doing this on each publish is too slow to support an efficient workflow for developers, who will make hundreds or thousands of updates to site templates during development. It’s also usually unnecessary, since many changes are minor and involve just one asset.
We tackle this in two ways:
Make a full publish faster by quickly detecting unchanged files.
Calculating which incremental update to make based on arbitrary changes to configuration is infeasible. Instead, we regenerate all site files but only upload files that have changed. We detect that by hashing each site file to calculate its fingerprint, which we store locally. On each publish, we filter out any files whose fingerprints have not changed. The major file hosting providers support this workflow by returning the MD5 hash of each file uploaded, e.g. the ETag field in S3’s LIST OBJECTS.
There is a pitfall in this approach: it requires that generated files are deterministic – that the site files are byte-for-byte identical when generated using the same configuration and data. This sounds like it would naturally be the case, but some common operations like iterating through a map can result in variations in the output files. Beware.
Provide a command-line tool for developers for quick iteration.
This command line tool runs the same code as the production system, but it generates just the single page that the developer is working on. This allows developers to see the effects of their changes ~instantaneously.
The next article will go into detail about our global, redundant file hosting to which we upload the files.
Processing Data Updates
To process a data update, the system has a job that’s simple in concept: notice when source data has been updated and republish any pages that changed as a result.
Indeed, some cases are that simple; when an individual insurance agent updates their description, the system needs only to republish their entity page. However, the general case is more complicated: let’s say the agent’s page includes their most recent reviews. The system has to process all new reviews and determine which sites and pages they appear on. Not only does the agent’s page need to be updated, but related pages may also need to be updated: their average review rating may appear on directory pages, pages for nearby agents, and the site’s search index.
In fact, sites can incorporate data from half a dozen source systems, and updates to the page do not necessarily correspond to updates to records in the source systems. Here are a few examples:
New or Archived Entities require new pages be published or existing pages deleted.
Holiday Hours can be provided, which override their normal hours on specific dates.
Upcoming Events can be displayed, which shows upcoming events chronologically.
Daily Hours show “Open Now” based on the time of day reported by the user’s browser, combined with the entity’s operating hours.
Last 5 Reviews show engagement from other users.
We use RabbitMQ to integrate all of these data sources. The source systems that manage these data sources publish updates when it changes, and our Content Generator reacts to those notifications by triggering the minimal update to site files. It still requires a lot of code (and tests!) to handle all of the various cases, but an organized team can plow through it.
To handle effective data that changes based on the calendar (holiday hours, upcoming events), the system schedules affected pages to be regenerated on time boundaries where the data changes, at most once a day. We use my cron package for that.
To regenerate only the pages which have changed, we need to know which data fields are used on each page. How? Developers write HTML templates using any data associated with an entity that the design calls for. The system then analyzes the HTML templates to see what fields and related records were used, and it uses that information to fetch the minimum information needed and to ignore updates to fields not in that set. You can find more detail in this article: Soy - Programmable Templates for Go.
The incremental processing of source and data updates allows us to update static sites quickly and efficiently. Besides that, we also perform a full publish for all sites every 24 hours. This ensures any errors in the incremental updates are corrected in a timely fashion, using dormant compute capacity at off-peak times.
An entity’s profile is composed of a customizable set of data fields of many different types. Besides text, we have types for important business data such as address, phone number, operating hours, and service areas. Most data types can be naturally used in HTML templates or converted to JSON. However, images are special: storing image data into the profile would increase the size of the profile by orders of magnitude, and the image bytes would not be helpful in a HTML template anyway.
Our approach was to peel off a separate system to manage images. It accepts images and returns a URL that the image can be accessed from, and that URL is what gets stored into entity profiles. For example:
The long random-looking sequence of characters in the URL is a fingerprint of the content. Once an image is written, we never update it. This allows us to set long-lived cache headers, so consumers never have to download an image more than once. It also allows us to effectively use CDNs to get images close to customers without having to check with the origin.
Customers often upload much higher resolution images than are needed for their pages, so we also resize and transcode images automatically. For example, the original image linked above can be accessed like this to return a thumbnail at least as big as 1440x400 pixels:
In HTML templates, we provide a special function that produces this URL given the original URL and the desired dimensions to make it easy to compose.
To avoid the possibility of DoS via a malicious user asking for large numbers of different dimensions, we only produce thumbnails at a predetermined set of dimensions. As a result, the example thumbnail above may not be exactly 1440x400.
Lastly, we normalize and optimize the image, for example:
Although the uploaded image was a PNG, the returned image is actually in the highly optimized WebP format, if you’re using Chrome.
Uploaded JPGs are re-encoded to follow the image optimization guidelines; this helps a lot because many originals are encoded at very high quality factors and are therefore huge for no benefit.
All images are converted to JPG or PNG (if they are not already), depending on whether the original image is in a lossy or lossless format. This allows us to support more unusual formats on input like SVGs, TGAs, TIFFs, and even Photo Spheres.
Mobile phones often encode image rotation using EXIF metadata, and there are scenarios where that is not respected, making the image appear upside down. We apply that rotation to the image content and strip that tag.
We normalize the color palette for screen display and strip the embedded color profile.
URLs & Redirects
Since Entity Pages are dynamic, their URLs also need to be dynamic. We accomplish that by using a Soy template, just like we do for page content, with a reduced set of available data. For example, a URL may look like this:
Astute readers may notice an issue – the data fields may have characters which are not URL friendly, such as spaces, non-latin characters, or emoji. To address that, we take advantage of the ability to introspect these templates, and we modify the tags to add print directives that lowercase and “latinize” the values and some other tweaks like replacing spaces with hyphens.
Especially astute readers may notice another issue – since they incorporate entity fields, page URLs may change as a result of processing a data update. This requires some special behavior that updates to the page content do not, like updating other pages that have incoming links and updating the sitemap. But that’s not all. Although we can update links on the pages under our control, incoming links could be present anywhere, and we’d like to avoid breaking them. Relatedly, in SEO lore, the SEO reputation/benefit that the page has built up would be squandered. To avoid these dire outcomes, our system maintains a set of redirects alongside other site files. It automatically creates them when page URLs change to avoid breaking incoming links, and we’ve also found it to be a useful mechanism to configure redirects in bulk for more complicated migrations.
Since this article could only cover so much, here are a few tidbits from topics that didn’t make the cut:
Nearby Locations - An entity page that represents a geographic location may have a list of nearby locations on it. To efficiently return those, we calculate and store every entity’s geohash, which allows efficient prefix lookups.
HTML Templating - The system is written in Go, and we ported the Closure Template language for authoring site templates; the port is also open source. It is about twice as fast as the builtin Go templating library.
Directories - A directory page requires information about all entities in scope. At the top of the directory, that involves all entities in the entire site! For this to be feasible, we’re careful to load precisely the fields used, which turns out to be a tiny fraction of the entity profile.
Besides that optimization challenge, directories can be surprisingly complicated to produce. Some span countries (such as the Arby’s example from earlier), and countries have different geographical browsing conventions. For example, the USA conventionally has 2 levels of hierarchy: “State > City”, while England uses just the City (e.g. Burger King’s directory). There are also various cases to implement for the benefit of user experience, like sending the user directly to the entity page from directory nodes that only contain a single entity.
Related records - Incorporating related data on pages using a relational database can be slow. Each page requires many related records, each of which must be located in an index containing all records in the system. Large sites can spend a huge amount of time just loading data. Graph Databases address exactly this problem with a property called “index-free adjacency”. This is something we are actively prototyping, although we haven’t yet incorporated it into the production system.
Comparison to JAMStack
This is easier to implement because it does not require a robust solution to efficiently processing updates, but we believe it doesn’t go far enough. Although the HTML for a JAMStack entity page could be returned very quickly, the main point of the page is the entity’s data, which would still be loaded via API from the browser. That loses the latency and reliability benefits that would otherwise be endowed by static serving, and it is bad for SEO.
To be continued…
In this post, you’ve gotten an idea of how the content for a site gets produced and the supporting systems that make it possible. In the next post, we’ll share how that content actually arrives at user’s browsers. Stay tuned!