Here at Yext, we make it easy for our customers to manage their location data. A very important aspect of this is the ability to quickly and powerfully search this data.
For the last four years or so, we’ve been using a custom-built server called ProfileSearchServer (PSS) to search our location data. PSS mainly searches over continuously updated in-memory data, but can also make external queries to our datastore for more rarely used fields. That said, since the most common searches involve in-memory fields, performance is generally very efficient.
While the performance and functionality of PSS has met our customers’ needs, we are always looking for ways to make our systems more scalable and powerful. Specifically, while the in-memory model is great for performance, it puts a hard limit on system capacity. As the number of locations in the system grows, ProfileSearchServer requires more and more JVM heap space (we currently allocate about 90 GB per instance) and issues like garbage collection become trickier and trickier to manage. Further, loading all this data upon restart of the server takes increasingly long to do, making deployment of code updates difficult to do quickly.
As a long term replacement for the current version of ProfileSearchServer, we’ve decided to develop a new search system based on Elasticsearch (ES). ES is a distributed, Lucene-based search engine that is currently the most popular search solution in enterprise. Instead of linearly searching over documents (as is done in PSS), ES maintains a disk-based index of terms to documents, allowing sublinear scaling.
ES has features that make it ideal for a replacement of PSS. For one, ES has Near Real Time (NRT) search, so documents that are ingested are available almost immediately for search. Further, as described below, ES has advanced caching for filters, which is a key use case for PSS. ES also has a rich Java API, making it simple to integrate into our existing codebase. Finally, ES is built to scale. Horizontal scaling is essentially as simple as installing ES on a new node in the network.
Before going forward with a full-fledged migration to ES, we built a prototype to demonstrate how an ES-based search server would work with our data and use cases. While PSS has several different types of queries that can be performed, for our prototype we chose our geography based search as a case study, as it allowed us to demonstrate several different aspects of ES functionality. We show below how we implemented two such geo type queries.
The Contains Query
The contains query is a type of geo query that is based on matching the text in a user query to a geographical place name. A document is considered a match if any of its address fields match the given string. Below we see how we’d implement this as an ES based query via the Java API:
BoolQueryBuilder queryBuilder = boolQuery(); queryBuilder.should(matchQuery("primary_profile.country.value", country.getCountryCode())) .should(matchPhraseQuery("primary_profile.address.value", text)) .should(matchPhraseQuery("primary_profile.address2.value", text)) .should(matchPhraseQuery("primary_profile.address3.value", text)) .should(matchPhraseQuery("primary_profile.sublocality.value", text)) .should(matchPhraseQuery("primary_profile.city.value", text)) .should(matchPhraseQuery("primary_profile.state.value", text)) .should(matchQuery("primary_profile.postal_code.value", text));
boolQuery() is a method in the ES API which returns a
BoolQueryBuilder, an object which allows you to build up arbitrary
boolean logic within queries.
matchPhraseQuery(...) allows us to match a phrase,
matchQuery will match on individual terms1. Finally,
should() should be interpreted like “or”; the document is a match
if any of the criteria are met.
The Lat/Long query
The Lat/Long query allows us to find locations which are within a given radius of a certain point. Currently in PSS, we do this by doing a distance calculation based on the query lat/long and a candidate location’s lat/long. ES has a sophisticated geolocation library that allows us to replicate this query, as shown here:
queryBuilder.filter( geoDistanceQuery("geo_pin") .point(latitude, longitude) .distance(filter.getMiles(), DistanceUnit.MILES));
geo_pin is the name of an object
field in our data that has subfields
method builds an ES query based on this field, in comparison with the supplied latitude and longitude
from our user search query. Also note that, as opposed to
should(...) as in the example above, all queries
supplied in a
filter(...) method chain are required to match.
Since ES represents a substantial departure from our current implementation, one of our biggest questions we set out to answer with the prototype was that of performance, especially since PSS documents are stored in memory, while ES stores its indices on disk.
To evaluate these differences, we ran an experiment consisting of 2000 different search queries of various types, including different limits on the max number of results that ES was instructed to return. In addition to the contains and lat/long queries described above, we also included another geo query type called in (which includes locations within any given geographic region). Finally, “simple” represents searches over all the different fields in the document.
Below we see a graph of mean query time for ES (“elastic”) and PSS (“profile”) for the different types of searches we implemented in the prototype, with result sets limited to a max of 25:
We see that while the ES based geo filters seem to take significantly longer than the PSS implementations, the average times, in absolute terms, are all quite acceptable (<35 ms). Further, for simple search, ES even outperforms the current implementation.
Even though the ES times are satisfactory, you still may wonder why PSS seems to perform better on some searches than ES. One significant portion of this difference is the overhead required to convert the documents returned from ES (which are in JSON) to our internal serialization format (protocol buffers). Below is a scatter plot of query time vs. number of results returned, this time with no limit imposed on the size of the result set:
The above chart shows a clear linear relationship between result size and query time, with a slope of about 1.25 ms / result. As the contains and in queries returned approximately 5 and 7 results, respectively, in the first timing experiment above, this accounts for about 6-9 ms of the gap between ES and PSS for these query types. Luckily, our key performance sensitive use cases all have document limits (e.g. 25) as part of their requests, mitigating the performance hit due to this encoding overhead.
While these experiments helped us gauge the feasibility of using Elasticsearch in our system, they also demonstrated interesting considerations one needs to make in designing a search application. For example, it turns out that in some cases, the serialization and encoding overhead may actually turn out to be more important to performance than the actual, within-index search time.
Overall, the prototype demonstrates that ES represents a viable long-term solution for Yext’s search needs. Using it, we will be able to recreate the logic in our current solution, and even before any optimization, get satisfactory, and in some cases even improved, performance. Going forward, we will look to improve this performance even further by experimenting with things like ES cluster configurations and alternative serialization techniques.
If you are curious about the title…
Typically in ES, you see filters used with a “term” query. The “match” and “matchPhrase” queries run the search query through an analysis phase, which allows for variation in cases, etc, and is thus makes more sense for this type of filter. ↩