Last year, I wrote a post about how we were planning on replacing our aging location data search system (ProfileSearchServer v1) with a updated version that used Elasticsearch as a backend.
To recap, ProfileSearchServer (PSS) is a critical system for us. It is the key service behind the main entry page into our Knowledge Manager product, and it powers much of the searching functionality for our Pages product, among many other use cases. The problem we confronted last year was that v1 was built on an in-memory architecture that had reached its ability to scale. Thus, we decided to build a new version using Elasticsearch as the primary storage mechanism instead of memory.
When I wrote the post in February of last year, we had just finished our evaluation and prototyping phase of Elasticsearch, and decided to go forward with the replacement. But how could we do this - the complete substitution of a critical production system - in a way that minimized risk?
To confront this, we developed a strategy based on 3 core techniques: integration testing, response comparison, and proxy-based deployment
Testing is of course crucial to the successful development of any software system, but when replacing an existing, critical system it takes on even greater importance. For PSS v2, we specifically focused our energy on having a complete set of integration tests. Integration tests sit in the testing heirarchy in between unit tests, which tests logic in isolation of outside systems, and acceptance tests, which are broad high level tests that confirm general functionality. Integration tests, on the other hand, tests the interaction of multiple modules but in a relatively focused way.
For PSS2, we were able to write integration tests for every type of interaction we expected the system to have with Elasticsearch. To make this possible, rather than running these tests against some static Elasticsearch instance, we utilized a pool of lightweight Docker instances each running their own Elasticsearch clusters. Each suite of related integration tests grabs an instance from the pool, creates a fresh index, uploads test data, tests the relevant set of queries, and then destroys the index and returns the instance to the pool once complete.
These integration tests represented a wide variety of interactions with Elasticsearch, including index creation, document updating, and querying. Every possible type of input query was given its own integration test, so we were able to be very confident that successful passage of our tests implied proper functioning of our system.
In addition to exhaustive automated testing, another technique we used to ensure that PSS2 had correct behavior was to simply see how its responses compared to that of PSS1. As shown in the diagram below, we did this via a module we called the differ. The differ was originally created by our listings team when they did a similar type of replacement to a live system, and for this project we expanded it a bit and used it extensively. Essentially, the original system (PSS1) sends out messages after completing requests with the original request and response. The differ then calls the new system (PSS2) with the same request, and then compares the results with the original and produces configurable reports/metrics on any differences.
One question this approach may raise is “what if the original had bugs?”. There are a couple answers to this. The first is that the current system presumably works relatively well (in terms of correctness, at least), so using it as a starting point to get a broad idea of whether your new system is working correctly probably isn’t so bad. The second is that this process will often expose bugs previously hidden in the original when differences between it and the new version are investigated. The developer can then choose to reproduce the bug in the new system (if only to be able to filter out the noise from the differ, so that it truly shows errors in the new version), or she can use this opportunity to fix the old system for the lifespan it has left.
An important feature of any deployment strategy is to be able to quickly rollback to the previous version of the system if there is a problem with the new one. Since we use Zookeeper to provide service discovery, one option we had was to use service discovery to manage the deployment of the new system, so clients would not notice any change, and the switchover would be simply handled by Zookeeper. However, not all our of our clients use service discovery, so for those clients we would still need to do a redeploy to pick up the new service. Further, while in theory this approach was doable, there were some nuances of our configuration which made it slightly complicated, and certainly not something that could be easily done by a typical developer (as opposed to, say, a devops engineer).
Thus, we opted for a slightly different approach, as shown below. We first added code to the new server that would proxy
calls to the old server based on a flag in the database. To start we set the flag to
PROXY=true. Next, we
redeployed all existing clients to hit the new server instead of the old. Once this was complete, we validated the proxy
was working correctly, and then took our time in doing final checking before setting the flag to
PROXY=FALSE, and effectively
deploying the new version of the service. While in practice we never needed to switch back to the old version, at any time
after the deploy we could have gone back by simply switching back the flag. Finally, after several weeks, when we
were happy that the new system was working well, we turned off the old PSS1 servers, deleted the old code, and removed the
proxying logic from PSS2.
The minimization of risk should be a key consideration whenever making large changes to pre-existing systems, including replacing them entirely. The above techniques helped us have a successful upgrade to such a system, and hopefully can help you if you ever find yourself planning such a replacement.