When you hear “distributed systems” you might think of big clusters of machines and large AWS bills, but the same rules apply to systems of all sizes. Whenever you have two or more machines communicating with each other over a network, they aren’t going to always agree on everything. If you’re not careful, you can run into consistency issues in a system as small as two machines.
A Toy Example
Let’s consider a toy example. Imagine you’ve built a social media site that allows your users to share posts with other users.
Let’s pretend we’re using MySQL to store our data. Here’s what our table looks like in MySQL. A post has an ID, a user ID, and text content.
CREATE TABLE posts (id int(11) NOT NULL AUTO_INCREMENT, user_id int(11), content VARCHAR(300), create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE);
Here’s an insert operation. It adds creates a new post with a given user ID and content in our database. The database handles creating the ID for the post.
INSERT INTO posts (user_id, content) VALUES (:user_id, :content);
If we have our database on a separate server from our application server, sometimes posts will fail because of network issues.
But we can fix this right? Just add retry logic. Now if a post fails because of a network issue, the code will just retry until the post is created.
Everything seems to work great until your users start complaining about a new problem. Sometimes when a user creates a post, multiple copies of the same post are created. What’s going on?
The key thing to note here is that a network failure doesn’t always mean we failed to create a post. When a request to create a post times out, one of two things may have happened. Either the request failed to reach the database and create a post, or the request created a post and the response from the database didn’t make its way back to our application server. There’s no way for our application server to know which of the two scenarios happened.
There are two ways to solve this problem. The easiest is to just not retry. Depending on your product requirements this can be the right thing to do. We can show an error message to the user and have them refresh the page and try again if they don’t see their post.
The other option is to retry, but make the operation idempotent. If an operation is idempotent, clients can make that same call repeatedly while producing the same result. This means we can have the application server retry over and over until it gets a success response from our database.
In this case, we can use universally unique identifiers. UUIDs can be generated without coordination and are (with very high probability) guaranteed to be unique if generated correctly. Each post will be given its own UUID by our application server.
We change posts table to include a UUID column with a unique key constraint. We change our insert operation to handle duplicate keys gracefully.
INSERT INTO posts (uuid, user_id, content) VALUES (:uuid, :user_id, :content) ON DUPLICATE KEY UPDATE uuid=:uuid;
Our application server will generate the UUID when it wants to create a post and retry the insert statement until it gets a success response from our database. There will always be exactly one post created.