Strategies For Breaking Up Commits

As software engineers, a major part of our job involves the breakdown of work into smaller units. We know that just as a complex organism must be composed of many individual cells, so even the most grand and ambitious of software projects will at the end of the day be accomplished as a series of commits. Deciding how to organize your changes into commits is a very open-ended problem with many potential solutions, but an ideal solution will follow these three guidelines:

  1. Each commit should accomplish just one purpose, and unrelated changes should not live together in the same commit.
  2. A commit should be a self-contained unit; it should not depend on other, future commits to avoid breaking the codebase.
  3. Each commit should be small enough for a code reviewer to read through and understand in its entirety without feeling overwhelmed.

The specifics of how to best accomplish these goals may depend on the design patterns you are following. At Yext, we use Domain-Driven-Design (DDD), including the Repository pattern. Implementation of a new server endpoint will typically consist of the following pieces:

  1. Domain models
  2. Repositories (or repos), which are interfaces encapsulating the required logic for accessing data sources
  3. Repo implementations, which are implementations of the repos backed by a specific data source (e.g. a database)
  4. Operation classes, which encapsulate the business logic required for the endpoint and often use one or more repos
  5. Unit tests for the operations
  6. Integration tests for the repo implementations

With that in mind, I’d like to take a look at two commonly used commit strategies for implementing such a new endpoint, evaluate these strategies in light of the three guidelines mentioned above, and propose a third strategy that fixes some of the shortcomings of the first two.

Strategy #1: “Goliath”

The strategy here is that there is no strategy: just implement the endpoint in its entirety, including the operation class, models, repos, implementations, unit and integration tests, in one big commit. This strategy satisfies guidelines 1 and 2: it has a single purpose (to implement the endpoint) and it is a self-contained unit. The problem is with guideline 3: this commit will be large and tedious to review. If you’ve ever reviewed a massive commit you know it takes much more mental energy than a smaller one, and it’s easier for issues to slip through the cracks. Thus, this strategy works well for the writer of the code but makes life harder for the reviewer.

Strategy #2: “Test Later”

This strategy involves two commits. The first contains the full implementation of the endpoint, but without unit or integration tests, which come in the second commit. This helps alleviate the main problem with the Goliath strategy, but it introduces a new problem: the first commit is no longer a fully self-contained unit. Because it doesn’t have any tests, it’s more likely to contain bugs, and shipping it could result in broken code in production. Even if this does not happen, the code from the first commit may still have to be revisited in the second commit if tests reveal bugs or non-ideal behavior. This means the Test Later strategy doesn’t quite satisfy guideline 2.

Strategy #3: “Logic-then-Infrastructure”

This third strategy also involves two commits, but they are structured differently. The first commit includes all the “Logic” (the operation class, the domain models, the repo interfaces, and the unit tests), while the second commit creates the “Infrastructure” (repo implementations, integration tests, and finally the actual server endpoint code that instantiates/runs the operation). Note the following characteristics of this strategy:

  1. Each commit has just one purpose, either to create the business/domain logic for the endpoint or to create the infrastructure layer for the endpoint.
  2. Both the operation and the repo implementations are shipped with their corresponding tests, therefore untested code is never shipped.
  3. Each commit is smaller and easier to review than a single monolithic commit would be.

Therefore, this strategy helps us satisfy all three of the guidelines mentioned above. As an added bonus, writing the business logic and models before the infrastructure is in place helps limit assumptions in the domain layer. For these reasons, I believe Logic-then-Infrastructure is the best of these three approaches and has great potential to ensure that code is both well-tested and easily-reviewable.

Implementing an example endpoint using Logic-then-Infrastructure

Say we want to implement a server endpoint for creating reviews, where a review consists of an author name, a rating, and some content. Our first (Logic) commit would create the domain model (if it did not already exist):

public class Review {
  private final String authorName;
  private final int rating;
  private final String content;
  // Constructor, getters, etc.
}

Add a repo interface method:

public interface ReviewsRepo {
  void createReview(Review review);
}

Create the operation (note that the operation contains some business logic):

public class CreateReviewOperation {
  private final ReviewsRepo reviewsRepo;

  public CreateReviewOperation(ReviewsRepo reviewsRepo) {
    this.reviewsRepo = reviewsRepo;
  }

  public void run(Review review) {
    if (review.getRating() < 1 || review.getRating() > 5) {
      throw new IllegalArgumentException("Invalid rating");
    }
    if (Strings.isNullOrEmpty(review.getAuthorName())) {
      review.setAuthorName("Anonymous Reviewer")
    }
    reviewsRepo.createReview(review);
  }
}

And add some unit tests for the operation:

public class CreateReviewOperationTest {
  private ReviewsRepo reviewsRepo;
  private CreateReviewOperation operation;

  @Test
  public void testCreatesReview() {
    // Test code
  }

  // Other tests
}

At this point, all this code compiles and the unit tests should pass, because the operation doesn’t depend on the repo implementation code.

Our second (Infrastructure) commit would contain a repo implementation backed by a specific resource (in this case a database):

public class DbReviewsRepo implements ReviewsRepo {
  private final Connection connection;

  public DbReviewsRepo(Connection connection) {
    this.connection = connection;
  }

  public void createReview(Review review) {
    //Implementation of createReview that writes to the provided database
  }
}

Integration tests for this repo implementation:

public class DbReviewsRepoIT {
  private DbReviewsRepo repo;

  @Test
  public void testCreateReview() {
    // Test code
  }

  // Other tests
}

And finally, the endpoint itself:

public class ReviewServer {

  public void createReviewEndpoint(Request request) {
    Review review = transformToReview(request);
    try (Connection connection = getReviewsDatabase()) {
      new CreateReviewOperation(new DbReviewsRepo(connection)).run(review);
    } catch (SQLException e) {
      throw new RuntimeException(e);
    }
  }
}