Scripts are programs that are run as needed in order to accomplish such tasks as migrating or correcting data. As such, they have very different requirements from long-running server-side applications. We’ll discuss some script-specific demands and present some possible strategies for meeting them.

Aside: at Yext we deal a lot with businesses that have locations, so they’ll pop up a bunch in my examples.

Divide (and then maybe combine) and conquer

It is important to determine your script’s ‘unit of work.’ The specifics of this will vary, but in general we’ll say that this is the smallest indivisible thing on which your script’s task could operate. This might be a location, a business, a file, or a database row.

Once you know what your unit of work is (it is often rather obvious), you should see if you can batch your operations. The best batches are both logical and neither too big nor too small. By batching, we hope to save time on common operations and by keeping our batches at a reasonable size we hope to gain parallelizability and failure tolerance (more on these later). You might batch locations by business, batch files by date, batch database rows by some constant group size. If your units don’t really make sense to batch, then don’t batch them.

Sometimes what you need to do is too involved to reduce to this paradigm. In such cases, it might make more sense to write multiple simple scripts rather than a single complex one. For example, if you wanted to transform some files into CSVs (maybe you have 1,000s of files to reduce to 100s of CSVs) and then load the CSVs into a database, it could make more sense to write one script to transform the files and another to insert them.

Concurrency

With all our stuff nicely batched, we can often execute multiple batches concurrently. This helps us mitigate blocking operations and exceptionally slow batches. In Java, this might look like:

// Determine best number of threads via trial and error
ExecutorService executor = Executors.newFixedThreadPool(numThreads);
batches.forEach(batch -> executor.submit(new BatchRunnable(batch)));

// Wait for everything to finish
executor.shutdown();
while (!executor.isTerminated()) {
  try {
    executor.awaitTermination(500, TimeUnit.MILLISECONDS);
  } catch (InterruptedException ex) {}
}

Error / Exception handling

Error handling is often very easy to overlook, but is very important to do in the right way. An error while processing a batch means the batch wasn’t properly processed; we need a robust and exhaustive way to deal with this. When writing a script, you should assume that you will encounter errors.

Perhaps the most obvious approach is to log when things fail. But if the log isn’t written or the program crashes, we are left with an incomplete record. Instead of logging our progress on failure, we will log on it success, which makes recovering from errors and restarting our script much simpler. It is vital that either our operation is idempotent or we have a way to 100% guarantee we don’t mistakenly reprocess things. A set processing order is also often necessary in order to make restarting simpler.

Some approaches based on these ideas follow.

Approach 1: crash the program on error

Crashing the program on Exception makes sense where a script is short, data is basically valid, and Exceptions are caused by bugs. When the script crashes, the bug should be fixed and the script started where it left off:

static LinkedHashSet<Object> idsToProcess;

static void run(Object id) {
  try {
    doTask(id);
    markCompleted(id);
  } catch (Throwable t) {
    logger.error("Unable to do task for id " + id, t);
    System.exit(1);
  }
}

static synchronized void markCompleted(Object id) {
  idsToProcess.remove(id);
  if (!idsToProcess.isEmpty()) {
    logger.info("First id remaining: " + idsToProcess.iterator().next());
  }
}

public static void main(String[] args) {
  // Parse args, etc.
  idsToProcess = getIdsToProcess(firstIdRemaining);
  // ...
}

Approach 2: try to log errors, return to them later

This approach makes the most sense when the script is very long running and you want to finish the good cases as efficiently as possible, and come back later to clean up mistakes.

static void run(Object id) {
  try {
    doTask(id);
    markCompleted(id);
  } catch (Throwable t) {
    logger.error("Unable to do task for id " + id, t);
    try {
      // Saving failures saves us time if we need to restart
      markFailed(id);
    } catch (Throwable t2) {
      logger.error("Unable to mark id as failed " + id, t2);
    }
  }
}

static synchronized void markCompleted(Object id) throws IOException {
  write(completedIdsFile, id);
}

static synchronized void markFailed(Object id) throws IOException {
  write(failedIdsFile, id);
}

public static void main(String[] args) {
  // ... Initialization stuff
  Set<Object> completedIds = loadIds(completedIdsFile);
  Set<Object> failedIds = loadIds(failedIdsFile);
  Set<Object> idsToProcess = getIdsToProcess(completedIds, failedIds);
  // ...
}

Running your script

You could simply compile and run your script, either on your machine or a remote server. But what happens if your terminal crashes or your machine goes to sleep, killing both your script and your logs?

tee

tee helpfully lets you redirect your console output to a file, helping you save your logs. Here’s a very fancy looking command (source):

./myScript > >(tee stdout.log) 2> >(tee stderr.log >&2)

screen

screen allows you to leave sessions running on machines. It’s useful locally, but is especially nice for running your script on a remote machine: you can start a new screen, run your script, and not worry about your ssh connection dying or having to turn off your local machine.

Logging progress

When you write and run a script for someone, they’ll often, for some reason, want to know when your script will be finished. If you organize your script as I’ve recommended, it’s easy to intermittently log how much work has been done, how long doing that work took, and how long it’ll be until your script can stop doing work forever.