Configuring Sentry with JavaScript Source Maps

Sentry is an essential part of a Yext engineer’s workflow. Whenever an error occurs in one of our 1,000+ microservices, Sentry helps our engineers understand what’s wrong, so they can make fixes quickly.

Sentry is an open-source system which serves as a pipeline for errors from many different systems. When an application experiences an error, it will report it to the owning team’s Sentry board. Similar errors are grouped together into issues, and when new issues are identified our tooling will send a Slack message to the team who owns the service experiencing a problem so they can resolve it.

When viewing an issue, Sentry shows detailed information about the context in which the issue occurred. This usually includes some form of exception traceback, depending on what is supported by the programming language.

While we primarily use Sentry as a tool for tracking backend application bugs, it works on the frontend too! The Yext Platform includes the Sentry JavaScript SDK, which reports frontend JavaScript errors to our Sentry instance. This allows our engineers to view frontend bugs within the same interface as as those on the backend.

This comes with one major complication, however. JavaScript tracebacks in Sentry are more or less unreadable. This is because the tracebacks are generated from minified JavaScript files, designed to be served in production for end-users of Yext. Since minified JavaScript is intended for machines, not humans, it doesn’t contain helpful niceties like readable variable names, indentation, or comments. For example:

An unreadable, minifed JavaScript traceback inside Sentry

In order to actually make sense of a minified code traceback like this, source maps are essential.

What’s a source map?

Source maps are so named because they map minified JavaScript files to their non-minified counterparts. They essentially serve as a translation table which can be used in tandem with a minified JavaScript file to derive the original, unobfuscated source code.

Minified JavaScript files typically have a .min.js file extension, and can inform developer tooling of the presence of a .js.map source map file with a comment such as:

//# sourceMappingURL=entitysearchstorm.js.map

Developer tooling which understands source maps interprets that directive as a command to fetch the .js.map file and then parse it alongside the .min.js file to map minified source code to unminified code. Source maps are ubiquitous across the frontend developer tooling ecosystem, with support in Chrome and Firefox’s DevTools in addition to tools like Sentry.

Specifically, Sentry uses source maps to make its JavaScript tracebacks more readable, by augmenting each frame in the stack trace with the unminified source code present in that area.

The Problem

If source maps are so well supported across developer tooling, then why write this blog post?

For a multitude of reasons, the process of enabling JavaScript source map support in Sentry seemed simple initially, but spawned into a months-long troubleshooting adventure, rife with wrong turns, bad guesses, and – eventually – payoff.

After finally getting it working, I felt that the process of solving this problem was a perfect encapsulation of the kind of frustrating-yet-satisfying work that goes on in the realm of Production Engineering.

I’ll chronicle the journey by sharing each major troubleshooting step along the way, as well as how it either moved us closer to a solution or further away from it. So sit back and enjoy the ride…

Attempt 1. Adjusting The Setting

The most difficult problems often appear simple at the start, and this was no exception.

Slack thread: "Any idea why JS sourcemaps are not getting pulled in?" "Could be because Allow JavaScript Source Fetching" is disabled

I enabled the aptly-named “Allow JavaScript Source Fetching” setting within the Sentry UI, and waited for a new JavaScript event to be ingested by Sentry.

Attempt 2. Increasing the Cache Size

A few minutes later, we had a clue as to what was wrong: this banner began appearing on all JavaScript issues in Sentry.

There were errors encountered while processing this event: Remote file too large for caching

Per the message, the remote JavaScript files we were attempting to fetch appeared to be “too large for caching”. What cache is it talking about?

Sentry contains many different components which together allow for events to be ingested, but is at its core a Django-based Python application. Knowing this, I dived into its default settings file and looked at the available options, one of which was SENTRY_CACHE_MAX_VALUE_SIZE:

# Maximum content length for cache value.  Currently used only to avoid
# pointless compression of sourcemaps and other release files because we
# silently fail to cache the compressed result anyway.  Defaults to None which
# disables the check and allows different backends for unlimited payload.
# e.g. memcached defaults to 1MB  = 1024 * 1024
SENTRY_CACHE_MAX_VALUE_SIZE = None

This seemed like a reasonable setting to look at to fix the “too large for caching” error. It mentions source maps explicitly, as well as a possible default limit if using Memcached.

But having set up our Sentry deployment, I knew that we weren’t using Memcached, and the comment block also mentions that a value of None disables the check, which it was already set as. Given the mention of different, backend-related maximums, I assumed that Redis was being used as our cache backend rather than Memcached, since Redis was running inside our cluster for use by Sentry. Maybe setting a high value here would override a different maximum check present somewhere for Redis?

I set the variable to 40MB on the off chance that it made a difference. Unfortunately, it did not, and the error remained.

Attempt 3. Redis Limits

I next considered whether our Redis cluster might have a maximum value size that was being applied. However, I quickly stopped going down this troubleshooting route as the Redis documentation on data types clearly states:

A String value can be at max 512 Megabytes in length.

The JavaScript files attempting to be fetched were in the low-megabyte size range, so that certainly didn’t seem like it would be an issue. I moved on, figuring that Redis was probably functioning completely fine. (It was.)

Attempt 4. Maybe we’re timing out?

I next attempted to change some settings options related to fetch timeouts, on the oft chance that network requests to our served JavaScript files were timing out:

##################################
# JavaScript source map settings #
##################################

# Timeout (in seconds) for fetching remote source files (e.g. JS)
SENTRY_SOURCE_FETCH_TIMEOUT = 30

# Timeout (in seconds) for socket operations when fetching remote source files
SENTRY_SOURCE_FETCH_SOCKET_TIMEOUT = 30

# Maximum content length for source files before we abort fetching (40MB)
SENTRY_SOURCE_FETCH_MAX_SIZE = 40 * 1024 * 1024

This didn’t move the needle in any direction. These timeouts weren’t being hit before, and they weren’t still.

Attempt 5. Maybe upgrading will fix it?

Going on at the same time as the last troubleshooting step, our team was planning to upgrade to a more recent version of Sentry due to bugs we were experiencing with past upgrades. The Sentry development team shipped release 21.4.0 with a bug that prevented the Dashboard page from being displayed, after previously shipping release 21.3.0 with a bug that prevented new alerts from being created. Unluckily for us, we had run into both of those issues, and while we organizationally did not depend strongly on these specific Sentry monitoring features, wanted to quickly upgrade to a more stable version of Sentry without any showstopper bugs.

Given that we were planning to perform this upgrade soon, we thought it would be worth pushing further source map investigation until after the upgrade.

Attempt 6. Did we break it?

After the upgrade, we gradually started experiencing more issues with Sentry unrelated to source map fetching. Specifically, we were encountering scenarios where, after a spike in Sentry events sent from a downstream service, Sentry would get backed up with events and eventually stop processing them entirely. This culminated in us declaring an internal incident, where for a few days our team focused entirely on getting our internal instance of Sentry working reliably.

A list of commits in Gerrit relating to improving our deployed instance of Sentry

During this period, we disabled source map fetching (which already wasn’t working) in an attempt to improve Sentry performance. We discovered later that the source map fetching had nothing to do with the issues we were experiencing, but this did set us back to square one on the source map troubleshooting front. The Sentry issues we experienced could themselves be the basis for a different blog post, so I won’t make this one any longer going into our deployment problems in more depth.

After the incident, we made use of the aptly-timed Engineering-wide Quality Sprint to improve our monitoring of Sentry. I built out a customized Prometheus Exporter for black-box monitoring of Sentry’s event throughput, for instance, which helped us identify future event processing delays with alerts.

After all of this time spent on Sentry-related work, we spread out the remaining stories in our Sentry epic, including fixing source maps, across sprints so that we could make progress on other deliverables.

Attempt 7. Blame Kubernetes

A few weeks later, we picked up the source maps story into our sprint and began a deep-dive into the configuration of our deployed Sentry application. We run Sentry in a Google Kubernetes Engine cluster, using a modified version of the sentry-kubernetes Helm chart.

Sentry uses worker processes, which fetch background tasks from RabbitMQ, as part of the event processing flow which includes fetching sourcemaps. We identified that the sentry-worker Kubernetes deployment was configured to mount a sentry-data persistent volume, but this mount was read-only. This meant that no sentry-worker pods could fetch source maps and save them to disk.

A simple solution would be to remove the readOnly: true flag on the VolumeMount for the data volume. However, this would not work because the access mode of the volume was set to ReadWriteOnce. This meant that only a single pod could be configured with non-read only access to the volume. As part of our Kubernetes deployment of Sentry, we have anywhere from 5 to 15 replicas of the sentry-worker configured (based on autoscaling), plus the sentry-web pod, so that wouldn’t be workable.

What we needed was a way for multiple pods to read and write to the same storage. Kubernetes contains a PVC access mode called ReadWriteMany, which as the name implies allows multiple pods to have a read-write attachment to a single volume, which is exactly what we needed. However, Google Kubernetes Engine does not support this!

PersistentVolume resources support the following access modes:

  • ReadWriteOnce: The volume can be mounted as read-write by a single node.
  • ReadOnlyMany: The volume can be mounted read-only by many nodes.
  • ReadWriteMany: The volume can be mounted as read-write by many nodes. PersistentVolume resources that are backed by Compute Engine persistent disks don’t support this access mode.

However, there was still a path forward, hinted at by the documentation for the Sentry Kubernetes helm chart: running a NFS server inside our Kubernetes deployment, and having the sentry-web and sentry-worker deployments mount the volume using the NFS volume type. If we had wanted to fully utilize the Google Cloud ecosystem, Cloud Filestore might have been an alternative option for a GCP-managed NFS-like deployment, but we didn’t investigate this.

Attempt 8. The Gang Runs a NFS Server

My coworker spun up a NFS server deployment inside our sentry namespace, and through it was able to set up a multi-mount PVC. This involved creating a StatefulSet for the NFS server itself, which was backed by a normal GKE ReadWriteOnce volume, and then creating a separate PersistentVolume and PersistentVolumeContainer which referenced the StatefulSet via its DNS hostname as a NFS server. That final PVC could then have access mode ReadWriteMany, in order to be consumed by the workers and web pods.

Once deployed, we could now successfully kubectl exec into different worker pods and write data to the shared volume! I felt fairly confident for once that we were nearing the end of this rollercoaster.

But after updating our deployment with the ReadWriteMany volume fully working, we still had the same issue. After all of our work so far, we still didn’t have any source code – minified or not – showing up in Sentry.

At this point, my team had spent a lot of time on Sentry related work, and we were convinced that scraping source maps might just never work properly. Sentry recommends new users towards uploading their sourcemaps directly to Sentry itself, which is something we had considered earlier, but would require some more complicated logic in our CI deployment pipelines which we wanted to avoid if possible.

We kept a “fix Sentry source maps” JIRA task in our backlog, hoping to look into making those deployment-related changes for uploading source maps within a few months.

Attempt 9. A New Hope

But then, around a month later, a Slack post in our #help-production channel asking about Sentry source maps brought about a flurry of activity to try and fix the issue again…

One thing I especially enjoy about working at Yext is the broader engineering culture that is present across all of our teams. Instead of putting up silos between the different groups and teams in our Engineering organization, we encourage members of different teams to collaborate and help each other. Using a Git monorepo with a single code review tool (Gerrit) also helps eliminate friction, allowing anyone within Yext Engineering to view and upload a code review that touches another team’s code.

In this case, two engineers from our Web Publishing group noticed that the Content-Type of source map files being served from our webserver appeared to be wrong. They shipped a change to the base configuration file for the Java Play! Framework, which is used by many of our microservice applications which serve their own static JavaScript files.

I also think that our source maps may be wrong. They have a content type of application/x-navimap. The sourcemap validator can't find them.

The hope within Slack was palpable as Justin made the change locally, exposed his local instance through ngrok, and verified that the Source Map Validator at sourcemaps.io was now able to identify the source map. As if that wasn’t exciting enough, the Source Map Validator website was run by the Sentry team. If the site could detect the source map, that sure seemed to imply that Sentry would be able to well.

Justin and Brandon discussed further and put up a code review that was quickly shipped on a Thursday evening. After waiting the weekend for Sentry errors to pop up in some newly redeployed applications..

Looks like this is still happening (sadparrot)

By mid-day Monday, the error persisted. Nothing appeared to have changed in Sentry.

Attempt 10: The Other Caching Mechanism

Newly convinced that any source map serving issues were now resolved, I decided to dive deep into the Sentry source code to try and figure out what was going on. I cloned the github.com/getsentry/sentry repository, searched for references to source maps, and tried to build a mental model of what I thought might be going on.

I honed in on the fetch_sourcemap() function in Sentry’s Python source code, and traced the source map fetching process to this section in fetch_file(). That method contains the only reference in the entire codebase to the EventError.TOO_LARGE_FOR_CACHE constant, which corresponds with the “Remote file too large for caching” error message we were observing. Deductive reasoning suggested that this code snippet was invoking that code path, causing the error message:

def fetch_file(url, project=None, release=None, dist=None, allow_scraping=True):
    """
    Pull down a URL, returning a UrlResult object.
    Attempts to fetch from the database first (assuming there's a release on the
    event), then the internet. Caches the result of each of those two attempts
    separately, whether or not those attempts are successful. Used for both
    source files and source maps.
    """
    
    cache_key = f"source:cache:v4:{md5_text(url).hexdigest()}"

    logger.debug("Checking cache for url %r", url)
    result = cache.get(cache_key)

    # ..snip..

    if result is None:

        # ..snip..

        with metrics.timer("sourcemaps.fetch"):
            result = http.fetch_file(url, headers=headers, verify_ssl=verify_ssl)
            z_body = zlib.compress(result.body)
            cache.set(
                cache_key,
                (url, result.headers, z_body, result.status, result.encoding),
                get_max_age(result.headers),
            )

            # since the cache.set above can fail we can end up in a situation
            # where the file is too large for the cache. In that case we abort
            # the fetch and cache a failure and lock the domain for future
            # http fetches.
            if cache.get(cache_key) is None:
                error = {
                    "type": EventError.TOO_LARGE_FOR_CACHE,
                    "url": http.expose_url(url),
                }
                http.lock_domain(url, error=error)
                raise http.CannotFetch(error)

    # ..snip..

    return result

(For simplicity, I truncated sections of code in this snippet. You can view the full source code here.)

So, what’s going on in this code?

Sentry fetches the file requested, and then adds a mapping between the cache_key and the file’s contents in its cache backend. But then immediately afterwards, it queries that cache backend for the cache_key it just set to ensure that the value was properly stored in the cache. If data from the cache couldn’t be read, then Sentry assumes that the file was too large to fit in the cache and returns the error message we had been experiencing.

This made it clear that the core issue at play was, indeed, that fetched files weren’t persisting in Sentry’s cache. But Sentry’s cache was already configured via this line in the Sentry configuration file to use Redis, and I’d already verified that Redis was able to store large files:

# A primary cache is required for things such as processing events
SENTRY_CACHE = "sentry.cache.redis.RedisCache"

But then I had a major realization. I noticed this insightful comment at the top of the file, where the cache object referenced in the source code above was imported:

# separate from either the source cache or the source maps cache, this is for
# holding the results of attempting to fetch both kinds of files, either from the
# database or from the internet
from sentry.utils.cache import cache

“Separate from either the source cache or the source maps cache…” what does that mean? Are there different types of caches?

I quickly found the cache.py file which was being imported, and right at the top of the file was:

from django.core.cache import cache

default_cache = cache

The contents of Sentry’s cache.py show it itself importing the django.core.cache.cache object, which is backed by the Django cache framework. And because anything in the local scope, including imports, can itself be imported from another file in Python, this means that the fetch_file() command is, via an import of an import, using that framework.

Like I mentioned earlier, the Sentry application is technically a Django application, albeit an odd one. The most recent version of Sentry uses Django 2.1, which was released in 2018 and had extended support dropped in December 2019. At a configuration level, they attempt to abstract away many of Django’s underlying setting options with Sentry-specific equivalents, many of which end up overriding Django’s native settings.

(Sentry has always been closely tied to Django. In fact, it started as a library for the Django logging framework which added exceptions from Django projects into a database, using Django.)

For example, take a relatively mundane configuration option such as specifying an outbound email server. In a standard Django project, you would define EMAIL_BACKEND to be a valid backend type, such as SMTP. Then you would set options such as EMAIL_HOST, EMAIL_PORT and EMAIL_USE_TLS accordingly with the details of the SMTP server.

But in Sentry, you are instead expected to set something like SENTRY_OPTIONS['mail.backend'] in your settings file. Sentry itself then contains code which takes in this SENTRY_OPTIONS value to set Django’s EMAIL_BACKEND accordingly. More or less, this reinforces that when configuring Sentry, the developers want you setting Sentry-specific settings, rather than typical Django ones.

Because of this pattern, I took it for granted that the SENTRY_CACHE setting was likely implicitly configuring the Django-managed cache. So I was certainly surprised after opening a Python shell inside of our sandbox test instance of Sentry to check the Django CACHES setting:

$ kubectl exec -it svc/sentry-web -- bash
root@sentry-web-66f999bfdd-6q6cq:/# sentry shell
Python 3.6.13 (default, May 12 2021, 16:48:24)
>>> from django.conf import settings
>>> settings.CACHES
{'default': {'BACKEND': 'django.core.cache.backends.dummy.DummyCache'}}

This confirmed it: Redis wasn’t being used as a caching mechanism for fetching source maps, at least not in the context of this code. In fact, the Django cache backend being used for this purpose was, by definition, not caching anything at all!

After this revelation, I felt a sense of relief. There was now a tangible problem that I knew we could solve.

I think I might have an idea of how to fix this. will put up a CR soon. Summary: Sentry has two different caching mechanisms, it's own caching system (which uses Redis) and the Django caching system. we have the Redis cache set up, but the Django cache is unconfigured in sentry's settings.py. This means that Django defaulted to the django.core.cache.backends.dummy.DummyCache backend, which doesn't do anything. When Sentry tries to fetch a JS or sourcemap file, it runs this code which stores the result to the Django-backed cache, and then immediately checks if the data actually persisted in the cache. If it didn't, it gives the "too large for cache" error -- this appears to be the only place where this error message is used. Since the DummyCache is currently being used, that means that this error behavior happens 100% of the time. Reply: This is some awesome detective work and sounds very promising

Attempt 11: Configuring Memcached with Django

I spun up a basic Memcached deployment inside of the Kubernetes cluster which runs Sentry, and configured the Django CACHES backend to use it. The configuration of both was fairly straightforward. After creating Kubernetes sentry-memcached StatefulSet and Service objects, I simply added the following to Sentry’s configuration:

CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
        'LOCATION': [
            'sentry-memcached:11211'
        ]
    }
}

I chose Memcached because it is an easy to configure but reliable cache system which is one of the default backends supported by Django.

Knowing from before that source maps and JavaScript files could take up a fair amount of space, Memcached’s default 1MB item size was not going to cut it, however. Thankfully, since Memcached 1.4.2 this maximum size is easily configurable with the -m argument, so I set it to 25MB in its container arguments by way of a ConfigMap:

piVersion: apps/v1
kind: StatefulSet
metadata:
  name: sentry-memcached
  namespace: sentry
spec:
  template:
    spec:
      containers:
        - name: sentry-memcached
          image: memcached:1.6.6
          args: [
            "-m $(MEMCACHED_MEMORY_LIMIT)",
            "-I $(MEMCACHED_MAX_ITEM_SIZE)"
          ]
          envFrom:
            - configMapRef:
                name: sentry-memcached-config
          # ...
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: sentry-memcached-config
  namespace: sentry
data:
  # 4GB max memory limit per pod
  MEMCACHED_MEMORY_LIMIT: "4096"
  # 25MB max item size (to support large JS files and sourcemaps)
  MEMCACHED_MAX_ITEM_SIZE: "26214400"

(In previous versions, setting a limit greater than 1MB would require recompiling Memcached from source. Thankfully that wasn’t necessary here.)

After deploying Memcached and the new Sentry configuration, we had a new error message:

HTTP returned error response: 403

Although learning that there’s yet another thing wrong might at first seem discouraging, when troubleshooting a problem it’s often a good thing, because it usually means you’re moving in the right direction. I could now move back to checking the Yext-side of serving sourcemaps to identify why they were returning a HTTP 403 Forbidden response to Sentry.

Attempt 12: Updating Our Internal IPs

Thankfully, this next fix was fairly straightforward. As mentioned previously, our Sentry instance was set up with Google Kubernetes Engine. For security reasons, our HAProxy load balancers are configured with an allow-list of internal IP ranges, which are used to protect source map files from being accessed by the outside world. (They do contain our unobfuscated source code, after all.)

Update: testing in devops-sbx, am now getting a HTTP 403 error trying to fetch the sourcemaps. I suspect this is because it isn't counted as an internal IP in haproxy

With a fairly trivial change to our HAProxy configuration, we added our egress IPs for Sentry to be classified as within our internal IP range. One configuration update later, and…

Attempt 13. Back to Square One?

We were back to getting a “Remote file too large for caching” error! But how?

Well, it turned out that, for the first time, the error message text was actually right. The remote file WAS too large for the cache, but not because the remote file was too large… instead, the cache size was too small.

I had set a 25MB max size for Memcached items previously, which was much larger than the JavaScript files in question. Shouldn’t Sentry be allowing items more than 1MB in size after reading Memcached’s limit?

It turns out that the Python Memcached library being used by Django, and thus implicitly by Sentry, didn’t check Memcached’s max value size and instead defaulted to 1MB always. Even though the Memcached server was able to support large value sizes, the library which manages the connection to Memcached performed its own checks before making a new request to Memcached to ensure that the value was not too large, but did so using its own configuration setting!

Supposedly, Django’s Memcached backend supports specifying the server_max_value_length via an OPTIONS dictionary passed to the underlying python-memcached library, so that it can be made aware of the server’s increased size limit. This didn’t seem to work for me, however.

This override behavior was not especially well documented, and Sentry’s use of a very old Django version didn’t help when searching through bug reports to find a documented example of propagating this setting. (Newer versions of Django have moved to a new Memcached backend, which is more actively maintained and appears to have better documentation.)

I ended up needing to import the library in Sentry’s settings file and manually override one of its constants:

# 25MB max object size: keep in sync with MEMCACHED_MAX_ITEM_SIZE
import memcache
memcache.SERVER_MAX_VALUE_LENGTH = 1024 * 1024 * 25

I uploaded the CR, deployed to production, and waited.

It worked.

Source maps present within the Sentry UI

As an English proverb says, “all things come to those who wait.”

I hope you enjoyed following along on this adventure! If this blog post brought out your inner love for DevOps and infrastructure, check out our careers page – we’re hiring!

jwoglom

James Woglom

James works on infrastructure and developer tooling as a part of Yext's Production Engineering group, and is focused on making it easier to operate applications effectively. When working from home, he enjoys sharing his office chair with his sometimes-nosy cat.

Read More