-.- --. .-. --..

Notes on running production code

This tweet by Charity Majors prompted me to write this post. It’s structured on the lines of the tweet-thread she has posted on how production deploys are more of an organic thing that evolves over time rather than a flip-of-a-switch event. This summarizes what I learnt over the course of running and maintaining a production deployment of a JavaScript error tracker.

Background

In the early days of my career as a programmer, I worked on Rails applications that were mostly deployed on Heroku. Pushing to prod was nearly a push of a button. Adding metrics was a click in UI or a command on from my local machine.

Away from such apps, I’ve also worked on writing/modifying build scripts that pushed code to production in an on-premise setup. But the one project in which I maintained the entire pipeline (Software Development Life Cycle—SDLC, except may be not much of the D) was an on-premise installation of Sentry (A JavaScript error tracker). It’s a typical1 Django project that has solid documentation and instructions on how to set it up on your own.

This system initially ran on a single machine serving a single team, handling a moderate rate of errors (“events” in Sentry-speak). Over a period of one and a half years though, this became critical infrastructure that serviced more than 18 teams by the time I left the project.

Even though I read and got familiar about how best to run applications in production (like the 12-factor app, for instance), it’s never quite the same actually maintaining an application in production, especially without the convenience of all the modern cloud tooling that a well-done {S,P}aaS provide (AWS, GCP, Heroku).

It’s been an interesting journey that I think it’s worth documenting.

Lesson 1 Documentation

Which is a neat segue into documentation. Sentry’s documentation is very thorough, and the code is open-source. Since I had seen many Rails applications like it1, even though I’m not fluent in Python, it was slightly easy to grok the codebase with the documentation’s help, and to know where to expect something in code. For instance, caching configuration—it’s usually a middleware.

However, that’s not what I’m referring to. Even with thorough documentation of the tool itself, there’s usually a lot of underlying assumptions/knowledge that goes into the deployment. For instance:

  1. How do the environment variables and secrets get populated in production?
  2. What’s the architecture layout? What worked best and what were the initial mistakes, if any?
  3. Who owns the cluster? How does someone SSH into a particular box, for instance?
  4. What are the relevant dashboards for the service? What kind of monitoring is setup, and why the specific choice?

All these (and more) should be propagated within the team, in order to ensure continuity of the system. This is even more important if the team is primarily composed of non-ops programmers.

Moreover, as the application and the choices evolve, there will be more information on why the older decisions aren’t valid anymore. That chain is quite important if someone else has to refer to it in the future.

Lesson 2 Just enough tooling

The sentry on-premise installation can be done in two ways: Use the supplied Docker images, or wrangle Python setup directly on the machine. Before I worked on this project, the web processes were run using Docker images, and the database was installed directly on a separate machine. The entire setup was manually checked and fixed during downtimes. And there were a lot. That was fine during the evaluation stage, but not good when there are going to be multiple teams relying on the system to be available for longer than a week.

Anything from Ansible to Kubernetes is the possible surface area of choice when it comes to deployment automation. I’m comfortable with Ansible, so I started writing the first set of scripts that automated everything except the database. A bonus was that the team also uses Ansible in other projects, so knowledge-pool wise, it was a good choice. But Docker was new, and Ansible + Docker was not really a smooth thing to work with. But I had to resist the rabbit-hole of shiny DevOps tooling, and concentrate on robustness.

So this ended up being a set of glorified shell scripts that ran on multiple machines. Having these early on (rudimentary at first, but slightly more sophisticated towards the end) was very helpful in scaling up the operations2.

Lesson 3 Monitoring, Alerting

Again, not having these was fine when the entire system was being tried out, but once there is one team more than just yours, it’s almost necessary to have monitoring set up.

As far as alerting goes, even the most rudimentary alerting helps you debug a production issue much faster than not having one. I mean, this sounds like a no-brainer. In our case, the way it helped me was, at a glance, to find out which particular piece of the machinery was down; I didn’t have to check all the machines (I used an Ansible script before that). Soon I built up enough intuition about which process might’ve gone down where there was an outage.

The company I was working for had infrastructure for logging metrics (via statsd and visualising using Grafana, and Sentry has statsd support and pushes important internal metrics if needed.

Lesson 4 Capacity planning

This is another “concept” that sounds easy and intuitive, but requires effort (or perhaps experience?) to get a good grasp around it. When the code was running on a single machine, which had both the database and web processes that got proxied by Nginx, it had far fewer moving parts than it had towards the end2. This means that most of the effects that might happen during an unexpected increase in throughput are easier to understand: Is there enough free memory? Is there enough CPU? Are there enough free file descriptors etc. If a process goes down, it’s relatively easier to know why.

But when I started moving processes to individual machines2 for “easier scalability”, I faced issues that I did not expect would happen. For instance, to take up more requests per second, I “simply” increased the number of machines and web processes, but that choked the database, because PostgreSQL’s connection mechanism is quite resource-heavy. I then had to run a database connection pooler (PGBouncer is pretty simple to set up, if you’re looking for one). But that didn’t solve the overall problem, because that’s just the first component that failed. Adding the pooler solved the connection problem of the database, but the high web process count caused Redis3 to choke on connections, but this was a slower failure; it showed up only after a while at runtime when there was a spike in requests.

In the end, once I knew the breaking points of the existing component, I was able to make much better decisions around capacity calculations. However, this should’ve started much earlier in the life cycle of the deployment.

Overall, I agree to what @mipsytipsy says:

That’s the start of a thought-provoking thread that, if you’re a beginner to running your code in production, and/or maintaining infrastructure that you own, is necessary reading.

For instance:

This, as I explained in the post above, is definitely true. In my case the code was already ready, but the more important work started after deploying the first version to production and other teams started using the application.

And to close out:


Footnotes

  1. A “typical” Django/Rails app ends up with these components:

    1. The main web process
    2. A background job queue (Sidekiq in Ruby-land, Celery in Python-land)
    3. Redis, optionally, as the store for this background job queue
    4. Database, typically PostgreSQL in both the frameworks

    The background job is pretty much a necessary component for an application that needs higher than normal scale. The frameworks have ORMs that handle communication with the database. A typical application also needs authentication, which is done by user information stored in the database. This means any request to the application that needs authorisation/authentication needs to talk to the database. So if you need to handle 1000 queries per second, you pretty much need to make 1000 connections to the database. So an additional layer of caching is required to store these frequently-used database queries, which is sometimes done by reusing the Redis layer if it’s already available. This is true for Sentry at-least. And to alleviate the pressure on database when there are transactions that take a lot of time to complete, the applications use the background job queue to do that. For instance, processing images or files is typically done this way.

     2

  2. Scaled-up configuration

    At the time of the last change I made to the infrastructure, the cluster configuration looked something like this:

    1. 8 web processes per machine running on around 30 machines
    2. 5 machines running Redis queues. Sentry in the simplest configuration uses all the machines as a distributed cache for temporary usage.
    3. 1 high-memory machine running Redis, which is used as the message queue. This queue is used by the web processes to store the “event” data, and by the background workers to process them and store those to the database later.
    4. 4 worker processes per machine running on around 12 machines.
    5. Database setup on 2 machines (one primary, one secondary) with the database pooler process.

    Last time I checked, this setup was handling around 5000 event registers per second at maximum capacity.

     2 3

  3. A well commented out Redis connection and listen queue configuration at redis.conf