July 11, 2020

Building a Container-Friendly App

I've been going down two semi-related rabbit holes lately: DevOps and Kubernetes. And it's gotten me thinking about what kinds of best practices there are or should be for application development in this distributed event-sourced cloud-native buzzword-laden ecosystem. (Blockchain!)

So I'm just going to lay out some of the things I've determined are useful to keep in mind when designing an application that is likely to be used in a cloud/container environment. Much if not all of this is probably already out there, and may even be well-known, but this is for my own edification, so I have these findings in one place and can reference them.

1. The UNIX Philosophy (Do One Thing)

Just as has been hammered in for good design of software functions, discrete bounded contexts continue to be useful. Separation of duties and modularity allow both functions and applications to be easier to manage and easier to reason about. Don't try to do more than you need to, and if you need to do multiple things, each of those things should be done separately. If your app needs state, don't keep it yourself, instead use an external database. (Your app shouldn't also be a database, it should just be an app.)

Do one thing. Do it well. And if you need to do more than one thing, make another app to do it.

And speaking of databases and state, another thing to keep in mind is that if at all possible, your application should...

2. Be Stateless

If you give them the chance, functional programmers will spend all day telling you how horrible state is. Changing state in your application means that doing the same thing twice might have different results the second time, and likewise all other functionality loses any previous guarantees. Additionally, if you're keeping state, your application can't easily scale out horizontally because each instance is different from every other instance.

For some apps state is simply unavoidable and necessary, but try your best to get rid of it. Users/clients will appreciate being able to scale by just running more instances of your container, and it makes security, upgrades, debugging and life in general much easier for everybody.

If you really do need state and you are using an external database, always use transactions. I'm assuming this is a "duh" statement, but explicit is better than implicit. Lock your writes and commit or rollback after every change. This ensures that multiple instances of your app don't conflict with each other, and scaling and reasoning about your application remains reasonably easy. And of course, failures and shutdowns aren't as problematic with transactions.

3. Keep Horizontal Scaling in Mind

It sure would be nice if your app could scale by just running more instances. That would make everybody's life much easier. In order to do this, your application should use as few resources as possible, your container images should be as small as possible, you should document what resources your app requires, your app should fail gracefully and shutdown quickly. And your application should have the ability to output a whole bunch of

4. Metrics, Logging, and Health Checks

One of the nifty features of this new cloud native society is bringing big data to your own resources. Service meshes like Istio come with Prometheus and Grafana as a side-thought, which track everything. It's trivial (almost) to make sure logs are read from every container in the cluster and aggregated into Elasticsearch. So make your cluster admin's jobs easier by providing robust metrics and logging, and health checks. (And format the metrics and logging in a way that is easy for computers to parse. It's not hard to JSON everything. Computers really like JSON event streams.)

Side note: I know I've seen a Kelsey Hightower video where he begged application developers to print some basic startup info to stdout as well.

If something unexpected happens, your application should immediately start failing any health checks, log an error, and shutdown with a non-zero exit status. Let the administrator deal with the problem by noticing the shutdown in the metrics and then reading the useful error messages from your application.

Make the admin's job easier. Give them the logs and metrics they need to determine how best to diagnose and fix any problems. Which brings us to

5. Staying in Your Own Lane

Don't try to do things you don't need to be doing. Don't retry connections. Don't use https (!). Don't do authentication.

These are the domain of the scheduler, the service mesh, the ops people, and other parts of the organization and cluster management teams. They are not your concern. Your job is to make sure your application works as expected under the right conditions, and to document what the right conditions are. Your job is not to create the right conditions.

Don't retry connections. If your application tries to reconnect to a database that it can't live without, then your application is simply taking up resources when it's not doing anything. You should log the error and shutdown with a non-zero exit status. Then, without any effort by you, Kubernetes will bring your app back up, and it will fail again, and eventually the admins will notice that shit ain't working right and they'll go look at the logs.

Your application does nobody any good by taking up RAM and CPU trying to see if the network is up. That's not your job. Your application shouldn't be doing this. Let it fail.

Don't use https. This might sound crazy coming from a security guy, but network security isn't your domain. Your service mesh and network/firewall policies are already taking care of network security. If you're using TLS in your application, you're adding redundant layers of encryption and simply taking up extra time and resources doing unnecessary double encryption. Even worse, you're preventing your service mesh and firewall from routing and monitoring your traffic.

As far as your application is concerned, every connection is unencrypted and on localhost or in a local network. (Of course, this may not be the case when connecting without outside third parties. But you get the idea.) Network security is no longer the responsibility of the application. Don't try to reclaim that responsibility. Let your ops people take care of it for you. As far as your application knows, your service and database dependencies are on http. That's OK. Trust me, the network traffic is getting encrypted.

Don't do authorization or authentication. Everybody who connects to your application is allowed to connect to it. The network policies, service mesh, and other layers of security have made sure of that. Your app should simply wait for instructions and then do as it's told. (Now of course your organization may do things differently, but without clear organizational direction, you should not be doing any authentication yourself.) Just like https, authorization and authentication are not your responsibility.

This main point here with all of this is that your application should only focus on what it needs to focus on. If your app is a database and has users, fine, maybe it needs to worry about users and authentication. And for security, yes you still need to protect your C code against buffer overflows. But make sure the things you focus on are actually things that lay within your own domain and aren't things that should be off-loaded to the service mesh, scheduler, network/firewall policies, or other operations components.

Modularity, being self-contained, and doing only what you need to do are the name of the game here. If you try to steer out of your own lane, even with the best of intentions, you'll just cause accidents. Make your app easy to reason about, and let operations take care of securing it.

6. Least Privilege, Least Access, etc.

One of the ways the ops admins are going to secure your app is to force it to run as an unprivileged user, in a read-only filesystem, with no ability to do... anything. Your application had better work with those constraints. This means don't try to read (or god forbid write to) the filesystem unless you have to. All network listening should be bound to high ports. Don't require root for something, and if you absolutely do need to do something as root, try to do it first and make it an InitContainer.

Every security seminar everywhere has emphasized the idea of "least privilege". Don't do more than you need to do. Don't access more than you need to access. Don't have permissions you don't need to have. Do as little as necessary in order to get the job done. Your application should not and will not have access to anything. Make sure it works without access. And if it really needs something, explicitly document and request that exact thing and only that thing, and nothing more. (And that thing had better not be "run as root"...)

7. Don't Hard Code Variables

It's not a container-specific idea, but it's still worth repeating. Everything should be configurable. Users will reimplement your other service, so its address should be configurable. If there are variables your application simply cannot function without, log an error and gracefully fail when the application is started without those variables provided.

All variables should be able to be loaded from four places: a file that contains only that variable's value, a configuration file with many variables' values, environment variables, and the arguments when the application is run. The final set of variables for the applications configuration are created by the following, in order:

A hashmap/dictionary is created and any hard-coded defaults are put in.
If no config file is specified with runtime arguments, look for a global config file and load/overwrite the hashmap with values from that.
If no config file is specified with runtime arguments, look for a local config file and load/overwrite the hashmap with values from that.
If a config file is specified in the runtime arguments, load/overwrite the hashmap with values from that.
If environment variables specify the path to a file that contains the value of a variable, load/overwrite the hashmap with that value for that variable. (Repeat for multiple different variable's file locations.)
Load/overwrite the hashmap with environment variables. (And if environment variables exist for both a variable and a file path for a variable, log an error and exit with a non-zero status.)
Load/overwrite the hashmap with runtime arguments.

This requires that you ensure your application container starts with no environment variables set. It is also useful to have a clear format for variables. For example, having "_FILEPATH" for all variable's file paths. This way, a user who sees USERNAME_FILEPATH somewhere can intuit that the env var for DB_HOSTNAME file path is likely to be DB_HOSTNAME_FILEPATH.

Having a file path allows your variables to be mounted from secrets and other similar constructs, and although most container management software can set env vars from secrets, not all of them can, and it's better to give your users more options.

8. Catch SIGTERM and Quit Fast

SIGTERM is how container management engines tell your app it's time to shutdown. Do whatever cleanup you need to and make sure you are able to exit cleanly within 5s-10s, since you may get a SIGKILL or SIGHUP if you don't shutdown fast enough. And make sure any cleanup processes can be interrupted without catastrophic error. For example, if you need to do one last database write, it had better be a transaction, so there is no corruption or mixed data if your application is killed before the write can finish.

9. Know and Document Your Resource Requirements

This allows your application to be scheduled more effectively and makes sure your users' resources are being utilized optimally. When you publish your application, your documentation should include information about its resource requirements. Make the cluster admin's job easier. If your application runs out of resources, don't spaz out. Free up enough to log the error and shutdown gracefully with a non-zero exist status.

10. Use Build Containers and Keep Images Small

Nobody wants to wait for your 3GB hello world nodejs application to download. Make your image as small as possible. Use build containers and pull from common base images to increase the chance that the image layer already exists on the user's system. (Not to mention, they're likely more robust and secure by default.) Where possible, your final image filesystem should just be your application as a single static binary and nothing else, with the possible exception of a self-contained init process manager such as tini.

11. Unit Test Both the Application and the Container

Ideally, you have CI/CD and your tests go fast because you have a small image and your change batch sizes are small.

12. Tag Your Images With Semantic Versioning

Semantic versioning is useful and tells your users what features are available and what kinds of backwards compatibility guarantees they have with your application. Use it.

13. Be Tolerant in What You Accept and Strict in What You Output

This is more of a general design principle, but I run into it often enough that it's worth putting here. Your application should accept more than the letter of the specification. Common expected use cases should be handled (and documented). But everything your application emits should be picture-perfect and follow the specification exactly. This ensures your application has the most utility for users, because it works with other less precise applications, and it outputs in a manner that is guaranteed to be accepted by other applications.

I ran into this recently when my reverse proxy was failing to forward a path when the backend service returned redirect without a response body, and that specific redirect type wasn't supposed to have an empty body according to the specification. Guess how much of a shit I gave about whose fault it was? That's right, I just wanted the reverse proxy to forward the damned traffic.

If you're really a stickler for the rules, you can add a "strict" configuration option and even default it to true/on, but you should give the user the ability to be more lenient about what your application accepts. If there's not-technically-correct behavior that is common and expected to behave in a certain way, just go with it.

14. Provide Thorough Documentation

That means help menus, man pages, websites or Github wikis, and of course, code comments. And in a container world, many are adopting the practice of an "extended" help menu that is basically the man page, and can be called with -hh (similar to additional debug levels with -vvv). However you do it, make sure you provide excellent documentation. Documentation helps your users utilize your software as effectively as possible, and it also makes them less likely to contact you for help. It's win-win. Document everything. Users should be able to know what something is, how it works, why it works that way, and how to change any aspect of it to suit their use case. Change logs should act as event streams from version to version, and features and functionality should theoretically be able to be derived by replaying the change logs from the beginning.