One of Kubernetes' killer features is its ability to seamlessly update applications no matter how large your deployment is. Did a developer make a code change, and now you need to update a thousand running containers? Just run <span class="code-class-custom">kubectl apply -f manifest.yaml</span> and watch as Kubernetes replaces each outdated pod with the new version.

Unfortunately, like with many Kubernetes features, there are hidden risks here that could impact the reliability of your applications. Updates typically roll out gradually, not all at once. What happens if your team releases another update before the first rollout finishes? What happens if you push a release while Kubernetes is upgrading itself? Depending on how you identify container image versions, you might end up with two different versions running side-by-side: one with the latest fix, and one without it.

In this blog, we'll explore the container version uniformity problem, what the risks are, how you can avoid them, and how Gremlin helps ensure consistent versioning across your environment.

Looking for more Kubernetes risks lurking in your system? Grab a copy of our comprehensive ebook, “Kubernetes Reliability at Scale.” 

What is version uniformity and why is it important?

Version uniformity refers to the image version used when declaring pods. When you define a pod or deployment in a Kubernetes manifest, you can specify which version of the container image to use in one of two ways:

  1. Tags, which are created by the image's creator to identify a single version of a container. Multiple container versions can have the same tag, meaning a single tag could refer to multiple different container versions over time.
  2. Digests, which are the result of running the image through a hashing function (usually SHA256). Each digest identifies one single version of a container; changing the container in any way also changes the digest.

Tags are easier to read than digests, but they come with a catch: a single tag could refer to multiple image versions. The most infamous example is <span class="code-class-custom">latest</span>, which always points to the most recently released version of a container image. If you deploy a pod using the <span class="code-class-custom">latest</span> tag today, then deploy another pod tomorrow, you could end up with two completely different versions of the same pod.

As an example, imagine we have a Kubernetes application called the Bank of Anthos. One of the deployments in our application is the "userservice," which handles actions like authenticating users and storing personal data. We want this service to have plenty of redundancy and overhead, so we deploy 20 replicas of it across our clusters:

YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: userservice
spec:
  replicas: 20
  selector:
    matchLabels:
      app: userservice
  template:
    metadata:
      labels:
        app: userservice
    spec:
      containers:
        - name: userservice
          image: gcr.io/bank-of-anthos-ci/userservice:v0.5.10

Now, imagine we need to make a quick hotfix to the userservice. It's not a major change, so we push the updated image directly to our image repository without updating the label. The updated image has a new digest, and now the label points to it instead of the original version. If we schedule another pod to a different node (e.g. adding a new replica or scaling up), then the newly created pod will use the updated version, but the already-running pods will continue using the old version. We'll have two different versions of the same container running side-by-side.

You can imagine what kind of problems this could cause if we changed the way user data was stored in the database, or the way passwords were hashed in response to a critical security bug. Users might see strange errors or might not be able to log in at all. Worse yet, only a percentage of users might be impacted, making it even harder to troubleshoot the problem.

How do I prevent version mismatches?

Start by checking your manifest (YAML) files. This is where you define the parameters for your pod(s) including the container image location and version. When specifying an image, always use a digest. This locks the deployment to a specific version, whereas using a general tag like <span class="code-class-custom">latest</span> could pull different versions depending on when you deploy the pod.

For example, when we're deploying the userservice, we should specify it as <span class="code-class-custom">userservice:sha256:d33e608c24821613713e8b85ce5fbec118a18076140c1b3ee39359d606ce20ef</span>. The default (risky) choice is just to use <span class="code-class-custom">userservice:latest</span>. However, <span class="code-class-custom">latest</span> is a floating tag that references whatever the current version of the image is. If we add another container instance to this deployment, it could end up using a different version of the container and deploying it alongside older, potentially incompatible versions.

Another area where version mismatches often occur are updates. When rolling out an update, Kubernetes doesn't replace every container at once, as this could cause service downtime, failed requests, and a poor experience for users. Instead, it gradually replaces individual pods while making sure to keep a minimum percentage (at least 75%) of pods in the deployment running. Using digests will also prevent this problem, but an alternative approach would be to use a different deployment strategy. For example, in a blue/green deployment, the new version (green) is released alongside the old version (blue), but traffic continues going to the old version. Once the new version is ready, traffic is instantly switched over, then the old version is taken down. We cover a few of these different methods in our blog on testing in production.

How do I validate that my fix works?

The most direct way to check for mismatched container versions is by using <span class="code-class-custom">kubectl</span> to query every container image. For example, the official Kubernetes documentation provides this command for listing each container image across all namespaces, along with the number of Pods actively using that image:

BASH

kubectl get pods --all-namespaces -o jsonpath="{.items[*].status.containerStatuses[*].imageID}" |\
tr -s '[[:space:]]' '\n' |\
sort |\
uniq -c


1 gcr.io/bank-of-anthos-ci/accounts-db@sha256:04da06045c2ce2d9fd151fda682907eecb8eb9faeb84d0a60ea2a221e0b85441
2 gcr.io/bank-of-anthos-ci/balancereader@sha256:164ef93c47334e0c5ce114326397abbe730e8114398072f48fb63ffe447237ad
2 gcr.io/bank-of-anthos-ci/contacts@sha256:5f28ba99be16ac8173ac73d22f72b94e34c3b33b8d0497b8b05364fcbd1a161b
2 gcr.io/bank-of-anthos-ci/frontend@sha256:2317dfa4351d6cb63b9b52161c39feaf84e4f3e9460ac601175ffc5e1774d354
1 gcr.io/bank-of-anthos-ci/ledger-db@sha256:73e6f191dccc5344ee795470db676dd107f62a40d5425f47d116609dadf5efa4
2 gcr.io/bank-of-anthos-ci/ledgerwriter@sha256:bc8263483ea15427fe4ee06a67dea42811177c62fb68cefcab843d14dd54dc25
2 gcr.io/bank-of-anthos-ci/loadgenerator@sha256:6aaed05ef6342c8476fed2b32224fdace0ff6403688112cb816867b110dae0ac
2 gcr.io/bank-of-anthos-ci/transactionhistory@sha256:578eee3c7a84a6dceae1c0a8823fd0ab091fa32a216e47f4c7f8691adc2ba1ce
1 gcr.io/bank-of-anthos-ci/userservice@sha256:1d0e45ca69fed59a1fa4c5c3ea356b0e47779149b47e45f8d3ec422a61560909
1 gcr.io/bank-of-anthos-ci/userservice@sha256:d33e608c24821613713e8b85ce5fbec118a18076140c1b3ee39359d606ce20ef

You'll notice there are two versions of userservice, each one having a different digest. To fix these, we'd want to lock down the specific image by adding the correct digest to our manifest, then re-deploying it.

What other Kubernetes risks should I be looking for?

We covered the more common Kubernetes reliability risks—such as resource requests and limits, liveness probes, and high availability—in our eBook, "Kubernetes Reliability at Scale." We're also launching a brand new set of Detected Risks this year, followed by more blog posts like these. When you're ready to start uncovering risks, sign up for a free 30-day trial and get a complete report of your reliability risks in minutes.

No items found.
Categories
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL
K8s Reliability at Scale

To learn more about Kubernetes failure modes and how to prevent them at scale, download a copy of our comprehensive ebook

Get the Ultimate Guide