Beyond Native Policies: Implementing Cosign Webhooks and Managing Open Source Friction
Many validations can be done using Kubernetes Validating Admission Policies, but there are some cases that requires to use Kubernetes Webhooks as ValidatingWebhookConfiguration or MutatingWebhookConfiguration to perform a different set of validations. This is the point we want to cover in this post from AS Inc Example.
Many of our important validations are done using Common Expression Language (CEL) inside a Kubernetes native resource, before asking yourself what this post is talking about, we are talking about important security decisions: read-only root filesystems, removing administrative capabilities, not running as root and so on. We can use the Validating Policies to manage tagging and other important platform decisions.
And why do we need a webhook? Why do we need another admission controller in Kubernetes that will increase the latency in each API request to create resources involving containers? As our platform team knows, Kubernetes Validating Admission Policies are running in Kube-API Server and this capability can run CEL validations but does not have any external access. Our security team spent some good time talking with the platform team about Supply Chain attacks and how important it is to control and validate each container image that we have running in our cluster. For this validation, they will require external calls which makes Kubernetes Validating Webhooks, or Mutating, a right choice for it. In the end, the security team’s goal was to have another layer to complicate any attacker action. Even an internal actor who wants to make bad use of our infrastructure will be required to inject a compromised container image inside our own container registry and then try what they want to. That means, they will need to inject their malicious code in our company repository, pass a four eye principle in the pull request, have it merge in the codebase and re use it in Kubernetes.
Swiss Army Knife or Scalpel
We ran a couple of ideas before choosing our software and it was tough because we have such great open source tools to use that it was difficult to decide. Kyverno or Policy-controller are amazing tools, with a lot of options, and we can make this analogy they are a kind of swiss army knife, and here in AS Inc Example, we do not have a big security team to manage Custom Resources Definitions or reviewing developers own custom resources using all kind of features from these open source tooling. They want to run it as soon as possible without the operations cost, maybe they will have it in the future, but this is not the time, they choose to work extensively with alerts in Falco for instance.
They chose an Image Kubernetes Admission controller called Dvorah, and it has a simple config file based on YAML that is simple and efficient. It cuts out what they need very precisely, like a scalpel.
Dvorah supports Cosign use cases, signature using private and public keys, this is a straightforward Cosign case where you can generate a private key and a public key, you use a private key (and keep it in a very secure place, or use an cloud provider option like AWS KMS) to sign our container images and upload it to the same registry that our image is hosted. Another Cosign case is a keyless signature, which means you use verifiable identities like token.actions.githubusercontent.com (GitHub identity) to sign it. Then our process was very simple, for every open source image that we found we update our configuration as we decided before to not import anything for our internal registry, due to some budget limitations (yes, my imaginary company is running under budget restrictions).
We set it up to have an aggressive cache to avoid container registry calls for every verification. We discovered that Dvorah creates in its memory an entry to keep track of parent resources in Kubernetes and it avoids making new calls in this case. For instance, when you are creating a Deployment, Kubernetes controllers will create a Replicaset and it will end up as Pod in the cluster. This parent relation is managed by metadata.ownerReference and checking it, it can skip another remote call to container Registry to validate the same image.
After a couple of tests, we figured out that many of open source keyless images will be configured only as image:tag and tag cache is not enabled by default, because Dvorah was intended to force strict security and request developers to use image:tag@digest which is not too common. We enabled tag cache to have a wide range of images in our cache and save some remote requests.
Open source issues
As you should always expect, no software will fit perfectly in each use case that you have, and in this case you have two options: Or you adapt our process to use that limitation as a process limitation, or you can open an issue on that open source repository and help them to fix it. This is a trade-off when choosing an open-source tool.
We are running it for ArgoCD, but not all ArgoCD images are from their project, they have a Redis image from a public AWS registry, without any signature, and what can we do in this case? We can run our Kubernetes Admission controller in audit mode or we can ask a feature to whitelist it. We opened an issue and waited for them to review and work on it. We talked with their developer (me again) and he helped us with this and a couple of minor issues that we found. In most cases, this is an exception, usually you open an issue and a bot in a couple of months will close it. When you are going to choose an open-source solution, remember this and double check it during a good proof of concept.
We are running it fine and the Cert manager has restarted. If you remember it correctly, Cert manager was installed before our Admission controller, but all creation requests like a Pod or a Replicaset are analysed by our Admission controller; which includes restarts that assign it to a new node. We discovered that this software hardcoded a crypto.SHA256 and by coincidence, Cert Manager uses SHA512 to sign its container images using Cosign. Again we do open a new issue. This is a common open-source lifecycle. Fortunately our case was simple and Dvorah’s developers delivered it fast.
Checking how the Dvorah repository works, we decided to start following some of these best practices. First we decided to tag for each imported step in GitHub Workflows using a commit hash and adding it version as a comment at the end of the same line. Using this approach we do not only make it more reliable but follow the Dependabot pattern and we can use this GitHub Bot to keep our GitHub Workflows always updated. You can check in the Dvorah repository in closed Pull Requests and check how many contributions it gives. Keeping it updated is an important security measurement.
Other important notes are the use of Open Container Initiative (OCI) as Helm package manager and publish it in GitHub packages and Dvorah uses the e2e-framework, a framework from Kubernetes developers which we covered in our first post. We don’t think only helm lint or helm template is enough to validate it and having an extended test is a good practice.
To conclude our post, adding an open source tool can bring some drawbacks as any software and we should always validate it using our design decisions before committing team effort to implement it. Having a clear vision of where your company wants to go is important and helps your teams to make decisions. Choosing a tool because you like it, or because it is a high-trending tool, could be a bad choice in the future.