title: "Kubernetes Security Best Practices: Lessons from Securing Production Clusters"
slug: "kubernetes-security-best-practices-2026"
meta_description: "Kubernetes security best practices from a production architect: Pod Security Standards, RBAC, network policies, OPA Gatekeeper, Falco, and image scanning."
author: "Kenny Ogunlowo"
date: "2026-05-29"
category: "Cybersecurity"
tags: ["kubernetes security", "k8s security best practices", "pod security standards", "rbac kubernetes", "falco", "opa gatekeeper", "container security"]
internal_links:
- "/collections/cybersecurity-frameworks"
- "/collections/devops-pipelines"
- "/pages/free-courses"
word_count: 2300
Kubernetes Security Best Practices: Lessons from Securing Production Clusters
Two incidents shaped how I think about Kubernetes security.
The first was at a healthcare platform running on GKE. A developer deployed a test container that pulled from a public Docker Hub image pinned to `latest`. That image had been replaced with a version containing a cryptominer. The miner ran for 11 days before our anomaly detection caught the spike in CPU utilization. The blast radius was contained — the workload had no database access, no secrets mounted — but the post-mortem required six hours of forensics and a conversation with our CISO about container provenance policies.
The second was at a financial services firm I consulted for. Their Kubernetes cluster had no network policies. A compromised microservice was able to make direct connections to the internal metadata service API, retrieve the attached service account credentials, and use those credentials to enumerate S3 buckets. That incident took three days to fully scope and resulted in a regulatory notification.
Neither incident required a novel attack vector. Both exploited basic misconfigurations that most Kubernetes documentation does not emphasize clearly enough. This guide covers the controls that would have prevented both — and the broader security posture I implement on every cluster I build or audit today.
For the complete cybersecurity framework toolkit, the Cybersecurity Frameworks collection at Citadel Cloud includes Kubernetes security baseline configurations and audit checklists.
The Kubernetes Threat Model
Before implementing controls, you need to know what you are defending against. The Kubernetes attack surface has four primary domains:
- The API server: The central control plane component. Unauthorized access to the API server gives an attacker full cluster control.
- Node compromise: If an attacker gains node-level access (through a container escape or SSH compromise), they can access all Pods on that node and potentially pivot to other nodes.
- Workload compromise: A compromised application container can attempt lateral movement to other services, exfiltrate secrets, or access the cloud provider's instance metadata service.
- Supply chain: Malicious or vulnerable container images, compromised base images, or dependency injection attacks.
The NSA and CISA released the Kubernetes Hardening Guide in 2021 (updated 2022). It remains the authoritative reference for US government and defense environments. Every recommendation in this guide aligns with that framework.
Pod Security Standards: Your First Line of Defense
Pod Security Standards (PSS) replaced Pod Security Policies (PSP) as of Kubernetes 1.25. PSP was deprecated and removed because it was difficult to configure correctly — misconfiguration was worse than no policy in many cases. PSS is simpler and enforced at the namespace level via admission control.
The Three PSS Profiles
- Privileged: No restrictions. This is the default and should never be used for production workloads.
- Baseline: Prevents known privilege escalation vectors. Blocks `hostPID`, `hostIPC`, `hostNetwork`, privileged containers, and dangerous capabilities.
- Restricted: The most locked-down profile. Requires non-root execution, read-only root filesystem is strongly recommended, seccomp profile must be set, and a minimal set of capabilities is dropped.
Apply PSS to namespaces via labels:
apiVersion: v1
kind: Namespace
metadata:
name: production-api
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/audit-version: latest
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: latest
The `enforce` label blocks non-compliant Pods at admission. The `audit` label logs violations without blocking. The `warn` label returns a user-facing warning. Use all three: enforce in production, audit and warn in lower environments to catch issues before they reach production.
Writing PSS-Compliant Pod Specs
A restricted-compliant Pod specification:
apiVersion: v1
kind: Pod
metadata:
name: api-server
namespace: production-api
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containers:
- name: api
image: registry.citadelcloud.com/api:v2.4.1@sha256:abc123...
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "256Mi"
cpu: "500m"
volumeMounts:
- name: tmp-dir
mountPath: /tmp
volumes:
- name: tmp-dir
emptyDir: {}
Three things worth noting here:
- The image is pinned to an immutable SHA digest, not a mutable tag. This prevents the Docker Hub cryptominer scenario from my introduction.
- `readOnlyRootFilesystem: true` forces any temporary writes to explicitly mounted volumes. This makes it harder for malware to persist files on disk.
- Resource limits are set. A container without limits can consume all node resources during an attack or runaway process. On the Lockheed Martin cluster, we enforced resource limits via OPA policies because developers consistently forgot to set them.
RBAC: Least Privilege for Every Identity
Kubernetes RBAC controls what API operations each identity can perform. "Identity" means service accounts (for workloads), users (for humans), and groups (for sets of users or service accounts).
Common RBAC Mistakes
Cluster-admin for CI/CD service accounts. I see this in almost every cluster audit I do. The CI/CD system was given cluster-admin because it was "easier" during initial setup, and nobody revoked it. Any pipeline compromise immediately gives the attacker full cluster control.
Default service account mounting. By default, Kubernetes automatically mounts the default service account token into every Pod. That token has RBAC permissions unless you have configured them explicitly. Most Pods do not need API server access at all.
Wildcard resources and verbs. Rules like `resources: ["*"]` and `verbs: ["*"]` are cluster-admin in disguise. Every Role and ClusterRole should specify exactly which resources and verbs are needed.
Correct RBAC Configuration
Disable automatic service account token mounting globally:
apiVersion: v1
kind: ServiceAccount
metadata:
name: api-server-sa
namespace: production-api
automountServiceAccountToken: false
Or disable it per Pod:
spec:
automountServiceAccountToken: false
Create narrow-scoped Roles for workloads that do need API access:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: configmap-reader
namespace: production-api
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list"]
resourceNames: ["app-config"] # Restrict to specific resource names
For CI/CD systems, scope the service account to the minimum required for deployments:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: deployer
namespace: production-api
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "patch", "update"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
Audit your cluster's RBAC configuration regularly. The `kubectl auth can-i` command lets you test permissions as specific service accounts:
kubectl auth can-i --list --as=system:serviceaccount:production-api:api-server-sa
Network Policies: Zero-Trust Pod Networking
By default, all Pods in a Kubernetes cluster can communicate with all other Pods across all namespaces. This is the definition of implicit trust — the opposite of zero-trust. Network Policies let you define explicit ingress and egress rules at the Pod level.
Network Policies require a CNI plugin that supports them. Calico, Cilium, and AWS VPC CNI all support Network Policies. Flannel does not.
Deny-All Base Policy
Start with a deny-all policy in every namespace and explicitly allow only required traffic:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production-api
spec:
podSelector: {} # Applies to all Pods in namespace
policyTypes:
- Ingress
- Egress
Then explicitly allow required communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-ingress
namespace: production-api
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
podSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
ports:
- protocol: TCP
port: 8080
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-to-database
namespace: production-api
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Egress
egress:
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
- to: # Allow DNS resolution
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- protocol: UDP
port: 53
This approach would have blocked the lateral movement in the financial services incident I described. The compromised service had no policy allowing egress to the metadata service IP range — with network policies in place, that connection attempt would have been silently dropped.
OPA Gatekeeper: Policy as Code at Admission
Open Policy Agent (OPA) Gatekeeper is an admission controller that enforces custom policies on all Kubernetes API operations. Where RBAC controls who can do what, OPA Gatekeeper controls what configurations are valid regardless of who submitted them.
Installing Gatekeeper
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/v3.16.3/deploy/gatekeeper.yaml
Writing Constraint Templates
Constraint Templates define reusable policy logic in Rego (OPA's policy language). This example enforces that all container images must come from an approved registry:
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8sallowedrepos
spec:
crd:
spec:
names:
kind: K8sAllowedRepos
validation:
openAPIV3Schema:
type: object
properties:
repos:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8sallowedrepos
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
satisfied := [good | repo = input.parameters.repos[_]; good = startswith(container.image, repo)]
not any(satisfied)
msg := sprintf("Container image <%v> is not from an approved registry. Approved registries: %v", [container.image, input.parameters.repos])
}
Apply a Constraint to enforce the policy:
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRepos
metadata:
name: require-approved-registries
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
namespaces:
- production-api
- staging-api
parameters:
repos:
- "registry.citadelcloud.com/"
- "gcr.io/citadel-prod/"
- "public.ecr.aws/citadel/"
Other high-value Gatekeeper policies to implement:
- Require resource limits: Any Pod without CPU and memory limits is blocked
- Block latest tags: Force image digest pinning instead of mutable tags
- Require specific labels: Enforce labeling standards for cost allocation and incident response
- Block privileged containers: Belt-and-suspenders on top of PSS
- Require seccomp profiles: Ensure all workloads have a seccomp policy
The Cybersecurity Frameworks collection includes ready-to-apply Gatekeeper Constraint Templates for all of the above policies.
Falco: Runtime Threat Detection
Falco is a Cloud Native Computing Foundation (CNCF) project that monitors kernel system calls in real time and alerts on suspicious behavior. Where Gatekeeper prevents bad configurations at admission, Falco detects bad behavior at runtime.
What Falco Detects
Falco comes with a default rule set covering common attack patterns. Rules I have triggered in real environments:
- A shell spawned inside a running container (`proc.name = bash` inside a production API container)
- Sensitive file reads (`/etc/shadow`, `/root/.ssh/`, `/proc/*/mem`)
- Network connections to unexpected external destinations
- Container running as root when it should not be
- A process writing to a directory it should only be reading
Falco Deployment
The recommended deployment method in production is as a DaemonSet using the eBPF driver (lower overhead than the kernel module):
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update
helm install falco falcosecurity/falco \
--namespace falco \
--create-namespace \
--set driver.kind=ebpf \
--set falcosidekick.enabled=true \
--set falcosidekick.config.slack.webhookurl="https://hooks.slack.com/services/..." \
--set falcosidekick.config.slack.minimumpriority=warning
Custom Falco Rules
Add custom rules tailored to your environment:
- rule: Cryptominer Process Started
desc: Detects execution of known cryptominer processes
condition: >
spawned_process and
(proc.name in (xmrig, minerd, cpuminer, ccminer, cgminer, bfgminer))
output: >
Cryptominer process started (user=%user.name container=%container.name
image=%container.image.repository:%container.image.tag
proc=%proc.name pid=%proc.pid)
priority: CRITICAL
tags: [cryptominer, container, process]
- rule: Unexpected Outbound Connection
desc: Detects unexpected outbound connections from production containers
condition: >
outbound and
not fd.sport in (443, 80, 5432, 6379, 9200) and
container.image.repository startswith "registry.citadelcloud.com"
output: >
Unexpected outbound connection (user=%user.name container=%container.name
connection=%fd.name)
priority: WARNING
tags: [network, container]
The cryptominer rule would have alerted within seconds of the rogue container in my healthcare incident — rather than requiring 11 days of anomaly detection based on CPU metrics.
Image Scanning: Security Starts in the Pipeline
Every container image you run is a potential attack vector. Outdated base images contain known CVEs. Dependencies pulled in by your application code may have vulnerabilities. Scanning needs to happen at every stage: build time, push time, and deployment time.
Trivy: The Open-Source Standard
Trivy is the most widely deployed open-source vulnerability scanner for container images. It scans OS packages, language-specific dependencies, and Infrastructure as Code configurations.
In a GitHub Actions pipeline:
- name: Scan container image
uses: aquasecurity/trivy-action@master
with:
image-ref: 'registry.citadelcloud.com/api:${{ github.sha }}'
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'HIGH,CRITICAL'
exit-code: '1' # Fail the pipeline on HIGH or CRITICAL findings
Image Signing with Cosign
Trivy tells you what vulnerabilities are in an image. Cosign tells you that the image has not been tampered with between build and deployment. Cosign (part of the Sigstore project) lets you cryptographically sign and verify container images:
# Sign during CI
cosign sign --key cosign.key registry.citadelcloud.com/api:$GIT_SHA
# Verify during deployment (or in Gatekeeper)
cosign verify --key cosign.pub registry.citadelcloud.com/api:$GIT_SHA
A Gatekeeper policy can enforce that only signed images are deployed to production, combining image provenance verification with policy enforcement.
Secrets Management: Do Not Put Them in Kubernetes Secrets
The default `kubernetes.io/v1 Secret` object stores values base64-encoded — not encrypted. Anyone with access to the etcd datastore (or certain RBAC permissions) can read them in plaintext. For production workloads in regulated environments, native Kubernetes Secrets are not sufficient for sensitive values.
AWS Secrets Manager + External Secrets Operator: The External Secrets Operator syncs secrets from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault into Kubernetes Secrets. Your applications consume standard Kubernetes Secrets, but the values are sourced from an encrypted, audited secrets store.
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: database-credentials
namespace: production-api
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: db-credentials
creationPolicy: Owner
data:
- secretKey: db_password
remoteRef:
key: production/api/database
property: password
etcd encryption at rest: Enable envelope encryption for etcd using a KMS provider (AWS KMS, GCP Cloud KMS). This encrypts Secrets, ConfigMaps, and other sensitive API objects at the storage layer.
Audit Logging and Compliance
Every Kubernetes API request should be logged. Audit logs are your forensic record for incident response and compliance evidence for FedRAMP, HIPAA, and PCI-DSS. Configure audit policy to capture security-relevant events without creating an unmanageable data volume:
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: None
resources:
- group: ""
resources: ["events"]
- level: Metadata
resources:
- group: ""
resources: ["secrets", "configmaps"]
- level: RequestResponse
verbs: ["create", "update", "patch", "delete"]
resources:
- group: "rbac.authorization.k8s.io"
resources: ["clusterroles", "clusterrolebindings", "roles", "rolebindings"]
- level: Request
omitStages:
- RequestReceived
Ship audit logs to a SIEM (Splunk, Elastic, Datadog) or a cloud-native logging solution (AWS CloudWatch Logs, GCP Cloud Logging) with retention policies that meet your compliance requirements. At Lockheed Martin, audit logs shipped to Splunk and triggered automated alerts on specific API operations (role binding creation, secret access from unexpected service accounts).
The DevOps Pipelines collection at Citadel Cloud includes logging pipeline configurations for Kubernetes audit data integrated with Splunk and Elastic.
The Security Baseline Checklist
Every cluster I build or audit gets measured against this baseline before going to production:
- [ ] Control plane access restricted to authorized networks (no `0.0.0.0/0` on the API server)
- [ ] etcd encrypted at rest with KMS
- [ ] Node OS hardened (CIS Benchmark Level 2 for the node OS)
- [ ] All namespaces have PSS enforce labels (baseline minimum, restricted for production workloads)
- [ ] RBAC audited — no cluster-admin service accounts, no wildcard permissions
- [ ] Network Policies deployed — default deny-all with explicit allow rules
- [ ] OPA Gatekeeper installed with policies for: approved registries, required resource limits, no latest tags
- [ ] Falco deployed with custom rules for the environment
- [ ] Image scanning in CI pipeline, blocking HIGH/CRITICAL CVEs
- [ ] Image signing with Cosign
- [ ] Secrets sourced from external secrets manager (not native Kubernetes Secrets)
- [ ] Audit logging enabled and shipping to SIEM with 90-day retention
- [ ] Node-level access restricted (no direct SSH in production — use Session Manager or kubectl exec with audit logging)
FAQ
How do I handle legacy applications that cannot run as non-root?
This is a real constraint. The practical approach is to start by applying the `baseline` PSS profile (which blocks privilege escalation and dangerous capabilities but does not require non-root) and target `restricted` as a migration goal. Document every exception with a risk acceptance and a timeline for remediation. At Cigna Healthcare, we had a 12-month roadmap to move all workloads from `privileged` namespace policies to `baseline`, with a subsequent 18-month roadmap to reach `restricted` for all customer-facing services. Progress matters more than perfection on day one.
Is Kubernetes security different between EKS, AKS, and GKE?
The core Kubernetes security primitives (PSS, RBAC, Network Policies, Audit Logging) work the same across all three. The differences are in the managed control plane configuration (EKS does not expose etcd directly, GKE Autopilot enforces more security constraints by default), the IAM integration (EKS uses IAM Roles for Service Accounts, AKS uses Workload Identity with Entra ID, GKE uses Workload Identity with Google IAM), and the default security posture. GKE Autopilot has the most opinionated security defaults. EKS Standard gives you the most control. AKS sits between them.
How often should I audit my cluster's security posture?
At minimum: run kube-bench (the CIS Kubernetes Benchmark tool) after every major cluster upgrade and quarterly otherwise. Run Trivy against your running workloads weekly for new CVE coverage. Review RBAC bindings whenever a team member leaves or changes roles. Run a full penetration test annually for production clusters in regulated environments. Automated tooling handles the continuous checks; manual review is for the edge cases and the policy questions that automation cannot answer.
What is the difference between Falco and a WAF for Kubernetes?
A WAF (Web Application Firewall) sits at the network layer and inspects HTTP/S traffic for known attack signatures. Falco operates at the system call layer inside the node and detects malicious behavior that has already bypassed the network perimeter. They are complementary, not alternatives. A WAF protects against external attacks on your application endpoints. Falco detects what happens after a successful compromise — a shell being spawned, a sensitive file being read, an unexpected process being launched. For production clusters handling regulated data, both layers are required.
Does Kubernetes security work differently at scale with hundreds of clusters?
Yes, significantly. Managing RBAC, Network Policies, and OPA Gatekeeper constraints across 50+ clusters requires centralized policy management. Tools like Flux CD or ArgoCD for GitOps-based policy deployment, and tools like Gatekeeper's `Config` resource for centralized constraint management, become essential. The Cybersecurity Frameworks collection at Citadel Cloud includes multi-cluster security architecture patterns for enterprise environments managing Kubernetes at scale.
*Kenny Ogunlowo is a Senior Multi-Cloud DevSecOps Architect with production Kubernetes experience in FedRAMP, HIPAA, and CMMC-regulated environments at Cigna Healthcare, Lockheed Martin, NantHealth, BP Refinery, and Patterson UTI. He holds AWS, Azure, and GCP professional certifications.*