Kubernetes Security Best Practices: Lessons from Securing Production C

Citadel Cloud Management; Sam O., Citadel Cloud Management

June 25, 2026 By Kenny Ogunlowo 14 min read

Kubernetes Security Best Practices: Lessons from Securing Production Clusters

title: "Kubernetes Security Best Practices: Lessons from Securing Production Clusters"

slug: "kubernetes-security-best-practices-2026"

meta_description: "Kubernetes security best practices from a production architect: Pod Security Standards, RBAC, network policies, OPA Gatekeeper, Falco, and image scanning."

author: "Kenny Ogunlowo"

date: "2026-05-29"

category: "Cybersecurity"

tags: ["kubernetes security", "k8s security best practices", "pod security standards", "rbac kubernetes", "falco", "opa gatekeeper", "container security"]

internal_links:

"/collections/cybersecurity-frameworks"
"/collections/devops-pipelines"
"/pages/free-courses"

word_count: 2300

Kubernetes Security Best Practices: Lessons from Securing Production Clusters

Two incidents shaped how I think about Kubernetes security.

The first was at a healthcare platform running on GKE. A developer deployed a test container that pulled from a public Docker Hub image pinned to `latest`. That image had been replaced with a version containing a cryptominer. The miner ran for 11 days before our anomaly detection caught the spike in CPU utilization. The blast radius was contained — the workload had no database access, no secrets mounted — but the post-mortem required six hours of forensics and a conversation with our CISO about container provenance policies.

The second was at a financial services firm I consulted for. Their Kubernetes cluster had no network policies. A compromised microservice was able to make direct connections to the internal metadata service API, retrieve the attached service account credentials, and use those credentials to enumerate S3 buckets. That incident took three days to fully scope and resulted in a regulatory notification.

Neither incident required a novel attack vector. Both exploited basic misconfigurations that most Kubernetes documentation does not emphasize clearly enough. This guide covers the controls that would have prevented both — and the broader security posture I implement on every cluster I build or audit today.

For the complete cybersecurity framework toolkit, the Cybersecurity Frameworks collection at Citadel Cloud includes Kubernetes security baseline configurations and audit checklists.

The Kubernetes Threat Model

Before implementing controls, you need to know what you are defending against. The Kubernetes attack surface has four primary domains:

The API server: The central control plane component. Unauthorized access to the API server gives an attacker full cluster control.
Node compromise: If an attacker gains node-level access (through a container escape or SSH compromise), they can access all Pods on that node and potentially pivot to other nodes.
Workload compromise: A compromised application container can attempt lateral movement to other services, exfiltrate secrets, or access the cloud provider's instance metadata service.
Supply chain: Malicious or vulnerable container images, compromised base images, or dependency injection attacks.

The NSA and CISA released the Kubernetes Hardening Guide in 2021 (updated 2022). It remains the authoritative reference for US government and defense environments. Every recommendation in this guide aligns with that framework.

Pod Security Standards: Your First Line of Defense

Pod Security Standards (PSS) replaced Pod Security Policies (PSP) as of Kubernetes 1.25. PSP was deprecated and removed because it was difficult to configure correctly — misconfiguration was worse than no policy in many cases. PSS is simpler and enforced at the namespace level via admission control.

The Three PSS Profiles

Privileged: No restrictions. This is the default and should never be used for production workloads.
Baseline: Prevents known privilege escalation vectors. Blocks `hostPID`, `hostIPC`, `hostNetwork`, privileged containers, and dangerous capabilities.
Restricted: The most locked-down profile. Requires non-root execution, read-only root filesystem is strongly recommended, seccomp profile must be set, and a minimal set of capabilities is dropped.

Apply PSS to namespaces via labels:


apiVersion: v1
kind: Namespace
metadata:
  name: production-api
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest

The `enforce` label blocks non-compliant Pods at admission. The `audit` label logs violations without blocking. The `warn` label returns a user-facing warning. Use all three: enforce in production, audit and warn in lower environments to catch issues before they reach production.

Writing PSS-Compliant Pod Specs

A restricted-compliant Pod specification:


apiVersion: v1
kind: Pod
metadata:
  name: api-server
  namespace: production-api
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: api
    image: registry.citadelcloud.com/api:v2.4.1@sha256:abc123...
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL
    resources:
      requests:
        memory: "128Mi"
        cpu: "250m"
      limits:
        memory: "256Mi"
        cpu: "500m"
    volumeMounts:
    - name: tmp-dir
      mountPath: /tmp
  volumes:
  - name: tmp-dir
    emptyDir: {}

Three things worth noting here:

The image is pinned to an immutable SHA digest, not a mutable tag. This prevents the Docker Hub cryptominer scenario from my introduction.
`readOnlyRootFilesystem: true` forces any temporary writes to explicitly mounted volumes. This makes it harder for malware to persist files on disk.
Resource limits are set. A container without limits can consume all node resources during an attack or runaway process. On the Lockheed Martin cluster, we enforced resource limits via OPA policies because developers consistently forgot to set them.

RBAC: Least Privilege for Every Identity

Kubernetes RBAC controls what API operations each identity can perform. "Identity" means service accounts (for workloads), users (for humans), and groups (for sets of users or service accounts).

Common RBAC Mistakes

Cluster-admin for CI/CD service accounts. I see this in almost every cluster audit I do. The CI/CD system was given cluster-admin because it was "easier" during initial setup, and nobody revoked it. Any pipeline compromise immediately gives the attacker full cluster control.

Default service account mounting. By default, Kubernetes automatically mounts the default service account token into every Pod. That token has RBAC permissions unless you have configured them explicitly. Most Pods do not need API server access at all.

Wildcard resources and verbs. Rules like `resources: ["*"]` and `verbs: ["*"]` are cluster-admin in disguise. Every Role and ClusterRole should specify exactly which resources and verbs are needed.

Correct RBAC Configuration

Disable automatic service account token mounting globally:


apiVersion: v1
kind: ServiceAccount
metadata:
  name: api-server-sa
  namespace: production-api
automountServiceAccountToken: false

Or disable it per Pod:


spec:
  automountServiceAccountToken: false

Create narrow-scoped Roles for workloads that do need API access:


apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: configmap-reader
  namespace: production-api
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list"]
  resourceNames: ["app-config"]  # Restrict to specific resource names

For CI/CD systems, scope the service account to the minimum required for deployments:


apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployer
  namespace: production-api
rules:
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "patch", "update"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]

Audit your cluster's RBAC configuration regularly. The `kubectl auth can-i` command lets you test permissions as specific service accounts:


kubectl auth can-i --list --as=system:serviceaccount:production-api:api-server-sa

Network Policies: Zero-Trust Pod Networking

By default, all Pods in a Kubernetes cluster can communicate with all other Pods across all namespaces. This is the definition of implicit trust — the opposite of zero-trust. Network Policies let you define explicit ingress and egress rules at the Pod level.

Network Policies require a CNI plugin that supports them. Calico, Cilium, and AWS VPC CNI all support Network Policies. Flannel does not.

Deny-All Base Policy

Start with a deny-all policy in every namespace and explicitly allow only required traffic:


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production-api
spec:
  podSelector: {}  # Applies to all Pods in namespace
  policyTypes:
  - Ingress
  - Egress

Then explicitly allow required communication:


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-ingress
  namespace: production-api
spec:
  podSelector:
    matchLabels:
      app: api-server
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: ingress-nginx
      podSelector:
        matchLabels:
          app.kubernetes.io/name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-to-database
  namespace: production-api
spec:
  podSelector:
    matchLabels:
      app: api-server
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
  - to:  # Allow DNS resolution
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
    ports:
    - protocol: UDP
      port: 53

This approach would have blocked the lateral movement in the financial services incident I described. The compromised service had no policy allowing egress to the metadata service IP range — with network policies in place, that connection attempt would have been silently dropped.

OPA Gatekeeper: Policy as Code at Admission

Open Policy Agent (OPA) Gatekeeper is an admission controller that enforces custom policies on all Kubernetes API operations. Where RBAC controls who can do what, OPA Gatekeeper controls what configurations are valid regardless of who submitted them.

Installing Gatekeeper


kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/v3.16.3/deploy/gatekeeper.yaml

Writing Constraint Templates

Constraint Templates define reusable policy logic in Rego (OPA's policy language). This example enforces that all container images must come from an approved registry:


apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sallowedrepos
spec:
  crd:
    spec:
      names:
        kind: K8sAllowedRepos
      validation:
        openAPIV3Schema:
          type: object
          properties:
            repos:
              type: array
              items:
                type: string
  targets:
  - target: admission.k8s.gatekeeper.sh
    rego: |
      package k8sallowedrepos

      violation[{"msg": msg}] {
        container := input.review.object.spec.containers[_]
        satisfied := [good | repo = input.parameters.repos[_]; good = startswith(container.image, repo)]
        not any(satisfied)
        msg := sprintf("Container image <%v> is not from an approved registry. Approved registries: %v", [container.image, input.parameters.repos])
      }

Apply a Constraint to enforce the policy:


apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRepos
metadata:
  name: require-approved-registries
spec:
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Pod"]
    namespaces:
    - production-api
    - staging-api
  parameters:
    repos:
    - "registry.citadelcloud.com/"
    - "gcr.io/citadel-prod/"
    - "public.ecr.aws/citadel/"

Other high-value Gatekeeper policies to implement:

Require resource limits: Any Pod without CPU and memory limits is blocked
Block latest tags: Force image digest pinning instead of mutable tags
Require specific labels: Enforce labeling standards for cost allocation and incident response
Block privileged containers: Belt-and-suspenders on top of PSS
Require seccomp profiles: Ensure all workloads have a seccomp policy

The Cybersecurity Frameworks collection includes ready-to-apply Gatekeeper Constraint Templates for all of the above policies.

Falco: Runtime Threat Detection

Falco is a Cloud Native Computing Foundation (CNCF) project that monitors kernel system calls in real time and alerts on suspicious behavior. Where Gatekeeper prevents bad configurations at admission, Falco detects bad behavior at runtime.

What Falco Detects

Falco comes with a default rule set covering common attack patterns. Rules I have triggered in real environments:

A shell spawned inside a running container (`proc.name = bash` inside a production API container)
Sensitive file reads (`/etc/shadow`, `/root/.ssh/`, `/proc/*/mem`)
Network connections to unexpected external destinations
Container running as root when it should not be
A process writing to a directory it should only be reading

Falco Deployment

The recommended deployment method in production is as a DaemonSet using the eBPF driver (lower overhead than the kernel module):


helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set driver.kind=ebpf \
  --set falcosidekick.enabled=true \
  --set falcosidekick.config.slack.webhookurl="https://hooks.slack.com/services/..." \
  --set falcosidekick.config.slack.minimumpriority=warning

Custom Falco Rules

Add custom rules tailored to your environment:


- rule: Cryptominer Process Started
  desc: Detects execution of known cryptominer processes
  condition: >
    spawned_process and
    (proc.name in (xmrig, minerd, cpuminer, ccminer, cgminer, bfgminer))
  output: >
    Cryptominer process started (user=%user.name container=%container.name
    image=%container.image.repository:%container.image.tag
    proc=%proc.name pid=%proc.pid)
  priority: CRITICAL
  tags: [cryptominer, container, process]

- rule: Unexpected Outbound Connection
  desc: Detects unexpected outbound connections from production containers
  condition: >
    outbound and
    not fd.sport in (443, 80, 5432, 6379, 9200) and
    container.image.repository startswith "registry.citadelcloud.com"
  output: >
    Unexpected outbound connection (user=%user.name container=%container.name
    connection=%fd.name)
  priority: WARNING
  tags: [network, container]

The cryptominer rule would have alerted within seconds of the rogue container in my healthcare incident — rather than requiring 11 days of anomaly detection based on CPU metrics.

Image Scanning: Security Starts in the Pipeline

Every container image you run is a potential attack vector. Outdated base images contain known CVEs. Dependencies pulled in by your application code may have vulnerabilities. Scanning needs to happen at every stage: build time, push time, and deployment time.

Trivy: The Open-Source Standard

Trivy is the most widely deployed open-source vulnerability scanner for container images. It scans OS packages, language-specific dependencies, and Infrastructure as Code configurations.

In a GitHub Actions pipeline:


- name: Scan container image
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'registry.citadelcloud.com/api:${{ github.sha }}'
    format: 'sarif'
    output: 'trivy-results.sarif'
    severity: 'HIGH,CRITICAL'
    exit-code: '1'  # Fail the pipeline on HIGH or CRITICAL findings

Image Signing with Cosign

Trivy tells you what vulnerabilities are in an image. Cosign tells you that the image has not been tampered with between build and deployment. Cosign (part of the Sigstore project) lets you cryptographically sign and verify container images:


# Sign during CI
cosign sign --key cosign.key registry.citadelcloud.com/api:$GIT_SHA

# Verify during deployment (or in Gatekeeper)
cosign verify --key cosign.pub registry.citadelcloud.com/api:$GIT_SHA

A Gatekeeper policy can enforce that only signed images are deployed to production, combining image provenance verification with policy enforcement.

Secrets Management: Do Not Put Them in Kubernetes Secrets

The default `kubernetes.io/v1 Secret` object stores values base64-encoded — not encrypted. Anyone with access to the etcd datastore (or certain RBAC permissions) can read them in plaintext. For production workloads in regulated environments, native Kubernetes Secrets are not sufficient for sensitive values.

AWS Secrets Manager + External Secrets Operator: The External Secrets Operator syncs secrets from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault into Kubernetes Secrets. Your applications consume standard Kubernetes Secrets, but the values are sourced from an encrypted, audited secrets store.


apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
  namespace: production-api
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: db-credentials
    creationPolicy: Owner
  data:
  - secretKey: db_password
    remoteRef:
      key: production/api/database
      property: password

etcd encryption at rest: Enable envelope encryption for etcd using a KMS provider (AWS KMS, GCP Cloud KMS). This encrypts Secrets, ConfigMaps, and other sensitive API objects at the storage layer.

Audit Logging and Compliance

Every Kubernetes API request should be logged. Audit logs are your forensic record for incident response and compliance evidence for FedRAMP, HIPAA, and PCI-DSS. Configure audit policy to capture security-relevant events without creating an unmanageable data volume:


apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: None
  resources:
  - group: ""
    resources: ["events"]
- level: Metadata
  resources:
  - group: ""
    resources: ["secrets", "configmaps"]
- level: RequestResponse
  verbs: ["create", "update", "patch", "delete"]
  resources:
  - group: "rbac.authorization.k8s.io"
    resources: ["clusterroles", "clusterrolebindings", "roles", "rolebindings"]
- level: Request
  omitStages:
  - RequestReceived

Ship audit logs to a SIEM (Splunk, Elastic, Datadog) or a cloud-native logging solution (AWS CloudWatch Logs, GCP Cloud Logging) with retention policies that meet your compliance requirements. At Lockheed Martin, audit logs shipped to Splunk and triggered automated alerts on specific API operations (role binding creation, secret access from unexpected service accounts).

The DevOps Pipelines collection at Citadel Cloud includes logging pipeline configurations for Kubernetes audit data integrated with Splunk and Elastic.

The Security Baseline Checklist

Every cluster I build or audit gets measured against this baseline before going to production:

[ ] Control plane access restricted to authorized networks (no `0.0.0.0/0` on the API server)
[ ] etcd encrypted at rest with KMS
[ ] Node OS hardened (CIS Benchmark Level 2 for the node OS)
[ ] All namespaces have PSS enforce labels (baseline minimum, restricted for production workloads)
[ ] RBAC audited — no cluster-admin service accounts, no wildcard permissions
[ ] Network Policies deployed — default deny-all with explicit allow rules
[ ] OPA Gatekeeper installed with policies for: approved registries, required resource limits, no latest tags
[ ] Falco deployed with custom rules for the environment
[ ] Image scanning in CI pipeline, blocking HIGH/CRITICAL CVEs
[ ] Image signing with Cosign
[ ] Secrets sourced from external secrets manager (not native Kubernetes Secrets)
[ ] Audit logging enabled and shipping to SIEM with 90-day retention
[ ] Node-level access restricted (no direct SSH in production — use Session Manager or kubectl exec with audit logging)

FAQ

How do I handle legacy applications that cannot run as non-root?

This is a real constraint. The practical approach is to start by applying the `baseline` PSS profile (which blocks privilege escalation and dangerous capabilities but does not require non-root) and target `restricted` as a migration goal. Document every exception with a risk acceptance and a timeline for remediation. At Cigna Healthcare, we had a 12-month roadmap to move all workloads from `privileged` namespace policies to `baseline`, with a subsequent 18-month roadmap to reach `restricted` for all customer-facing services. Progress matters more than perfection on day one.

Is Kubernetes security different between EKS, AKS, and GKE?

The core Kubernetes security primitives (PSS, RBAC, Network Policies, Audit Logging) work the same across all three. The differences are in the managed control plane configuration (EKS does not expose etcd directly, GKE Autopilot enforces more security constraints by default), the IAM integration (EKS uses IAM Roles for Service Accounts, AKS uses Workload Identity with Entra ID, GKE uses Workload Identity with Google IAM), and the default security posture. GKE Autopilot has the most opinionated security defaults. EKS Standard gives you the most control. AKS sits between them.

How often should I audit my cluster's security posture?

At minimum: run kube-bench (the CIS Kubernetes Benchmark tool) after every major cluster upgrade and quarterly otherwise. Run Trivy against your running workloads weekly for new CVE coverage. Review RBAC bindings whenever a team member leaves or changes roles. Run a full penetration test annually for production clusters in regulated environments. Automated tooling handles the continuous checks; manual review is for the edge cases and the policy questions that automation cannot answer.

What is the difference between Falco and a WAF for Kubernetes?

A WAF (Web Application Firewall) sits at the network layer and inspects HTTP/S traffic for known attack signatures. Falco operates at the system call layer inside the node and detects malicious behavior that has already bypassed the network perimeter. They are complementary, not alternatives. A WAF protects against external attacks on your application endpoints. Falco detects what happens after a successful compromise — a shell being spawned, a sensitive file being read, an unexpected process being launched. For production clusters handling regulated data, both layers are required.

Does Kubernetes security work differently at scale with hundreds of clusters?

Yes, significantly. Managing RBAC, Network Policies, and OPA Gatekeeper constraints across 50+ clusters requires centralized policy management. Tools like Flux CD or ArgoCD for GitOps-based policy deployment, and tools like Gatekeeper's `Config` resource for centralized constraint management, become essential. The Cybersecurity Frameworks collection at Citadel Cloud includes multi-cluster security architecture patterns for enterprise environments managing Kubernetes at scale.

*Kenny Ogunlowo is a Senior Multi-Cloud DevSecOps Architect with production Kubernetes experience in FedRAMP, HIPAA, and CMMC-regulated environments at Cigna Healthcare, Lockheed Martin, NantHealth, BP Refinery, and Patterson UTI. He holds AWS, Azure, and GCP professional certifications.*

Share this article

Citadel Cloud Management Team

Enterprise Cloud Architects

Enterprise experience across Fortune 500 organizations in healthcare, defense, energy, and technology. AWS, Azure, GCP, FedRAMP, CMMC, HIPAA certified.

LinkedIn GitHub

You might also like

Get free cloud career resources

Join 5,000+ cloud professionals. Weekly insights on AWS, Azure, GCP, and DevOps.

Explore Free Courses

Kubernetes Security Best Practices: Lessons from Securing Production Clusters

Kubernetes Security Best Practices: Lessons from Securing Production Clusters

The Kubernetes Threat Model

Pod Security Standards: Your First Line of Defense

The Three PSS Profiles

Writing PSS-Compliant Pod Specs

RBAC: Least Privilege for Every Identity

Common RBAC Mistakes

Correct RBAC Configuration

Network Policies: Zero-Trust Pod Networking

Deny-All Base Policy

OPA Gatekeeper: Policy as Code at Admission

Installing Gatekeeper

Writing Constraint Templates

Falco: Runtime Threat Detection

What Falco Detects

Falco Deployment

Custom Falco Rules

Image Scanning: Security Starts in the Pipeline

Trivy: The Open-Source Standard

Image Signing with Cosign

Secrets Management: Do Not Put Them in Kubernetes Secrets

Audit Logging and Compliance

The Security Baseline Checklist

FAQ

Citadel Cloud Management Team

You might also like

Get free cloud career resources

Your Cart (0)

Get 20% Off Your First Purchase

Kubernetes Security Best Practices: Lessons from Securing Production Clusters

The Kubernetes Threat Model

Pod Security Standards: Your First Line of Defense

The Three PSS Profiles

Writing PSS-Compliant Pod Specs

RBAC: Least Privilege for Every Identity

Common RBAC Mistakes

Correct RBAC Configuration

Network Policies: Zero-Trust Pod Networking

Deny-All Base Policy

OPA Gatekeeper: Policy as Code at Admission

Installing Gatekeeper

Writing Constraint Templates

Falco: Runtime Threat Detection

What Falco Detects

Falco Deployment

Custom Falco Rules

Image Scanning: Security Starts in the Pipeline

Trivy: The Open-Source Standard

Image Signing with Cosign

Secrets Management: Do Not Put Them in Kubernetes Secrets

Audit Logging and Compliance

The Security Baseline Checklist

FAQ

Citadel Cloud Management Team

You might also like

Zero Trust Architecture: The Complete Implementation Guide for Multi-Cloud Environments

Zero Trust Architecture: Complete Implementation Guide [2026]

What Is Infrastructure as Code? Complete Explanation [2026]

Get free cloud career resources