Federation on Amazon EKS
This guide is the opinionated AWS happy path for running your own AGNTCY Directory instance and federating it with the public Directory network.
It makes the following choices for you:
- Amazon Web Services
- Amazon EKS
- Helm-first deployment
https_webSPIRE federation- NGINX ingress
- Route53 with ExternalDNS
- cert-manager with Let's Encrypt
- EBS-backed persistence
- Zot as the OCI registry
- SPIRE as the workload identity system
If you want the reference material behind this guide, see the following guides:
- Production Deployment
- Running a Federated Directory Instance
- Federation Bundle Profiles
- Federation Best Practices and Troubleshooting
This document is intentionally narrow. It does not try to cover GitOps, Azure, GCP, on-premises deployment, or the https_spiffe profile. The goal is to give first-time operators one concrete path that they can follow end to end.
What You Will Build
flowchart TD
internet[Internet]
route53[Route53]
nginx[NGINXIngress]
nlb[AWSNLB]
api[DirectoryAPI]
reconciler[Reconciler]
zot[ZotRegistry]
pg[PostgreSQL]
spireIngress[SPIREFederationIngress]
spire[SPIREServer]
oidc[SPIREOIDCDiscovery]
ebs[EBSVolumes]
prod[PublicDirectoryProd]
internet --> route53
route53 --> nginx
route53 --> nlb
nginx --> api
nginx --> zot
nginx --> spireIngress
nginx --> oidc
nlb -->|P2P routing| api
spireIngress --> spire
api --> pg
api --> ebs
reconciler --> pg
reconciler --> zot
zot --> ebs
pg --> ebs
api --> spire
reconciler --> spire
spire --> prod
Decisions Before You Start
Choose these before you install anything:
-
Your SPIRE trust domain.
This is permanent for the deployment.
-
Your public DNS names.
-
Your federation profile.
This guide always uses
https_web.
To keep the mental model simple, this guide uses the same base domain for both DNS and the trust domain:
- Trust domain:
partner.example.com - Directory API:
api.partner.example.com - Zot:
zot.partner.example.com - P2P routing:
routing.partner.example.com - SPIRE federation endpoint:
spire.partner.example.com - SPIRE OIDC discovery:
oidc-discovery.spire.partner.example.com
Note
Do not split these in your first deployment. You can change them later.
Version Pins Used in This Guide
These pins match the current staging deployment references at the time this guide was written:
| Component | Source | Version |
|---|---|---|
| Directory Helm chart | oci://ghcr.io/agntcy/dir/helm-charts/dir |
v1.2.0 |
| Directory apiserver image | ghcr.io/agntcy/dir-apiserver |
v1.2.0 |
| Directory reconciler image | ghcr.io/agntcy/dir-reconciler |
v1.2.0 |
| SPIRE Helm chart | spiffe/spire |
0.28.3 |
Note
Older Directory docs still show v1.0.0 examples. Prefer the versions above for a fresh deployment.
Before You Start
This guide assumes you already have:
- An EKS cluster running and reachable with
kubectl - An ingress-nginx controller installed in the cluster
--enable-ssl-passthrough=trueenabled on that ingress controller- cert-manager installed with a production ClusterIssuer such as
letsencrypt-prod - ExternalDNS configured to manage records in your Route53 hosted zone
- An EBS-backed StorageClass named
ebs-sc-encrypted kubectl,helm,openssl,jq, andhtpasswdinstalled locally
This guide does not try to provision the AWS infrastructure from zero in the main flow. For the AWS-specific provisioning considerations, see the Appendix.
Setting Up Federation Using AWS EKS
-
Export the Variables Used by the Rest of the Guide
export TRUST_DOMAIN="partner.example.com" export BASE_DOMAIN="partner.example.com" export DIR_API_HOST="api.${BASE_DOMAIN}" export DIR_ZOT_HOST="zot.${BASE_DOMAIN}" export DIR_ROUTING_HOST="routing.${BASE_DOMAIN}" export SPIRE_FEDERATION_HOST="spire.${BASE_DOMAIN}" export SPIRE_OIDC_HOST="oidc-discovery.spire.${BASE_DOMAIN}" export DIR_NAMESPACE="dir" export SPIRE_NAMESPACE="spire" export CERT_ISSUER="letsencrypt-prod" export STORAGE_CLASS="ebs-sc-encrypted" -
Verify the Cluster Add-Ons Before You Deploy Anything
Check that the required controllers and storage class already exist:
kubectl get ns kubectl get deployment -A | egrep 'ingress-nginx|cert-manager|external-dns' kubectl get storageclass kubectl get clusterissuerConfirm that the ingress-nginx controller has SSL passthrough enabled:
kubectl get deployment -n ingress-nginx ingress-nginx-controller -o yaml | \ grep enable-ssl-passthroughExpected result:
ingress-nginx-controlleris presentcert-manageris presentexternal-dnsis present- your
ClusterIssuerexists ebs-sc-encryptedexists- the ingress controller arguments include
--enable-ssl-passthrough=true
If any of these are missing, stop here and resolve them first. This guide assumes the platform layer is already working.
-
Generate the Credentials Used by Directory and Zot
Use generated values instead of hardcoding static credentials in your shell history or values files:
export DIR_OCI_ADMIN_PASSWORD="$(openssl rand -base64 24)" export DIR_SYNC_PASSWORD="$(openssl rand -base64 24)" export DIR_DB_PASSWORD="$(openssl rand -base64 24)" htpasswd -nbB admin "${DIR_OCI_ADMIN_PASSWORD}" > zot.htpasswd htpasswd -nbB user "${DIR_SYNC_PASSWORD}" >> zot.htpasswd openssl genpkey -algorithm Ed25519 -out node.privkeyKeep
node.privkeyon disk. It will be injected into the Helm release via--set-filelater. -
Install SPIRE for
https_webFederationAdd the hardened SPIRE chart repository and install the CRDs:
helm repo add spiffe https://spiffe.github.io/helm-charts-hardened helm repo update helm upgrade --install spire-crds spire-crds \ --repo https://spiffe.github.io/helm-charts-hardened \ --namespace spire-crds \ --create-namespaceCreate
spire-values.yaml:cat <<EOF > spire-values.yaml global: spire: trustDomain: ${TRUST_DOMAIN} clusterName: eks namespaces: create: false ingressControllerType: other installAndUpgradeHooks: enabled: false deleteHooks: enabled: false spire-server: federation: enabled: true tls: spire: enabled: false certManager: enabled: true issuer: create: false certificate: issuerRef: kind: ClusterIssuer name: ${CERT_ISSUER} ingress: enabled: true className: nginx controllerType: other host: ${SPIRE_FEDERATION_HOST} tlsSecret: spire-federation-cert annotations: cert-manager.io/cluster-issuer: ${CERT_ISSUER} external-dns.alpha.kubernetes.io/hostname: ${SPIRE_FEDERATION_HOST} nginx.ingress.kubernetes.io/ssl-passthrough: "false" nginx.ingress.kubernetes.io/backend-protocol: "HTTPS" nginx.ingress.kubernetes.io/proxy-ssl-server-name: "on" nginx.ingress.kubernetes.io/proxy-ssl-name: "${SPIRE_FEDERATION_HOST}" # SPIRE serves its own self-signed cert on the backend; the ingress # controller cannot validate it, so upstream verification is disabled. nginx.ingress.kubernetes.io/proxy-ssl-verify: "off" controllerManager: watchClassless: true className: dir-spire identities: clusterFederatedTrustDomain: enabled: true clusterSPIFFEIDs: default: federatesWith: - prod.ads.outshift.io spiffe-oidc-discovery-provider: ingress: enabled: true className: nginx host: ${SPIRE_OIDC_HOST} annotations: cert-manager.io/cluster-issuer: ${CERT_ISSUER} external-dns.alpha.kubernetes.io/hostname: ${SPIRE_OIDC_HOST} config: domains: - ${SPIRE_OIDC_HOST} EOFInstall SPIRE:
helm upgrade --install spire spiffe/spire \ --version 0.28.3 \ --namespace "${SPIRE_NAMESPACE}" \ --create-namespace \ -f spire-values.yamlWait for SPIRE to become healthy:
kubectl wait --for=condition=ready pod -n "${SPIRE_NAMESPACE}" \ -l app.kubernetes.io/name=server --timeout=300s kubectl wait --for=condition=ready pod -n "${SPIRE_NAMESPACE}" \ -l app.kubernetes.io/name=agent --timeout=300s -
Verify the SPIRE Public Endpoints
Check that the ingress objects and public certificates exist:
kubectl get ingress -n "${SPIRE_NAMESPACE}" kubectl get certificate -n "${SPIRE_NAMESPACE}"Verify the federation endpoint:
curl -I "https://${SPIRE_FEDERATION_HOST}"Verify the OIDC discovery document:
curl "https://${SPIRE_OIDC_HOST}/.well-known/openid-configuration" | jq .Confirm that Route53 and ExternalDNS have published the hostnames:
dig +short "${SPIRE_FEDERATION_HOST}" dig +short "${SPIRE_OIDC_HOST}"At this point, your SPIRE bundle endpoint should already be externally reachable over standard HTTPS.
-
Create the Directory Values File
Create
dir-values.yaml:cat <<EOF > dir-values.yaml apiserver: image: repository: ghcr.io/agntcy/dir-apiserver tag: v1.2.0 pullPolicy: IfNotPresent spire: enabled: true className: dir-spire trustDomain: ${TRUST_DOMAIN} useCSIDriver: true dnsNameTemplates: - ${DIR_API_HOST} federation: - className: dir-spire trustDomain: prod.ads.outshift.io bundleEndpointURL: https://prod.spire.ads.outshift.io bundleEndpointProfile: type: https_web routingService: type: LoadBalancer cloudProvider: "aws" aws: internal: false nlbTargetType: "instance" externalTrafficPolicy: Local annotations: external-dns.alpha.kubernetes.io/hostname: "routing.${BASE_DOMAIN}" config: listen_address: "0.0.0.0:8888" oasf_api_validation: disable: true authn: enabled: true mode: "x509" socket_path: "unix:///run/spire/agent-sockets/api.sock" audiences: - "spiffe://${TRUST_DOMAIN}/spire/server" authz: enabled: true enforcer_policy_file_path: "/etc/agntcy/dir/authz_policies.csv" store: provider: "oci" oci: # Use the external address, not the internal .svc.cluster.local name. # The apiserver shares this address with remote peers via the # RequestRegistryCredentials RPC so they can pull records during sync. registry_address: "${DIR_ZOT_HOST}" auth_config: insecure: "false" username: "admin" password: "${DIR_OCI_ADMIN_PASSWORD}" routing: listen_address: "/ip4/0.0.0.0/tcp/5555" key_path: /etc/routing/node.privkey datastore_dir: /etc/routing/datastore directory_api_address: "${DIR_API_HOST}:443" gossipsub: enabled: true sync: auth_config: username: "user" password: "${DIR_SYNC_PASSWORD}" publication: scheduler_interval: "1h" worker_count: 1 worker_timeout: "30m" database: type: "postgres" postgres: host: "" port: 5432 database: "dir" # PostgreSQL runs as a subchart in the same namespace, so traffic # stays within the pod network. Use "require" or "verify-full" if # your PostgreSQL instance is external or crosses a network boundary. ssl_mode: "disable" authz_policies_csv: | p,${TRUST_DOMAIN},* p,*,/agntcy.dir.store.v1.StoreService/Pull p,*,/agntcy.dir.store.v1.StoreService/PullReferrer p,*,/agntcy.dir.store.v1.StoreService/Lookup p,*,/agntcy.dir.store.v1.SyncService/RequestRegistryCredentials pvc: create: true storageClassName: ${STORAGE_CLASS} size: 20Gi strategy: type: Recreate ingress: enabled: true className: nginx annotations: nginx.ingress.kubernetes.io/ssl-passthrough: "true" nginx.ingress.kubernetes.io/backend-protocol: "GRPCS" external-dns.alpha.kubernetes.io/hostname: ${DIR_API_HOST} hosts: - host: ${DIR_API_HOST} paths: - path: / pathType: ImplementationSpecific tls: - hosts: - ${DIR_API_HOST} postgresql: enabled: true auth: username: "dir" password: "${DIR_DB_PASSWORD}" database: "dir" primary: persistence: enabled: true storageClass: ${STORAGE_CLASS} size: 20Gi reconciler: enabled: true image: repository: ghcr.io/agntcy/dir-reconciler tag: v1.2.0 pullPolicy: IfNotPresent config: database: type: "postgres" postgres: host: "" port: 5432 database: "dir" ssl_mode: "disable" local_registry: registry_address: "${DIR_ZOT_HOST}" repository_name: "" auth_config: insecure: false regsync: enabled: true interval: "1m" timeout: "30m" authn: enabled: true mode: "x509" socket_path: "unix:///run/spire/agent-sockets/api.sock" indexer: enabled: true interval: "30m" secrets: # privKey is injected via --set-file in the helm install command. # PEM data is multiline and cannot be safely embedded in a YAML heredoc. syncAuth: username: "user" password: "${DIR_SYNC_PASSWORD}" ociAuth: username: "admin" password: "${DIR_OCI_ADMIN_PASSWORD}" postgresAuth: username: "dir" password: "${DIR_DB_PASSWORD}" zot: persistence: true pvc: create: true accessModes: ["ReadWriteOnce"] storage: 100Gi storageClassName: ${STORAGE_CLASS} mountSecret: true authHeader: "admin:${DIR_OCI_ADMIN_PASSWORD}" secretFiles: htpasswd: |- $(sed 's/^/ /' zot.htpasswd) mountConfig: true ingress: enabled: true className: nginx annotations: cert-manager.io/cluster-issuer: ${CERT_ISSUER} external-dns.alpha.kubernetes.io/hostname: ${DIR_ZOT_HOST} hosts: - host: ${DIR_ZOT_HOST} paths: - path: / pathType: ImplementationSpecific tls: - secretName: zot-public-tls hosts: - ${DIR_ZOT_HOST} EOFNote
The routing datastore, PostgreSQL, and Zot PVCs are all pinned to
ebs-sc-encrypted. If your cluster uses a different default StorageClass, update thestorageClassNamevalues for each component before you install Directory. -
Install Directory
helm upgrade --install dir oci://ghcr.io/agntcy/dir/helm-charts/dir \ --version v1.2.0 \ --namespace "${DIR_NAMESPACE}" \ --create-namespace \ -f dir-values.yaml \ --set-file apiserver.secrets.privKey=node.privkey--set-filereads the PEM file and passes its exact bytes to the chart, which base64-encodes them into the Kubernetes Secret. This avoids the multiline-in-YAML corruption that would happen if the key were embedded directly in the values file.Wait for the main workloads:
kubectl wait --for=condition=ready pod -n "${DIR_NAMESPACE}" \ -l app.kubernetes.io/name=apiserver --timeout=300s kubectl get pods -n "${DIR_NAMESPACE}" kubectl get svc -n "${DIR_NAMESPACE}" kubectl get ingress -n "${DIR_NAMESPACE}" kubectl get pvc -n "${DIR_NAMESPACE}" -
Verify the Directory Endpoints and Certificates
Check that the public API hostname presents the SPIFFE-issued certificate rather than the ingress default certificate:
echo | openssl s_client -connect "${DIR_API_HOST}:443" \ -servername "${DIR_API_HOST}" 2>/dev/null | \ openssl x509 -noout -subjectIf SSL passthrough is working, the certificate subject should come from SPIRE and not from the ingress controller.
Confirm that the API server successfully obtained an X.509-SVID:
kubectl logs -n "${DIR_NAMESPACE}" -l app.kubernetes.io/name=apiserver | \ grep "Successfully obtained valid X509-SVID"Verify that Zot is reachable over HTTPS:
curl -u "admin:${DIR_OCI_ADMIN_PASSWORD}" "https://${DIR_ZOT_HOST}/v2/_catalog"Verify the DNS records:
dig +short "${DIR_API_HOST}" dig +short "${DIR_ZOT_HOST}"Verify the persistent volumes:
kubectl get pvc -n "${DIR_NAMESPACE}" -o wideThe PVCs should be
Boundand backed by EBS. -
Confirm Federation with the Public Directory
The first proof point is that your SPIRE server can fetch the public Directory trust bundle:
kubectl exec -n "${SPIRE_NAMESPACE}" spire-server-0 -c spire-server -- \ spire-server bundle list -id spiffe://prod.ads.outshift.io -format spiffeIf the bundle is missing:
- check that your SPIRE federation endpoint is externally reachable
- check that
https://prod.spire.ads.outshift.iois reachable from the cluster - check cert-manager and DNS for
${SPIRE_FEDERATION_HOST}
-
Onboard Your Trust Domain into
dir-stagingYour cluster trusting prod is only half of the setup. The public production Directory must also learn how to trust your SPIRE domain.
Create a file named
onboarding/federation/${TRUST_DOMAIN}.yamlin yourdir-stagingfork with this content:className: dir-spire trustDomain: partner.example.com bundleEndpointURL: https://spire.partner.example.com bundleEndpointProfile: type: https_webThen:
- Open a pull request against
agntcy/dir-staging. - Wait for the maintainers to merge it and roll it out.
- Make sure the public side also adds the authorization policy for your trust domain.
Until that pull request is merged and applied, prod will not accept requests authenticated with your trust domain.
- Open a pull request against
-
Validate from a SPIRE-Enabled Client
The easiest client validation is to run
dirctlfrom an environment that already has a SPIRE agent socket for your trust domain.Set the client environment:
export DIRECTORY_CLIENT_SERVER_ADDRESS="${DIR_API_HOST}:443" export DIRECTORY_CLIENT_SPIFFE_SOCKET_PATH="/tmp/spire-agent/public.sock"Then run a basic connectivity check against your own Directory:
dirctl info bafytest123 # Expected: Error: record not foundOnce the
dir-stagingonboarding pull request is merged, validate access to the public Directory as well:dirctl pull bafytest123 \ --server-addr prod.api.ads.outshift.io \ --spiffe-socket-path "${DIRECTORY_CLIENT_SPIFFE_SOCKET_PATH}" # Expected: Error: record not foundIf you want a fuller post-deployment smoke test against your own Directory, use the normal Directory CLI workflows from Directory CLI Guide:
dirctl push record.jsondirctl info <cid>dirctl search --name <name>dirctl sync create https://prod.api.ads.outshift.io:443
Troubleshooting
If you get stuck, check these first:
certificate is valid for ingress.local: SSL passthrough is not working, or the API ingress is configured with a terminating TLS secret.certificate signed by unknown authorityon the federation endpoint: cert-manager or the ClusterIssuer is misconfigured.- missing prod bundle in SPIRE: your cluster cannot reach
https://prod.spire.ads.outshift.io, or your SPIRE federation controller settings are wrong. PendingPVCs: your EBS CSI setup or StorageClass defaulting is incomplete.- prod rejects your trust domain after local federation works: your
dir-stagingonboarding pull request has not been merged or rolled out yet.
For in-depth troubleshooting, see Federation Best Practices and Troubleshooting.
Appendix: Best-Effort AWS Provisioning Notes
Note
This appendix is intentionally marked as best effort. It reflects the AWS shape assumed by the rest of the guide, but it was not validated end to end in the environment used to write this document.
If you do not already have the platform prerequisites, these are the usual AWS building blocks you need before the main walkthrough starts:
EKS Cluster
- Create an EKS cluster in subnets that allow the ingress controller and worker nodes to reach the public internet.
- Use managed node groups unless your platform team already standardizes on Karpenter or a custom node model.
- Make sure the cluster can provision EBS volumes through the EBS CSI driver.
Route53 and DNS
- Create or reuse a hosted zone for the base domain.
- Make sure ExternalDNS can write records into that zone.
-
Reserve the five public names used in this guide:
api.<domain>zot.<domain>routing.<domain>spire.<domain>oidc-discovery.spire.<domain>
IAM and Workload Identity
- ExternalDNS usually needs an IAM role that can change Route53 records.
- cert-manager may need additional AWS permissions if you use Route53-based DNS challenges instead of an HTTP challenge flow.
- If your platform uses IAM Roles for Service Accounts (IRSA) or the newer EKS Pod Identity, create those bindings before you install the controllers. AWS recommends EKS Pod Identity for new EC2-based clusters; IRSA is still required for Fargate workloads.
Ingress and Load Balancers
- The ingress-nginx controller must be exposed through an AWS load balancer that is reachable from the public internet.
- The Directory API path depends on SSL passthrough, so verify that the ingress controller keeps that capability when you customize the Service annotations.
- The public production deployment model uses an AWS Network Load Balancer for TCP passthrough. If your platform defaults to a different load balancer behavior, validate it carefully before exposing the API hostname.
Security Groups and Networking
- Worker nodes and the ingress load balancer must allow inbound HTTPS from the internet for the public hostnames.
-
Egress must allow the cluster to reach:
- Let's Encrypt
- Route53 APIs, if used by your controller setup
https://prod.spire.ads.outshift.io
-
If your company routes outbound traffic through a NAT or firewall, confirm that cert-manager and SPIRE can still complete their external calls.
Storage
- Make
ebs-sc-encryptedthe default StorageClass if you want the subcharts to inherit EBS automatically. - Verify after install that PostgreSQL, Zot, and the routing datastore PVC all bind to the expected StorageClass.
A Good First Cut
If your platform team asks what they need to hand you before you can follow the main guide, ask for this:
- A working EKS cluster
- Ingress-nginx with SSL passthrough
- cert-manager with a production ClusterIssuer
- ExternalDNS wired to Route53
- An encrypted EBS StorageClass named
ebs-sc-encrypted - Public DNS delegation for your chosen domain