Federation on Amazon EKS

This guide is the opinionated AWS happy path for running your own AGNTCY Directory instance and federating it with the public Directory network.

It makes the following choices for you:

Amazon Web Services
Amazon EKS
Helm-first deployment
https_web SPIRE federation
NGINX ingress
Route53 with ExternalDNS
cert-manager with Let's Encrypt
EBS-backed persistence
Zot as the OCI registry
SPIRE as the workload identity system

If you want the reference material behind this guide, see the following guides:

This document is intentionally narrow. It does not try to cover GitOps, Azure, GCP, on-premises deployment, or the https_spiffe profile. The goal is to give first-time operators one concrete path that they can follow end to end.

What You Will Build

flowchart TD
    internet[Internet]
    route53[Route53]
    nginx[NGINXIngress]
    nlb[AWSNLB]
    api[DirectoryAPI]
    reconciler[Reconciler]
    zot[ZotRegistry]
    pg[PostgreSQL]
    spireIngress[SPIREFederationIngress]
    spire[SPIREServer]
    oidc[SPIREOIDCDiscovery]
    ebs[EBSVolumes]
    prod[PublicDirectoryProd]

    internet --> route53
    route53 --> nginx
    route53 --> nlb
    nginx --> api
    nginx --> zot
    nginx --> spireIngress
    nginx --> oidc
    nlb -->|P2P routing| api
    spireIngress --> spire
    api --> pg
    api --> ebs
    reconciler --> pg
    reconciler --> zot
    zot --> ebs
    pg --> ebs
    api --> spire
    reconciler --> spire
    spire --> prod

Decisions Before You Start

Choose these before you install anything:

Your SPIRE trust domain.

This is permanent for the deployment.
Your public DNS names.
Your federation profile.

This guide always uses https_web.

To keep the mental model simple, this guide uses the same base domain for both DNS and the trust domain:

Trust domain: partner.example.com
Directory API: api.partner.example.com
Zot: zot.partner.example.com
P2P routing: routing.partner.example.com
SPIRE federation endpoint: spire.partner.example.com
SPIRE OIDC discovery: oidc-discovery.spire.partner.example.com

Note

Do not split these in your first deployment. You can change them later.

Version Pins Used in This Guide

These pins match the current staging deployment references at the time this guide was written:

Component	Source	Version
Directory Helm chart	`oci://ghcr.io/agntcy/dir/helm-charts/dir`	`v1.2.0`
Directory apiserver image	`ghcr.io/agntcy/dir-apiserver`	`v1.2.0`
Directory reconciler image	`ghcr.io/agntcy/dir-reconciler`	`v1.2.0`
SPIRE Helm chart	`spiffe/spire`	`0.28.3`

Note

Older Directory docs still show v1.0.0 examples. Prefer the versions above for a fresh deployment.

Before You Start

This guide assumes you already have:

An EKS cluster running and reachable with kubectl
An ingress-nginx controller installed in the cluster
--enable-ssl-passthrough=true enabled on that ingress controller
cert-manager installed with a production ClusterIssuer such as letsencrypt-prod
ExternalDNS configured to manage records in your Route53 hosted zone
An EBS-backed StorageClass named ebs-sc-encrypted
kubectl, helm, openssl, jq, and htpasswd installed locally

This guide does not try to provision the AWS infrastructure from zero in the main flow. For the AWS-specific provisioning considerations, see the Appendix.

Setting Up Federation Using AWS EKS

Export the Variables Used by the Rest of the Guide

export TRUST_DOMAIN="partner.example.com"
export BASE_DOMAIN="partner.example.com"

export DIR_API_HOST="api.${BASE_DOMAIN}"
export DIR_ZOT_HOST="zot.${BASE_DOMAIN}"
export DIR_ROUTING_HOST="routing.${BASE_DOMAIN}"
export SPIRE_FEDERATION_HOST="spire.${BASE_DOMAIN}"
export SPIRE_OIDC_HOST="oidc-discovery.spire.${BASE_DOMAIN}"

export DIR_NAMESPACE="dir"
export SPIRE_NAMESPACE="spire"
export CERT_ISSUER="letsencrypt-prod"
export STORAGE_CLASS="ebs-sc-encrypted"

Verify the Cluster Add-Ons Before You Deploy Anything

Check that the required controllers and storage class already exist:
```
kubectl get ns
kubectl get deployment -A | egrep 'ingress-nginx|cert-manager|external-dns'
kubectl get storageclass
kubectl get clusterissuer
```
Confirm that the ingress-nginx controller has SSL passthrough enabled:
```
kubectl get deployment -n ingress-nginx ingress-nginx-controller -o yaml | \
  grep enable-ssl-passthrough
```
Expected result:
- ingress-nginx-controller is present
- cert-manager is present
- external-dns is present
- your ClusterIssuer exists
- ebs-sc-encrypted exists
- the ingress controller arguments include --enable-ssl-passthrough=true
If any of these are missing, stop here and resolve them first. This guide assumes the platform layer is already working.

Generate the Credentials Used by Directory and Zot

Use generated values instead of hardcoding static credentials in your shell history or values files:

export DIR_OCI_ADMIN_PASSWORD="$(openssl rand -base64 24)"
export DIR_SYNC_PASSWORD="$(openssl rand -base64 24)"
export DIR_DB_PASSWORD="$(openssl rand -base64 24)"

htpasswd -nbB admin "${DIR_OCI_ADMIN_PASSWORD}" > zot.htpasswd
htpasswd -nbB user "${DIR_SYNC_PASSWORD}" >> zot.htpasswd

openssl genpkey -algorithm Ed25519 -out node.privkey

Keep node.privkey on disk. It will be injected into the Helm release via --set-file later.

Install SPIRE for https_web Federation

Add the hardened SPIRE chart repository and install the CRDs:

helm repo add spiffe https://spiffe.github.io/helm-charts-hardened
helm repo update

helm upgrade --install spire-crds spire-crds \
  --repo https://spiffe.github.io/helm-charts-hardened \
  --namespace spire-crds \
  --create-namespace

Create spire-values.yaml:

cat <<EOF > spire-values.yaml
global:
  spire:
    trustDomain: ${TRUST_DOMAIN}
    clusterName: eks
    namespaces:
      create: false
    ingressControllerType: other

  installAndUpgradeHooks:
    enabled: false

  deleteHooks:
    enabled: false

spire-server:
  federation:
    enabled: true
    tls:
      spire:
        enabled: false
      certManager:
        enabled: true
        issuer:
          create: false
        certificate:
          issuerRef:
            kind: ClusterIssuer
            name: ${CERT_ISSUER}
    ingress:
      enabled: true
      className: nginx
      controllerType: other
      host: ${SPIRE_FEDERATION_HOST}
      tlsSecret: spire-federation-cert
      annotations:
        cert-manager.io/cluster-issuer: ${CERT_ISSUER}
        external-dns.alpha.kubernetes.io/hostname: ${SPIRE_FEDERATION_HOST}
        nginx.ingress.kubernetes.io/ssl-passthrough: "false"
        nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
        nginx.ingress.kubernetes.io/proxy-ssl-server-name: "on"
        nginx.ingress.kubernetes.io/proxy-ssl-name: "${SPIRE_FEDERATION_HOST}"
        # SPIRE serves its own self-signed cert on the backend; the ingress
        # controller cannot validate it, so upstream verification is disabled.
        nginx.ingress.kubernetes.io/proxy-ssl-verify: "off"
  controllerManager:
    watchClassless: true
    className: dir-spire
    identities:
      clusterFederatedTrustDomain:
        enabled: true
      clusterSPIFFEIDs:
        default:
          federatesWith:
            - prod.ads.outshift.io

spiffe-oidc-discovery-provider:
  ingress:
    enabled: true
    className: nginx
    host: ${SPIRE_OIDC_HOST}
    annotations:
      cert-manager.io/cluster-issuer: ${CERT_ISSUER}
      external-dns.alpha.kubernetes.io/hostname: ${SPIRE_OIDC_HOST}
  config:
    domains:
      - ${SPIRE_OIDC_HOST}
EOF

Install SPIRE:

helm upgrade --install spire spiffe/spire \
  --version 0.28.3 \
  --namespace "${SPIRE_NAMESPACE}" \
  --create-namespace \
  -f spire-values.yaml

Wait for SPIRE to become healthy:

kubectl wait --for=condition=ready pod -n "${SPIRE_NAMESPACE}" \
  -l app.kubernetes.io/name=server --timeout=300s

kubectl wait --for=condition=ready pod -n "${SPIRE_NAMESPACE}" \
  -l app.kubernetes.io/name=agent --timeout=300s

Verify the SPIRE Public Endpoints

Check that the ingress objects and public certificates exist:
```
kubectl get ingress -n "${SPIRE_NAMESPACE}"
kubectl get certificate -n "${SPIRE_NAMESPACE}"
```
Verify the federation endpoint:
```
curl -I "https://${SPIRE_FEDERATION_HOST}"
```
Verify the OIDC discovery document:
```
curl "https://${SPIRE_OIDC_HOST}/.well-known/openid-configuration" | jq .
```
Confirm that Route53 and ExternalDNS have published the hostnames:
```
dig +short "${SPIRE_FEDERATION_HOST}"
dig +short "${SPIRE_OIDC_HOST}"
```
At this point, your SPIRE bundle endpoint should already be externally reachable over standard HTTPS.

Create the Directory Values File

Create dir-values.yaml:

cat <<EOF > dir-values.yaml
apiserver:
  image:
    repository: ghcr.io/agntcy/dir-apiserver
    tag: v1.2.0
    pullPolicy: IfNotPresent

  spire:
    enabled: true
    className: dir-spire
    trustDomain: ${TRUST_DOMAIN}
    useCSIDriver: true
    dnsNameTemplates:
      - ${DIR_API_HOST}
    federation:
      - className: dir-spire
        trustDomain: prod.ads.outshift.io
        bundleEndpointURL: https://prod.spire.ads.outshift.io
        bundleEndpointProfile:
          type: https_web

  routingService:
    type: LoadBalancer
    cloudProvider: "aws"
    aws:
      internal: false
      nlbTargetType: "instance"
    externalTrafficPolicy: Local
    annotations:
      external-dns.alpha.kubernetes.io/hostname: "routing.${BASE_DOMAIN}"

  config:
    listen_address: "0.0.0.0:8888"
    oasf_api_validation:
      disable: true
    authn:
      enabled: true
      mode: "x509"
      socket_path: "unix:///run/spire/agent-sockets/api.sock"
      audiences:
        - "spiffe://${TRUST_DOMAIN}/spire/server"
    authz:
      enabled: true
      enforcer_policy_file_path: "/etc/agntcy/dir/authz_policies.csv"
    store:
      provider: "oci"
      oci:
        # Use the external address, not the internal .svc.cluster.local name.
        # The apiserver shares this address with remote peers via the
        # RequestRegistryCredentials RPC so they can pull records during sync.
        registry_address: "${DIR_ZOT_HOST}"
        auth_config:
          insecure: "false"
          username: "admin"
          password: "${DIR_OCI_ADMIN_PASSWORD}"
    routing:
      listen_address: "/ip4/0.0.0.0/tcp/5555"
      key_path: /etc/routing/node.privkey
      datastore_dir: /etc/routing/datastore
      directory_api_address: "${DIR_API_HOST}:443"
      gossipsub:
        enabled: true
    sync:
      auth_config:
        username: "user"
        password: "${DIR_SYNC_PASSWORD}"
    publication:
      scheduler_interval: "1h"
      worker_count: 1
      worker_timeout: "30m"
    database:
      type: "postgres"
      postgres:
        host: ""
        port: 5432
        database: "dir"
        # PostgreSQL runs as a subchart in the same namespace, so traffic
        # stays within the pod network. Use "require" or "verify-full" if
        # your PostgreSQL instance is external or crosses a network boundary.
        ssl_mode: "disable"

  authz_policies_csv: |
    p,${TRUST_DOMAIN},*
    p,*,/agntcy.dir.store.v1.StoreService/Pull
    p,*,/agntcy.dir.store.v1.StoreService/PullReferrer
    p,*,/agntcy.dir.store.v1.StoreService/Lookup
    p,*,/agntcy.dir.store.v1.SyncService/RequestRegistryCredentials

  pvc:
    create: true
    storageClassName: ${STORAGE_CLASS}
    size: 20Gi

  strategy:
    type: Recreate

  ingress:
    enabled: true
    className: nginx
    annotations:
      nginx.ingress.kubernetes.io/ssl-passthrough: "true"
      nginx.ingress.kubernetes.io/backend-protocol: "GRPCS"
      external-dns.alpha.kubernetes.io/hostname: ${DIR_API_HOST}
    hosts:
      - host: ${DIR_API_HOST}
        paths:
          - path: /
            pathType: ImplementationSpecific
    tls:
      - hosts:
          - ${DIR_API_HOST}

  postgresql:
    enabled: true
    auth:
      username: "dir"
      password: "${DIR_DB_PASSWORD}"
      database: "dir"
    primary:
      persistence:
        enabled: true
        storageClass: ${STORAGE_CLASS}
        size: 20Gi

  reconciler:
    enabled: true
    image:
      repository: ghcr.io/agntcy/dir-reconciler
      tag: v1.2.0
      pullPolicy: IfNotPresent
    config:
      database:
        type: "postgres"
        postgres:
          host: ""
          port: 5432
          database: "dir"
          ssl_mode: "disable"
      local_registry:
        registry_address: "${DIR_ZOT_HOST}"
        repository_name: ""
        auth_config:
          insecure: false
      regsync:
        enabled: true
        interval: "1m"
        timeout: "30m"
        authn:
          enabled: true
          mode: "x509"
          socket_path: "unix:///run/spire/agent-sockets/api.sock"
      indexer:
        enabled: true
        interval: "30m"

  secrets:
    # privKey is injected via --set-file in the helm install command.
    # PEM data is multiline and cannot be safely embedded in a YAML heredoc.
    syncAuth:
      username: "user"
      password: "${DIR_SYNC_PASSWORD}"
    ociAuth:
      username: "admin"
      password: "${DIR_OCI_ADMIN_PASSWORD}"
    postgresAuth:
      username: "dir"
      password: "${DIR_DB_PASSWORD}"

  zot:
    persistence: true
    pvc:
      create: true
      accessModes: ["ReadWriteOnce"]
      storage: 100Gi
      storageClassName: ${STORAGE_CLASS}
    mountSecret: true
    authHeader: "admin:${DIR_OCI_ADMIN_PASSWORD}"
    secretFiles:
      htpasswd: |-
$(sed 's/^/        /' zot.htpasswd)
    mountConfig: true
    ingress:
      enabled: true
      className: nginx
      annotations:
        cert-manager.io/cluster-issuer: ${CERT_ISSUER}
        external-dns.alpha.kubernetes.io/hostname: ${DIR_ZOT_HOST}
      hosts:
        - host: ${DIR_ZOT_HOST}
          paths:
            - path: /
              pathType: ImplementationSpecific
      tls:
        - secretName: zot-public-tls
          hosts:
            - ${DIR_ZOT_HOST}
EOF

Note

The routing datastore, PostgreSQL, and Zot PVCs are all pinned to ebs-sc-encrypted. If your cluster uses a different default StorageClass, update the storageClassName values for each component before you install Directory.

Install Directory

helm upgrade --install dir oci://ghcr.io/agntcy/dir/helm-charts/dir \
  --version v1.2.0 \
  --namespace "${DIR_NAMESPACE}" \
  --create-namespace \
  -f dir-values.yaml \
  --set-file apiserver.secrets.privKey=node.privkey

--set-file reads the PEM file and passes its exact bytes to the chart, which base64-encodes them into the Kubernetes Secret. This avoids the multiline-in-YAML corruption that would happen if the key were embedded directly in the values file.

Wait for the main workloads:

kubectl wait --for=condition=ready pod -n "${DIR_NAMESPACE}" \
  -l app.kubernetes.io/name=apiserver --timeout=300s

kubectl get pods -n "${DIR_NAMESPACE}"
kubectl get svc -n "${DIR_NAMESPACE}"
kubectl get ingress -n "${DIR_NAMESPACE}"
kubectl get pvc -n "${DIR_NAMESPACE}"

Verify the Directory Endpoints and Certificates

Check that the public API hostname presents the SPIFFE-issued certificate rather than the ingress default certificate:

echo | openssl s_client -connect "${DIR_API_HOST}:443" \
  -servername "${DIR_API_HOST}" 2>/dev/null | \
  openssl x509 -noout -subject

If SSL passthrough is working, the certificate subject should come from SPIRE and not from the ingress controller.

Confirm that the API server successfully obtained an X.509-SVID:

kubectl logs -n "${DIR_NAMESPACE}" -l app.kubernetes.io/name=apiserver | \
  grep "Successfully obtained valid X509-SVID"

Verify that Zot is reachable over HTTPS:

curl -u "admin:${DIR_OCI_ADMIN_PASSWORD}" "https://${DIR_ZOT_HOST}/v2/_catalog"

Verify the DNS records:

dig +short "${DIR_API_HOST}"
dig +short "${DIR_ZOT_HOST}"

Verify the persistent volumes:

kubectl get pvc -n "${DIR_NAMESPACE}" -o wide

The PVCs should be Bound and backed by EBS.

Confirm Federation with the Public Directory

The first proof point is that your SPIRE server can fetch the public Directory trust bundle:
```
kubectl exec -n "${SPIRE_NAMESPACE}" spire-server-0 -c spire-server -- \
  spire-server bundle list -id spiffe://prod.ads.outshift.io -format spiffe
```
If the bundle is missing:
- check that your SPIRE federation endpoint is externally reachable
- check that https://prod.spire.ads.outshift.io is reachable from the cluster
- check cert-manager and DNS for ${SPIRE_FEDERATION_HOST}
Onboard Your Trust Domain into dir-staging

Your cluster trusting prod is only half of the setup. The public production Directory must also learn how to trust your SPIRE domain.

Create a file named onboarding/federation/${TRUST_DOMAIN}.yaml in your dir-staging fork with this content:
```
className: dir-spire
trustDomain: partner.example.com
bundleEndpointURL: https://spire.partner.example.com
bundleEndpointProfile:
  type: https_web
```
Then:
1. Open a pull request against agntcy/dir-staging.
2. Wait for the maintainers to merge it and roll it out.
3. Make sure the public side also adds the authorization policy for your trust domain.
Until that pull request is merged and applied, prod will not accept requests authenticated with your trust domain.
Validate from a SPIRE-Enabled Client

The easiest client validation is to run dirctl from an environment that already has a SPIRE agent socket for your trust domain.

Set the client environment:
```
export DIRECTORY_CLIENT_SERVER_ADDRESS="${DIR_API_HOST}:443"
export DIRECTORY_CLIENT_SPIFFE_SOCKET_PATH="/tmp/spire-agent/public.sock"
```
Then run a basic connectivity check against your own Directory:
```
dirctl info bafytest123
# Expected: Error: record not found
```
Once the dir-staging onboarding pull request is merged, validate access to the public Directory as well:
```
dirctl pull bafytest123 \
  --server-addr prod.api.ads.outshift.io \
  --spiffe-socket-path "${DIRECTORY_CLIENT_SPIFFE_SOCKET_PATH}"
# Expected: Error: record not found
```
If you want a fuller post-deployment smoke test against your own Directory, use the normal Directory CLI workflows from Directory CLI Guide:
- dirctl push record.json
- dirctl info <cid>
- dirctl search --name <name>
- dirctl sync create https://prod.api.ads.outshift.io:443

Troubleshooting

If you get stuck, check these first:

certificate is valid for ingress.local: SSL passthrough is not working, or the API ingress is configured with a terminating TLS secret.
certificate signed by unknown authority on the federation endpoint: cert-manager or the ClusterIssuer is misconfigured.
missing prod bundle in SPIRE: your cluster cannot reach https://prod.spire.ads.outshift.io, or your SPIRE federation controller settings are wrong.
Pending PVCs: your EBS CSI setup or StorageClass defaulting is incomplete.
prod rejects your trust domain after local federation works: your dir-staging onboarding pull request has not been merged or rolled out yet.

For in-depth troubleshooting, see Federation Best Practices and Troubleshooting.

Appendix: Best-Effort AWS Provisioning Notes

Note

This appendix is intentionally marked as best effort. It reflects the AWS shape assumed by the rest of the guide, but it was not validated end to end in the environment used to write this document.

If you do not already have the platform prerequisites, these are the usual AWS building blocks you need before the main walkthrough starts:

EKS Cluster

Create an EKS cluster in subnets that allow the ingress controller and worker nodes to reach the public internet.
Use managed node groups unless your platform team already standardizes on Karpenter or a custom node model.
Make sure the cluster can provision EBS volumes through the EBS CSI driver.

Route53 and DNS

Create or reuse a hosted zone for the base domain.
Make sure ExternalDNS can write records into that zone.
Reserve the five public names used in this guide:
- api.<domain>
- zot.<domain>
- routing.<domain>
- spire.<domain>
- oidc-discovery.spire.<domain>

IAM and Workload Identity

ExternalDNS usually needs an IAM role that can change Route53 records.
cert-manager may need additional AWS permissions if you use Route53-based DNS challenges instead of an HTTP challenge flow.
If your platform uses IAM Roles for Service Accounts (IRSA) or the newer EKS Pod Identity, create those bindings before you install the controllers. AWS recommends EKS Pod Identity for new EC2-based clusters; IRSA is still required for Fargate workloads.

Ingress and Load Balancers

The ingress-nginx controller must be exposed through an AWS load balancer that is reachable from the public internet.
The Directory API path depends on SSL passthrough, so verify that the ingress controller keeps that capability when you customize the Service annotations.
The public production deployment model uses an AWS Network Load Balancer for TCP passthrough. If your platform defaults to a different load balancer behavior, validate it carefully before exposing the API hostname.

Security Groups and Networking

Worker nodes and the ingress load balancer must allow inbound HTTPS from the internet for the public hostnames.
Egress must allow the cluster to reach:
- Let's Encrypt
- Route53 APIs, if used by your controller setup
- https://prod.spire.ads.outshift.io
If your company routes outbound traffic through a NAT or firewall, confirm that cert-manager and SPIRE can still complete their external calls.

Storage

Make ebs-sc-encrypted the default StorageClass if you want the subcharts to inherit EBS automatically.
Verify after install that PostgreSQL, Zot, and the routing datastore PVC all bind to the expected StorageClass.

A Good First Cut

If your platform team asks what they need to hand you before you can follow the main guide, ask for this:

A working EKS cluster
Ingress-nginx with SSL passthrough
cert-manager with a production ClusterIssuer
ExternalDNS wired to Route53
An encrypted EBS StorageClass named ebs-sc-encrypted
Public DNS delegation for your chosen domain