Federation Best Practices and Troubleshooting
This page covers operational best practices and troubleshooting for Directory federation. It applies to both https_web and https_spiffe profiles. For profile selection and configuration details, see Federation Bundle Profiles.
Operational Best Practices
Universal Best Practices
-
Consistent className configuration
- Use standardized className value across all resources (recommended: "dir-spire")
- Ensures SPIRE Controller Manager correctly processes ClusterSPIFFEID resources
- Prevents controller filtering issues
-
Enable ClusterFederatedTrustDomain Controller
- Required for automatic trust bundle fetching
- Without this controller, federation configuration is ignored
- Verify in controller manager logs
-
Implement Bundle Refresh Monitoring
- Monitor SPIRE server logs for "Bundle refreshed" events
- Alert on bundle staleness exceeding 24 hours
- Implement automated monitoring:
kubectl logs -n dir-spire -l app.kubernetes.io/name=server -c spire-server | \ grep "Bundle refreshed" | \ grep "trust_domain=partner.com" -
Configure Rate Limiting
- Protect federation endpoints from abuse
- Recommended configuration: 100 requests/second with 5x burst multiplier
- Adjust based on number of federation partners and refresh frequency
-
Set Appropriate CA TTL
- Recommended: 720 hours (30 days)
- Balances security (shorter TTL) with operational stability (fewer rotations)
- Consider operational capacity for handling rotation issues
https_web Specific Practices
-
Use Production Certificate Authority
- Deploy with production Let's Encrypt (
letsencryptClusterIssuer) - Avoid staging certificates in production federation scenarios
- Staging certificates are not trusted by remote SPIRE servers by default
- Deploy with production Let's Encrypt (
-
Monitor Certificate Expiration
- Although cert-manager auto-renews, implement monitoring
- Alert on certificates expiring within 7 days
- Verify renewal process during certificate lifecycle:
kubectl get certificate -n dir-spire spire-server-federation-cert -o yaml | \ grep -A5 "status:" -
Validate External Network Accessibility
- Periodically test federation endpoint from external network
- Verify DNS resolution from federation partner networks
- Validate certificate chain visibility:
curl -v https://spire.example.com 2>&1 | grep -A10 "SSL certificate"
https_spiffe Specific Practices
-
Maintain Bootstrap Bundle Freshness
- Update bootstrap bundles before CA rotation (every caTTL period)
- Implement bundle distribution automation where feasible
- Document manual update procedures for operational staff
-
Document Bundle Exchange Procedures
- Establish clear onboarding process for new federation partners
- Define secure channels for bundle transmission
- Maintain contact registry for federation partner technical contacts
-
Validate SSL Passthrough Persistence
- Verify ssl-passthrough configuration after cluster maintenance
- Test after NGINX ingress controller upgrades
- Automate validation:
kubectl get deployment -n ingress-nginx ingress-nginx-controller -o yaml | \ grep "enable-ssl-passthrough" || echo "WARNING: SSL passthrough not enabled"
Troubleshooting
Connection Issues
# Check SPIRE agent status
spire-agent api fetch x509-svid
# Verify network connectivity
curl -v https://prod.api.ads.outshift.io
# Verify env vars are set
echo $DIRECTORY_CLIENT_SERVER_ADDRESS $DIRECTORY_CLIENT_SPIFFE_SOCKET_PATH
Federation Issues
# Verify trust bundle exchange
spire-server federation show --trustDomain prod.ads.outshift.io
# Test bundle endpoint
curl https://prod.spire.ads.outshift.io/
Common Errors
| Error | Solution |
|---|---|
connection refused |
Check SPIRE agent running and socket path |
x509: certificate signed by unknown authority |
Verify trust bundle configuration |
context deadline exceeded |
Check network and firewall |
permission denied |
Ensure SPIFFE ID registration and policies |
https_web Profile Issues
Issue: "certificate is valid for ingress.local, not spire.example.com"
Root Cause: NGINX serving default certificate instead of cert-manager issued certificate.
Resolution: Ensure tlsSecret is set under spire-server.ingress in your Helm values:
# Under spire-server.ingress in values
ingress:
tlsSecret: "spire-server-federation-cert"
Verification:
kubectl get certificate -n dir-spire spire-server-federation-cert
# Status should be READY=True
Issue: "certificate signed by unknown authority"
Root Cause: cert-manager certificate not issued or ClusterIssuer misconfigured.
Resolution:
# Check Certificate status
kubectl describe certificate -n dir-spire spire-server-federation-cert
# Check ClusterIssuer status
kubectl get clusterissuer letsencrypt -o yaml
# Review cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager
Issue: "no endpoints available for service"
Root Cause: Ingress not routing traffic to SPIRE server service.
Resolution:
# Verify Ingress status
kubectl get ingress -n dir-spire
# Check service endpoints
kubectl get endpoints -n dir-spire spire-server
# Verify SPIRE server pods are running
kubectl get pods -n dir-spire -l app.kubernetes.io/name=server
https_spiffe Profile Issues
Issue: "certificate contains no URI SAN"
Root Cause: SPIRE server has not issued itself a federation X.509-SVID.
Resolution:
# Restart SPIRE server to trigger SVID issuance
kubectl rollout restart -n dir-spire deployment/spire-server
# Wait for rollout to complete
kubectl rollout status -n dir-spire deployment/spire-server
# Verify federation SVID issuance in logs
kubectl logs -n dir-spire -l app.kubernetes.io/name=server -c spire-server | \
grep "federation"
Issue: "certificate signed by unknown authority"
Root Cause: Bootstrap bundle is missing, stale, or incorrectly formatted.
Resolution: Obtain a fresh bundle from your federation partner, validate it, and update trustDomainBundle in your federation resource. See Bootstrap Bundle Exchange Process for extraction and validation steps:
# Obtain fresh bundle from federation partner
curl -k https://spire.partner.com > fresh-partner-bundle.json
# Validate bundle structure
jq '.' fresh-partner-bundle.json
# Update trustDomainBundle in federation configuration
# Redeploy application
Issue: "ssl-passthrough annotation not taking effect"
Root Cause: NGINX ingress controller not configured with --enable-ssl-passthrough flag.
Resolution:
# Verify controller has ssl-passthrough enabled
kubectl get deployment -n ingress-nginx ingress-nginx-controller -o yaml | \
grep "enable-ssl-passthrough"
# If not present, patch deployment
kubectl patch deployment -n ingress-nginx ingress-nginx-controller --type='json' \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value":"--enable-ssl-passthrough"}]'
# Verify configuration persisted after controller restart