
How Istio and Argo CD helped us to debug platform migration issues for our App

At Wehkamp, we’ve been very busy with the migration of our container platform. Our (old) Blaze platform runs on Mesos, Marathon, and NGINX, but we’re shifting to a new Atlas platform, powered by Kubernetes and Istio. This setup gives us more resilience, security and auto scaling. A year ago, we successfully migrated our www-ingress to this new platform. This week, we set our sights on migrating the app-ingress for our Android and IOS backend services. Unfortunately, things didn’t go as planned…
A big shout-out to Chris Vahl for working together on this story.
Key differences: routing and tokens
Atlas works with an Istio service mesh. Istio adds traffic management, security, and observability by deploying sidecar proxies alongside our services to intercept and control traffic. To make adoption easier, we’ve added new traffic classes to our internal helm chart. ArgoCD will apply this and generate the proper ingress rules by convention (routes + constraints). Services need to be more explicit about what they expose and where, which helps to reduce the attack surface.
We’ve also enhanced JWT token security with stricter validation and encryption. One of our biggest challenges during the migration was dealing with old JWT tokens from our Blaze platform. Since Blaze and Atlas tokens aren’t backwards compatible, we had to exchange them during the migration. Because old versions of our app might still be in use, we needed to handle this dynamically. So we’ve built a Cloudflare worker to assist with the token exchange, ensuring a smooth transition for all Wehkamp users.
Problems ahead!
During the switch, we ran into problems almost immediately after switching our endpoints to the new platform, which forced us to revert to maintain stability. Unfortunately, finding the root cause wasn’t straightforward. There could have been subtle configuration mismatches between our development and production environments, like differences in Cloudflare settings (e.g., WAF or Workers) or Kubernetes configurations (Istio routes or constraints) that didn’t fully align.
We have two main environments: dev and production. The dev environment is used for Q/A, but no matter how much effort we put in… dev and prod are never exactly the same. To debug effectively, we needed to connect a new subdomain for testing without disrupting production…
Leveraging Istio and Argo CD to generate routes
So what are our options?
- Reconfigure our app ingress to point to a different domain. However, due to the nature of our charts, this would require a redeploy of over 50 services in production.
- Add the extra domain and create additional virtual services for every app, but modifying the charts would still trigger a redeploy, which we want to avoid.
- 🤔 So… what if we add a separate Argo CD ApplicationSet that only contains virtual services? This would let us avoid a redeploy altogether while handling the routing changes we needed.
Script it
Let’s create a script that:
- Uses the Argo CD cli to query all production applications (e.g. prod- convention).
- Lists all manifests that belong on each app and find the VirtualService that ends in -app (convention).
- Modifies the manifest, by dropping the labels, changing the name and replacing the hosts.
- Stores the manifests in a single file with the --- YAML file separator
- And we can write an Argo CD ApplicationSet that uses the the YAML file.
Bash to the rescue!
Here’s what the script would look like (assuming you’re already logged into Argo CD):
#!/bin/bash
set -e
# Output file for all manifests
output_file="combined_virtualservices.yaml"
# Clear the output file if it already exists
> "$output_file"
# Get a list of all Argo CD apps
apps=$(argocd app list --output name)
for app in $apps; do
# Skip any apps that do not contain the word 'prod-'
if [[ $app != *prod-* ]]; then
continue
fi
echo "Processing app: $app"
# Get the VirtualService manifests for this app
manifest=$(argocd app manifests "$app" | yq e 'select(.kind == "VirtualService" and .metadata.name == "*-app") | .')
# No manifest, go to the next app
if [[ -z "$manifest" ]]; then
continue
fi
# Modify the manifest:
# delete metadata.labels,
# replace spec.hosts,
# and update name
modified_manifest=$(echo "$manifest" | \
yq e 'del(.metadata.labels) | .spec.hosts = ["new-app-ingress.domain.tld"] | .metadata.name += "-new-ingress"' -)
# Append the modified manifest to the output file
echo "---" >> "$output_file"
echo "$modified_manifest" >> "$output_file"
echo " Modified and appended manifest for app: $app"
done
The resulting YAML file can be quite large and should be stored in folder manifests/app2-vs/prod of the repository. These services will handle the routing on the new ingress URL, which allows us to debug and analyze discrepancies without interfering with production traffic.
Referencing the Virtual Services in Argo CD
Once we had the modified virtual service definitions, we needded an Argo CD ApplicationSet to deploy those to production. Here’s a snippet of how we defined it:
---
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: app2-vs
namespace: argo-cd
spec:
generators:
- matrix:
generators:
- list:
elements:
- envType: prod
- clusters:
selector:
matchLabels:
type: workload
matchExpressions:
- {key: environment, operator: In, values: ["{{envType}}"]}
template:
metadata:
name: "{{name}}-app2-vs"
namespace: argo-cd
spec:
destination:
server: "{{server}}"
namespace: app2-vs
project: default
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- PruneLast=true
- ApplyOutOfSyncOnly=true
sources:
- repoURL: https://github.com/our-org/our-workload-services.git
path: manifests/app2-vs/{{metadata.labels.environment}}
targetRevision: HEAD
Once this config was picked up by Istio, all of our productions services were available on the new domain. And our debugging began. In a matter of hours all of the issues were identified and solved by our platform and application teams.
Lessons learned
What did we learn from this?
- Environments are never the same, so there is value if we can test our production changes on a different URL before swithing over our backend.
- Describing Kubernetes / Istio resources using Argo CD is very powerful and makes it easy to extract information from a running cluster. Istio makes it simple to replicate a secondary routing strategy.
- YQ makes it easy to extract the right manifest and edit it.
- GitOps for the win! 💪
The combination of GitOps, Argo CD, Istio, and Kubernetes gives us a flexible and secure way to debug production issues with a secondary domain without any disruptions to the production workloads.
I work as a Pathfinder at Wehkamp.nl, one of the biggest e-commerce companies of the Netherlands. This article is part of our Tech Blog, check it out & subscribe. Looking for a great job? Check our job offers or drop me a line on LinkedIn.
Originally published at https://keestalkstech.com on Oktober 31, 2024.
How Istio and Argo CD helped us to debug platform migration issues for our App was originally published in wehkamp-techblog on Medium, where people are continuing the conversation by highlighting and responding to this story.