Helm troubleshooting

This guide helps you diagnose and resolve common issues when deploying or operating Midaz on Kubernetes with Helm. Each section covers a specific symptom, the diagnostic commands to investigate it, and the steps to resolve it.

General diagnostic commands

Start with these commands to get a broad picture of your deployment state before diving into specific issues.

# List all Helm releases in the midaz namespace
helm list -n midaz

# Check the status of a specific release
helm status midaz -n midaz

# List all pods and their current state
kubectl get pods -n midaz

# Get events for the namespace (useful for spotting recent failures)
kubectl get events -n midaz --sort-by='.lastTimestamp'

# Describe a specific pod (replace <pod-name> with the actual name)
kubectl describe pod <pod-name> -n midaz

# Tail logs for a pod
kubectl logs <pod-name> -n midaz --tail=100

# Follow logs in real time
kubectl logs <pod-name> -n midaz -f

Pods stuck in Pending

Symptom: One or more pods remain in Pending state and never start. Diagnostic commands:

kubectl get pods -n midaz
kubectl describe pod <pod-name> -n midaz
kubectl get events -n midaz --sort-by='.lastTimestamp'
kubectl top nodes

Common causes and solutions:

Insufficient CPU or memory on nodes — The scheduler cannot find a node that satisfies the pod’s resource requests. Check the Events section of kubectl describe pod. Look for messages like Insufficient cpu or Insufficient memory. Either reduce resources.requests in your values.yaml, or add more nodes to the cluster.
PersistentVolumeClaim not bound — A PVC required by a dependency (PostgreSQL, MongoDB, Valkey) is stuck in Pending.
```
kubectl get pvc -n midaz
kubectl describe pvc <pvc-name> -n midaz
```
Verify that a StorageClass is available and set as the default. See PVC stuck in Pending below.
Node selector or affinity mismatch — The pod requires a specific node label that no node in the cluster has. Check your values.yaml for nodeSelector or affinity settings, and verify that your nodes have the expected labels:
```
kubectl get nodes --show-labels
```

ImagePullBackOff

Symptom: Pods show ImagePullBackOff or ErrImagePull status. Diagnostic commands:

kubectl describe pod <pod-name> -n midaz
kubectl get events -n midaz --sort-by='.lastTimestamp' | grep -i image

Common causes and solutions:

Wrong image tag — The specified tag does not exist in the registry. Check the image.tag value in your values.yaml against the version compatibility table.

Private registry requires authentication — The cluster cannot pull images without credentials. Create an image pull secret and reference it in your values.yaml:

kubectl create secret docker-registry regcred \
  --docker-server=<registry-url> \
  --docker-username=<username> \
  --docker-password=<password> \
  -n midaz

ledger:
  imagePullSecrets:
    - name: regcred

Missing imagePullSecrets — The secret exists but is not referenced in the component’s config. Ensure imagePullSecrets is set for all affected components.

CrashLoopBackOff

Symptom: Pods start and immediately crash, restarting repeatedly. Diagnostic commands:

kubectl get pods -n midaz
kubectl logs <pod-name> -n midaz --previous
kubectl describe pod <pod-name> -n midaz

Use --previous to see logs from the last crashed container instance, not the currently restarting one.

Common causes and solutions:

Bad or missing environment variables — A required config key is absent or has an incorrect value. Check the logs for messages like missing env var, invalid config, or similar. Review the configmap section of your values.yaml.
Missing Kubernetes Secret — The pod references a secret that does not exist.
```
kubectl get secrets -n midaz
kubectl describe secret <secret-name> -n midaz
```
If the secret is missing, create it manually or re-run the Helm install.
Wrong database credentials — The service cannot authenticate with PostgreSQL, MongoDB, or Redis. Check logs for authentication failed, connection refused, or ECONNREFUSED. Verify the secrets section in your values.yaml and confirm the credentials match those used when the databases were provisioned.
OOMKilled — The container exceeded its memory limit and was killed by the kernel.
```
kubectl describe pod <pod-name> -n midaz | grep -A5 "Last State"
```
Look for OOMKilled in the Last State section. Increase resources.limits.memory in your values.yaml. See Pod eviction / OOMKilled below.

Helm install timeout

Symptom: helm install or helm upgrade fails with a timeout error before the release reaches deployed state. Diagnostic commands:

helm status midaz -n midaz
kubectl get pods -n midaz
kubectl describe pod <pod-name> -n midaz
kubectl get events -n midaz --sort-by='.lastTimestamp'

Common causes and solutions:

Slow image pulls — Large images on a slow connection can exceed the default timeout. Increase the timeout:

helm install midaz oci://registry-1.docker.io/lerianstudio/midaz-helm \
  --version <version> \
  -n midaz \
  --create-namespace \
  --timeout 15m

Init containers failing — An init container (e.g., the database bootstrap job) is hanging or retrying. Check init container logs:
```
kubectl logs <pod-name> -n midaz -c <init-container-name>
```
Readiness probes failing — The pod is running but not passing its readiness check, so Helm waits indefinitely. Describe the pod and look at the Conditions and Events sections. You may need to increase initialDelaySeconds in your readiness probe settings, or investigate why the service is not healthy on startup.

Services not reachable

Symptom: Midaz APIs are unreachable from outside the cluster, or services cannot communicate internally. Diagnostic commands:

kubectl get ingress -n midaz
kubectl describe ingress <ingress-name> -n midaz
kubectl get svc -n midaz
kubectl get endpoints -n midaz

Common causes and solutions:

Ingress misconfiguration — The Ingress resource exists but the controller is not picking it up. Verify that ingress.className matches the class of your installed ingress controller:
```
kubectl get ingressclass
```
Also check that the ingress controller pod itself is running:
```
kubectl get pods -n ingress-nginx
```
DNS not pointing to the load balancer — The hostname in your Ingress does not resolve to the controller’s external IP. Get the external IP and compare with your DNS record:
```
kubectl get svc -n ingress-nginx
```

TLS misconfiguration — A missing or expired TLS secret causes the ingress to fail silently. Verify the secret exists and is not expired:

kubectl get secret <tls-secret-name> -n midaz
kubectl describe secret <tls-secret-name> -n midaz

If using cert-manager, check the Certificate resource status:

kubectl get certificate -n midaz
kubectl describe certificate <cert-name> -n midaz

PVC stuck in Pending

Symptom: A PersistentVolumeClaim remains in Pending state and the dependent pod cannot start. Diagnostic commands:

kubectl get pvc -n midaz
kubectl describe pvc <pvc-name> -n midaz
kubectl get storageclass

Common causes and solutions:

No default StorageClass — No StorageClass is marked as default in the cluster.
```
kubectl get storageclass
```
If none shows (default), either create a StorageClass or explicitly set one in your values.yaml for the affected dependency (e.g., postgresql.primary.persistence.storageClass).
Wrong access mode — The StorageClass does not support the access mode requested by the PVC (e.g., ReadWriteMany on a storage driver that only supports ReadWriteOnce). Check the Events section of kubectl describe pvc. Adjust accessModes in your values.yaml to match what your StorageClass supports.
Volume binding mode is WaitForFirstConsumer — Some StorageClasses use delayed binding. The PVC will stay Pending until a pod consuming it is scheduled. This is normal behavior; wait for the pod to be scheduled.

Pod eviction / OOMKilled

Symptom: Pods are repeatedly evicted or show OOMKilled in their last state. Diagnostic commands:

kubectl get pods -n midaz
kubectl describe pod <pod-name> -n midaz | grep -A10 "Last State"
kubectl top pods -n midaz
kubectl top nodes

Common causes and solutions:

Memory limits set too low — The container’s resources.limits.memory is below what the service actually needs under load. Review the current memory usage with kubectl top pods, then increase the limit in your values.yaml:
```
ledger:
  resources:
    requests:
      memory: "256Mi"
      cpu: "250m"
    limits:
      memory: "512Mi"
      cpu: "500m"
```
Node under memory pressure — The node itself is under pressure and the kubelet is evicting lower-priority pods. Check node conditions:
```
kubectl describe node <node-name> | grep -A5 Conditions
```
Consider adding nodes or enabling cluster autoscaler. You can also set PriorityClass on Midaz pods to protect them from eviction.

RabbitMQ definitions not loaded

Symptom: Midaz services start but transactions fail, queues are missing, or messages are not being processed. Logs may show AMQP connection errors or missing exchanges/queues. Diagnostic commands:

kubectl get pods -n midaz | grep rabbit
kubectl logs <rabbitmq-pod-name> -n midaz --tail=100
# Check if the bootstrap job ran
kubectl get jobs -n midaz
kubectl logs job/<bootstrap-job-name> -n midaz

Common causes and solutions:

External RabbitMQ missing load_definitions.json — When using an external RabbitMQ instance, the required queues, exchanges, and bindings are not present. Enable the bootstrap job in your values.yaml:

global:
  externalRabbitmqDefinitions:
    enabled: true
    connection:
      protocol: "http"
      host: "your-rabbitmq-host"
      port: "15672"
      portAmqp: "5672"

Or apply the definitions manually:

curl -u {user}:{pass} -X POST -H "Content-Type: application/json" \
  -d @load_definitions.json \
  http://{host}:{port}/api/definitions

The load_definitions.json file is at charts/midaz/files/rabbitmq/load_definitions.json in the Helm repository.

Bootstrap job failed silently — The job ran but encountered an error (wrong credentials, network timeout, wrong port).
```
kubectl logs job/<bootstrap-job-name> -n midaz
```
Verify the rabbitmqAdminLogin credentials and that the management port (default 15672) is reachable from within the cluster.

Deploy Midaz using Helm — Initial installation guide
Upgrading Midaz and plugins via Helm — Upgrade procedures and rollback
Upgrading Helm — Breaking changes and migration paths between major versions
Version compatibility — Version mapping reference
Helm repository — Source code and release notes

Documentation Index

​General diagnostic commands

​Pods stuck in Pending

​ImagePullBackOff

​CrashLoopBackOff

​Helm install timeout

​Services not reachable

​PVC stuck in Pending

​Pod eviction / OOMKilled

​RabbitMQ definitions not loaded

​Related resources

General diagnostic commands

Pods stuck in Pending

ImagePullBackOff

CrashLoopBackOff

Helm install timeout

Services not reachable

PVC stuck in Pending

Pod eviction / OOMKilled

RabbitMQ definitions not loaded

Related resources