You know how detective stories—whether in police dramas or medical series like House or The Good Doctor—always have that moment where everything seems fine… until suddenly, it’s not?
Software engineering is no different. If you love detective stories, debugging systems might just be your kind of thrill.
Let me tell you a real story.
Chapter 1: The First Signs of Trouble
A few days ago, a critical system we built started failing. Users reported that some operations weren’t working. No clear error messages. No immediate patterns. Just a lot of complaints.
It felt like being on a plane in clear skies and suddenly hearing the engines sputter—alarming, yet no obvious clue why.
We kicked off our investigation.
- Were the triggers firing correctly?
- Was Lambda executing as expected?
- Was the engine service running properly?
- Was SNS failing to send messages?
- Was the API erroring out somewhere?
Everything seemed to be working. No outright failures. But something was definitely wrong.
To make things more confusing, this was happening in both dev and QA environments.
- If the issue started in QA, why was it now happening in dev too?
- And why weren’t we seeing any obvious failures in the logs?
We checked everything we could access, but we didn’t have direct access to the QA job queue. However, we did have access to dev’s queue, and the failure was present there too.
This ruled out a one-off deployment bug. It wasn’t just a bad rollout affecting one environment.
Something deeper was wrong.
The system was broken in multiple places, yet there was no smoking gun.
Chapter 2: The Clue in the Logs
After hours of combing through logs, we finally found something:
Lambda was failing to access an SSM parameter due to insufficient permissions.
Strange. The IAM policy showed that Lambda had access. Why was it still failing?
Digging deeper, we noticed something odd:
- The permission was assigned at the parent level for the SSM parameter.
- But the error was occurring on a specific child parameter.
That’s when it hit us—encryption keys.
Each environment has its own encryption keys. Lambda had permission to read from SSM, but it couldn’t decrypt the child parameter because it was encrypted with a different key from another environment.
This explained why it was failing in QA—but why did dev suddenly start failing as well?
That didn’t make sense.
In QA, the problem was clear: Lambda was trying to access a parameter encrypted with a key from a different environment, which it didn’t have access to.
But in dev, this shouldn’t even happen. Unlike QA, dev doesn’t have multiple environments, so the KMS encryption issue shouldn’t exist there at all.
That meant something else was wrong in dev.
We updated the key permissions, redeployed Lambda, and tested the fix.
Lambda was now working fine in both dev and QA.
But both environments were still broken.
Chapter 3: The Hidden Adversary
With Lambda fixed, dev should have started working again.
But it was still broken.
At this point, a rollback was tempting, but we had a major release coming up, and reverting now would cause unnecessary panic and delays. Tension was rising—time was ticking away.
We checked the logs again. Nothing new. Everything looked fine.
So we changed strategies. Instead of blindly searching, we actively simulated the issue.
- We created a debugging tunnel.
- Attached a debugger to the running API instance.
- Stopped the engine on the server.
- Ran the engine locally to see what happened.
The API was working fine.
But then we noticed something strange—jobs were getting pulled from the queue but were not hitting our local engine instance.
That meant something else was consuming the jobs before they reached our local instance.
This led us to suspect there might be a hidden engine instance running somewhere.
We checked the server.
And there it was.
A rogue process—an older instance of the engine, still running on the server.
Someone had manually started it—probably for testing—and forgotten to shut it down.
It wasn’t part of our deployment system. It wasn’t updated. And it wasn’t visible in the usual monitoring dashboards.
Since our system supports multiple instances only when they share the same deployment, this rogue instance was causing conflicts.
We killed the process.
Dev was now fully operational.
But QA? Still down.
Chapter 4: The Missing Piece
If the rogue process was the problem in dev, was the same thing happening in QA?
We checked for hidden instances in QA. No rogue processes. No extra engines running.
So if there was no hidden instance, why was QA still failing?
That’s when we realized:
QA doesn’t just have the engine service.
It also has the processor service.
And in QA, the processor service runs across multiple swimlanes—each engine server can have multiple swimlane instances, and the processor service runs on all those instances.
A quick check confirmed our suspicion:
The processor service was stopped on Swimlane 1.
And the reason test jobs were failing?
It wasn’t a coincidence—the testers had explicitly configured all test jobs to run only on Swimlane 1.
So even though other swimlanes were running fine, every single test job was failing, making it look like the entire QA environment was broken.
Why was it stopped?
It turned out, another developer had noticed that jobs weren’t running in QA.
To debug the issue, they had stopped the processor service to run it locally—but never restarted it.
This was a classic case of a mistaken diagnosis.
- The real issue was in Lambda.
- But since the developer was looking in the wrong place, they thought the processor service was the problem.
- They stopped the processor service, but never restarted it.
And since Lambda had been fixed, but the processor was still down, QA remained broken.
We restarted the processor service on Swimlane 1.
Chapter 5: The Final Fix
To summarize, here’s what we had to do to restore everything:
- Fixed Lambda’s encryption key issue.
- Killed the rogue engine process in dev.
- Restarted the processor service on Swimlane 1 in QA.
The system came back to life.
And with that, the mystery was solved.
Like this blog? You know what to do!