Eating Crow in the Disk Queue

Note: This is a fictionalized account. If it sounds exactly like your company, that’s just because dysfunction has a standard operating procedure.

Eating Crow in the Disk Queue

“Sometimes the fix you mocked last time is the only fix this time.”

We had another P1 today.

Last outage, I gave the team grief for restarting the server before finding the root cause. Restarting without diagnosing felt like hitting CTRL+ALT+DEL on your career. I chided them because, in my mind, we needed to understand the why before we reached for the power button.

Fast forward to today. Guess who was on call?

Me.

The data warehouse SQL Server started gasping for breath. Jobs piled up. Maintenance tasks collided with reporting queries. The disk queue went vertical. It was like watching cars jammed on a one-lane bridge while more and more trucks arrived.

I dug deeper and found the real problem:
We had the wrong VMware disk driver installed. LSI instead of PVSCSI. On a production data warehouse server.

And here’s the kicker. This wasn’t a new mistake. The servers had been built this way since 2019: SIX YEARS of wrong drivers quietly sitting under our data warehouse.

I joined in 2023, and a few months later, after uncovering collation mismatches and other oddities, I asked the data center team about the disk drivers. They told me everything was fine. They were the experts, so I believed them. What they hadn’t done was read the SQL Server best practices guide. And I didn’t yet have the knowledge to challenge their answer.

The result: I spent the past two years trying to stabilize a server that was doomed by a silent misconfiguration that had been festering for six.

Under load, the LSI driver just collapsed. The queue backed up, and it didn’t matter what you canceled. Even if we disabled all the jobs, it would have taken hours to clear the backlog because of how the queue itself was being handled.

The only way out? The exact fix I had mocked: cancel everything and restart the server.

So that’s what I did. And then I ate crow.

I apologized to the team and admitted, out loud, that apparently I can be wrong. I even said it as a nudge; because everyone thinks I think I’m always right.

The lesson? Trust, but verify. Expertise without the manual is just confidence in costume. And sometimes, the crow you eat is served hot from a VMware driver misconfiguration.

The Illusion of Stability

The astounding part is that, despite the wrong driver, we still managed to make the server mostly stable.

When I started in 2023, Idera was screaming red alerts every single day. With indexing, maintenance, and query rewrites, we pushed it down into “constant yellow with only occasional spikes of red.”

Of course, when the red came back, it was RED: the kind of red that stops everything.

So we weren’t fixing the real problem. We were just painting over it in warmer shades.

Epilogue

We’re now pushing the data center team to correct the driver. Six years is long enough. With PVSCSI in place, this particular flavor of outage shouldn’t ever return.

At least next time I eat crow, it will be for a different mistake.

Disclaimer: Any resemblance to actual companies, teams, or outages is coincidental. Because, surely, no real organization would leave a production data warehouse on the wrong VMware driver for six years.