Story9 min readAugust 20, 2025

The Platform Engineer Who Was the Glue - Holding Together a System Nobody Else Could See

A story about a platform engineer who spent years keeping fragile infrastructure alive with manual interventions - and what happened when she finally built the system she'd been substituting for.

I got paged at 3:17 a.m. on a Saturday because a certificate expired.

Not a complex certificate. A TLS cert on an internal service that nobody remembered existed until it started throwing 502s that cascaded across three dependent services and woke up the on-call chain.

I fixed it in twelve minutes. I'd been fixing it every eleven months for four years. Because the renewal was manual, the reminder was a calendar event I'd set on my personal account, and I was the only person who knew the service existed.

I wasn't doing platform engineering. I was doing infrastructure hospice - keeping dying systems alive through sheer memory and manual intervention.

The invisible work problem

Platform engineering is the discipline nobody sees until something breaks. We don't build features. We don't ship things customers touch. We maintain the substrate that everything else runs on - and our success metric is that nothing happens.

Which means our work is invisible. Until it fails.

The CI pipeline that processes 400 builds a day? Nobody notices until it takes 45 minutes instead of 12.
The deployment system that rolls out changes to 30 services? Nobody cares until a rollback doesn't work.
The monitoring stack that catches issues before customers do? Nobody remembers it exists until there's a gap.
The infrastructure provisioning that used to take two weeks and now takes two minutes? Everyone assumes it's always been that way.

My team's reward for doing great work was the assumption that the work didn't need doing.

The infrastructure nobody could explain

We had 247 services in production. I could explain what maybe 180 of them did. The other 67 were a combination of experiments that became load-bearing, migrations that were never completed, and services that existed because deleting them might break something and nobody wanted to find out.

The infrastructure had grown organically over five years. Each addition made sense at the time. Nobody had a record of why decisions were made. Nobody could explain the full dependency graph. Nobody could tell you which services were critical and which were vestigial.

Our infrastructure wasn't architected. It was accumulated. And I was the only person who had enough scar tissue to navigate it.

Invisible, critical work: the calm before the storm.

The day I realized I was the problem

I got sick. Real sick. Two weeks out. No laptop, no Slack, no 'just checking in.'

When I came back, three things had broken that nobody could fix, two things had been fixed incorrectly creating new problems, and one team had provisioned infrastructure in a way that bypassed every standard we had because they couldn't find the documentation - because the documentation was a set of notes I'd never gotten around to publishing.

My manager said: 'We really need to reduce the bus factor on platform.'

The bus factor was one. Me.

The bus factor: tension peaks as the system’s fragility is exposed.

That week I found the Team Helix blog and two ideas landed hard:

Helix argues that governance encoded as constraints - infrastructure rules, deployment gates, architecture standards - removes the need for human gatekeepers. The system enforces the standards. Humans don't have to remember.

Helix also frames decision records as the antidote to tribal knowledge: every material infrastructure decision - what was provisioned, why, what constraints apply, when it should be reviewed - captured so the context outlives the person who made it.

I'd been the governance layer and the knowledge base. Both needed to be encoded into the system, not carried in my head.

Revelation in solitude: it’s time to build a system, not be the system.

What I built to replace myself

1. Infrastructure as code with embedded policy

Every infrastructure change through code, every code change through a pipeline, every pipeline with policy constraints. No more 'just SSH in and fix it.' No more manual provisioning that bypasses standards.

2. A service registry with provenance

Every service documented: what it does, who owns it, why it exists, what depends on it, and when it should be reviewed for decommissioning. Not a wiki page nobody reads. A live registry enforced through the deployment pipeline - you can't deploy a service that isn't registered.

3. Self-service with guardrails

Instead of teams coming to me for every infrastructure request, we built self-service templates with encoded constraints. Teams could provision what they needed, within the boundaries we set, without waiting for me.

My calendar went from 80% 'can you provision this?' to 20% strategic conversations about platform direction.

What changed

On-call incidents dropped by 70% because automated systems caught the certificate renewals, the drift, and the configuration problems before they became pages.
New teams could provision infrastructure in minutes instead of waiting days for me to get to their request.
When I took vacation, nothing broke. The system handled what I used to handle manually.
We decommissioned 43 of those 67 mystery services - safely - because we finally had the dependency graph to know what was connected to what.

The best compliment I received: a new engineer said 'your infrastructure just works.' She had no idea how much manual effort had previously been required to make it 'just work.' And that was the point.

From invisible glue to shared platform: a team united around the system.

The fine print

This platform engineer is fictional.

There wasn't a single 3 a.m. certificate page that started the change. There wasn't one two-week absence that exposed the bus factor. There wasn't a clean migration from 'human glue' to 'platform system.'

But every platform engineer recognizes this pattern. The invisible work. The tribal knowledge. The manual interventions that keep everything running. The terrifying realization that you can't take a vacation without something falling apart.

The fix isn't 'document more.' The fix is building a platform that encodes its own governance - where standards are constraints, knowledge is recorded, and the system operates predictably without depending on any single person's memory.

See how Helix unblocks Platform Engineers from toil

See governed autonomy in action

Request a demo and see how Team Helix applies these ideas to your engineering workflow.

Request a Demo Read the Blog

The CISO Who Was Tired of Being the Afterthought - And Turned Security Into a Delivery PrimitivePrevious The Compliance Officer Who Was Done With Audit Season - And Built a System That Made It DisappearNext