Eliminating Cold Starts

2026-04-03T00:00:00+00:00

The internal postmortem automation in the last post eventually worked but that wasn’t the problem I’d been trying to solve. Customer-facing postmortems were a different kind of challenge. When an incident happened, we owed our customers a written explanation — what broke, why, and what we’d done to prevent it. We needed to convey what we had learnt from the postmortem and how we planned to evolve our product and our processes. The language had to be precise yet readable. More importantly, we wanted customers to trust that they were in good hands even after we made mistakes.

A Simpler Problem

Technical writing ability on the team varied, as it does with any global team that has a diverse background. Engineers from a customer facing background ended up delivering postmortems with fewer review cycles and rewrites. Of course, everyone was improving steadily but our incident rate demanded faster growth. That’s challenging, especially when writing and communication is a skill that develops with experience and time. As the incident rate grew, delivery slowed down and bottlenecks appeared in both drafting and reviewing. I’d learnt a bunch from my previous automation attempts and I was eager to try and optimize this.

The customer facing postmortem had a few things working in my favor: it didn’t need a lot of tool calls. I wasn’t assembling a timeline from Slack channels and Freshdesk tickets. I already had the internal postmortem. I just needed to read that, pull in one or two customer-facing tickets, and translate the content for a different audience. There were also several published postmortems that I could use to extract the overall writing style and good vs bad examples from.

Reducing Activation Energy

I skipped the free form-first approach and went straight to phase-by-phase generation — one paragraph, correct it, move on. The feedback loop was the same: generate, refine in Confluence, feed the diff back to Claude, ask it to generate a new prompt, update the project instructions.

Four iterations. That was enough to get the output to a consistent floor — not polished, not as good as a deliverable document, but good enough that anyone could start from it. Starting a customer facing postmortem had much less ceremony and didn’t feel like a monumental task anymore. When I shared it with the team, the impact wasn’t that the output got dramatically better. It was that the variance dropped. First drafts arrived faster and review cycles shortened.

The automation didn’t make anyone a better writer — it gave everyone a starting point that wasn’t a blank page.

Where the Pattern Broke

I tried the same approach on internal ticket replies, like status updates. Give Claude the ticket, generate a draft, correct it, feed the diff back.

Much to my surprise, it didn’t work. Every ticket had its own story. Every ticket involved a different set of stakeholders. The history of each ticket was unique and so were the decisions at different points in time. There was no template to converge on, so the prompt never stabilized. I spent significant time iterating on a skill that kept needing to be rewritten for each ticket. I eventually abandoned it.

The postmortem automation worked because there was an underlying structure to converge on. Ticket replies didn’t have one. That distinction mattered more than I’d initially assumed. Many months later, I realized an important lesson — if a decision is unique every time, make it easier for a user to decide. In short, give them the context they need but don’t automate the outcome. We’ll explore this in a future post.

The First Automation

2026-03-31T00:00:00+00:00

Our incident postmortem process had always been challenging to automate. At Workato’s scale, things break in nontrivial ways, and the postmortem was how we made sense of it. The goal was always to automate the toil — copying incident details from Slack channels and Jira tickets into a document by hand felt unnecessary. But the real constraint was depth and quality. We wanted to compile as many learnings from an incident that we could. We wanted timeline entries detailed enough to extract reliable metrics: not just that an incident happened, but how it evolved, where time was lost, what ideas we missed and how to set a foundation for a better future. I’d tried to automate it once before, and we briefly had an intern build a custom solution, but we deprecated it soon after the intern left. Nobody had made it work.

What I had tried

Back in 2023, I’d tried to automate parts of document generation. I discovered that Confluence’s API was borderline unusable. Parsing the document format it expected required a library and a lot of patience. Even as an expert Workato recipe builder, doing this well in a short timespan was challenging. My first attempt got as far as filling in the basics — channel name, ticket links, date. Anything that required reading details and making sense of it, I gave up on. So, for a long time, RCA documents were filled mostly by hand.

I’d used LLMs a year earlier and was mostly unimpressed. My early experiments ended up in generated content that I threw away, but I was eager to try and figure out some solution that could reduce toil.

The First Iteration

Our organization unlocked Claude for everyone sometime last year, and I felt that it was a good chance to try automating postmortems again. I created a Claude project and wrote a simple prompt: read the template, read the Slack incident channel, pull in any relevant Jira and Freshdesk tickets, populate the document. Ask me before writing anything.

The first session was a long free form conversation. Claude fetched things, generated text, asked questions, looped back. Eventually it produced something, but it didn’t look like any RCA we’d written before. The format was off. The framing was off. The first try wasn’t promising.

I also made a separate mistake early on. Once the project was set up, I decided to overhaul the RCA template — there were things I’d wanted to fix for years. What I didn’t know was that other teams had internal automations built on top of the existing structure that broke. I confused people who’d been using the old format. I reverted within a day. I’d caused a headache but I learnt a valuable lesson.

An Unexpected Roadblock

The incident timeline needed to show when customers first reported the issue, and that information lived in Freshdesk, not Slack. Much to my dismay, Freshdesk doesn’t provide a native MCP server. But I’d been moving fast and I wanted to carry that momentum.

Since I work at Workato and we’d launched an MCP server offering earlier that year, I decided to poke around. Our platform can expose API collections as MCP servers. I didn’t want to build a full connector or design a set of recipes from scratch, so I looked for an OpenAPI spec — Freshdesk doesn’t publish an official one, but I found an unofficial spec in about ten minutes. I imported it, cleaned it up slightly, and the MCP server was running in fifteen minutes. That went way better than I’d expected and I was working on RCA generation again in half an hour.

One catch: our production Claude instance is locked down — adding custom MCP servers requires an approval process. So once I finished my RCA and wanted to share my Claude project with the team, I submitted an internal ticket. Getting the MCP approved and deployed took two days since people are distributed across the globe. Once in production, my team, and the support team, could use Claude with the Freshdesk MCP to speed up their work. Workato’s governance controls meant we could gate access by team, preventing any unauthorized access.

That gap stuck with me. Fifteen minutes to build. Two days to deploy. I’d never experienced this kind of speed before, and I was determined to keep going.

Refining the Output

I found that Claude was way too interested in exploring unrelated channels and context so I added approval gates for tool calls — before Claude explored a Slack channel or fetched a ticket, I had to confirm it was relevant. This stopped the wandering. I also found that reviewing a single, large document was painful. Instead, I broke generation into phases and reviewed the output one section at a time, corrected it, and moved on. Only after the full document was approved did Claude write anything to Confluence.

The process felt slower but it was more reliable now. This reliability was the foundation. I wasn’t undoing mistakes made three sections ago. The key idea I came back to was that “Slow is smooth, smooth is fast.”

The Feedback Loop

At the end of each session, I was looking at another gap: the document Claude had generated versus the document I’d actually wanted. That gap contained information — specific corrections, things I’d rephrased, structural changes I’d made. I started feeding that diff back to Claude and asking it to analyze what had changed and why, and then used those differences to generate an updated prompt. I’d put the updated prompt back into the project instructions. Two or three iterations, and the output was consistent enough to share with the team.

The prompt loop was the thing worth keeping. Every completed RCA made the next one slightly better, without me having to work out what to change. But that presented me with a new problem, could I automate that too? Spoiler alert: it wasn’t easy.

My Work Blog