Designing Human-in-the-Loop for Agentic LLM Workflows
Two patterns I shipped at Delo and the decision logic behind them.
Context
Where Delo came from, and how I found the wedge.
Delo is Wonder's bet on agentic AI. Wonder is Deloitte Canada's internal product incubator, structured like an internal VC: a partner sponsor pitches an idea to a funding committee, and if approved, gets runway to assemble a team. Delo's pitch was to build AI agents that automate high-effort accounting workflows. The team ran on quarterly funding tranches against outcome-based milestones. Year 1 came to roughly $1M.
I joined as founding PM after the project had lost its anchors. The previous sponsor had left to start his own company, the director of product was on leave, and the team had run through two pivots without converging. The new sponsor, a former CFO at Hopper, brought me in to find product-market fit before the next funding gate.
I spent the first six weeks on two tracks, working alongside Delo's other cofounder with our partner sponsor in support. With the build team, I ran one-on-ones and an on-site to understand who was carrying context and where momentum had broken. With my business counterpart, I ran 20+ discovery interviews across CFOs, controllers, and bookkeepers to find a wedge. The most consistent pain was not the categorization work everyone talks about. It was reconciliation: matching Stripe payouts against QuickBooks entries at month-end. Controllers were spending twenty or more hours a month on it, manually. That became the foundation for our first agent.
The two HITL patterns we shipped against that pain share a central design choice: trust flows through provenance, not through confidence.
Cross-platform reconciliation
Provenance-based escalation. Trust the document, not the model.
The first agent reconciled Stripe payouts to QuickBooks entries. Each morning, Finn (the AI accounting agent we built at Delo) called Flinks (a banking aggregator) to retrieve any new transactions from the client's bank account since the last run. We pulled from the bank rather than from QBOQuickBooks Online. The cloud accounting platform most of Delo's SME clients used to manage their books. directly because QuickBooks does not expose uncategorized transactions via API. Once retrieved, Finn matched those transactions against open bills and entries in QBO, then wrote results back to the ERPEnterprise Resource Planning system. In Delo's context, this refers to the client's accounting software (typically QuickBooks) that serves as the system of record for financial data.: creating or updating payment objects, updating bill objects, and logging decisions to the vector store. In pilot, the 20-hour monthly task came down to about 15 minutes.
The hard design question was not whether to use an agent but what the controller sees when she opens her reconciliation queue. In discovery interviews, controllers never used the language of confidence scores when describing what they needed. The recurring question was "what does it match to," not "how sure is it." We ran multiple usability sessions with Deloitte accountants and external controllers to pressure-test this early, and the pattern was consistent: when a prototype surfaced a confidence score, controllers either ignored it or immediately asked what it matched to, while a prototype that surfaced the source document got acted on. That distinction became the design constraint.
A controller cannot present results to her CFO that she cannot defend. If the agent reconciles a Stripe payout against a QBO entry, she needs to know why the agent matched them: "Matched against Bill #1234 in QBO, dated October 14, for Acme Co." is defensible in a way that "the model is highly confident" never will be, because the audit trail knows how to handle the first answer and has no place to put the second.
The escalation rule was binary. If the agent could match a transaction to a hard document (a bill, a receipt, a payout statement), the match was processed automatically. The match still surfaced in the review queue, rolled up under an optional tier the controller could open if she wanted. If the agent could not match to a hard document and had to fall back on inference alone, the match was flagged for required review.
The escalation rule was the document rather than the model. The controller did not have to trust the agent, she had to trust the document, and the document was something her audit process already knew how to handle.
Transaction categorization
Match-type review. Hide confidence. Show provenance.
The second agent automated transaction categorization, tagging raw expenses, deposits, and transfers with the right general ledger codes, tax treatments, and metadata. It is the most tedious and error-prone task in the month-end cycle, and in pilots Finn auto-categorized 70 to 80 percent of transactions, leaving the remaining 20 to 30 percent for human review. For a controller managing a month-end cycle of 100 to 300 transactions, that meant reviewing tens of items rather than the full ledger, and the ratio improved as the vector store built up client history.
Each morning Finn called Flinks (a banking aggregator) to retrieve any new transactions since the last run. Finn then queried those transactions against historical data in the ERP and a vector store of past client categorizations (a searchable database of every categorization decision the agent had previously made for that client), using prompt engineering developed with subject matter experts (accountants and controllers) to sort them. High-confidence matches (an open bill in QBO, a known vendor, a prior identical transaction) were grouped for automatic or batch approval, while low-confidence items (no historical reference, no document match, above the materiality threshold) were flagged for human review. Users approved directly in Finn's workspace without leaving the platform, and once approved, Finn wrote results back to the ERP, creating or updating payment objects, updating bill objects, and logging decisions to the vector store. The agent got more accurate the longer it ran on a given client.
The HITL design choice that mattered was how we organized the review queue. We did not expose confidence in the UI at all, and instead organized everything by match type.
When the controller opened a flagged transaction, the UI showed two things: the categorization Finn proposed, and the rationale behind it. If the rationale referenced a similar past transaction from the vector store, the document was linked. The controller's review action (accept, modify, override) fed back into the vector store as a training signal for the next similar case.
We deliberately did not show a confidence number. A controller does not trust "0.87 confidence," she trusts "matched against Bill #1234 from Vendor X dated October 14, with this reasoning trail."
The other reason to avoid confidence scores is that they are not standardized. Different models compute confidence differently, and many teams shortcut the problem entirely by simply prompting the model to output a confidence rating. That number is not a calibrated probability but the model's expressed certainty about its own answer, which can be systematically overconfident on unfamiliar inputs. Routing a reviewer's attention through a number the product team itself cannot fully define is a design failure waiting to happen.
A practical constraint
QuickBooks does not expose unposted transactions via API. It was deliberate.
Halfway through building the reconciliation agent we discovered that QuickBooks does not expose unposted transactions via API, which meant Finn could not see new transactions as they arrived. The workaround used QBO's Rules engine to auto-post transactions to placeholder accounts, but it added per-instance configuration that did not scale. The fix was Flinks: pulling directly from the client's bank account each morning bypassed the QBO API entirely and gave us cleaner, earlier data. QuickBooks later shipped their own v1 accounting agent, confirming the restriction was protecting their surface, not an oversight. The pattern generalizes: as agents become more capable, they threaten the platforms they sit on, and those platforms respond by gating data. The right architecture sources from the cleanest available system of record, which in most regulated workflows is the bank rather than the ERP.
What this taught me
Two ideas I am taking forward.
The lesson that travels is that controllers and accountants do not orient around the model's certainty about itself when they make decisions they have to defend. They orient around documents, records, something they can point to. Routing the trust contract through the document, and exposing the model's reasoning about the document rather than its certainty of the answer, makes the agent's output something a user can actually defend to their CFO or in an audit.
HITL is a contract term, not a product feature. The harder design question is never whether to add human review to an agentic workflow, but what to surface, when to surface it, and what to do with the signal once a human acts on it. Get that wrong and the product either oversteps and breaks trust, or understeps and relocates the work from the agent back to the human with a worse interface.
I left Delo in August 2025 for a PM role at OPENLANE, and the Delo team has carried the work forward since.