Swap one OCR vendor for another
Rejected on sight: the same demos we’d already sat through, the same failure modes on the messy 30%. A new logo on the same OCR-first approach.
A mid-market logistics SaaS had a 14-person manual review queue for incoming shipping documents. Days-long backlog. Enterprise complaints. We built an LLM pipeline that handles the messy 30% the off-the-shelf vendors couldn't, and the queue cleared in week one.
Document volume had doubled in 18 months. Same-day processing became a 3-day backlog. Two enterprise contracts had filed quality-of-service complaints. One was actively shopping replacement vendors.
The client provided freight management software to mid-market shippers. Incoming shipping documents (bills of lading, customs forms, proof-of-delivery scans, weight tickets) needed structured extraction before they surfaced in the shipper's ops dashboard. That extraction was the 14-person review team.
The CTO had inherited the problem six months earlier and had already tried two off-the-shelf vendors. Both passed the demo. Neither held up in production. The pattern was consistent: the clean 70% of documents were fine. The other 30% (bad scans, warehouse phone photos, partial pages) broke every OCR-first system they evaluated.
The CTO's words in our first call: "We tried two solutions. Demos looked good. Neither vendor's tech worked on the 30% of our documents that are bad scans, partial pages, or photos taken on warehouse phones. We needed something that handled the messy 30%, not the clean 70%."
Rejected on sight: the same demos we’d already sat through, the same failure modes on the messy 30%. A new logo on the same OCR-first approach.
Higher build cost, higher long-term ROI, but only if we could prove exception-handling on their hardest real documents in three weeks. This was the bet.
Tooling to make the 14 people faster instead of fewer. Cheaper to build, lower ceiling, and the queue stays a team.
We pitched Path B as a prototype-first engagement with an explicit kill criterion: if a 3-week prototype couldn't hit 92% straight-through on a held-out set of their hardest documents, we'd recommend Path C and they'd pay only for the prototype phase.
The prototype hit 94.3%. We moved into the production build.
Every document enters at the top. High-confidence records flow straight to the dashboard. Low-confidence records surface to the exception team with only the uncertain fields highlighted, not the whole document.
Classifies document type (BOL, customs, POD, weight ticket) and quality (clean / messy / un-OCRable) in a single pass, routing each to the right processing path immediately.
Clean documents run through the existing AWS Textract OCR; messy documents through GPT-4o vision with a structured output schema validated against business rules.
Every field gets a confidence score. High-confidence records (~89% of volume) flow straight to the customer’s dashboard; low-confidence records flag for human review with only the uncertain fields highlighted.
1,200 representative documents with human-labeled ground truth run against every PR before shipping. The system doesn’t change without the eval passing.
| Metric | Before | After |
|---|---|---|
| Avg document processing time | 14 hours | 1.8 hours |
| Backlog (peak) | 3+ days | <2 hours |
| Straight-through processing rate | 0% (all manual) | 89% |
| Headcount on review team | 14 | 3 |
| Customer-reported quality issues | 47/quarter | 6/quarter |
We'd written the budget for a vendor SaaS. They convinced us to spend a third of that on a prototype first. The prototype showed us the vendors couldn't handle our messy 30%, and gave us a system that could. Eighteen months in, it's still running.
Clement W. · VP Engineering, logistics SaaS
If your 'AI' is actually a person in another time zone, we should talk. We'll tell you whether a pipeline makes economic sense, or whether it doesn't.