Logistics SaaS Custom AI & LLM Build Engagement · capped budget

We replaced a 14-person review queue with a pipeline.

A mid-market logistics SaaS had a 14-person manual review queue for incoming shipping documents. Days-long backlog. Enterprise complaints. We built an LLM pipeline that handles the messy 30% the off-the-shelf vendors couldn't, and the queue cleared in week one.

Book a 30-min strategy call Jump to results

14-person queue → pipeline → 89% automated · 3-person exception desk

The "AI" they had marketed was 14 people in a low-cost market. They were drowning.

Document volume had doubled in 18 months. Same-day processing became a 3-day backlog. Two enterprise contracts had filed quality-of-service complaints. One was actively shopping replacement vendors.

The client provided freight management software to mid-market shippers. Incoming shipping documents (bills of lading, customs forms, proof-of-delivery scans, weight tickets) needed structured extraction before they surfaced in the shipper's ops dashboard. That extraction was the 14-person review team.

The CTO had inherited the problem six months earlier and had already tried two off-the-shelf vendors. Both passed the demo. Neither held up in production. The pattern was consistent: the clean 70% of documents were fine. The other 30% (bad scans, warehouse phone photos, partial pages) broke every OCR-first system they evaluated.

The split that defined the problem, and set the prototype kill criteria

The CTO's words in our first call: "We tried two solutions. Demos looked good. Neither vendor's tech worked on the 30% of our documents that are bad scans, partial pages, or photos taken on warehouse phones. We needed something that handled the messy 30%, not the clean 70%."

We scoped three paths. We bet on the hard one.

Path A

Swap one OCR vendor for another

Rejected on sight: the same demos we’d already sat through, the same failure modes on the messy 30%. A new logo on the same OCR-first approach.

Path BChosen

Custom LLM pipeline

Higher build cost, higher long-term ROI, but only if we could prove exception-handling on their hardest real documents in three weeks. This was the bet.

Path C

Augment the review team

Tooling to make the 14 people faster instead of fewer. Cheaper to build, lower ceiling, and the queue stays a team.

We pitched Path B as a prototype-first engagement with an explicit kill criterion: if a 3-week prototype couldn't hit 92% straight-through on a held-out set of their hardest documents, we'd recommend Path C and they'd pay only for the prototype phase.

94.3%

Straight-through on the hardest held-out docs
(kill threshold was 92%)

The prototype hit 94.3%. We moved into the production build.

Four layers. The messy 30% finally had a home.

Every document enters at the top. High-confidence records flow straight to the dashboard. Low-confidence records surface to the exception team with only the uncertain fields highlighted, not the whole document.

Clean docs through Textract · messy docs through GPT-4 vision · every record confidence-scored · every change eval-tested before shipping

Layer 1

Triage

Classifies document type (BOL, customs, POD, weight ticket) and quality (clean / messy / un-OCRable) in a single pass, routing each to the right processing path immediately.

Layer 2

Hybrid extraction

Clean documents run through the existing AWS Textract OCR; messy documents through GPT-4o vision with a structured output schema validated against business rules.

Layer 3

Validation & confidence scoring

Every field gets a confidence score. High-confidence records (~89% of volume) flow straight to the customer’s dashboard; low-confidence records flag for human review with only the uncertain fields highlighted.

Layer 4

Evaluation harness

1,200 representative documents with human-labeled ground truth run against every PR before shipping. The system doesn’t change without the eval passing.

The queue cleared in week one. The numbers held at twelve months.

−87%

Document processing time

89%

Straight-through, no human touch

14→3

Review team headcount

$720K

Annualised savings

Every metric moved. The at-risk enterprise contract expanded to a second region.

Metric	Before	After
Avg document processing time	14 hours	1.8 hours
Backlog (peak)	3+ days	<2 hours
Straight-through processing rate	0% (all manual)	89%
Headcount on review team	14	3
Customer-reported quality issues	47/quarter	6/quarter

We'd written the budget for a vendor SaaS. They convinced us to spend a third of that on a prototype first. The prototype showed us the vendors couldn't handle our messy 30%, and gave us a system that could. Eighteen months in, it's still running.

Clement W. · VP Engineering, logistics SaaS

Runs in your VPC, not in a black box.

Models: GPT-4 / GPT-4o (vision + extraction), Claude (validation reasoning), small fine-tuned classifier for triage
OCR (clean docs): AWS Textract
Orchestration: Custom Python pipeline, deployed as containerized services
Vector store: Pinecone
Evaluation: Custom eval harness running 1,200 test cases on every PR
Monitoring: Prometheus + Grafana, with per-document-type accuracy tracking
Infra: Client's AWS account, deployed in their VPC

7 wks

Timeline · 3-wk prototype + 4-wk build

Senior team · AI, data, backend & delivery

$77K

Budget cap · phase-gated

$72K

Actual spend · 6% under cap

8 hrs

Hand-off training · eval harness included

Sound familiar?