PXT · AI Consulting Built by engineers since 2007
All work
Logistics SaaS Custom AI & LLM Build Engagement · capped budget

We replaced a 14-person review queue with a pipeline.

A mid-market logistics SaaS had a 14-person manual review queue for incoming shipping documents. Days-long backlog. Enterprise complaints. We built an LLM pipeline that handles the messy 30% the off-the-shelf vendors couldn't, and the queue cleared in week one.

14-person queue LLM pipeline triage · extract 89% dashboard 11% 3 exceptions
14-person queue → pipeline → 89% automated · 3-person exception desk

The "AI" they had marketed was 14 people in a low-cost market. They were drowning.

Document volume had doubled in 18 months. Same-day processing became a 3-day backlog. Two enterprise contracts had filed quality-of-service complaints. One was actively shopping replacement vendors.

The client provided freight management software to mid-market shippers. Incoming shipping documents (bills of lading, customs forms, proof-of-delivery scans, weight tickets) needed structured extraction before they surfaced in the shipper's ops dashboard. That extraction was the 14-person review team.

The CTO had inherited the problem six months earlier and had already tried two off-the-shelf vendors. Both passed the demo. Neither held up in production. The pattern was consistent: the clean 70% of documents were fine. The other 30% (bad scans, warehouse phone photos, partial pages) broke every OCR-first system they evaluated.

Clean · 70% OCR handles it fine Messy · 30% Vendor tech failed here
The split that defined the problem, and set the prototype kill criteria

The CTO's words in our first call: "We tried two solutions. Demos looked good. Neither vendor's tech worked on the 30% of our documents that are bad scans, partial pages, or photos taken on warehouse phones. We needed something that handled the messy 30%, not the clean 70%."

We scoped three paths. We bet on the hard one.

Path A

Swap one OCR vendor for another

Rejected on sight: the same demos we’d already sat through, the same failure modes on the messy 30%. A new logo on the same OCR-first approach.

Path BChosen

Custom LLM pipeline

Higher build cost, higher long-term ROI, but only if we could prove exception-handling on their hardest real documents in three weeks. This was the bet.

Path C

Augment the review team

Tooling to make the 14 people faster instead of fewer. Cheaper to build, lower ceiling, and the queue stays a team.

We pitched Path B as a prototype-first engagement with an explicit kill criterion: if a 3-week prototype couldn't hit 92% straight-through on a held-out set of their hardest documents, we'd recommend Path C and they'd pay only for the prototype phase.

94.3%
Straight-through on the hardest held-out docs
(kill threshold was 92%)

The prototype hit 94.3%. We moved into the production build.

Four layers. The messy 30% finally had a home.

Every document enters at the top. High-confidence records flow straight to the dashboard. Low-confidence records surface to the exception team with only the uncertain fields highlighted, not the whole document.

Layer 1: Triage Classify doc type · assess scan quality BOL · customs · POD · weight ticket clean · 70% messy · 30% existing OCR path AWS Textract LLM path GPT-4o vision Layer 2: Hybrid Extraction OCR for clean · GPT-4 vision for messy structured schema output Layer 3: Validation & Confidence Confidence score per field business-rule validation 89% Flag ~11% exception team → customer dashboard Layer 4: Evaluation Harness 1,200 labeled test cases · runs on every PR nothing ships without passing
Clean docs through Textract · messy docs through GPT-4 vision · every record confidence-scored · every change eval-tested before shipping
Layer 1

Triage

Classifies document type (BOL, customs, POD, weight ticket) and quality (clean / messy / un-OCRable) in a single pass, routing each to the right processing path immediately.

Layer 2

Hybrid extraction

Clean documents run through the existing AWS Textract OCR; messy documents through GPT-4o vision with a structured output schema validated against business rules.

Layer 3

Validation & confidence scoring

Every field gets a confidence score. High-confidence records (~89% of volume) flow straight to the customer’s dashboard; low-confidence records flag for human review with only the uncertain fields highlighted.

Layer 4

Evaluation harness

1,200 representative documents with human-labeled ground truth run against every PR before shipping. The system doesn’t change without the eval passing.

The queue cleared in week one. The numbers held at twelve months.

−87%
Document processing time
89%
Straight-through, no human touch
14→3
Review team headcount
$720K
Annualised savings

Every metric moved. The at-risk enterprise contract expanded to a second region.

MetricBeforeAfter
Avg document processing time 14 hours 1.8 hours
Backlog (peak) 3+ days <2 hours
Straight-through processing rate 0% (all manual) 89%
Headcount on review team 14 3
Customer-reported quality issues 47/quarter 6/quarter

We'd written the budget for a vendor SaaS. They convinced us to spend a third of that on a prototype first. The prototype showed us the vendors couldn't handle our messy 30%, and gave us a system that could. Eighteen months in, it's still running.

Clement W. · VP Engineering, logistics SaaS

Runs in your VPC, not in a black box.

Models
GPT-4 / GPT-4o (vision + extraction), Claude (validation reasoning), small fine-tuned classifier for triage
OCR (clean docs)
AWS Textract
Orchestration
Custom Python pipeline, deployed as containerized services
Vector store
Pinecone
Evaluation
Custom eval harness running 1,200 test cases on every PR
Monitoring
Prometheus + Grafana, with per-document-type accuracy tracking
Infra
Client's AWS account, deployed in their VPC
7 wks
Timeline · 3-wk prototype + 4-wk build
4
Senior team · AI, data, backend & delivery
$77K
Budget cap · phase-gated
$72K
Actual spend · 6% under cap
8 hrs
Hand-off training · eval harness included
Sound familiar?

Scope a custom AI build

If your 'AI' is actually a person in another time zone, we should talk. We'll tell you whether a pipeline makes economic sense, or whether it doesn't.

No deck · No demo · No sales pressure