Redesigning an AI Document Pipeline: GPT-4o Vision to Gemini

View markdown Download markdown

A client came to me with a working data pipeline. It ran on a schedule, scraped public documents, extracted structured records, resolved entities, and loaded everything into a production app. The MVP had shipped, users were relying on it, and the engineering team had moved on to other work.

The problems were accumulating quietly.

Extraction was failing on 16% of documents, silently, with no alerting. Thirteen percent of resolved entities had fabricated identifiers from a geocoder that guessed wrong. Ninety-eight percent of records had placeholder timezone data because the seed CSV left those fields blank and the import command defaulted them to "UTC." The same physical entity was appearing four or five times in the database under different names. And AI costs were running at a flat rate every run, whether documents had changed or not.

They brought me in to figure out what to do about it.

What I Found

I started with the attrition funnel.

Stage Count
Scrape targets attempted 105
Documents found ~39
Downloaded successfully ~34
Extracted successfully ~29
Individual records yielded ~70

Each document contained multiple entries in tabular form.

The team had no visibility into any of it.

Per run, the extraction step produced roughly 3.6 timeouts on average (documents that reached the AI but never got a response within the 30-second cURL timeout), plus 2 PPTX failures the extractor couldn't read. Both failure types produced zero output and zero log entries that would surface in monitoring. Over 120 runs per month, that's hundreds of documents that entered the extraction step and came out with nothing. The pipeline moved on. Downstream records were just missing.

Five structural problems explain why.

Vision mode for document extraction. The pipeline converted every PDF to PNG images, base64-encoded each page, and sent them to GPT-4o Vision. That made sense at the time: it's the most direct path to "get data out of this PDF" without worrying about whether the document has a text layer, and also the most expensive and the most fragile. Two system dependencies (pdftotext and pdftoppm) were required in the Docker image. A temp directory had to be managed for the converted images. A text-mode extractor existed and had never been used in production.

Free-form JSON output parsed with regex. The extraction prompt asked the model to wrap output in a ```json code block. A parsing trait stripped the markdown wrappers and fell back to regex when that failed. When regex failed too, the document was silently dropped. No retry. No alert. Free-form JSON was the standard approach before providers shipped schema enforcement. It made sense when the MVP was being built, but it accumulated fragility with every edge case the parser hadn't seen.

AI-first entity resolution with a 100-record context limit. When the pipeline needed to match a raw entity name against existing database records, it sent the raw name plus the first 100 existing records to the AI classifier. The database had 308 records. Two-thirds were invisible to the AI. When the AI couldn't match (which happened frequently), the pipeline fell through to an open-source geocoder that resolved "Norfolk" to Norfolk, England. Sending existing records to the AI was a reasonable way to bootstrap entity matching without building a proper lookup system. But it doesn't scale past the context limit, and there was no floor when it failed. The geocoder fallback would create a new record with a fabricated identifier rather than match the existing one. Same physical entity, new database record, different identifier. Record deduplication broke. Users saw the same data listed multiple times under slightly different names. Forty of 308 records (13%) had fabricated identifiers with no basis in authoritative reference data.

No change detection. Every document was re-extracted on every run, regardless of whether it had changed since the last run. With hundreds of documents and a schedule running four times a day, the pipeline was spending AI budget on identical content.

No failure visibility. Timeouts, wrong file types, and extraction failures all produced the same result: nothing. One source timed out on almost every run. Two sources failed every run because they published PPTX files with .pdf extensions. No one knew.

The Redesign: Gemini-Native Document Processing

The first design decision was straightforward: stop converting documents and start using native PDF ingestion.

graph TD
    PDF[PDF]
    subgraph Old
        B[pdftoppm] --> C[PNG pages] --> D[base64 encode] --> E[GPT-4o Vision] --> F[regex parse] --> G[drop on failure]
    end
    subgraph New
        I[Gemini native] --> J[schema-enforced JSON]
    end
    PDF --> B
    PDF --> I

Gemini processes PDFs natively. You base64-encode the file, set the MIME type to application/pdf, and send it as an inline data part alongside your text prompt. The model reads the document directly. No pdftoppm, no page-by-page image conversion, no temp directories, no two extractors with a factory to choose between them. Digital PDFs (documents with selectable text layers) get native text extraction. The model reads the actual text, not pixels of text.

{
  "contents": [{
    "role": "user",
    "parts": [
      { "text": "Extract flight records from this PDF..." },
      { "inline_data": { "mime_type": "application/pdf", "data": "<base64>" } }
    ]
  }]
}

The token math makes this concrete.

GPT-4o Vision (high-detail) Gemini 3.1 Flash Lite (native PDF)
Tokens per page 765 ~258
Price per 1M input tokens $2.50 $0.25
Cost per page ~$0.0019 ~$0.000065

Combined, that's roughly 30x cheaper per page: three times fewer tokens at 10x lower price per token.

The actual extraction cost from production logs tells the story. The old pipeline: $0.52 per run extracting ~34 documents. The new pipeline's first full production run: $0.004 extracting 31 documents. That's a 99% reduction in extraction cost.

Honest notes on Gemini's PDF handling

The limit is 1,000 pages per document and 50 MB inline (2 GB via the Files API). Scan quality matters for scanned documents. Google's own documentation notes that the model isn't precise at locating text in PDFs and can hallucinate when interpreting handwritten content. Digital PDFs with text layers are reliable. Heavy scans with degraded quality are not.

Vertex AI auth: no API keys, no secrets

All Gemini API calls go through Vertex AI with Application Default Credentials (ADC). The Cloud Run service account gets roles/aiplatform.user for Vertex AI and roles/places.viewer for the Places API. On Cloud Run, the metadata server provides an OAuth2 access token automatically. Locally, gcloud auth application-default login handles it. No API key management. No secrets rotation. No GEMINI_API_KEY in environment variables.

The standard Gemini PHP client library authenticates via API key and is incompatible with Vertex AI's bearer token auth, so I use direct REST calls to the Vertex AI endpoint with the ADC token instead. It's a few more lines of setup, but there's no secret to manage and no key rotation to fail.

The second design decision was to stop parsing free-form text and start enforcing output contracts at the API level.

Gemini's generationConfig accepts a responseMimeType and a responseSchema. Set responseMimeType to application/json, provide the schema, and the API returns valid JSON that matches the schema or returns an error. No markdown code block stripping. No regex fallback. The schema defines exactly what the output looks like: field names, types, nullable fields, enum constraints on status values.

{
  "responseMimeType": "application/json",
  "responseSchema": {
    "type": "object",
    "properties": {
      "event_date": {
        "type": "string",
        "description": "Date in YYYY-MM-DD format. Resolve day names to calendar dates using document context."
      },
      "event_time": {
        "type": "string",
        "description": "Four-digit 24-hour time (e.g. 0930). No suffixes."
      },
      "capacity": { "type": "integer", "nullable": true },
      "status": {
        "type": "string",
        "enum": ["Confirmed", "Tentative", "Pending", "Closed"],
        "nullable": true
      },
      "secondary_location": { "type": "string", "nullable": true }
    }
  }
}

The description fields serve as per-field prompt instructions. They replaced most of what the old free-form prompt had to do.

Schema enforcement eliminates an entire class of output inconsistencies. The same fields, before and after:

Old Output Schema-Enforced Output
"secondary_location": "N/A" "secondary_location": null
"capacity": "53T" "capacity": 53, "status": "Tentative"
"event_date": "Saturday" "event_date": "2026-03-07"
"event_time": "9:30L" "event_time": "0930"

One honest note on timing. OpenAI shipped structured outputs in August 2024. The pipeline was built in January 2026, seventeen months later. The feature existed the entire time. The MVP just hadn't adopted it. Gemini's approach (responseSchema in generationConfig) makes structured output the obvious default rather than an opt-in upgrade, but the pattern was available on OpenAI too.

The Redesign: Grounding AI Outputs Against Authoritative Data

The MVP skipped grounding entirely.

The MVP's entity resolution chain put AI first. A raw entity name plus some existing records went to the AI classifier, which decided whether it was a known entity or a new one. When it failed, the geocoder handled the fallback. The problem is that this chain has no floor. When the AI was wrong, the geocoder didn't correct it. It just produced a new record with a fabricated identifier and city-center coordinates.

I put deterministic lookups first and AI last.

graph TD
    RAW[Raw entity name] --> T1{Alias lookup}
    T1 -- "match" --> DONE[Resolved]
    T1 -- "miss" --> T2{Standard code lookup}
    T2 -- "match" --> DONE
    T2 -- "miss" --> T3{Exact name match}
    T3 -- "match" --> DONE
    T3 -- "miss" --> T3B{Non-entity filter}
    T3B -- "rejected" --> SKIP[Skipped]
    T3B -- "pass" --> T4{AI research with grounded search}
    T4 -- "high confidence" --> T5{Reference database cross-check}
    T5 --> DONE
    T4 -- "low confidence" --> REVIEW[Queued for human review]
Tier Method Cost
1 Alias lookup FREE, instant
2 Standard code lookup (ICAO) FREE, instant
3 Exact name match FREE, instant
3b Non-entity filter FREE, instant
4 AI research (Gemini with Google Search + URL context) ~$0.001/lookup
5 Reference database cross-check (70K+ records) FREE

AI never runs in isolation here. When the pipeline reaches tier 4, it uses two Gemini platform features that let the model do its own primary research rather than relying on context the pipeline pre-fetches.

Google Search grounding (googleSearch tool). This is a built-in Gemini capability, not a separate API call. You declare it as a tool in the request, and the model autonomously decides when and what to search during generation. The pipeline doesn't construct search queries or parse search results. The model does. It formulates queries based on the entity name, evaluates results, and incorporates what it finds into its response. In RAG, you pre-fetch context and stuff it into the prompt. Here, the model controls the research loop.

URL context (urlContext tool). Declared alongside Google Search, this lets the model fetch and read full web pages it discovers during search. When the model finds a relevant URL in search results, like an official military installation page, a terminal contact directory, or an airport database entry, it can follow the link and read the page content. The model reads primary sources rather than relying on search snippets.

Both tools are declared with empty configuration objects. The API handles all orchestration internally:

{
  "tools": [
    { "googleSearch": {} },
    { "urlContext": {} }
  ]
}

With these tools, the model searches the web, reads source pages, and returns a structured assessment: canonical name, coordinates, ICAO/IATA codes, contact information, aliases, and a confidence score with reasoning. The pipeline validates those results against the OurAirports reference database. The old pattern sent the AI a list of existing records and hoped it picked the right one.

The reference database cross-check is what prevents the "Norfolk, England" class of error. When AI returns an ICAO code, the pipeline looks it up in the reference database. If the code exists, it uses the reference database's coordinates rather than the AI's estimate.

Confidence thresholding controls what happens next. Above 0.80 confidence, the pipeline creates a new record. Below 0.80, it queues the entity for human review rather than guessing. The old pipeline had no confidence threshold. Everything the geocoder returned became a record, regardless of how plausible it was.

For security-critical classification (the app filters certain records from public view based on geographic region, and a wrong classification either leaks sensitive data or hides valid records from users), I use a static country-to-classification lookup table with approximately 250 entries rather than asking the AI to guess. AI hallucination on this field has real safety implications. A static table doesn't hallucinate.

Timezone resolution follows the same principle. Coordinates go in, an IANA timezone identifier comes out via offline boundary polygon lookup. Deterministic from coordinates, never a default. The 98% UTC placeholder problem disappears entirely.

Most teams reach for AI as a smarter classifier when they hit resolution problems: send it more context, prompt it more carefully, fine-tune it. That can improve accuracy at the margins, but the fundamental architecture stays the same. When AI is wrong, there's still nothing to catch it.

When AI proposes a match that doesn't cross-reference against the authoritative database, I reject it. When confidence is low, the entity goes to human review instead of into the database. No retry with a different prompt. The entity stays unresolved until better data or human judgment is available.

The Redesign: Self-Improving Economics

Two mechanisms make the pipeline cheaper over time: the alias cache and content hash change detection. They compound.

The alias cache. Every successful resolution (at any tier) creates an alias mapping the raw entity name to the canonical record. On the next run, that name resolves at tier 1 (alias lookup) for free. It never touches AI or the web again.

First run economics from actual production data: cold alias cache, no records in the database yet. Every entity name went to AI research. 84 entities created, 355 aliases established, 2 queued for human review. Enrichment cost: $0.539 across 121 Gemini Pro calls. Those 355 aliases are now in the cache. On the next run, any name seen in this run resolves at tier 1 for free.

Second run: 3 PDFs had changed. The alias cache resolved 4 entity names instantly. Two entities were new and went to AI research. Enrichment cost: $0.014.

Content hash change detection. The pipeline stores a SHA-256 hash of each downloaded document. On the next run, if the hash matches, extraction is skipped entirely. The document hasn't changed.

This is where the old pipeline bled money quietly. Every document was re-extracted on every run. A document that hadn't been updated in three days still triggered a full GPT-4o Vision extraction pass on all four daily runs. At $0.52 per run for the extraction step and roughly 34 documents per run, a document that doesn't change for a week gets re-extracted 28 times. The cost accrues even though nothing in the output changes.

The production numbers confirm this works. The first run extracted 31 PDFs and yielded 153 flights. The second run found that 28 of 31 PDFs were unchanged. It extracted 3 PDFs, yielded 4 flights, and cost $0.0003 in extraction. The other 28 documents were skipped in milliseconds.

The old pipeline's cost structure was flat. Every run cost the same regardless of what had actually changed. The new one front-loads cost on the first run (bootstrapping the alias cache, extracting everything) and gets cheaper as aliases accumulate and stable documents stop triggering extraction.

Thinking Mode and Per-Task Model Selection

Gemini 3.1 introduced thinking mode: the model can perform explicit chain-of-thought reasoning before producing its final output. You configure it with a thinkingConfig in the request, and the level controls how much reasoning the model performs:

{
  "generationConfig": {
    "responseMimeType": "application/json",
    "responseSchema": { ... },
    "thinkingConfig": { "thinkingLevel": "MEDIUM" }
  }
}

The API response includes both thinking parts (marked with "thought": true) and the final output. The pipeline filters thinking parts out of the response text. They're useful for debugging but aren't part of the structured result.

This matters because thinking tokens are billed as output tokens. In a multi-step pipeline, you're paying for reasoning on every document, every entity, every run. A blanket "use the smartest model with maximum thinking" approach doesn't scale. Per-task thinking levels work.

Task Model Thinking Why
Document extraction Gemini 3.1 Flash Lite Disabled Schema already constrains format. Table extraction doesn't benefit from deeper reasoning.
Entity research Gemini 3.1 Pro MEDIUM Disambiguation is genuinely hard. Travis AFB vs Travis County Airport. Ramstein vs Rammstein. The model needs to evaluate search results, cross-reference sources, and reason about confidence.

The model split reflects the same principle. Extraction runs on Flash Lite ($0.25/$1.50 per million input/output tokens): fast, cheap, purpose-built for structured data extraction from well-formatted documents. It processes a PDF and returns a JSON array of records. Entity research runs on Pro ($2.00/$12.00 per million tokens): deeper reasoning, access to Google Search and URL context tools, and thinking budget for working through ambiguous cases.

The model IDs, thinking budgets, and timeout settings are all config-driven, never hardcoded. Upgrading to a new model is a config change, not a deployment.

Why not use thinking for extraction?

I tested it. Thinking mode on extraction produces the same structured output at higher cost. The schema already constrains what the model can return: field names, types, enums, nullable rules. The model doesn't need to reason about what format to use or how to structure its response. It reads a table and fills in fields. The constraint is parsing accuracy, not reasoning depth. Flash Lite without thinking handles this well.

Entity research is different. The model needs to decide which search results are authoritative, whether "McGuire AFB" and "Joint Base McGuire-Dix-Lakehurst" are the same place, and whether a phone number from a 2019 webpage is still valid. That kind of reasoning is where thinking budget pays for itself.

The Results

The new pipeline is fully implemented and running in production. Both the old and new numbers come from actual production logs.

Actual production cost from pipeline logs, a single run on the old system:

Extract:  91,260 input tokens | 4,180 output tokens | $0.519
Enrich:  147,338 input tokens |   394 output tokens | $0.074
Total:   $0.59/run
Monthly: ~$71 (4 runs/day x 30 days)

Actual production cost on the new system, first full run (cold alias cache, all entities new):

Extract:  31 PDFs | 127,433 input tokens | 16,684 output tokens | $0.004
Enrich:   84 entities created | 121 Gemini calls                 | $0.539
Total:   $0.543/run

Actual production cost on the new system, second run (355 aliases cached, 28 of 31 documents unchanged):

Extract:  3 PDFs | 11,113 input tokens | 612 output tokens      | $0.0003
Enrich:   4 resolved from cache | 2 new entities | 3 Gemini calls   | $0.014
Total:   $0.014/run
Component Old (actual) New (run 1, cold) New (run 2, warm)
Extraction $0.52/run $0.004 $0.0003
Enrichment $0.07/run $0.539 $0.014
Total $0.59/run $0.543 $0.014
Monthly (4x/day) ~$71 N/A ~$2

The first run costs roughly the same as a single old run, but the cost is split differently: extraction dropped 99%, and enrichment was higher because every entity was new. After that, cost scales with two variables: how many documents changed (extraction) and how many entities the cache hasn't seen yet (enrichment). As aliases accumulate, enrichment cost approaches zero for any entity seen in a previous run.

Honest cost range: best case, worst case, typical

In the best case (full alias cache, no new entities, no document changes), the per-run cost is effectively zero. In the worst case (all documents new, all entities new), the first run showed $0.543, almost entirely enrichment. Each new entity costs roughly $0.005–$0.007 in Pro research to bootstrap. Once it's cached, it costs nothing to resolve on subsequent runs.

The new pipeline spends real money on entity resolution through Gemini Pro research rather than a free geocoder. The free geocoder was cheaper per lookup and wrong often enough to cause significant downstream data quality problems. The new approach costs more per entity on the first encounter but approaches zero as the alias cache warms up.

The data quality numbers matter more.

Problem (old) Fix (new)
16% silent extraction failure rate Explicit errors with tracking. Failures don't disappear.
13% of records had fabricated identifiers Entities that can't be grounded against authoritative data go to human review, not into the database.
Wrong country resolution (Norfolk, VA -> Norfolk, England) Reference database cross-check with 70K+ airport records. AI results validated against authoritative coordinates.
98% of records had "UTC" placeholder timezones Deterministic timezone resolution from coordinates via offline boundary polygon lookup.
Same entity appeared 4-5 times under different names One canonical record per entity. Name variants stored as aliases. Alias cache prevents re-resolution.
Security-critical classifications were AI-guessed Static country-to-classification table lookup. No hallucination risk.
No confidence thresholding AI results below 0.80 confidence queued for human review instead of auto-created.

Patterns That Transfer

Use native capabilities. If the model processes a format directly, send it that format. Don't convert documents to images to extract text. Every conversion step is a dependency, a failure mode, and a cost center.

Enforce output contracts at the API level. If your provider supports schema enforcement, use it. The regex fallback goes away and so does the silent data loss.

Put deterministic lookups before AI in your resolution chain. Alias lookup, exact name match, standard code lookup: these are faster and cheaper than AI and more reliable for known entities. AI should be the last resort for genuinely unknown cases.

Ground AI outputs against authoritative reference data. AI proposes; reference data validates. Build a cross-check that rejects proposals the reference data can't confirm. If there's nothing to validate against, AI confidence is all you have, and that's not enough for production data.

Match model capability to task complexity. A cheap, fast model handles structured extraction from well-formatted documents. A more capable model with thinking budget and web search tools handles ambiguous entity research. Paying for reasoning you don't need is wasteful. Skimping on it where ambiguity is real just moves the cost into data quality problems downstream.

Make failures visible. Silent failures are the hardest to fix because they don't show up in monitoring. Timeouts, wrong file types, and parse errors should all produce explicit log entries.

Build caches so costs decrease over time. Alias caches and content hashes both do this. The first run is the most expensive one. Every subsequent run should be cheaper.


The MVP made sense when it shipped. Vision mode was the fastest path to getting PDFs extracted, the free geocoder the fastest path to entities, regex the fastest path to JSON. None of those decisions were wrong given the constraints at the time.

Native PDF processing eliminated the conversion complexity. Schema-enforced output closed the regex parsing path and stopped the silent data loss. Content hashing and alias caching turned a flat-cost pipeline into one that gets cheaper with every run.

If you're running an AI pipeline that started as an MVP and has been accumulating similar debt, I'm happy to take a look.