Use case · Document intelligence

Document intelligence pipelines on a filesystem.

Invoices, receipts, contracts, expense reports, claims, KYC docs. Different verticals, same shape: ingest, extract with vision or text, aggregate, flag, snapshot for retention. TroveFiles gives you all four pieces — multimodal storage, sandboxed exec, signed webhooks, and per-tenant isolation — in one place.

1.0 THE PROBLEM

Document workflows
are five services
duct-taped together.

AP automation, contract review, claims, KYC — they all look the same on a whiteboard. A customer uploads a document. Something extracts it. Something aggregates the extractions. Something flags risk. Something keeps an auditable trail.

In production they become S3 buckets, SQS queues, Lambda routers, a vector DB, an audit table, RBAC policies, and a retention job. Five services, five vendor SLAs, five places to forget about a tenant.

TroveFiles gives you the four pieces that matter — multimodal storage, sandboxed exec, signed webhooks, per-tenant isolation — behind one API and one filesystem per customer.

2.0 THE PIPELINE

Four stages.
One filesystem.

Every stage reads and writes inside the same per-tenant namespace. Each write fires a signed webhook, so the next stage runs without polling, queues, or shared infrastructure between tenants.

01

Ingest

Client portal uploads into the tenant's namespace. TroveFiles fires file.written.

inbox/invoices/*.pnginbox/receipts/*.pnginbox/contracts/*.txtinbox/expenses/*.csv
02

Extract

Vision agent for images, text agent for contracts, sandboxed Python for CSVs.

extracted/invoices/*.jsonextracted/receipts/*.jsonextracted/contracts/*.jsonextracted/expenses/*.json
03

Aggregate

Rollup script runs in the sandbox. Monthly + quarterly totals, vendor cuts, category cuts.

reports/monthly/YYYY-MM.jsonreports/quarterly/YYYY-QN.json
04

Flag

Compliance checks fire on every aggregation. Duplicates, threshold breaches, risk terms.

flags/duplicate_invoices.jsonflags/over_threshold.json
3.0 THE PATTERN

Four calls.
One pipeline.

01

Onboard a tenant

One namespace-locked workspace key per customer. The key cannot reach any other tenant's data, even if it leaks. All of that customer's documents — uploads, extracted JSON, reports, audit log — live under one filesystem root.

from trove_sdk import TroveAdminClient, TroveClient

admin = TroveAdminClient(api_key="trove-admin-...", workspace_id="ws-abc123")

# One namespace-locked key per client. Even if it leaks, it can only see
# one tenant's documents.
key = admin.create_key(name="finscope-acme-corp", namespace="acme-corp")
fs  = TroveClient(api_key=key.api_key, namespace="acme-corp")

fs.exec("mkdir -p workspace/inbox/{invoices,receipts,contracts,expenses}")
fs.exec("mkdir -p workspace/extracted workspace/reports workspace/flags")
02

Extract with vision

Read the uploaded image directly out of the customer's namespace. Send it to Claude as a base64 image block. Write the structured JSON back to extracted/. Same shape works for PDFs, contracts (text), and CSVs (in the sandbox).

import base64, json, anthropic

claude = anthropic.Anthropic()

PROMPT = """Extract structured invoice data. Output ONLY JSON:
{"invoice_no": ..., "vendor": ..., "issued": "YYYY-MM-DD",
 "line_items": [{"description":..., "qty":..., "amount_usd":...}],
 "total_usd": ...}"""

def extract_invoice(image_bytes: bytes) -> dict:
    b64 = base64.standard_b64encode(image_bytes).decode()
    resp = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1500,
        system=PROMPT,
        messages=[{"role": "user", "content": [
            {"type": "image", "source": {
                "type": "base64", "media_type": "image/png", "data": b64}},
            {"type": "text", "text": "Extract this invoice."},
        ]}],
    )
    return json.loads("".join(b.text for b in resp.content if b.type == "text"))

# Pull the upload from TroveFiles, run vision, write structured JSON back —
# all inside the customer's namespace.
raw  = fs.read_bytes("workspace/inbox/invoices/INV-2026-0001.png")
data = extract_invoice(raw)
fs.write("workspace/extracted/invoices/INV-2026-0001.json", json.dumps(data, indent=2))
03

Aggregate in the sandbox

Roll every extracted JSON into monthly totals, vendor cuts, duplicate-invoice groups, threshold breaches. Runs as Python inside the namespace via exec_detailed. Documents never cross your service boundary; token cost stays flat as the corpus grows.

# Aggregate every extracted invoice/receipt/expense into monthly rollups
# and compliance flags — running INSIDE the namespace via TroveFiles's sandbox.
# Documents never leave the tenant; the analytics live next to the data.

ROLLUP = """
import json, glob, os
from collections import defaultdict

monthly, dups, over = defaultdict(lambda: {'total_usd': 0.0}), defaultdict(list), []
for path in sorted(glob.glob('extracted/invoices/*.json')):
    d = json.load(open(path))
    monthly[d['issued'][:7]]['total_usd'] += d['total_usd']
    dups[(d['vendor'], round(d['total_usd'], 2))].append(path)
    if d['total_usd'] > 5000:
        over.append({'invoice_no': d['invoice_no'], 'total_usd': d['total_usd']})

os.makedirs('reports/monthly', exist_ok=True)
os.makedirs('flags', exist_ok=True)
for m, agg in monthly.items():
    json.dump(agg, open(f'reports/monthly/{m}.json', 'w'))
json.dump([{'vendor': v, 'total': t, 'count': len(p)}
           for (v, t), p in dups.items() if len(p) > 1],
          open('flags/duplicate_invoices.json', 'w'))
json.dump(over, open('flags/over_threshold.json', 'w'))
"""

fs.write("workspace/scripts/rollup.py", ROLLUP)
result = fs.exec_detailed("python3 scripts/rollup.py")
assert result.exit_code == 0, result.stderr
04

Drive the whole thing with webhooks

One signed webhook fires the moment the customer uploads. Your handler verifies the signature, routes by file type, and runs the right extractor. The extracted JSON write fires the same webhook a second time — that's when downstream systems (ERP, Slack, approval flow) get notified.

from fastapi import FastAPI, Request, HTTPException
from trove_sdk import verify_webhook, WebhookSignatureError
import os

app = FastAPI()

@app.post("/trove/events")
async def receive(request: Request):
    body = await request.body()
    try:
        event = verify_webhook(
            secret=os.environ["TROVE_WEBHOOK_SECRET"],
            body=body,
            signature_header=request.headers["x-trove-signature"],
        )
    except WebhookSignatureError:
        raise HTTPException(401)

    if event.type != "file.written":
        return {"ok": True}

    path = event.data["path"]
    # Route by document type. Each branch reads the upload from the
    # customer's namespace, extracts, writes structured JSON back, and
    # re-runs the aggregator. The whole pipeline is event-driven —
    # no cron, no polling, no shared queue.
    if path.startswith("workspace/inbox/invoices/"):
        await process_invoice(event.namespace, path)
    elif path.startswith("workspace/inbox/receipts/"):
        await process_receipt(event.namespace, path)
    elif path.startswith("workspace/inbox/expenses/"):
        await process_expense(event.namespace, path)
    elif path.startswith("workspace/inbox/contracts/"):
        await process_contract(event.namespace, path)
    return {"ok": True}
05

Snapshot for compliance

Every Nth document, snapshot the namespace. Tar-on-S3, 30-day default retention, restorable to a clean namespace. Point your auditor at any snapshot to reconstruct the exact state of one tenant's pipeline at a moment in time.

# After every Nth document, snapshot the namespace for compliance
# retention. Snapshots are tar-on-S3 with 30-day default retention
# and can be restored to a clean namespace at any time.
if docs_seen % 6 == 0:
    snap = fs.create_snapshot(label=f"finscope-{run_id}-{docs_seen}docs")
    audit_log.append({"event": "snapshot",
                      "snapshot_id": snap.snapshot_id,
                      "size_bytes":  snap.size_bytes,
                      "label":       snap.label})
4.0 SAME PATTERN, DIFFERENT VERTICALS

What teams build with this.

The pipeline shape is identical across verticals — only the extraction prompt and compliance flags change. One TroveFiles integration, many products.

AP automation
docs:invoices · receipts · vendor contracts
flags:duplicate billing · over-threshold spend · auto-renewal cliffs
Expense intelligence
docs:expense CSVs · receipt scans · travel itineraries
flags:policy breaches · category over-runs · stale submissions
Contract review
docs:MSAs · NDAs · SOWs · order forms
flags:unlimited indemnity · auto-renew with no notice cap · unfavorable governing law
Claims processing
docs:claim forms · supporting photos · police/medical reports
flags:missing documentation · over-limit claims · suspect duplicates
KYC / onboarding
docs:ID scans · proof-of-address · corporate registries
flags:expired IDs · sanctions list hits · address mismatch

The full reference build — FinScope, a multi-tenant AP + expense + contract pipeline — ships as examples/finscope.py in the SDK.

5.0 FAQ

Document pipelines,
answered.

What documents can the pipeline handle?

Anything you can upload — PNG/JPG scans (invoices, receipts, ID cards), PDFs, plain text contracts, CSVs of expense lines, structured JSON. Vision extraction handles the image-shaped ones; the sandboxed exec environment handles CSV/Python analytics; Claude reads contracts as text. The shape of the pipeline is the same for AP, contract review, expense intelligence, claims, KYC docs, anything paperwork-driven.

How does multimodal extraction work?

Upload the image to the customer's namespace via fs.upload(). Read it back as bytes with fs.read_bytes(). Base64-encode and pass it to Claude as an image content block. Write the extracted JSON back to workspace/extracted/. The image never leaves the customer's namespace — it's read into memory in your handler, sent once to Anthropic for inference, and discarded.

How do you keep one customer's documents from leaking to another?

Mint one namespace-locked workspace key per client at provision time with admin.create_key(name=..., namespace="acme-corp"). The TroveFiles API rejects any request from that key against any other namespace, even if the key is stolen. Webhooks can also be subscribed per-namespace, so each tenant's events stay scoped to its own subscription.

Why run aggregation in the sandbox instead of in my service?

Three reasons. First, document data never has to cross your service boundary — Python runs next to the files inside the namespace. Second, your token costs are flat regardless of how many extracted JSONs you aggregate over (it's Python, not Claude). Third, the script and its outputs live in the customer's namespace, so they're part of the same audit trail and snapshot retention as the source documents.

How is this audit-ready?

Three layers. (1) Every file write fires a signed webhook event with the actor key id, so you have an authenticated, real-time audit stream. (2) Each tenant's namespace gets a structured pipeline.jsonl that your handler writes after every step (ingest, extracted, rollup, snapshot). (3) Periodic snapshots create immutable point-in-time tar archives in S3 — point your auditor at any of them.

What's the latency from upload to extracted JSON?

Dominated by Claude inference: a few hundred ms for a receipt, a couple of seconds for a dense multi-line invoice. TroveFiles's webhook fires within milliseconds of fs.upload() completing, so end-to-end you're looking at sub-three-second pipeline runtime per document for typical AP/expense workloads.

Can the pipeline trigger downstream systems?

Yes — the file.written event for the extracted JSON fires the webhook a second time. Use that to push to your ERP, post to Slack, kick off a finance approval flow, or hand off to a downstream agent. The whole graph is just file events on the wire.

How do I plug this into an existing portal?

Your client portal POSTs uploads straight into the customer's namespace using the namespace-locked key you minted at provision time. From the portal's perspective it's just object storage with a webhook on every PUT. The intelligence pipeline runs entirely on your backend — your portal doesn't need to know it exists.

One filesystem
per customer.
Whole pipeline included.

Multimodal storage. Sandboxed analytics. Signed webhooks. Per-tenant keys. Snapshot retention. Wire it together in an afternoon.