TroveFiles vs. vector databases for AI agent retrieval.
Vector databases are the default answer for "how does my agent retrieve information?" — but they're overkill for most agent workflows. Here's an honest comparison: when grep beats embeddings, when embeddings win, and when the right answer is both.
Two different
retrieval models.
Vector databases (Pinecone, pgvector, Weaviate, Qdrant) retrieve by semantic similarity: chunk the documents, embed the chunks, store the vectors, embed the question, return the nearest matches. Filesystem retrieval (TroveFiles) retrieves by structural pattern: the agent issues a grep, awk, or pdftotext command and gets back the literal match.
For most agent tasks — pulling a clause out of a contract, finding a specific number in a filing, listing files matching a pattern — exact pattern retrieval is faster, cheaper, and easier to reason about. Vector search wins where exact match genuinely fails.
What it costs to retrieve.
The same retrieval task — "find EBITDA in a quarterly filing" — looks dramatically different across the two models. One is a single shell command. The other is a multi-step pipeline.
# Upload once, search forever
trove.upload("workspace/filings/q3-2024.pdf", open("q3.pdf", "rb"))
# Agent retrieves with one shell command per question
bash("grep -r 'EBITDA' workspace/filings/")
bash("pdftotext workspace/filings/q3-2024.pdf - | sed -n '/Risk/,/^$/p'")
# Cost: storage. Latency: ~10ms per grep.
# No embedding pipeline. Deterministic. Files live forever.The LLM is a better
retrieval architect
than your pipeline.
The deeper case for filesystem retrieval isn't that grep is fast. It's that the LLM, given a filesystem and a shell, will architect its own retrieval better than any pre-built RAG pipeline does.
A 2026-class model handed a workspace doesn't one-shot a top-k query. It does what an analyst does: ls -la workspace/contracts/ to scan by date, find … | xargs grep for hierarchical search, head -200 to skim before committing to a full read, pdftotext file.pdf - | sed -n '/Risk/,/^$/p' to grab a section, then iterates — looks at intermediate results and decides what to do next.
Vector DBs are stuck in 2023 RAG ergonomics: one query, one ranked list, hope it's right. Filesystem retrieval lets the agent do what it's already good at — multi-step reasoning over partial information.
The honest tradeoffs.
| Dimension | TroveFiles (filesystem) | Vector database |
|---|---|---|
| Retrieval model | Exact pattern (grep, awk, jq) | Semantic similarity (cosine, dot product) |
| Best for | Keyword search, structured data, exact-match recall | Fuzzy semantic search, ranking, similarity |
| Write cost | One file write | Chunk + embed + upsert (per doc, per change) |
| Read cost | One shell command | Embed query + index query + rerank |
| Determinism | Same command, same answer | Depends on embedding model and chunking strategy |
| Inspectability | cat the file, eyeball the result | Vector inspection, no human-readable representation |
| Multi-tenant isolation | Per-namespace directory roots | Per-namespace collections (tooling varies) |
| Multimodal preprocessing | pdftotext, ffmpeg, convert in the workspace | Separate ETL pipeline before embedding |
| Deletion / GDPR | rm -rf workspace/users/alice/ | Find every chunk, drop from index, hope metadata is clean |
| Portability | A directory of files — copy anywhere | Vendor-locked index format; migrating is a project |
| Cost at small scale (< 100k docs) | Storage only | Embeddings + index hosting |
| Cost at large scale (10M+ docs) | Grep latency grows linearly | Sublinear with proper index |
Pick TroveFiles when…
- • The agent retrieves by keyword, name, or path.
- • You want the LLM to architect retrieval (grep + awk + sed) rather than pre-index.
- • Source documents change often — re-embedding is a tax you want to skip.
- • Determinism matters more than fuzzy similarity.
- • You want one tool that handles memory, files, and retrieval.
Pick a vector database when…
- • The question is genuinely semantic ("like X but not exactly X").
- • Corpus is huge and stable — pre-indexing pays for itself.
- • You need ranking by similarity, not just match/no-match.
- • The user query language is far from the document language (translation, paraphrase).
The strongest production setups use both: TroveFiles for the agent's own memory and known-keyword retrieval, a vector database for semantic search over a large stable corpus. See the memory use case for how teams split the work.
Filesystem vs. vector DB,
answered.
When should I pick a filesystem over a vector database?
When the agent is retrieving things it knows the keywords for: contract clauses, code identifiers, named entities, exact phrases, structured data. Filesystem retrieval (grep, awk, jq, pdftotext) is faster, cheaper, and deterministic. The agent issues a shell command and gets the exact match.
When does a vector database actually pay off?
When the question doesn't map cleanly to keywords — "find conversations that felt similar to this one," "retrieve documents semantically related to a topic," "rank these passages by relevance." Embeddings shine where exact-match search fails, which is genuinely a real-but-narrow set of agent tasks.
Can I use both?
Yes — most production agents do. TroveFiles for the agent's own memory, scratchpad, and known-keyword corpus retrieval. A vector database for semantic similarity over a large external corpus. They are complements, not competitors.
Doesn't a vector database scale better than grep?
For very large corpora (tens of millions of documents), vector indices win on latency. For typical agent corpora — a customer's files, a knowledge base, last year's contracts — TroveFiles stays sub-second by keeping retrieval close to the data, so the agent isn't paying for round trips between an embed call, an index query, and a rerank.
Does grep hold up under concurrent agents?
Yes. TroveFiles is built so each command runs independently — concurrent agents fan out instead of queuing through a shared index. Throughput scales with parallel readers rather than the index tier you pay for. Vector DBs, by contrast, route every query through a single index, so concurrency is bounded by replicas and pricing tier.
What about chunking and re-embedding when documents change?
Vector pipelines have to re-chunk and re-embed every time a source document changes. With TroveFiles, you just write the new file. The next grep picks it up. No re-indexing job, no embedding cost on writes.
How do I migrate from a vector database to TroveFiles?
Most migrations are partial: keep the vector DB for true semantic queries, move the keyword and structured-data retrieval onto TroveFiles. Upload the source docs, point the agent's bash tool at TroveFiles, and start removing custom retrieval code. Most teams find 60-80% of their queries collapse into grep/awk/jq.
Who's running TroveFiles in production?
TroveFiles is the storage layer behind Silvia, our AI CFO with over $30 billion in connected assets. Every Silvia user has a TroveFiles namespace where the agent stores memories, skills, and preferences and retrieves them via shell commands across sessions.
Try retrieval that
doesn't need embeddings.
Upload a file, run grep, see the answer. API key in 30 seconds.