Yuval Avidani
Author
Key Takeaway
PageIndex is an open-source RAG framework that eliminates vector embeddings entirely, using hierarchical document trees to enable reasoning-based retrieval instead of similarity matching. Created by Vectify AI, it achieves 98.7% accuracy on FinanceBench and provides full explainability for professional document analysis.
What is PageIndex?
PageIndex is a revolutionary approach to Retrieval-Augmented Generation (RAG) that fundamentally rethinks how we retrieve information from documents. The project PageIndex solves the problem of context loss and inaccurate retrieval that we all face when working with long, complex documents in traditional vector-based RAG systems.
Traditional RAG relies on two core assumptions: that documents should be split into chunks, and that semantic similarity (measured via vector embeddings) equals relevance. PageIndex challenges both assumptions by introducing what they call "Vectorless RAG" - a system that maintains document structure and uses LLM reasoning instead of vector matching.
The Problem We All Know
Anyone who has worked with RAG systems on professional documents - financial reports, legal contracts, technical specifications - knows the frustration. We ask a precise question about a 50-page document, and the system returns chunks that are semantically similar but structurally wrong. The executive summary mentions the same keywords as the detailed footnote, so both get high similarity scores, but only one actually answers our question.
This happens because chunking destroys document structure. When we split a report into 500-token pieces, we lose the hierarchical organization that makes documents navigable. Chapter 3, Section 2, Paragraph 4 becomes just "chunk_47" - divorced from its context in the document's logical flow.
Vector similarity compounds the problem. Embeddings measure semantic closeness, not relevance. Two passages can be "near" each other in embedding space while being completely different in meaning when we consider their position and role in the document structure. It's like judging book relevance by word frequency instead of reading the table of contents.
How PageIndex Works
PageIndex takes a fundamentally different approach. Instead of chunking and embedding, it builds what they call a "Tree Index" - think of it like an intelligent, multi-level table of contents with summaries at every node.
Here's the process: First, the system analyzes the document structure and creates a hierarchical tree. At the top level, you have chapter summaries. Each chapter branches into section summaries. Each section branches into paragraph summaries. This preserves the logical organization of the document.
The magic happens during retrieval. Instead of computing vector similarity scores, PageIndex uses an agentic workflow where the LLM reasons its way through the tree structure. The LLM examines the top-level summaries, decides which branch is most likely to contain the answer, drills down to the next level, evaluates those summaries, and continues navigating until it reaches the relevant content.
Think of it like this: Imagine you're looking for information in a textbook. You don't read every page and compute similarity scores. You look at the table of contents, find the relevant chapter, scan that chapter's sections, and drill down to the specific paragraph. That's exactly what PageIndex enables the LLM to do.
Quick Start
Here's how we get started with PageIndex:
# Installation
pip install pageindex
# Basic usage
from pageindex import TreeIndexer, ReasoningRetriever
# Build the tree index from a document
indexer = TreeIndexer()
tree = indexer.build_tree(document_path="report.pdf")
# Query using reasoning-based retrieval
retriever = ReasoningRetriever(tree)
result = retriever.query("What was the Q3 revenue growth?")
print(result.answer)
print(result.reasoning_path) # Shows the path through the tree
A Real Example
Let's say we're analyzing a financial earnings report and need to find specific metrics:
from pageindex import TreeIndexer, ReasoningRetriever
# Build hierarchical index of the earnings report
indexer = TreeIndexer(
summarization_model="gpt-4",
preserve_structure=True
)
tree = indexer.build_tree("Q3_earnings_report.pdf")
# Query with reasoning
retriever = ReasoningRetriever(
tree=tree,
reasoning_model="gpt-4",
max_depth=5 # How deep to navigate the tree
)
# The LLM will reason: "Revenue growth is likely in Financial Results
# section, probably in the Revenue subsection, specifically in Q3 data"
result = retriever.query(
"What was the year-over-year revenue growth in Q3?"
)
print(f"Answer: {result.answer}")
print(f"Reasoning path: {result.path}") # Shows the navigation
print(f"Source location: {result.source_reference}") # Exact document location
Key Features
- Vectorless Retrieval - No embeddings, no chunking. The system maintains the original document structure and uses LLM reasoning to navigate it. Think of it like having an AI research assistant who knows how to use a table of contents instead of doing full-text search.
- Hierarchical Tree Structure - Documents are represented as multi-level trees with summaries at each node. This preserves the logical organization and makes navigation explainable - we can see exactly which path the LLM took through the document.
- Reasoning-Based Navigation - Instead of similarity scores, the LLM actively decides where to look next based on query intent and tree structure. It's agentic workflow under the hood, where the model controls its own retrieval process.
- Full Explainability - Every retrieval shows the complete path through the tree. We can audit exactly why the system chose to navigate to a particular section, making it suitable for high-stakes applications like legal or financial analysis.
- SOTA Performance - Achieves 98.7% accuracy on FinanceBench, a challenging benchmark for financial document QA. This beats traditional vector RAG approaches significantly on complex, structured documents.
When to Use PageIndex vs. Alternatives
PageIndex shines when document structure matters and accuracy is critical. If we're building a system to analyze legal contracts, financial reports, technical specifications, or research papers - documents where hierarchical organization is meaningful - PageIndex's reasoning-based approach will outperform traditional vector RAG.
Traditional vector RAG systems like LangChain with FAISS or Pinecone are still excellent for semantic search across large document collections. If we're building a knowledge base where users ask broad questions and we want to find all related content across hundreds of documents, vector similarity is perfect. It's fast, scalable, and good at finding conceptually related content.
The key difference: vector RAG is optimized for finding similar content. PageIndex is optimized for finding precise answers in structured documents. Use vectors when we're searching across a corpus. Use PageIndex when we're analyzing within a document.
Hybrid approaches are also possible. We could use vector search to find the top 3 relevant documents from a large collection, then use PageIndex to precisely extract answers from each of those documents.
My Take - Will I Use This?
In my view, this is one of the most important advances in RAG I've seen this year. The fundamental insight - that document structure matters and reasoning beats similarity for complex retrieval tasks - is spot on.
I'm already planning to use PageIndex for our financial document analysis workflows. We've struggled with traditional RAG giving us contextually wrong answers because it would match keywords in the wrong sections. Having the LLM reason through the document structure like a human analyst would is exactly what we need.
The explainability is huge for us. In professional contexts, we can't just show an answer - we need to show where it came from and why the system chose that path. PageIndex's tree navigation gives us that audit trail automatically.
The limitation to watch: building the tree index is computationally more expensive than simple chunking and embedding. For documents we only query once or twice, the overhead might not be worth it. But for documents we query repeatedly - regulatory filings, technical specs, research papers - the upfront cost pays off in accuracy and explainability.
Check out the project: PageIndex on GitHub
Frequently Asked Questions
What is PageIndex?
PageIndex is an open-source RAG framework that eliminates vector embeddings and chunking, instead using hierarchical document trees to enable reasoning-based retrieval where LLMs navigate document structure like humans use a table of contents.
Who created PageIndex?
PageIndex was created by Vectify AI, a team focused on advancing document intelligence and reasoning-based AI systems.
When should we use PageIndex?
Use PageIndex when analyzing long, structured documents where hierarchy matters - financial reports, legal contracts, technical specifications, research papers - and when accuracy and explainability are critical.
What are the alternatives to PageIndex?
Traditional vector-based RAG systems (LangChain with FAISS/Pinecone/Weaviate) are excellent alternatives for semantic search across large document collections. Use vectors for broad search across corpora, use PageIndex for precise analysis within structured documents.
What are the limitations of PageIndex?
Building the hierarchical tree index requires more upfront computation than simple chunking and embedding. It's best suited for documents that will be queried multiple times, not one-off document processing.
