The Higher Approach For Doc Chatbots?

What if the best way we construct AI doc chatbots right now is flawed? Most programs use RAG. They cut up paperwork into chunks, create embeddings, and retrieve solutions utilizing similarity search. It really works in demos however usually fails in actual use. It misses apparent solutions or picks the fallacious context. Now there’s a new method known as PageIndex. It doesn’t use chunking, embeddings, or vector databases. But it reaches as much as 98.7% accuracy on robust doc Q&A duties. On this article, we’ll break down how PageIndex works, why it performs higher on structured paperwork, and how one can construct your individual chatbot utilizing it.

The Downside with Conventional RAG

Right here’s the basic RAG pipeline you’ve most likely seen 100 occasions.

  • You are taking your doc – may very well be a PDF, a report, a contract – and also you chop it into chunks. Perhaps 512 tokens every, possibly with some overlap.
  • You run every chunk by an embedding mannequin to show it right into a vector — a protracted record of numbers that represents the “which means” of that chunk.
  • You retailer all these vectors in a vector database — Pinecone, Weaviate, Chroma, no matter your flavour is.
  • When the person asks a query, you embed the query the identical method, and also you do a cosine similarity search to seek out the chunks whose vectors are closest to the query vector.
  • You hand these chunks to the LLM as context, and it writes the reply.

Easy. Elegant. And completely riddled with failure modes.

Downside 1: Arbitrary chunking destroys context

While you slice a doc at 512 tokens, you’re not respecting the doc’s precise construction. A single desk would possibly get cut up throughout three chunks. A footnote that’s vital to understanding the primary textual content leads to a totally completely different chunk. The reply you want would possibly actually span two adjoining chunks that the retriever picks solely one among.

Downside 2: Similarity shouldn’t be the identical as relevance

That is the large one. Vector similarity finds textual content that seems like your query. However paperwork usually don’t repeat the query’s phrasing once they reply it. Ask “What’s the termination clause?” and the contract would possibly simply say “Part 14.3 — Dissolution of Settlement.” Low cosine similarity. Missed completely.

Downside 3: It’s a black field

You get three chunks again. Why these three? You haven’t any concept. It’s pure math. There’s no reasoning, no rationalization, no audit path. For monetary paperwork, authorized contracts, and medical information? That opacity is a significant issue.

Downside 4: It doesn’t scale to lengthy paperwork

A 300-page technical guide with complicated cross-references? The sheer variety of chunks makes retrieval noisy. You find yourself getting chunks which are vaguely associated as an alternative of the precise part you want.

These aren’t edge circumstances. These are the on a regular basis failures that RAG engineers spend most of their time preventing. And the explanation they occur is definitely fairly easy — all the structure is borrowed from engines like google, not from how people really learn and perceive paperwork.

When a human skilled must reply a query from a doc, they don’t scan each sentence searching for the one which sounds most just like the query. They open the desk of contents, skim the chapter headings, navigate, and cause about the place the reply needs to be earlier than they even begin studying.

That’s the perception behind PageIndex.

What’s PageIndex?

PageIndex was constructed by VectifyAI and open-sourced on GitHub. The core concept is deceptively easy:

As an alternative of looking a doc, navigate it: the best way a human skilled would.

Right here’s the important thing psychological shift. Conventional RAG asks: “Which chunks look most just like my query?”

PageIndex asks: “The place on this doc would a wise human search for the reply to this query?”

These are two very completely different questions. And the second seems to supply dramatically higher outcomes.

PageIndex does this by constructing what it calls a Reasoning Tree. It’s primarily an clever, AI-generated desk of contents on your doc.

Right here’s how you can visualize it. On the prime, you will have a root node that represents all the doc. Under that, you will have nodes for every main part or chapter. Every of these branches into subsections. Every subsection branches into particular matters or paragraphs. Each single node on this tree has two issues:

  1. A title: what this part is about
  2. A abstract: a concise AI-generated description of what’s on this part

This tree is constructed as soon as, whenever you first submit the doc. It’s your index.

Now right here’s the place it will get intelligent. While you ask a query, PageIndex does two issues:

1. Tree Search (Navigation)

It sends the query to an LLM together with the tree, however simply the titles and summaries, not the total textual content. The LLM reads by the tree like a human reads a desk of contents, and it causes: “Okay, given this query, which branches of the tree are most definitely to comprise the reply?”

The LLM returns an inventory of particular node IDs, and you may see its reasoning. It actually tells you why it selected these sections. Full transparency.

PageIndex fetches solely the total textual content of these chosen nodes, fingers it to the LLM as context, and the LLM writes the ultimate reply grounded completely in the actual doc textual content.

Two LLM calls. No embeddings. No vector database. Simply reasoning.

And since each reply is tied to particular nodes within the tree, you all the time know precisely which web page, which part, which a part of the doc the reply got here from. Full audit path. Full explainability.

The way it Works: Deep Dive

Let me go deeper into the mechanics, as a result of that is the actually fascinating half.

The Tree Index – Constructing Part

While you name submit_document(), PageIndex reads your PDF or textual content file and does one thing outstanding. It doesn’t simply extract textual content but in addition understands the construction. Utilizing a mix of format evaluation and LLM reasoning, it identifies:

  • What are the pure sections and subsections?
  • The place does one subject finish and one other start?
  • How do the items relate to one another hierarchically?

It then constructs the tree and generates a abstract for each node. Not only a title. An precise condensed description of what’s in that part. That is what allows the good navigation later.

The tree makes use of a numeric node ID system that mirrors actual doc construction: 0001 is likely to be Chapter 1, 0002 Chapter 2, 0003 the primary part inside Chapter 1, and so forth. The hierarchy is preserved.

Why This Beats Chunking

Take into consideration what chunking does to a 50-page monetary report. You get possibly 300 chunks, every with zero consciousness of whether or not it’s from the manager abstract or a footnote on web page 47. The embedder treats all of them equally.

The PageIndex tree, then again, is aware of that node 0012 is the “Income Breakdown” subsection underneath the “Q3 Monetary Outcomes” part underneath “Annual Report 2024.” That structural consciousness is enormously invaluable whenever you’re looking for one thing particular.

The Search Part – Reasoning, Not Math

Right here’s the opposite factor that makes PageIndex particular. The search step shouldn’t be a mathematical operation. It’s a cognitive operation carried out by an LLM.

While you ask, “What have been the primary danger elements disclosed on this report?”, the LLM doesn’t measure cosine distance. It reads the tree, acknowledges that the “Threat Elements” part is strictly what’s wanted, and selects these nodes, identical to you’d.

This implies PageIndex handles semantic mismatch naturally. That is the type of mismatch that kills vector search. The doc calls it “Threat Elements.” Your query calls it “predominant risks.” A vector search would possibly miss it. An LLM studying the tree construction won’t.

The Numbers

PageIndex powered Mafin 2.5, VectifyAI’s monetary RAG system, which achieved 98.7% accuracy on FinanceBench. For these unaware, this can be a benchmark particularly designed to check AI programs on monetary doc questions, the place the paperwork are lengthy, complicated, and stuffed with tables and cross-references. That’s the toughest setting for conventional RAG. It’s the place PageIndex shines most.

What’s it Greatest For?

PageIndex is especially highly effective for:

  • Monetary stories: earnings statements, SEC filings, 10-Ks
  • Authorized contracts: the place each clause issues and context is all the pieces
  • Technical manuals: complicated cross-referenced documentation
  • Coverage paperwork: HR insurance policies, compliance paperwork, regulatory filings
  • Analysis papers: structured educational content material

Principally: wherever your doc has actual construction that chunking would destroy.

And the actually thrilling factor? You should utilize it with any LLM. OpenAI, Anthropic, Gemini — the tree search and reply era steps are simply prompts. You’re in full management.

Fingers-on With Jupyter Pocket book

Okay. You now know the idea. You recognize why PageIndex exists, what it does, and the way it works underneath the hood. Now let’s really construct one thing with it.

I’m going to open a Jupyter pocket book and stroll you thru the entire PageIndex pipeline: importing a doc, getting the reasoning tree again, navigating it with an LLM, and asking questions. Each line of code is defined. No hand-waving.

Set up PageIndex

%pip set up -q --upgrade pageindex

 First issues first. We set up the pageindex Python library. One line, executed. No vector database to arrange. No embedding mannequin to obtain. That is already less complicated than any conventional RAG setup.

Imports & API Setup

import os
from pageindex import PageIndexClient
import pageindex.utils as utils
from dotenv import load_dotenv
load_dotenv()
PAGEINDEX_API_KEY = os.getenv("PAGEINDEX_API_KEY")
pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)

We import the PageIndexClient. That is our connection to the PageIndex API. All of the heavy lifting of constructing the tree occurs on their finish, so we don’t want a beefy machine. We additionally load API keys from a .env file — all the time hold your keys out of your code.

OpenAI Setup

import openai 
async def call_llm(immediate, mannequin="gpt-4.1-mini", temperature=0): 
    shopper = openai.AsyncOpenAI(api_key=OPENAI_API_KEY) 
    response = await shopper.chat.completions.create(...) 
    return response.selections[0].message.content material.strip()

Right here we outline our LLM helper perform. We’re utilizing GPT-4.1-mini for value effectivity — however this works with any OpenAI mannequin, and you possibly can swap in Claude or Gemini with a one-line change. Temperature zero retains the solutions factual and constant.

Submit the Doc

pdf_path = "/Customers/soumil/Desktop/PageIndex/HR Insurance policies-1.pdf" 
doc_id = pi_client.submit_document(pdf_path)["doc_id"] 
print('Doc Submitted:', doc_id)

That is the magic line. We level to our PDF — on this case an HR coverage doc — and submit it. PageIndex takes the file, reads its construction, and begins constructing the reasoning tree within the background. We get again a doc_id, a singular identifier for this doc that we’ll use in each subsequent name. Discover there’s no chunking code, no embedding name, no vector database connection.

Anticipate Processing & Get the Tree

whereas not pi_client.is_retrieval_ready(doc_id): 
    print("Nonetheless processing... retrying in 10 seconds") 
    time.sleep(10) 
tree = pi_client.get_tree(doc_id, node_summary=True)['result'] 
utils.print_tree(tree)

PageIndex processes the doc asynchronously — we simply ballot each 10 seconds till it’s prepared. Then we name get_tree() with node_summary=True, which supplies us the total tree construction together with summaries.

Have a look at this output. That is the reasoning tree. You may see the hierarchy — the top-level HR Insurance policies node, then Digital Communication Coverage, Sexual Harassment Coverage, Grievance Redressal Coverage, every branching into its subsections. Each node has an ID, a title, and a abstract of what’s in it.

That is what conventional RAG throws away. The construction. The relationships. The hierarchy. PageIndex retains all of it.

Tree Search with the LLM

question = "What are the important thing HR insurance policies and worker pointers?" 
tree_without_text = utils.remove_fields(tree.copy(), fields=['text']) 
search_prompt = f""" 
You're given a query and a tree construction of a doc... 
Query: {question} 
Doc tree construction: {json.dumps(tree_without_text, indent=2)} 
Reply in JSON: {{ "considering": "...", "node_list": [...] }} 
""" 
tree_search_result = await call_llm(search_prompt)

Now we search. For this, we construct a immediate that features the query and all the tree — however crucially, with out the total textual content content material of every node. Simply the titles and summaries. This retains the immediate manageable whereas giving the LLM all the pieces it must navigate.

The LLM is instructed to return a JSON object with two issues: its considering course of and the record of related node IDs.

Have a look at the output. The LLM tells us precisely why it selected every part. It reasoned by the tree like a human would. And it gave us an inventory of 30 node IDs — each part of this HR doc, as a result of the query is broad.

This transparency is one thing you merely can’t get with cosine similarity.

Fetch Textual content and Generate Reply

node_list = tree_search_result_json["node_list"] 
relevant_content = "nn".be part of(node_map[node_id]["text"] for node_id in node_list) 
answer_prompt = f"""Reply the query primarily based on the context: 
Query: {question} 
Context: {relevant_content}""" 
reply = await call_llm(answer_prompt) 
utils.print_wrapped(reply)

Step two. Now that we all know which nodes are related, we fetch their full textual content — solely these nodes, nothing else. We be part of the textual content and construct a clear context immediate. Another LLM name, and we get our reply.

Have a look at this reply. Detailed, structured, correct. And each single declare might be traced again to a selected node within the tree, which maps to a selected web page within the PDF. Full audit path. Full explainability.

The ask() Operate

async def ask(question): 
    # Full pipeline: tree search → textual content retrieval → reply era 
    ... 
 
user_query = enter("Enter your question: ") 
await ask(user_query)

Now we package deal all the pipeline right into a single ask() perform. Submit a query, get a solution — the tree search, retrieval, and era all occur underneath the hood. Let me present you a few stay examples.

Sort a query: e.g., “What are the penalties for sexual harassment?”

Watch what occurs. It searches the tree, identifies the Sexual Harassment Coverage nodes particularly, pulls their textual content, and offers us a exact, cited reply in seconds. That is the expertise you need to ship to your customers.

One other one. Once more, it finds precisely the suitable part. No confusion, no noise, no hallucination. Simply the reply, from the doc, with a transparent path exhibiting the place it got here from.

Conclusion

Let’s carry this collectively. Conventional RAG finds textual content that appears just like a query. However the actual purpose is to seek out the suitable reply in a structured doc. PageIndex solves this higher. It builds a reasoning tree and lets the mannequin navigate it intelligently. The result’s correct and explainable solutions, with as much as 98.7% accuracy on FinanceBench. It isn’t excellent for each use case. Vector search nonetheless works nicely for giant scale semantic search. However for lengthy, structured paperwork, PageIndex is a stronger method. You could find all of the code within the description. Add your API keys and get began.

I’m a Information Science Trainee at Analytics Vidhya, passionately engaged on the event of superior AI options comparable to Generative AI functions, Massive Language Fashions, and cutting-edge AI instruments that push the boundaries of know-how. My function additionally includes creating participating instructional content material for Analytics Vidhya’s YouTube channels, growing complete programs that cowl the total spectrum of machine studying to generative AI, and authoring technical blogs that join foundational ideas with the newest improvements in AI. Via this, I goal to contribute to constructing clever programs and share information that conjures up and empowers the AI group.

Login to proceed studying and revel in expert-curated content material.

Muhib
Muhib
Muhib is a technology journalist and the driving force behind Express Pakistan. Specializing in Telecom and Robotics. Bridges the gap between complex global innovations and local Pakistani perspectives.

Related Articles

Stay Connected

1,857,218FansLike
121,208FollowersFollow
6FollowersFollow
1FollowersFollow
- Advertisement -spot_img

Latest Articles