Multi-Document Agent
In this guide, you learn towards setting up an agent that can effectively answer different types of questions over a larger set of documents.
These questions include the following
- QA over a specific doc
- QA comparing different docs
- Summaries over a specific doc
- Comparing summaries between different docs
We do this with the following architecture:
- setup a “document agent” over each Document: each doc agent can do QA/summarization within its doc
- setup a top-level agent over this set of document agents. Do tool retrieval and then do CoT over the set of tools to answer a question.
Setup and Download Data
We first start by installing the necessary libraries and downloading the data.
pnpm i llamaindex
import {
Document,
ObjectIndex,
OpenAI,
OpenAIAgent,
QueryEngineTool,
SimpleNodeParser,
SimpleToolNodeMapping,
SummaryIndex,
VectorStoreIndex,
serviceContextFromDefaults,
storageContextFromDefaults,
} from "llamaindex";
And then for the data we will run through a list of countries and download the wikipedia page for each country.
import fs from "fs";
import path from "path";
const dataPath = path.join(__dirname, "tmp_data");
const extractWikipediaTitle = async (title: string) => {
const fileExists = fs.existsSync(path.join(dataPath, `${title}.txt`));
if (fileExists) {
console.log(`File already exists for the title: ${title}`);
return;
}
const queryParams = new URLSearchParams({
action: "query",
format: "json",
titles: title,
prop: "extracts",
explaintext: "true",
});
const url = `https://en.wikipedia.org/w/api.php?${queryParams}`;
const response = await fetch(url);
const data: any = await response.json();
const pages = data.query.pages;
const page = pages[Object.keys(pages)[0]];
const wikiText = page.extract;
await new Promise((resolve) => {
fs.writeFile(path.join(dataPath, `${title}.txt`), wikiText, (err: any) => {
if (err) {
console.error(err);
resolve(title);
return;
}
console.log(`${title} stored in file!`);
resolve(title);
});
});
};
export const extractWikipedia = async (titles: string[]) => {
if (!fs.existsSync(dataPath)) {
fs.mkdirSync(dataPath);
}
for await (const title of titles) {
await extractWikipediaTitle(title);
}
console.log("Extration finished!");
These files will be saved in the tmp_data
folder.
Now we can call the function to download the data for each country.
await extractWikipedia([
"Brazil",
"United States",
"Canada",
"Mexico",
"Argentina",
"Chile",
"Colombia",
"Peru",
"Venezuela",
"Ecuador",
"Bolivia",
"Paraguay",
"Uruguay",
"Guyana",
"Suriname",
"French Guiana",
"Falkland Islands",
]);
Load the data
Now that we have the data, we can load it into the LlamaIndex and store as a document.
import { Document } from "llamaindex";
const countryDocs: Record<string, Document> = {};
for (const title of wikiTitles) {
const path = `./agent/helpers/tmp_data/${title}.txt`;
const text = await fs.readFile(path, "utf-8");
const document = new Document({ text: text, id_: path });
countryDocs[title] = document;
}
Setup LLM and StorageContext
We will be using gpt-4 for this example and we will use the StorageContext
to store the documents in-memory.
const llm = new OpenAI({
model: "gpt-4",
});
const ctx = serviceContextFromDefaults({ llm });
const storageContext = await storageContextFromDefaults({
persistDir: "./storage",
});