PyMuPDF, LLM & RAG#

Integrating PyMuPDF into your Large Language Model (LLM) framework and overall RAG (Retrieval-Augmented Generation) solution provides the fastest and most reliable way to deliver document data.

There are a few well known LLM solutions which have their own interfaces with PyMuPDF - it is a fast growing area, so please let us know if you discover any more!

Integration with LangChain#

It is simple to integrate directly with LangChain by using their dedicated loader as follows:

from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("example.pdf")
data = loader.load()

See LangChain Using PyMuPDF for full details.

Integration with LlamaIndex#

Use the dedicated PyMuPDFReader from LlamaIndex 🦙 to manage your document loading.

from llama_index.readers.file import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="example.pdf")

See Building RAG from Scratch for more.

Preparing Data for Chunking#

Chunking (or splitting) data is essential to give context to your LLM data and with Markdown output now supported by PyMuPDF this means that Level 3 chunking is supported.

Outputting as Markdown#

In order to export your document in Markdown format you will need the separate helper for this available from the PyMuPDF RAG repository. See the helpers/pymupdf_rag.py file and make this available to your project as follows:

from pymupdf_rag import to_markdown

doc = fitz.open("input.pdf")

md_text = to_markdown(doc)

# write markdown to some file
output = open("out-markdown.md", "w")
output.write(md_text)
output.close()

How to use Markdown output#

Once you have your data in Markdown format you are ready to chunk/split it and supply it to your LLM, for example, if this is LangChain then do the following:

from pymupdf_rag import to_markdown
from langchain.text_splitter import MarkdownTextSplitter

# Get the MD text
doc = fitz.open("input.pdf")
md_text = to_markdown(doc) # get markdown for all pages

splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)

splitter.create_documents([md_text])

For more see 5 Levels of Text Splitting