PyMuPDF, LLM & RAG

Integrating PyMuPDF into your Large Language Model (LLM) framework and overall RAG (Retrieval-Augmented Generation) solution provides the fastest and most reliable way to deliver document data.

There are a few well known LLM solutions which have their own interfaces with PyMuPDF - it is a fast growing area, so please let us know if you discover any more!

If you need to export to Markdown or obtain a LlamaIndex Document from a file:

Integration with LangChain

It is simple to integrate directly with LangChain by using their dedicated loader as follows:

from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("example.pdf")
data = loader.load()

See LangChain Using PyMuPDF for full details.

Integration with LlamaIndex

Use the dedicated PyMuPDFReader from LlamaIndex 🦙 to manage your document loading.

from llama_index.readers.file import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="example.pdf")

See Building RAG from Scratch for more.

Preparing Data for Chunking

Chunking (or splitting) data is essential to give context to your LLM data and with Markdown output now supported by PyMuPDF this means that Level 3 chunking is supported.

Outputting as Markdown

In order to export your document in Markdown format you will need a separate helper. Package pymupdf4llm is a high-level wrapper of PyMuPDF functions which for each page outputs standard and table text in an integrated Markdown-formatted string across all document pages:

# convert the document to markdown
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")

# Write the text to some file in UTF8-encoding
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())

For further information please refer to: pymupdf4llm documentation

How to use Markdown output

Once you have your data in Markdown format you are ready to chunk/split it and supply it to your LLM, for example, if this is LangChain then do the following:

import pymupdf4llm
from langchain.text_splitter import MarkdownTextSplitter

# Get the MD text
md_text = pymupdf4llm.to_markdown("input.pdf")  # get markdown for all pages

splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)

splitter.create_documents([md_text])

For more see 5 Levels of Text Splitting