PyMuPDF, LLM & RAG#
Integrating PyMuPDF into your Large Language Model (LLM) framework and overall RAG (Retrieval-Augmented Generation) solution provides the fastest and most reliable way to deliver document data.
There are a few well known LLM solutions which have their own interfaces with PyMuPDF - it is a fast growing area, so please let us know if you discover any more!
Integration with LangChain#
It is simple to integrate directly with LangChain by using their dedicated loader as follows:
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("example.pdf")
data = loader.load()
See LangChain Using PyMuPDF for full details.
Integration with LlamaIndex#
Use the dedicated PyMuPDFReader
from LlamaIndex 🦙 to manage your document loading.
from llama_index.readers.file import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="example.pdf")
See Building RAG from Scratch for more.
Preparing Data for Chunking#
Chunking (or splitting) data is essential to give context to your LLM data and with Markdown output now supported by PyMuPDF this means that Level 3 chunking is supported.
Outputting as Markdown#
In order to export your document in Markdown format you will need the separate helper for this available from the PyMuPDF RAG repository. See the helpers/pymupdf_rag.py
file and make this available to your project as follows:
from pymupdf_rag import to_markdown
doc = fitz.open("input.pdf")
md_text = to_markdown(doc)
# write markdown to some file
output = open("out-markdown.md", "w")
output.write(md_text)
output.close()
How to use Markdown output#
Once you have your data in Markdown format you are ready to chunk/split it and supply it to your LLM, for example, if this is LangChain then do the following:
from pymupdf_rag import to_markdown
from langchain.text_splitter import MarkdownTextSplitter
# Get the MD text
doc = fitz.open("input.pdf")
md_text = to_markdown(doc) # get markdown for all pages
splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)
splitter.create_documents([md_text])
For more see 5 Levels of Text Splitting