PyMuPDF4LLM¶
PyMuPDF4LLM is a lightweight extension for PyMuPDF that turns PDFs into clean, structured data with minimal setup. It includes layout analysis without any GPU requirement.
PyMuPDF4LLM is aimed to make it easier to extract document content in the format you need for LLM & RAG environments. It supports Markdown, JSON and TXT extraction, as well as LlamaIndex and LangChain integration.
Important
You can also extend the supported file types to also include Office document formats (DOC/DOCX, XLS/XLSX, PPT/PPTX, HWP/HWPX) by using PyMuPDF Pro with PyMuPDF4LLM.
Features¶
Support for Markdown, JSON and plain text output formats.
Support for multi-column pages.
Support for image and vector graphics extraction.
Layout analysis for better semantic understanding of document structure.
Support for page chunking output.
Integration with LlamaIndex & LangChain.
API¶
See: The PyMuPDF4LLM API.
Installation¶
Install the package via pip with:
pip install pymupdf4llm
Extracting¶
As Markdown¶
To retrieve your document content in Markdown use the to_markdown() method as follows:
import pymupdf4llm
md = pymupdf4llm.to_markdown("input.pdf")
As JSON¶
To retrieve your document content in JSON use the to_json() method as follows:
import pymupdf4llm
json = pymupdf4llm.to_json("input.pdf")
The JSON export will give you bounding box information and layout data for each element on the page. This can be used to create your own custom output formats or to simply have more detailed information about the document structure for RAG workflows & LLM integrations.
As TXT¶
To retrieve your document content in TXT use the to_text() method as follows:
import pymupdf4llm
txt = pymupdf4llm.to_text("input.pdf")
Note
Instead of using filename strings as above, one can also provide a PyMuPDF Document.
Finally we can save the output to an external file as follows:
from pathlib import Path
suffix = ".md" # or ".json" or ".txt"
Path(doc.name).with_suffix(suffix).write_bytes(md.encode())
Integrations¶
With LlamaIndex¶
PyMuPDF4LLM supports direct conversion to a LlamaIndex document. A document is first converted into Markdown format and then a LlamaIndex document is returned as follows:
import pymupdf4llm
llama_reader = pymupdf4llm.LlamaMarkdownReader()
llama_docs = llama_reader.load_data("input.pdf")
With LangChain¶
PyMuPDF4LLM also supports LangChain integration, see the PyMuPDF4LLM Document Loader for more details.
Using with PyMuPDF Pro¶
For Office document support, PyMuPDF4LLM works seamlessly with PyMuPDF Pro. Assuming you have PyMuPDF Pro installed you will be able to work with Office documents as expected:
import pymupdf4llm
import pymupdf.pro
pymupdf.pro.unlock()
md = pymupdf4llm.to_markdown("sample.doc")
PyMuPDF4LLM & PyMuPDF Layout¶
By default PyMuPDF4LLM includes a layout analysis module to enhance output results. To disable this module you can do so by calling the use_layout() method.
Further Resources¶
Sample code¶
Blogs¶
PyMuPDF4LLM Document Loader
