# PyMuPDF-Layout

**Repository Path**: mirrors_pymupdf/PyMuPDF-Layout

## Basic Information

- **Project Name**: PyMuPDF-Layout
- **Description**: Benchmarking and demos for PyMuPDF Layout
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-15
- **Last Updated**: 2026-04-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Flow-Aware PDF-to-Markdown Benchmark

This repository provides a benchmark for evaluating the accuracy of PDF-to-Markdown extraction tools. The main goal is to measure how well a tool can convert a complex, 2D PDF document into a 1D (text/markdown) format that **preserves the logical reading flow** of the content.

---

## The Core Problem

Large Language Models (LLMs) operate on a 1D sequence of tokens. They cannot natively understand the 2D spatial layout of a PDF. This "2D-to-1D" gap is a major bottleneck.

Existing benchmarks are often inadequate:
1.  **They focus on layout detection:** Datasets like DocLayNet are excellent for identifying *bounding boxes* (e.g., "this is a paragraph"), but not for connecting text blocks that form a single, logical flow (e.g., "this paragraph continues in the next column").
2.  **They assume a "total order":** Some benchmarks incorrectly assume a single, linear reading path for an entire document. In reality, complex documents have a **partial order**. For example, a main article and a sidebar can be read independently; neither logically precedes the other.

This benchmark is designed to measure a tool's ability to extract and correctly sequence these logically coherent "threads" of text.

---

## Benchmark Methodology

### 1. Dataset
The benchmark uses **127 PDF documents** sampled from the [DocLayNet dataset](https://github.com/DS4SD/DocLayNet), ensuring a diverse mix of challenging, real-world layouts. The documents are sourced from six distinct categories:
* Financial Reports
* Scientific Articles
* Laws & Regulations
* Government Tenders
* Manuals
* Patents

### 2. Ground Truth
To create a "ground truth" for evaluating text flow, for each document, we **manually copied multiple, random pieces of text in their correct logical reading order** to create "ground truth snippets" for each PDF. An evaluation metric can then check if a tool's output contains these snippets, in order, without being jumbled with text from other columns or sections. This method effectively tests the preservation of reading flow.

### 3. Evaluation Metric: FATA Score
We evaluate tools using a **Flow-Aware Text Accuracy (FATA) Score**. 

1.  For each ground truth text snippet (`truth_i`), we search the tool's entire markdown output to find the substring that is its "best match" (`best_match_i`).
2.  This "best match" is determined using the **Normalized Levenshtein distance** (a measure of character-level similarity).
3.  The final FATA score is a weighted average of the similarity scores for all snippets.
4.  A high FATA score (max 1.0) indicates the tool successfully extracted the text snippets with their internal order intact. A low score indicates the text was "mangled" (e.g., columns interleaved, text garbled), making it impossible to find a clean match for the ground truth snippets. These are converted to percentages, 100% being perfect.

---

## Initial Tools Evaluated

This benchmark was used to generate a comparative analysis of modern PDF extraction tools that produce markdown directly. The initial set of tools evaluated includes:

* LlamaParse
* Docling
* Marker
* Reducto
* PyMuPDF4LLM
* Pymupdf-Layout
* Google Gemini (multimodal)
---


# How to run this benchmark

```bash
uv sync
uv run prod_benchmark.py
```