18 min read
AI assisted

PDF to Markdown — Five Tools Compared

Five PDF-to-Markdown converters (markitdown, pdftotext, pymupdf, mineru, opendataloader-pdf) scored against a seven-criterion 100-point rubric

Five PDF-to-Markdown converters — markitdown, pdftotext, pymupdf, mineru, opendataloader-pdf — were measured on the same academic-paper PDF with a 100-point rubric. mineru scored 92; markitdown scored 14, effectively unusable. This post covers the rubric, per-tool results, and what produced the gap.


Setup

The source document is TV-RAG (ACM MM '25), a nine-page academic paper on temporal-aware video retrieval. It contains LaTeX-style math formulas, multi-column layouts, HTML-renderable tables, figure references, and the citation and CCS metadata typical of ACM proceedings. That mix of elements makes it a reasonable stress test: a tool that handles only plain prose will expose itself quickly.

Each tool was invoked on the same PDF with its default CLI surface — no post-processing scripts, no format-specific flags beyond what the tool offers out of the box. Measuring what each converter delivers without manual intervention is the relevant question for an automated ingestion pipeline.


Rubric

Seven criteria, 100 points total:

Criterion Points What is measured
Text fidelity 20 Accuracy of extracted prose — character errors, dropped words, broken sentences
Structure preservation 20 Headings and sections rendered as Markdown (#, ##, etc.)
Formula handling 15 Math expressions converted to valid LaTeX ($...$, $$...$$)
Table handling 15 Tables rendered as Markdown or HTML with data intact
Image references 10 Figures referenced with ![](path) syntax
Readability 10 Overall rendering quality without manual cleanup
Noise / artifacts 10 Absence of spurious characters, metadata bleed, encoding garbage

Grade bands: A (90–100), B (75–89), C (60–74), D (40–59), F (0–39).


Overall scores

Rank Tool Score Grade Notes
1 mineru 92 A Full LaTeX formulas, HTML table preservation
2 opendataloader-pdf 71 C Image references intact, structure solid, word-merge issue
3 pymupdf 54 D Good image references, formulas and tables lost
4 pdftotext 39 F Text only, structure stripped
5 markitdown 14 F Severe noise, output not usable

The gap between first and second place (21 points) is smaller than the gap between second and fifth place (57 points). The real division is not between tools — it is between tools that model document structure and tools that do not.


The same passage, five outputs

The abstract of TV-RAG runs about 150 words — multi-column layout, soft hyphens, and the sentence "To mitigate these limitations, we propose TV-RAG…" Each tool encountered the same bytes; the five renderings below show what came out.

mineru:

Large Video Language Models (LVLMs) have rapidly emerged as the focus of
multimedia AI research. Nonetheless, when confronted with lengthy videos,
these models struggle: their temporal windows are narrow, and they fail to
notice fine-grained semantic shifts that unfold over extended durations.
Moreover, mainstream text-based retrieval pipelines, which rely chiefly on
surface-level lexical overlap, ignore the rich temporal interdependence among
visual, audio, and subtitle channels. To mitigate these limitations, we propose
TV-RAG, a training-free architecture that couples temporal alignment with
entropy-guided semantics to improve longvideo reasoning. ...

The prose is continuous and readable. Column breaks are absorbed silently. The one visible seam — longvideo without a space — is a hyphenation artifact from the PDF's column justification; mineru preserves it faithfully rather than inserting noise.

markitdown:

5
2
0
2

c
e
D
9
2

]

V
C
.
s
c
[

1
v
3
8
4
3
2
...
TV-RAG: A Temporal-aware and Semantic Entropy-Weighted
Framework for Long Video Retrieval and Understanding

The file opens with arXiv metadata encoded as isolated characters on separate lines. The abstract text does appear later in the output, but by then the preamble has already produced output that cannot be used as-is. The pattern affects everything before the paper's title.

pdftotext:

Large Video Language Models (LVLMs) have rapidly emerged as
the focus of multimedia AI research. Nonetheless, when confronted
with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic
shifts that unfold over extended durations. Moreover, mainstream
text-based retrieval pipelines, which rely chiefly on surface-level
lexical overlap, ignore the rich temporal interdependence among
visual, audio, and subtitle channels. ...

The text is accurate and column breaks are absorbed, but every line ends where the PDF column ended — typically 60–70 characters. Downstream line-aware parsers (e.g., chunkers that treat newlines as sentence breaks) will split these mid-sentence.

pymupdf:

Large Video Language Models (LVLMs) have rapidly emerged as
the focus of multimedia AI research. Nonetheless, when confronted
with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic
shifts that unfold over extended durations. Moreover, mainstream
text-based retrieval pipelines, which rely chiefly on surface-level
lexical overlap, ignore the rich temporal interdependence among
visual, audio, and subtitle channels. ...

The abstract body is close to pdftotext in structure. The difference surfaces in the author block, where pymupdf interleaves names and affiliations in column-scan order rather than logical order, and in the section headings, which appear as bold plain text rather than Markdown ## markers.

opendataloader-pdf:

Large Video Language Models (LVLMs) have rapidly emerged as thefocusofmultimediaAIresearch.Nonetheless,whenconfronted
with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic
shifts that unfold over extended durations. Moreover, mainstream
text-based retrieval pipelines, which rely chiefly on surface-level
lexical overlap, ...Totackle(C2),itincorporatesatemporalwindowbased BM25 model ...

Word-merge artifacts appear intermittently throughout. Where the PDF's column layout places two words at a boundary, opendataloader-pdf joins them without a space. The density of merges increases in denser passages — methodology sections and captions are more affected than the abstract opening.


Per-tool notes

mineru — 92

mineru is the only tool in the group that treats formula extraction as a first-class problem. Mathematical expressions in the source PDF come through as valid LaTeX: inline formulas wrapped in $...$, display formulas in $$...$$. The TV-RAG paper's notation for its temporal-decay retrieval module and entropy-weighting functions is fully readable after conversion.

Tables are preserved as HTML rather than Markdown pipe tables, which is a reasonable trade-off: HTML tables survive complex cell content that Markdown tables cannot represent. Section headings map cleanly to # and ## levels. The 3-point deduction on image references (7/10 rather than 10/10) reflects path imprecision on some figures, and the noise score of 7/10 reflects minor artifacts in the author affiliation block. Neither issue affects downstream use for retrieval or RAG chunking.

The entropy-weighting formula from Section 3.1, for example, comes through as valid LaTeX:

$$
\alpha _ { t } ~ = ~ { \frac { H ( F _ { t } ) } { \sum _ { j } H ( F _ { j } ) } } .
$$

Actual output excerpt (Section 3.1, Key-Frame Selection through entropy definition):

Key–Frame Selection. Although modern LVLMs excel at recognising objects, they remain
error-prone when asked to count instances, pinpoint locations, or reason about complex
interactions, frequently hallucinating details when contextual cues are sparse. To curb
this issue, motivated by [12, 18], we rank every sampled frame $\mathcal { F } = \{ F _ { t } \}$
by the semantic affinity between the detector request $R _ { d e t }$ and the frame content,
modulated by a temporal importance weight:

$$
F _ { k e y } = \Big \{ F _ { t } \ \big \vert \ \alpha _ { t } \cdot \mathrm { C L I P } \big ( R _ { d e t } , F _ { t } \big ) \geq \tau \Big \} ,
$$

where $\tau$ is a similarity threshold. The weight $\alpha _ { t }$ captures how much new
information the segment contributes and is computed from the normalised Shannon
entropy of its visual features:

$$
\alpha _ { t } ~ = ~ { \frac { H ( F _ { t } ) } { \sum _ { j } H ( F _ { j } ) } } .
$$

Object detection is then run only on this entropy-aware subset $\mathcal { F } _ { \mathrm { \kappa E Y } }$ ,
ensuring that the LVLM processes the most informative and contextually relevant frames.

Section headings, inline math, and display equations all render correctly. The image reference just above this passage reads ![](images/8fac5a2bad7e74abff09e37d61869afe64fa7bfec45b4542d9f66f1be249f95f.jpg) — a content-addressed filename rather than a human-readable one, which is the one weak point in an otherwise clean extraction.

mineru runs a layout-analysis pass before text extraction, which lets it distinguish formula regions, table regions, and prose regions before applying separate extraction logic to each. Tools that treat the entire page as undifferentiated text cannot do this.

opendataloader-pdf — 71

opendataloader-pdf scores the maximum on image references (10/10) and handles section structure well (17/20). Its weakness is two-column layout handling: words at column boundaries are merged without spaces. A representative line from the methodology section:

TV-RAG employsasemanticentropy-basedweightingstrategyforkeyframe
selection to evenly distribute selected frames across time, reducing
redundancy,enhancingrepresentativeness,andprioritizingthemost
informativeframes.

The merged tokens — employsasemanticentropy-basedweightingstrategyforkeyframe, redundancy,enhancingrepresentativeness — do not appear in query vocabularies and cannot be recovered by a standard tokenizer.

Actual output excerpt (Section 3.1, Building Retrieval Repositories through Key-Frame Selection):

Building Retrieval Repositories. Open source LVLMs tend to misread scene text and
spoken words, falling short of their proprietary counterparts. To curb such hallucinations
and exploit frame content more effectively, we offload text extraction to specialist models.
Concretely, EasyOCR [7] is run on each key frame to harvest on-screen captions, giving a
pool of strings TocR; meanwhile, the soundtrack is transcribed by Whisper [24], yielding
an ASR transcript TasR as advocated in prior work [15, 18, 40]. Both text streams are
embedded with the ContRieveR encoder [6] to obtain dense vectors, which are written to
two separate FAISS indices [8]: DocR for scene text and DasR for speech. This design
enables low-latency, similarity-based retrieval of the most relevant snippets during query time.

Key–Frame Selection. Although modern LVLMs excel at recognising objects, they remain
error-prone when asked to count instances, pinpoint locations, or reason about complex
interactions, frequently hallucinating details when contextual cues are sparse. To curb
this issue, motivated by [12, 18], we rank every sampled frame F = {𝐹𝑡} by the semantic
affinity between the detector request 𝑅𝑑𝑒𝑡 and the frame content, modulated by a
temporal importance weight:

𝐹𝑘𝑒𝑦 = 𝐹𝑡 𝛼𝑡 · CLIP 𝑅𝑑𝑒𝑡, 𝐹𝑡 ≥ 𝜏 ,

where𝜏 is a similarity threshold. The weight𝛼𝑡 captures how much new information
the segment contributes and is computed from the normalised Shannon entropy of its
visual features:

𝛼𝑡 =

𝐻(𝐹𝑡) 𝑗 𝐻(𝐹𝑗)

.

The prose in the "Building Retrieval Repositories" paragraph is clean — word merges are absent there. They concentrate at column boundaries in denser passages. The formula rendering shows what happens when LaTeX is absent: the set-builder notation 𝐹𝑘𝑒𝑦 = 𝐹𝑡 𝛼𝑡 · CLIP ... is flattened to a single line with Unicode math symbols, and the fraction α_t = H(F_t) / Σ H(F_j) is split across three lines with numerator and denominator on separate rows. The image reference for Figure 2 is ![image 2](TV-RAG_images/imageFile2.png) — a clean relative path, which is the strongest part of this output.

Formula support is absent (8/15), and tables are partially reconstructed but lose some structure (8/15). For a document heavy on prose and figures but light on math and tables, opendataloader-pdf would rank closer to mineru.

pymupdf — 54

pymupdf extracts text reliably (15/20) and preserves image references well (10/10), which makes it a reasonable choice for figure-rich documents where math is not a concern. Structure preservation drops to 10/20 because section headings are extracted as plain text without Markdown markers. Formulas come out as sequences of special characters rather than LaTeX (4/15), and tables flatten into rows of tab-separated values without column alignment (4/15). The output is clean enough to read but requires significant post-processing before it is useful as structured Markdown.

Actual output excerpt (Section 3 Methodology, Problem Setup and image reference):

**3** **Methodology**

We introduce TV-RAG, a novel training-free process designed for
LVLMs that can be seamlessly integrated into any existing LVLM.
As shown in Fig. 2, the process consists of three main phases: **(i) Se-**
**mantic entropy-based information extraction:** After obtaining the query, information is extracted based on semantic entropy
from different sources. **(ii) Temporal decay-enhanced retrieval**
**model:** In order to capture the important temporal information
in the video, the time window mechanism request is introduced
for obtaining the relevant information. **(iii) Context-enhanced**
**reasoning-based response generation:** In this final stage, the
auxiliary text retrieved based on the context-enhanced reasoning
mechanism is integrated with the user's query and fed into the
LVLM to generate the final output.

**Problem Setup.** Let V be an input video. We then use a frame–
selection unit to extract _𝑁_ representative images F Each frame is
then mapped into a visual embedding via a frozen image encoder,
e.g., CLIP-L [23], yielding F _𝑣_ from F . Finally, the visual tokens F _𝑣_
and a user query Q are supplied to a large video–language model
to generate the answer O:

![](./images/TV-RAG.pdf-2-0.png)

The heading "3 Methodology" is bold plain text rather than ## 3 Methodology. Bold inlines within the phase enumeration are preserved, which is better than pdftotext but falls short of Markdown heading structure. The formula O = LVLM(F𝑣, Q) does not appear — the equation block was replaced by an image reference (TV-RAG.pdf-2-0.png), which is the tool's fallback for content it cannot represent as text. The image reference is accurate but opaque to a text-based retriever.

pdftotext — 39

pdftotext does one thing: it extracts the character stream from a PDF. It does that with reasonable fidelity (15/20 on text accuracy) and reasonable noise suppression (7/10), but all structural information is discarded. Headings become plain lines. Tables become space-aligned text that collapses when the font metrics do not survive the extraction. Formulas are rendered as whatever Unicode approximation the PDF's character encoding provides. The output is useful if raw text is what you need — for keyword search over a document corpus, for example — but not for Markdown-based RAG where chunking depends on section boundaries.

Actual output excerpt (Section 3 Methodology, Problem Setup through Two-stage Processes):

3

Methodology

We introduce TV-RAG, a novel training-free process designed for
LVLMs that can be seamlessly integrated into any existing LVLM.
As shown in Fig. 2, the process consists of three main phases: (i) Semantic entropy-based information extraction: After obtaining the query, information is extracted based on semantic entropy
from different sources. (ii) Temporal decay-enhanced retrieval
model: In order to capture the important temporal information
in the video, the time window mechanism request is introduced
for obtaining the relevant information. (iii) Context-enhanced
reasoning-based response generation: In this final stage, the
auxiliary text retrieved based on the context-enhanced reasoning
mechanism is integrated with the user's query and fed into the
LVLM to generate the final output.

Problem Setup. Let V be an input video. We then use a frame–
selection unit to extract 𝑁 representative images F Each frame is
then mapped into a visual embedding via a frozen image encoder,
e.g., CLIP-L [23], yielding F𝑣 from F . Finally, the visual tokens F𝑣
and a user query Q are supplied to a large video–language model
to generate the answer O:
O = LVLM(F𝑣 , Q).

(1)

In this way, we complete the RAG process for videos.
Two-stage Processes. In this paper, motivated by previous efforts
[12, 18], upon receiving a user's query regarding a video, the LVLM
follows a two-phase process.

The section heading "3 Methodology" is a plain line — no # marker. The formula O = LVLM(F𝑣, Q) is rendered with Unicode subscripts rather than LaTeX. Every line ends at the PDF column boundary, so the sentence "We introduce TV-RAG…" wraps mid-clause across multiple lines. A sentence-boundary detector that relies on newlines will fragment this incorrectly.

markitdown — 14

markitdown's output on this document is not usable. The first lines of the converted file are a sequence of isolated single characters — digits and letters on separate lines — that appear to be arXiv metadata encoded in a way markitdown does not parse. That preamble occupies a significant portion of the output before the paper's title appears. The tool scores zero on formula handling, zero on table handling, and zero on image references. The text fidelity score (5/20) reflects that recognizable prose does appear in the output, but interspersed with enough noise (0/10 on artifacts) that it cannot be used as-is. For documents that do not contain complex math, tables, or multi-column layouts, markitdown may behave differently; on this test document its output was effectively unusable.

Actual output excerpt (file opening through title):

5
2
0
2

c
e
D
9
2

]

V
C
.
s
c
[

1
v
3
8
4
3
2
.
2
1
5
2
:
v
i
X
r
a

TV-RAG: A Temporal-aware and Semantic Entropy-Weighted
Framework for Long Video Retrieval and Understanding

Zongsheng Cao∗
agiczsr@gmail.com
Researcher

Feng Chen
chenfeng@lenovo.com
PCIE

Yangfan He∗
he00577@umn.edu
UMN

Zepeng Wang
wangzpb@lenovo.com
PCIE

The arXiv version identifier (2512.23483v1) has been split into one character per line. After the preamble clears, the author block is extracted correctly. The abstract text that follows is also readable prose. The problem is not that the whole file is garbage — it is that the preamble noise makes automated ingestion unreliable: a chunker has no safe starting offset to assume.


What produced the gap

All five tools ultimately parse the byte stream of a PDF file. The PDF format does not natively encode semantic structure — there is no tag that says "this region is a formula" or "this is a table header." What differs between tools is how much inference they apply on top of the raw parse.

pdftotext applies none. It extracts the character stream in reading order and stops. pymupdf adds coordinate-aware extraction, which enables it to detect image bounding boxes and emit references, but does not go further into semantic classification. markitdown appears to use a conversion layer that works well on simpler PDF structures but degrades on the ACM two-column format with embedded arXiv metadata.

opendataloader-pdf adds a structure-detection step that identifies headings and sections, and maps images to references. Its column-boundary weakness is a known failure mode of heuristic layout analysis: when two columns share a horizontal band, the extraction cursor can merge the last token of one column with the first token of the next.

mineru runs a dedicated layout analysis model before extraction. That model classifies each page region — formula, table, prose block, figure — and routes each region to the appropriate extractor. Formula regions go to a LaTeX renderer. Table regions go to an HTML serializer. This architectural choice is what produces the 15/15 scores on formula handling and table handling, and it is also what makes mineru the slowest tool in the group: the layout model inference adds real wall-clock time compared to pure text extraction.

The performance gap is not a tuning difference — it is a design difference. Tools built for plain text extraction cannot approach mineru's scores on a math- and table-heavy academic paper without a significant architectural change.


When to pick what

For academic papers with formulas and tables, mineru is the correct choice. There is no close second for documents that require LaTeX-faithful formula output.

For documents heavy on images and prose — slide decks exported as PDFs, image-rich reports — opendataloader-pdf is the better balance of structure and image handling, with the caveat that its output needs a word-merge cleanup pass on two-column layouts.

pymupdf is a reasonable default for image-referenced documents where math is absent and the goal is fast, clean text with figure pointers intact.

pdftotext suits cases where only the character stream matters: building a text search index, running a classifier over document content, or feeding plain text to a summarization step that does not require structure.

In an automated pipeline, the tool selection can be conditioned on document type. A classifier that detects whether a PDF contains formula regions — which is a tractable problem — can route academic papers to mineru and everything else to a lighter tool, capturing most of mineru's quality gains without paying its latency cost on every document.