AI May 20, 2026 14 min read

AI assisted

PDF-Markdown 변환 도구 5종 비교

학술 논문 PDF 변환 5종 (markitdown, pdftotext, pymupdf, mineru, opendataloader-pdf)을 7개 기준 100점 만점 루브릭으로 평가

#PDF #Markdown #RAG #Ingestion #mineru #markitdown #pymupdf #pdftotext #opendataloader-pdf #Benchmark

학술 논문 PDF를 마크다운으로 변환하는 다섯 도구(markitdown, pdftotext, pymupdf, mineru, opendataloader-pdf)를 같은 원본 PDF로 측정해 100점 만점 루브릭으로 평가했습니다. mineru가 92점으로 가장 높았고, markitdown은 14점으로 사용 불가 수준이었습니다. 이 글에서는 평가 루브릭, 도구별 결과, 차이를 만든 요인을 정리합니다.

측정 셋업

원본 문서는 ACM MM'25에 게재된 TV-RAG 학술 논문 PDF입니다. 수식이 다수 포함된 논문이고, 2단 컬럼 레이아웃과 표, 이미지 캡션이 혼재합니다. 학술 논문 PDF가 얼마나 까다로운 변환 대상인지를 가늠하는 데 적합한 원본입니다.

다섯 도구 모두 같은 PDF를 입력으로 받아 마크다운으로 변환했습니다. markitdown과 pdftotext는 별도의 스크립트 없이 CLI 명령 하나로 처리했고, pymupdf·mineru·opendataloader-pdf는 각각 Python 스크립트로 같은 방식의 인터페이스로 실행했습니다. mineru는 pipeline 백엔드, CPU 디바이스 기준으로 변환했습니다.

평가 루브릭

7개 기준으로 총 100점을 배분했습니다.

기준	배점	설명
텍스트 정확성	20	원문 추출 정확도
구조 보존	20	제목/섹션 마크다운 문법
수식 처리	15	LaTeX 형식 변환
표 처리	15	테이블 구조 보존
이미지 참조	10	마크다운 이미지 참조
가독성	10	렌더링 편의성
노이즈/아티팩트	10	불필요한 문자 여부

등급 구간은 A(90-100점), B(75-89점), C(60-74점), D(40-59점), F(0-39점)입니다. A는 즉시 사용 가능, F는 재변환이 권장되는 수준입니다.

종합 결과

순위	도구	총점	등급	핵심 특징
1	mineru	92/100	A	수식 LaTeX 완벽, 표 HTML 보존
2	opendataloader-pdf	71/100	C	이미지 완벽 참조, 구조 우수, 단어 병합 문제
3	pymupdf	54/100	D	이미지 참조 우수, 수식/표 손실
4	pdftotext	39/100	F	텍스트만 추출, 구조 손실
5	markitdown	14/100	F	심각한 노이즈, 사용 불가

도구별 기준 점수는 다음과 같습니다.

기준	mineru	opendataloader	pymupdf	pdftotext	markitdown
텍스트 정확성 (20)	18	15	15	15	5
구조 보존 (20)	20	17	10	5	5
수식 처리 (15)	15	8	4	4	0
표 처리 (15)	15	8	4	4	0
이미지 참조 (10)	7	10	10	0	0
가독성 (10)	10	7	7	4	4
노이즈/아티팩트 (10)	7	6	4	7	0
총점	92	71	54	39	14

1위와 2위 사이의 점수 차(21점)보다, 2위와 5위 사이의 격차(57점)가 훨씬 큽니다. 도구 선택에 따라 품질 차이가 미세한 편차 수준을 넘어 용도 자체가 갈리는 단계입니다.

같은 문단, 다섯 가지 결과

TV-RAG 논문 Introduction의 첫 단락을 각 도구가 어떻게 변환했는지 직접 비교합니다. 원본은 2단 컬럼 레이아웃의 학술 논문 PDF입니다.

mineru:

# 1 Introduction

Recent breakthroughs in large-scale language modelling have catalysed
rapid progress in multimodal research, ultimately leading to a new class
of Large Video–Language Models (LVLMs) [11, 16, 37]. Despite their
impressive accuracy on short, clip-length inputs, current LVLMs still
face significant obstacles when tasked with analyzing and reasoning over
very long videos.

섹션 헤딩이 # 마크다운으로 정확히 변환되고, 단어 사이 공백이 유지됩니다.

opendataloader-pdf:

### 1 Introduction

Recentbreakthroughsinlarge-scalelanguagemodellinghavecatalysed rapid
progress in multimodal research, ultimately leading to a new class of
Large Video–Language Models (LVLMs) [11, 16, 37]. Despite their
impressive accuracy on short, clip-length inputs, current LVLMs still
face significant obstacles when tasked with analyzing and reasoning over
very long videos.

섹션 헤딩 수준은 보존되지만, 컬럼 경계에서 단어 병합이 발생합니다(Recentbreakthroughs, largescale).

pymupdf:

**1** **Introduction**

Recent breakthroughs in large-scale language modelling have catalysed rapid progress in multimodal research, ultimately leading to
a new class of Large Video–Language Models (LVLMs) [11, 16, 37].
Despite their impressive accuracy on short, clip-length inputs, current LVLMs still face significant obstacles when tasked with analyzing and reasoning over very long videos.

헤딩이 볼드 텍스트로만 처리되고 # 마크다운 구조는 없습니다. 이미지 경로는 생성하지만 섹션 구조 인식은 없습니다.

pdftotext:

1 Introduction
Recent breakthroughs in large-scale language modelling have catalysed rapid progress in
multimodal research, ultimately leading to a new class of Large Video–Language Models
(LVLMs) [11, 16, 37]. Despite their impressive accuracy on short, clip-length inputs, cur-
rent LVLMs still face significant obstacles when tasked with analyzing and reasoning over
very long videos.

섹션 번호와 제목이 평문으로만 출력됩니다. 컬럼 내 줄 바꿈(cur- / rent)이 그대로 남아 하이픈 분리 단어가 중간에 잘립니다.

markitdown:

5
2
0
2

c
e
D
9
2

]

V
C
.
s
c
[

문서 시작부터 arXiv 버전 메타데이터가 한 글자씩 세로로 분리되어 출력됩니다. 이후 본문도 마찬가지 방식으로 오염됩니다. Introduction 본문에 도달하기 전에 이미 사용 불가 수준의 노이즈가 누적됩니다.

도구별 분석

mineru: 92점 (A)

수식 처리와 표 처리 두 항목에서 만점(15/15)을 받은 유일한 도구입니다. 학술 논문 PDF에서 수식을 LaTeX 형식( $...$ , $$...$$)으로 완벽하게 변환하고, 표를 HTML 테이블로 구조 그대로 출력합니다. 섹션 구조도 올바른 마크다운 헤딩(#, ##)으로 전환돼 구조 보존 20점 만점을 받았습니다.

이미지 참조 항목에서만 7점(10점 만점)을 받았는데, 이미지 경로를 참조하되 일부 누락이 있었습니다. 노이즈 항목 역시 7점으로, 일부 섹션에 경미한 아티팩트가 남았습니다.

실제 본문 일부 (Methodology §3, §3.1):

# 3 Methodology

We introduce TV-RAG, a novel training-free process designed for LVLMs that can be
seamlessly integrated into any existing LVLM. As shown in Fig. 2, the process consists
of three main phases: (i) Semantic entropy-based information extraction: After obtaining
the query, information is extracted based on semantic entropy from different sources.
(ii) Temporal decay-enhanced retrieval model: In order to capture the important temporal
information in the video, the time window mechanism request is introduced for obtaining
the relevant information. (iii) Context-enhanced reasoning-based response generation:
In this final stage, the auxiliary text retrieved based on the context-enhanced reasoning
mechanism is integrated with the user's query and fed into the LVLM to generate the
final output.

Problem Setup. Let $_ \mathrm { c } { } _ { V }$ be an input video. We then use a frame–
selection unit to extract $N$ representative images $\mathcal { F }$ Each frame is then mapped
into a visual embedding via a frozen image encoder, e.g., CLIP-L [23], yielding
$\mathcal { F } _ { v }$ from $\mathcal { F }$. Finally, the visual tokens $\mathcal { F } _ { v }$ and a user query
$\boldsymbol { Q }$ are supplied to a large video–language model to generate the answer $o$:

$$
O = \mathrm { L V L M } ( { \mathcal { F } } _ { v } , Q ) .
$$

Two-stage Processes. ... The process can thus be decoupled into two main stages, as
formalized in the following equations:

$$
\mathbf { R } = \mathsf { L V L M } ( \mathbf { P } , \mathbf { Q } ) , \quad \mathbf { R } = \{ \mathbf { R } _ { i } \} ,
$$

# 3.1 Semantic Entropy-based Extraction

Key–Frame Selection. ... we rank every sampled frame $\mathcal { F } = \{ F _ { t } \}$ by the semantic
affinity between the detector request $R _ { d e t }$ and the frame content:

$$
F _ { k e y } = \Big \{ F _ { t } \ \big \vert \ \alpha _ { t } \cdot \mathrm { C L I P } \big ( R _ { d e t } , F _ { t } \big ) \geq \tau \Big \} ,
$$

$$
\alpha _ { t } ~ = ~ { \frac { H ( F _ { t } ) } { \sum _ { j } H ( F _ { j } ) } } .
$$

섹션 헤딩이 # 마크다운으로, 수식이 $$...$$ LaTeX 블록으로 그대로 출력됩니다. 인라인 수식 표기( $...$ )도 유지됩니다.

mineru의 강점은 PDF를 단순히 텍스트 흐름으로 처리하지 않는다는 점입니다. 레이아웃 분석 파이프라인으로 수식 영역과 표 영역을 별도로 인식하고 각각 적합한 포맷으로 변환합니다. pipeline 백엔드는 CPU에서 실행 가능하지만, 복잡한 문서일수록 변환 시간이 길어지는 단점이 있습니다.

opendataloader-pdf: 71점 (C)

이미지 참조 항목에서 10점 만점을 받은 유일한 도구입니다. 마크다운 ![](path) 문법으로 모든 이미지를 정확히 참조하고, 섹션 구조도 17점으로 우수하게 보존했습니다.

실제 본문 일부 (Abstract + §3.1):

Large Video Language Models (LVLMs) have rapidly emerged as thefocusofmultimediaAIresearch.
Nonetheless,whenconfronted with lengthy videos, these models struggle: their temporal
windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over
extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly
on surface-level lexical overlap, ignore the rich temporal interdependence among visual,
audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a
training-free architecture that couples temporal alignment with entropy-guided semantics
to improve longvideoreasoning.Theframeworkcontributestwomainmechanisms: (i) a
time-decay retrieval module that injects explicit temporal offsets into the similarity
computation, thereby ranking text queries according to their true multimedia context; and
(ii) an entropyweightedkey-framesamplerthatselectsevenlyspaced,informationdense frames,
reducing redundancy while preserving representativeness.

...

Before retrieval,threemodality-specificrepositoriesarebuiltinparallelfrom the video
itself: an ASR base DasR, an OCR base DocR, and an object-detectionbase Ddet.
Subsequentlook-upsareexecutedagainst these lightweight databases, ensuring that
irrelevant tokens do not enter the LVLM's limited context window.

𝐹𝑘𝑒𝑦 = 𝐹𝑡 𝛼𝑡 · CLIP 𝑅𝑑𝑒𝑡, 𝐹𝑡 ≥ 𝜏 ,

𝛼𝑡 = 𝐻(𝐹𝑡) / 𝑗 𝐻(𝐹𝑗) .

단어 병합은 컬럼 경계에서 일관되게 발생합니다(thefocusofmultimediaAIresearch, longvideoreasoning, entropyweightedkey-framesampler). 수식은 LaTeX 없이 유니코드 변수 나열로만 출력됩니다.

약점은 2단 컬럼 레이아웃 처리입니다. 컬럼 경계에서 단어가 병합되는 문제가 반복적으로 발생했고, 수식은 LaTeX가 아닌 텍스트 형태로만 추출돼 수식 처리 항목에서 8점에 그쳤습니다. 표도 구조가 부분적으로 깨졌습니다.

구조와 이미지가 중요한 문서에서는 유용하지만, 수식이 포함된 학술 논문에는 적합하지 않습니다.

pymupdf: 54점 (D)

실제 본문 일부 (§3 Methodology):

**3** **Methodology**

We introduce TV-RAG, a novel training-free process designed for
LVLMs that can be seamlessly integrated into any existing LVLM.
As shown in Fig. 2, the process consists of three main phases: **(i) Se-**
**mantic entropy-based information extraction:** After obtaining the query, information is
extracted based on semantic entropy from different sources. **(ii) Temporal decay-enhanced
retrieval model:** In order to capture the important temporal information in the video, the
time window mechanism request is introduced for obtaining the relevant information.
**(iii) Context-enhanced reasoning-based response generation:** In this final stage, the
auxiliary text retrieved based on the context-enhanced reasoning mechanism is integrated
with the user's query and fed into the LVLM to generate the final output.

![](./images/TV-RAG.pdf-2-0.png)

invocation inflates inference latency, and (ii) many rely on closedsource models, limiting
both efficiency and the ease with which the community can replicate results using purely
open-source stacks.

섹션 번호가 **3** 볼드로만 처리되고 # 헤딩 구조는 없습니다. 이미지 경로는 ![](./images/...) 형식으로 참조됩니다. 수식은 텍스트로 손실되며, 페이지 단절로 Figure 캡션과 본문 문장이 뒤섞입니다.

텍스트 정확성(15점)과 이미지 참조(10점 만점)는 양호하지만, 수식과 표에서 각각 4점을 받아 전체 점수를 크게 끌어내렸습니다. 구조 보존도 10점으로 절반 수준에 머물렀는데, 섹션 구조를 인식하지 못하고 대부분을 plain text로 출력하는 방식 때문입니다.

빠른 텍스트 추출이 목적이고 수식이나 표가 없는 문서라면 pymupdf는 실용적인 선택입니다. 그러나 구조화된 학술 논문 변환에는 부족합니다.

pdftotext: 39점 (F)

실제 본문 일부 (§3 Methodology):

3

Methodology

We introduce TV-RAG, a novel training-free process designed for
LVLMs that can be seamlessly integrated into any existing LVLM.
As shown in Fig. 2, the process consists of three main phases: (i) Semantic entropy-based
information extraction: After obtaining the query, information is extracted based on
semantic entropy from different sources. (ii) Temporal decay-enhanced retrieval
model: In order to capture the important temporal information in the video, the time
window mechanism request is introduced for obtaining the relevant information.
(iii) Context-enhanced reasoning-based response generation: In this final stage, the
auxiliary text retrieved based on the context-enhanced reasoning mechanism is integrated
with the user's query and fed into the LVLM to generate the final output.

Problem Setup. Let V be an input video. We then use a frame–
selection unit to extract 𝑁 representative images F Each frame is
then mapped into a visual embedding via a frozen image encoder,
e.g., CLIP-L [23], yielding F𝑣 from F . Finally, the visual tokens F𝑣
and a user query Q are supplied to a large video–language model
to generate the answer O:
O = LVLM(F𝑣 , Q).

(1)

R = LVLM(P, Q), R = {R𝑖 },
(2)

MM'25, October 27–31, 2025, Dublin, Ireland

Zongsheng Cao et al.

where P is the prompt. In the second phase, these requests, R, along
with the video frames, F𝒗 , are used to generate the final output:
O = LVLM(F𝒗 , Q, R).

(3)

섹션 번호 3이 단독 행으로 분리되고 섹션 제목과 띄어 씁니다. 수식은 O = LVLM(F𝑣, Q). 형식의 평문으로만 남고, 수식 번호 (1) (2)가 빈 줄로 분리된 채 그대로 남습니다. 페이지 헤더(MM'25, October 27–31, 2025, Dublin, Ireland)가 본문 중간에 삽입됩니다. 이미지 참조는 전혀 없습니다.

텍스트 자체의 정확도는 15점으로 상위 도구들과 비슷하고, 노이즈 항목에서 7점으로 비교적 깨끗한 출력을 보였습니다. 그러나 이름 그대로 텍스트만 추출합니다. 구조 보존 5점, 수식 처리 4점, 표 처리 4점, 이미지 참조 0점으로, 학술 논문을 RAG 파이프라인에 투입할 때 필요한 구조 정보가 대부분 사라집니다.

단순 텍스트 인덱싱이 목적이고 마크다운 구조가 필요 없다면, pdftotext는 의존성이 적고 안정적인 선택입니다.

markitdown: 14점 (F)

실제 본문 일부 (파일 시작 ~ 저자 블록):

5
2
0
2

c
e
D
9
2

]

V
C
.
s
c
[

1
v
3
8
4
3
2
.
2
1
5
2
:
v
i
X
r
a

TV-RAG: A Temporal-aware and Semantic Entropy-Weighted
Framework for Long Video Retrieval and Understanding

Zongsheng Cao∗
agiczsr@gmail.com
Researcher

Feng Chen
chenfeng@lenovo.com
PCIE

Yangfan He∗
he00577@umn.edu
UMN

Zepeng Wang
wangzpb@lenovo.com
PCIE

Anran Liu∗†
anniegogo1008@gmail.com

arXiv 버전 식별자(2512.23483v1, [cs.CV], 29 Dec 2025)가 한 글자씩 세로로 분리되어 파일 맨 앞을 채웁니다. 저자 블록은 이름·이메일·소속이 각각 단독 행으로 나뉘고, 2단 컬럼 순서 대신 PDF 렌더링 순서 그대로 섞입니다. 이후 본문에도 동일한 구조 손실이 이어집니다.

5개 도구 중 유일하게 수식 처리(0점)와 표 처리(0점), 이미지 참조(0점)에서 모두 0점을 받았습니다. 노이즈 항목도 0점으로, 변환 결과물에 메타데이터 오염과 의미 없는 문자가 다수 섞여 있습니다. 텍스트 정확성도 5점에 그쳤습니다.

markitdown은 Office 문서(Word, Excel, PowerPoint), HTML, 이미지 등 다양한 형식을 마크다운으로 변환하는 범용 변환기입니다. PDF 변환을 지원하지만, 내부적으로 pdfminer 기반의 단순 텍스트 추출에 의존합니다. 수식 영역이나 표 구조를 별도로 인식하는 로직이 없어, 학술 논문처럼 복잡한 레이아웃에서는 노이즈 비율이 급격히 올라갑니다.

Word 문서나 PowerPoint 슬라이드 변환에는 markitdown이 적합한 도구입니다. 복잡한 PDF, 특히 수식과 표가 포함된 학술 논문 PDF에는 권장하지 않습니다.

차이를 만든 요인

다섯 도구 모두 PDF 파서 위에 휴리스틱 로직을 얹는 구조입니다. 차이는 파서 이후 단계에서 무엇을 인식하고 무엇을 변환하는지에서 발생합니다.

수식 처리. mineru만이 수식 영역을 레이아웃 분석 단계에서 별도로 식별하고, LaTeX 형식으로 변환하는 파이프라인이 있습니다. 나머지 도구들은 수식 영역을 일반 텍스트로 추출하거나 아예 손실합니다. 학술 논문에서 수식이 의미의 핵심인 경우가 많기 때문에, 결과물이 쓸 만한지 아닌지는 이 차이에 달렸습니다.

표 구조 보존. mineru는 표를 HTML 테이블로 재구성합니다. opendataloader-pdf는 마크다운 테이블 형식을 시도하지만 컬럼 병합에서 실패가 발생합니다. pymupdf와 pdftotext는 셀 데이터만 추출하고 구조를 버립니다. markitdown은 표 자체를 인식하지 못합니다.

이미지 참조. opendataloader-pdf와 pymupdf는 이미지를 파일로 추출하고 마크다운 경로로 참조합니다. mineru는 이미지를 일부 참조하지만 완전하지 않습니다. pdftotext와 markitdown은 이미지를 참조하지 않습니다.

레이아웃 인식. 2단 컬럼 레이아웃은 단순 텍스트 추출기가 공통적으로 고전하는 영역입니다. 읽기 순서를 컬럼 단위로 재구성하지 않으면, 왼쪽 컬럼과 오른쪽 컬럼의 텍스트가 섞여서 출력됩니다. mineru의 pipeline 백엔드는 레이아웃 분석을 포함하기 때문에 이 문제를 가장 잘 처리합니다.

권장 선택

용도	권장 도구	이유
학술 논문 (수식/표 필수)	mineru	수식 LaTeX·표 HTML 완벽 처리, A등급
이미지 중심 + 구조 보존	opendataloader-pdf	이미지 참조 완벽, 구조 우수
이미지 포함 단순 문서	pymupdf	이미지 참조 양호, 빠른 처리
텍스트만 필요	pdftotext	의존성 최소, 노이즈 적음
Office/HTML → MD	markitdown	PDF 이외 형식에서는 강점

RAG 파이프라인에서 학술 논문을 자동으로 수집해 인덱싱하는 경우라면 mineru가 현시점 최선입니다. 다만 처리 속도가 다른 도구보다 느리고, 설치 의존성도 큽니다(PyTorch 포함). 대량 배치 처리 환경에서는 변환 시간과 리소스 비용을 고려해야 합니다.

텍스트 중심 문서나 구조가 단순한 PDF는 pymupdf나 pdftotext로도 충분합니다. 파이프라인 단계별로 문서 유형을 구분해 변환기를 달리 적용하는 방식이 실용적입니다.