GraphDB Benchmark, Eight Engines (Part 1) — Decomposing the RCTE 290x Gap
- 1. GraphDB Benchmark, Eight Engines (Part 1) — Decomposing the RCTE 290x Gap
- 2. GraphDB Benchmark (2/2) — Workload Matrix and Final Recommendations
Knowledge-graph retrieval workloads — the kind GraphRAG and LightRAG generate — were measured across eight graph engines under identical data (1.14M edges) and identical infrastructure (GCP n2-standard-8). PostgreSQL with Recursive CTE delivered 290x the throughput of Apache AGE running on the same PostgreSQL instance. The gap traces to two specific costs: cypher() wrapper overhead of ~13ms per call and accumulated PostgreSQL plan generation. This post covers the methodology, the workload distribution that decides what "fast" means here, the internal mechanics of RCTE versus AGE, and where exactly the gap is created. The full eight-engine landscape and workload-based recommendations follow in Part 2.
Setup
Eight engines were measured across nine phases, all on the same dataset and the same class of infrastructure.
The data is a Wikipedia-derived knowledge graph built with rule-based parsing — no LLM extraction. The schema mirrors LightRAG's property graph model: Document, Chunk, and Entity node types, connected by HAS_CHUNK (Document → Chunk) and MENTIONS (Chunk → Entity) edges. At final scale: 50,000 documents, 108,076 chunks, 587,428 entities, 1,144,177 MENTIONS edges, 108,076 HAS_CHUNK edges — approximately 745K total nodes and 1.25M total edges. Each chunk mentions an average of ten entities; each entity appears in an average of two chunks, producing a sparse bipartite structure that approximates real LightRAG output.
The lightrag_pg_not_age subsequent verification project replaced this proxy data with actual LightRAG output (LLM-extracted entities from a Korean document corpus), reaching the same directional conclusion. That data set is smaller — 8,050 nodes / 25,844 edges — and the absolute numbers differ, but the ranking holds.
Infrastructure was GCP n2-standard-8 (8 vCPU, 32 GB RAM) per engine VM, in the asia-northeast3 region (Seoul), communicating over VPC-internal networking. Eight VMs in total: one per engine under test, plus a dedicated load generator. From Phase 6 onward the load generator ran a Go static binary (bench-native, CGO=0) with integrated drivers for Bolt, RESP, and PostgreSQL wire — replacing an earlier JMeter/JSR223 harness that introduced measurable per-request overhead.
The eight engines measured:
| Engine | Category | Version |
|---|---|---|
| PostgreSQL RCTE | PG, flat SQL | PG 17 |
| Apache AGE | PG extension, Cypher wrapper | 1.6.0 on PG 17 |
| Neo4j | Native graph DB | 5.20 Community |
| MemGraph | In-memory native graph | latest |
| FalkorDB | Redis-module graph (RedisGraph successor) | latest |
| AgensGraph | PG fork with native Cypher executor | v2.16 on PG 16 |
| pgRouting | PG extension, road-network routing | current |
| PGQ (SQL/PGQ) | PG 19-dev, ISO SQL:2023 graph | PG 19-dev |
PGQ is on a development branch and excluded from production-recommendation discussion; it appears in throughput tables for completeness. The remaining seven engines are the primary comparison subjects.
How the benchmark was run
Harness
Phases 1–5 used Apache JMeter with a JDBC sampler and a JSR223 pre-processor to inject seed IDs from a CSV pool. From Phase 6 onward the JMeter harness was replaced with a Go static binary (bench-native, CGO=0) that holds one persistent connection per worker goroutine and drives all seven engines through a shared Engine interface:
// engine/engine.go — the interface every engine adapter implements
type Engine interface {
Name() string
Setup(cfg map[string]string) error
Execute(queryName string, params map[string]any) (rows int, err error)
Close() error
}Execute is called in a tight loop per worker with no think-time delay. The loop records wall-clock elapsed time per call and writes JTL rows atomically.
Workload mix
Q1–Q7 weights are declared once and shared by all engine adapters:
// engine/engine.go
var QueryNames = []string{
"Q1_POINT_LOOKUP",
"Q2_CHUNK_TO_ENTITY",
"Q3_ENTITY_TO_CHUNK",
"Q4_CHUNK_ENTITY_CHUNK",
"Q5_DOC_EXPAND",
"Q6_CO_ENTITY",
"Q7_HYBRID_EXPAND",
}
// 30/30/20/10/3/4/3 → matches Phase 5 JMX distribution
var QueryWeights = []int{30, 30, 20, 10, 3, 4, 3}Each worker samples the next query index via a weighted random draw against a pre-built cumulative array, picks a random seed from the 500-row CSV pool, and calls eng.Execute. There is no per-query rate cap; throughput is purely a function of how fast the engine returns.
Concurrency sweep and run script
# bench/native/run_phase6.sh (condensed)
ENGINES="neo4j memgraph age rcte pgq falkordb"
USERS="1 10 50"
DURATION="180s"
RAMP="10s"
for eng in $ENGINES; do
for u in $USERS; do
./bench-native \
--engine="$eng" \
--users="$u" \
--duration="$DURATION" \
--ramp="$RAMP" \
--csv=seeds_random.csv \
--jtl="results/${eng}_u${u}.jtl" \
${ARGS[$eng]}
done
doneEach (engine, concurrency) cell runs for 180 seconds after a 10-second linear ramp. Workers start staggered across the ramp window: worker w sleeps ramp × w / total_users before entering the loop.
Metric collection
Every call produces one JTL row (timeStamp, elapsed, label, success). After all runs, analyze_jtl.py reads the CSV and computes per-label percentiles and aggregate RPS:
# bench/jmeter/analyze_jtl.py (core logic)
s = sorted(elapsed_ms_list)
n = len(s)
p50 = s[int(n * 0.50)]
p95 = s[int(n * 0.95)]
p99 = s[min(int(n * 0.99), n - 1)]
duration_s = (t_end_ms - t_start_ms) / 1000
rps = n / duration_sThe 180-second window used in Phase 6 headline numbers contains between ~14,000 calls (AGE, u=50) and ~4,060,000 calls (RCTE, u=50), giving stable p99 estimates at both ends of the throughput range.
Workload distribution
What a benchmark measures is always a product of the load it applies. This one is built around the retrieval call distribution that LightRAG generates when serving queries — not synthetic graph traversal benchmarks.
LightRAG's BaseGraphStorage interface exposes 18 abstract methods. During retrieval, the actual call distribution across workers=10, measured in lightrag_pg_not_age, is:
| Method | Share |
|---|---|
get_node |
35% |
get_node_edges |
25% |
upsert_node |
15% |
node_degree |
15% |
has_node |
10% |
get_node and has_node are point lookups: given an entity ID, retrieve the node record or check existence. get_node_edges is a 1-hop forward scan: given an entity ID, return all edges from that node. node_degree counts edges at a node. Together, get_node + has_node + get_node_edges + node_degree account for 85% of all calls — every one of them a 1-hop or sub-hop operation. The remaining 15% is upsert_node, a write.
The multi-hop benchmark queries (Q1–Q7) in the main Phase 6 measurement approximate this distribution at scale:
| Query | Pattern | Weight |
|---|---|---|
| Q1 | Point lookup by Chunk ID | 30% |
| Q2 | 1-hop: Chunk → MENTIONS → Entity | 30% |
| Q3 | 1-hop reverse: Entity → Chunk | 20% |
| Q4 | 2-hop: Chunk → Entity → Chunk | 10% |
| Q5 | Document → all Chunks | 3% |
| Q6 | Co-entity count, GROUP BY | 4% |
| Q7 | Multi-seed UNWIND expand | 3% |
Q1 through Q3 — the three dominant patterns — are all 1-hop or point lookups, covering 80% of the mix. Q4 is the only 2-hop pattern and carries 10% weight. Q5–Q7 together are 10%.
This distribution has a direct consequence for what "throughput" means in this benchmark: it is primarily measuring the cost of repeatedly executing 1-hop lookups under concurrency pressure. Graph database features that shine on deep traversal, variable-length path queries, or community detection are simply not on the critical path of this workload. An engine that optimizes those features at the expense of per-call overhead on simple lookups will measure poorly here — regardless of how impressive its graph algorithm library is.
get_knowledge_graph — LightRAG's BFS-based neighborhood expansion — does run multi-hop traversal, but it is capped: depth=2, max_nodes=50. It is called for visualization and context enrichment, not on every retrieval. Its contribution to total call volume is minor, and the depth-2 cap with a hard node limit means that RCTE's known scaling problem (frontier explosion in unconstrained deep BFS) is never triggered in this workload.
The measurement axes that follow from this distribution are aggregate RPS (requests per second) under concurrent load, and p50/p95/p99 latency per request type. At u=50 (50 concurrent workers, 180-second steady-state window), the test exercises both throughput ceiling and tail latency under realistic RAG serving concurrency.
Headline numbers — RCTE 22,581 RPS, AGE 78 RPS
RPS at u=50 (Phase 6, Go harness, OLTP mix):
RCTE ████████████████████████████████████████ 22,581
MemGraph████████████████████████████████████████ 22,462 (OOM-kill ×9)
Neo4j ██████████████████████████ 14,541
PGQ █████ 3,157 (dev branch)
FalkorDB█ ~296 (99.3% error)
AGE 78Phase 6 OLTP mix at u=50 (180-second window, Go harness, no errors):
| Engine | RPS | Error % | Note |
|---|---|---|---|
| RCTE (PG 17, flat SQL) | 22,581 | 0.0% | |
| MemGraph (in-RAM) | 22,462 | 0.0% | 1 GB container: OOM-kill 9 times |
| Neo4j (Bolt v5) | 14,541 | 0.0% | |
| PGQ (PG 19-dev SQL/PGQ) | 3,157 | 0.0% | dev branch |
| AGE (cypher() wrapper) | 78 | 6.9% | |
| FalkorDB | ~296 effective | 99.3% | ConnectionPool exhaustion |
| AgensGraph (PG 16 fork) | catastrophic in mix | mixed | Q7 single query: minutes-scale |
The 290x gap between RCTE (22,581) and AGE (78) is the headline. Both run on the same PostgreSQL 17 instance — same binary, same data, same hardware. The difference is entirely interface: RCTE calls flat SQL directly; AGE routes every call through the cypher() wrapper function.
At lower concurrency the gap is already visible but smaller. Phase 2 (JMeter harness, same 1.14M-edge dataset, GCP) measured:
| Concurrency | Neo4j RPS | AGE RPS | Ratio |
|---|---|---|---|
| u=1 | 503 | 33.6 | 15x |
| u=10 | 3,375 | 81.7 | 41x |
| u=50 | 4,836 | 88.6 | 55x |
Note that Phase 2 did not yet include RCTE as a candidate — that was added in Phase 4. The Phase 2 Neo4j vs AGE ratio was 55x at u=50; once RCTE entered the comparison in Phase 4 and the harness was replaced with the Go binary in Phase 6, the RCTE/AGE ratio grew to 290x. AGE's throughput barely scales with concurrency (33 → 88 RPS across u=1 to u=50, a 2.6x increase), while Neo4j scaled 9.6x across the same range. RCTE scaled approximately linearly.
Memory constraint results (Phase 7, 1 GB container, u=50):
| Engine | RPS | Change | Pattern |
|---|---|---|---|
| RCTE | 22,370 | -0.4% | Graceful OS page cache adaptation |
| Neo4j | FAIL | — | JVM cannot start below ~1.5 GB heap |
| MemGraph | 46,519 at 93.5% error | -93.5 pp | OOM-kill restart loop |
RCTE's -0.4% throughput degradation when the container memory ceiling drops from 32 GB to 1 GB is structurally explained: PostgreSQL's shared_buffers was set to 128 MB (scaled from the 8 GB baseline), working memory per query is ~26 MB, and additional data access falls through to the OS page cache. cgroup v2 memory limits apply to RSS but host page cache remains accessible. The result is graceful degradation rather than a cliff.
Why RCTE wins this workload
RCTE is not a graph database. It is two ordinary PostgreSQL tables with B-tree indexes, queried with standard SQL. The graph schema is:
CREATE TABLE nodes (
id text PRIMARY KEY,
label text,
properties jsonb
);
CREATE TABLE edges (
src text REFERENCES nodes(id),
dst text REFERENCES nodes(id),
edge_type text,
properties jsonb,
PRIMARY KEY (src, dst, edge_type)
);
CREATE INDEX ON edges(src);
CREATE INDEX ON edges(dst);A 1-hop lookup — the dominant pattern at 60%+ of calls — resolves to a single join plan.
The benchmark used denormalized flat tables (rcte_chunks, rcte_mentions, rcte_has_chunk) rather than the generic nodes/edges schema shown above. The actual queries executed in Phase 4–6:
-- Q1 (30%): point lookup by chunk ID
SELECT id
FROM rcte_chunks
WHERE id = $1;
-- Q2 (30%): 1-hop forward — chunk → entities it mentions
SELECT entity_id
FROM rcte_mentions
WHERE chunk_id = $1
LIMIT 50;
-- Q3 (20%): 1-hop reverse — entity → chunks that mention it
SELECT chunk_id
FROM rcte_mentions
WHERE entity_id = $1
LIMIT 50;
-- Q4 (10%): 2-hop — chunk → shared entities → other chunks
SELECT DISTINCT m2.chunk_id
FROM rcte_mentions m1
JOIN rcte_mentions m2 ON m1.entity_id = m2.entity_id
WHERE m1.chunk_id = $1
AND m2.chunk_id <> $1
LIMIT 50;Q1, Q2, and Q3 each touch one table with a single equality predicate on an indexed column. The PostgreSQL planner selects an index scan on rcte_mentions(chunk_id) or rcte_mentions(entity_id) and returns a small result set immediately. There is no parsing of a query language, no internal AST construction, no translation layer. The plan is generated once and cached in the prepared-statement cache. On repeated calls with different parameter values, the planner overhead is near zero.
Q4 is a self-join on rcte_mentions. At 1.14M MENTIONS rows it requires two index scans and a hash join, but with LIMIT 50 the executor can short-circuit after finding fifty matching rows. The query stays bounded in output size by the application's own semantics.
The get_knowledge_graph BFS — the only multi-hop operation on the critical path — is a single WITH RECURSIVE query:
WITH RECURSIVE walk AS (
SELECT $1 AS node_id, 0 AS depth, ARRAY[$1] AS path
UNION ALL
SELECT e.dst, w.depth + 1, w.path || e.dst
FROM walk w
JOIN edges e ON e.src = w.node_id
WHERE w.depth < 2
AND NOT e.dst = ANY(w.path)
)
SELECT DISTINCT node_id FROM walk LIMIT 50;The WHERE depth < 2 clause enforces LightRAG's depth=2 cap. NOT e.dst = ANY(w.path) prevents cycle-driven frontier explosion. LIMIT 50 mirrors LightRAG's max_nodes=50. The query stays bounded in both depth and node count — the two conditions that would cause WITH RECURSIVE to perform poorly (unbounded depth, exponential frontier) are explicitly prevented by the application's own semantics.
PostgreSQL has been optimizing OLTP workloads — heavy point-lookup and short-range-scan patterns — for decades. The process-per-connection model, the B-tree implementation, the planner's nested-loop join costing, and the prepared-statement cache are all tuned for exactly this access pattern. A workload that is 85% 1-hop lookups and 15% bounded BFS is not a graph problem in any deep algorithmic sense; it is a relational lookup problem with an adjacency-list schema, and relational lookup is what PostgreSQL does.
Why AGE is slow on this workload
Apache AGE runs on the same PostgreSQL 17 instance. Same binary, same hardware, same data loaded in parallel tables under AGE's catalog (ag_vertex, ag_label_edge). The structural difference is the interface.
Every AGE query passes through cypher(). The benchmark queries equivalent to Q1–Q4 above:
-- Q1: point lookup
SELECT * FROM cypher('graphdb', $$
MATCH (c:Chunk {id: '%s'})
RETURN c.id, c.word_count, c.section_title
$$) AS (id agtype, word_count agtype, section_title agtype);
-- Q2: 1-hop forward — chunk → entities
SELECT * FROM cypher('graphdb', $$
MATCH (c:Chunk {id: '%s'})-[:MENTIONS]->(e:Entity)
RETURN e.id, e.canonical_title
LIMIT 50
$$) AS (id agtype, canonical_title agtype);
-- Q3: 1-hop reverse — entity → chunks
SELECT * FROM cypher('graphdb', $$
MATCH (e:Entity {id: '%s'})<-[:MENTIONS]-(c:Chunk)
RETURN c.id, c.doc_id, c.ordinal
LIMIT 50
$$) AS (id agtype, doc_id agtype, ordinal agtype);
-- Q4: 2-hop — chunk → shared entities → other chunks
SELECT * FROM cypher('graphdb', $$
MATCH (c1:Chunk {id: '%s'})-[:MENTIONS]->(e:Entity)<-[:MENTIONS]-(c2:Chunk)
WHERE c2.id <> '%s'
RETURN c2.id, c2.doc_id, e.canonical_title
LIMIT 50
$$) AS (id agtype, doc_id agtype, canonical_title agtype);The cypher() function receives the Cypher string as a text argument. On each invocation it must:
- Parse the Cypher string through AGE's grammar (
cypher_gram.y,ag_scanner.l) - Construct an internal AST representing the graph pattern
- Translate that AST into a PostgreSQL internal plan tree — generating N-way JOINs from MATCH clauses
- Map result types through AGE's
agtypelayer back to PostgreSQL types - Return a set of records to the caller
This translation happens on every call. PostgreSQL's prepared-statement cache operates at the SQL level: the outer SELECT * FROM cypher(...) can be prepared, but the Cypher string inside is opaque to the planner. PG cannot cache the inner graph plan, only the outer wrapper. The net effect is that plan generation cost accumulates with every call.
Phase 3 source analysis confirmed the cost profile by examining AGE's C implementation. The cypher_clause.c translation layer generates N-way JOIN plans from MATCH patterns. For the Q1 point-lookup pattern — the simplest possible query — this overhead was measured at approximately 100x the latency of an equivalent direct SQL query. Phase 8 introduced AgensGraph (a PostgreSQL 16 fork with a native Cypher executor, no wrapper function) to isolate wrapper cost from planner cost. AgensGraph removes the cypher() wrapper entirely; its Cypher executor is integrated directly into PostgreSQL's plan tree. Q1 throughput on AgensGraph at u=1 was 4,795 RPS — roughly 100x faster than AGE on the same query type — confirming that the wrapper function accounts for the bulk of AGE's per-call overhead on simple lookups.
The Phase 3 internal analysis identified the wrapper cost as approximately 13 ms per call in the Phase 2 load conditions (GCP, u=50, JMeter harness). With the Go harness in Phase 6 the absolute numbers shifted, but the structural cause did not.
The Phase 3 EXPLAIN ANALYZE on AGE Q4 (2-hop, u=50) showed the plan PostgreSQL selected:
Hash Join (cost=...) (rows=8436)
-> Nested Loop
Rows Removed by Join Filter: 19,450,994
-> Hash (buckets=...)PostgreSQL's planner translated the MATCH (c1)-[:MENTIONS]->(e)<-[:MENTIONS]-(c2) pattern into a three-way join across the MENTIONS table. With 1.14M MENTIONS rows and weak selectivity estimates on agtype columns, the planner enumerated approximately 19.4 million intermediate rows before applying the join filter. Q4 p50 at u=50: 1,247 ms. The equivalent RCTE Q4 (self-join on rcte_mentions with an indexed equality predicate) returned in single-digit milliseconds.
A second cost source is PostgreSQL's planner behavior on more complex graph traversal patterns. AGE translates Cypher patterns like MATCH (c)-[:MENTIONS]->(e)<-[:MENTIONS]-(c2) into multi-way JOIN plans. PostgreSQL's query optimizer is designed for set-oriented relational algebra, not graph traversal cost estimation. On Q7 — a multi-seed UNWIND pattern that expands from several seed nodes — the planner selects a plan that does not terminate in reasonable time at 1.14M-edge scale. Q7 runs for minutes without returning. Because Q7 holds its worker connection for the full duration, at u=50 all workers eventually block on Q7 calls and throughput collapses. This is the primary driver of the 6.9% error rate at u=50 — workers timing out waiting for Q7.
AgensGraph removes the wrapper but not the planner problem. On OLTP mix (Q1–Q7 combined), AgensGraph processed approximately 11 executions in 9 minutes 7 seconds (0.6 RPS effective), because Q7 blocked every worker the same way it blocked AGE. The PG planner limit is structural to all Cypher-on-PostgreSQL approaches, not to the wrapper architecture.
Phase 3 summarized the decomposition:
| Cost source | AGE | AgensGraph |
|---|---|---|
| cypher() wrapper per call (~13ms, 100x on Q1) | present | absent |
| PG planner limit on multi-hop patterns | present | present |
RCTE avoids both. SQL plans on flat tables are within the PostgreSQL planner's design envelope. No translation layer exists. The per-call cost is the index scan plus nested-loop join — measured in microseconds, not milliseconds.
The gap as workload-structure match
The 290x figure is not a general statement about RCTE versus AGE across all graph workloads. It is a statement about a specific workload structure meeting a specific cost profile.
The workload is 85%+ 1-hop lookups and bounded-depth BFS. This structure has two properties that determine the outcome:
The dominant operations are point lookups and small range scans — exactly what B-tree indexes on flat tables serve in single-digit millisecond or sub-millisecond response time. Adding a translation layer (cypher()) on top of an operation that is already fast makes the translation cost disproportionately visible.
The graph-specific capabilities that distinguish a graph database from a flat SQL table — variable-length path queries,
shortestPath, cycle detection, community detection — are absent from the retrieval critical path. LightRAG'sget_knowledge_graphcaps traversal at depth 2, max_nodes 50. Variable-depth path queries andshortestPathdo not appear anywhere in the 18-methodBaseGraphStorageinterface on the hot path.
Where graph databases dominate RCTE in this same benchmark: Q11 shortestPath at u=10 measured Neo4j at 5,739 RPS versus RCTE at approximately 0.02 RPS (single query taking ~50 seconds). Q8 variable-depth path ([*1..4]) measured Neo4j at 2,820 RPS versus RCTE at 0.2 RPS. These operations require algorithms — Dijkstra, A*, bidirectional BFS — that operate efficiently on in-memory graph representations with early termination. RCTE must enumerate all paths via UNION ALL expansion, which is exponential in depth and impractical at 1.14M edges without strict bounding.
Workload-structure match explains the gap, and its direction can reverse completely depending on which queries dominate. A workload where shortestPath or [*1..N] variable-depth traversal is the critical path would produce results that are the mirror image of these numbers — and that analysis is the subject of Part 2.
For the LightRAG and general GraphRAG retrieval pattern: the design of LightRAG's retrieval pipeline concentrates 85% of call volume in 1-hop operations deliberately. The authors' choice to cap get_knowledge_graph at depth 2 and max_nodes 50 is not a limitation — it is what makes the system fast. The workload structure is also a design choice, and RCTE's advantage is a consequence of that choice aligning with flat SQL's strengths.
The lightrag_pg_not_age project, which implements the full BaseGraphStorage interface in plain PostgreSQL and tests it against NetworkX, Neo4j, and AGE backends with actual LightRAG output data, reached the same directional conclusion independently. On the 35/25/15/10/15 call distribution (get_node / get_node_edges / upsert_node / node_degree / has_node), RCTE measured 12,776 RPS against Neo4j at 1,684 RPS (7.6x gap) and AGE at 175 RPS (73x gap) on Barabási-Albert synthetic data. On real LightRAG EZIS data: RCTE 9,169 / Neo4j 1,693 / AGE 159 RPS. The absolute numbers are lower than Phase 6 (smaller infrastructure, different dataset), but the ordering is consistent with the main benchmark across two independent measurement setups.
Coming in Part 2
Part 2 covers Neo4j as the safe all-rounder, MemGraph's OOM risk, FalkorDB's catastrophic failure mode, AgensGraph's Q1/Q7 asymmetry, pgRouting limits, Microsoft GraphRAG / Leiden community-dependent systems, and the full workload-based recommendation matrix.