16 min read
AI assisted

GraphDB Benchmark (2/2) — Workload Matrix and Final Recommendations

Eight graph engines measured across OLTP, memory, analytics, and differentiation queries

Series — GraphDB Benchmark, Eight Engines
  1. 1. GraphDB Benchmark, Eight Engines (Part 1) — Decomposing the RCTE 290x Gap
  2. 2. GraphDB Benchmark (2/2) — Workload Matrix and Final Recommendations

Part 1 explained why RCTE outperforms Apache AGE by 290×. The cypher() wrapper consumes 13ms on every call. PostgreSQL's planner recompiles the Cypher AST to SQL on every invocation. Those costs stack under concurrent load. But Part 1 left a question open: is AGE unusually slow, or do graph databases broadly fall short of RCTE? Part 2 expands the scope to eight engines and adds three measurement axes beyond OLTP throughput — memory constraints, analytics workloads, and differentiation queries — to give a workload-based answer to which engine belongs where.


Why We Changed the Measurement Tool

Phases 1–5 used JMeter. The setup: JMeter ThreadGroup → JSR223 Groovy → DB-specific Java driver → query execution. Phase 5 measured MemGraph at 5,881 RPS at u=50, only 8% ahead of Neo4j at 5,442 RPS. We concluded the two engines performed similarly.

Replacing the load generator with a Go native binary in Phase 6 revealed that conclusion was wrong.

Phase 5 (JMeter) → Phase 6 (Go) changes
MemGraph: 5,881 → 22,462 RPS (+282%)
Neo4j:    5,442 → 14,541 RPS (+167%)
RCTE:    12,334 → 22,581 RPS (+83%)
AGE:         89 →     78 RPS (-12%)

JMeter was adding tens of milliseconds of overhead to every Cypher engine call. JVM Stop-the-World GC pauses, JSR223 script recompilation on each sample, Groovy reflection costs — they accumulated. AGE actually decreased slightly: its bottleneck is server-side (the cypher() wrapper itself), so removing client-side overhead does nothing.

The lesson is that the measurement instrument shapes the results. A Go goroutine uses ~2KB of stack; a JVM thread uses ~512KB. At u=50 concurrent workers, JVM thread management overhead matters. All Phase 6+ results use the Go harness.


All Eight Engines — OLTP Results

Same conditions throughout: 50K docs / 108K chunks / 587K entities / 1.14M edges (Wikipedia rule-based parsing), GCP n2-standard-8 × 8 VMs, u=50 × 180 seconds.

Engine u=50 RPS Error rate Core characteristic
RCTE (PG 17) 22,581 0% Flat SQL + btree indexes
MemGraph 22,462 0% 100% in-RAM Cypher server
Neo4j 5.20 14,541 0% Record store + doubly-linked edges
SQL/PGQ (PG 19-dev) 3,157 0% GRAPH_TABLE planner overhead
AGE 1.6.0 78 6.9% cypher() wrapper per-call cost
AgensGraph 2.16 catastrophic PG 16 fork, stalls on Q7
FalkorDB catastrophic 99.3% Redis single-thread, pool exhaustion
pgRouting 2.0 0% Q11 only, graph rebuilt per call

Numbers alone leave the causes invisible. Here is the technical explanation for each result.

Why RCTE and MemGraph tie at 22.5K

Both engines plateau near 22,500 RPS. If this were server saturation, p50 latency would climb. In practice, both held p50 ≤ 1ms at u=50 — the server is not the bottleneck. The most likely explanation is the load generator itself: the bench VM (n2-standard-4, 4 vCPUs) may be the ceiling. Two engines with completely different internal architectures converging to the same number is a strong signal of an external constraint.

Why is RCTE fast? The queries operate against flat normalized tables.

-- Q4 2-hop: Chunk → Entity → Chunk
SELECT DISTINCT m2.chunk_id
FROM rcte_mentions m1
JOIN rcte_mentions m2 ON m1.entity_id = m2.entity_id
WHERE m1.chunk_id = $1 AND m2.chunk_id <> $1
LIMIT 50;

PostgreSQL's planner handles this as a hash join or index lookup. With a btree index on entity_id, finding entities for a given chunk is O(log n). There is no graph abstraction layer; the PG buffer pool caches index pages directly.

Why is MemGraph fast? The entire graph is in memory. Nodes live in a skiplist-based in-memory structure; edges are adjacency lists per node. A 1-hop traversal is pointer chasing. Cypher's planner pushes the LIMIT down so traversal stops the moment 50 results are found. Disk I/O is zero.

Why AGE is at 78 RPS

The cypher() function is a PL/pgSQL wrapper. Walking through what happens on each call:

1. Enter cypher() PL/pgSQL function            (~0.5ms)
2. Parse Cypher string into AST                (~2ms)
3. Transform AST to SQL plan                   (~3ms)
4. Execute generated SQL (actual query work)   (~1–5ms)
5. Convert and return agtype result            (~3ms)

Steps 1–3 and 5 repeat on every single call regardless of whether the query is identical to the last one. In our measurements, Q1 (simple point lookup) had p50 of 11ms for AGE versus 7ms for Neo4j. The underlying queries retrieve roughly the same data; the 4ms gap is wrapper overhead. Under u=50 concurrent requests, 50 workers each paying this fixed cost saturates at 78 RPS.

Why FalkorDB fails catastrophically at u=50

FalkorDB is a Redis module. Redis's core design is a single-threaded event loop. Every GRAPH.QUERY command is serialized on Redis's main thread.

u=1:  1 client  → main thread fully utilized   → 35 RPS
u=10: 10 clients → serialized, backpressure     → 267 RPS
u=50: 50 clients → connection pool exhausted    → 99.3% errors

The jump from u=1 to u=10 is 7.8× — already not linear, already showing saturation. At u=50, clients cannot acquire a connection from the pool and fail immediately. This is why the error p99 is 7ms: that's how long it takes to get a connection-refused error back, not how long a query takes. Q7 (UNWIND list + 2-hop expand) inherently exceeds Redis's default 1-second query timeout, compounding the problem.

FalkorDB's GraphBLAS engine is a genuine strength for batch analytics. It is structurally unsuited for concurrent OLTP retrieval.


Memory Constraints — Three Different Failure Modes

Needing to reduce memory is common in production: Kubernetes pod resource limits, cost optimization, migration to smaller instances. We capped container memory at 32GB → 4GB → 1GB.

Engine 32GB 4GB 1GB Behavior at 1GB
RCTE 22,470 22,652 22,370 (−0.4%) Normal
Neo4j 12,927 12,357 (−4.4%) JVM won't initialize
MemGraph 22,656 22,773 93.5% errors OOM-kill loop, 9 restarts

Three engines, three fundamentally different failure patterns.

Why RCTE holds at −0.4% with 1GB

PostgreSQL uses a two-level memory architecture: shared_buffers (PG-managed buffer pool) and OS page cache. In a 1GB container, shared_buffers was set to 128MB. The OS page cache continues drawing from the host's physical memory (31GB).

The key is how Linux cgroup v2 memory limits work. The limit constrains RSS (Resident Set Size) — anonymous memory the process owns. Page cache is managed by the host kernel as file-backed memory and does not count toward the container's RSS limit directly. When PostgreSQL reads from data files, the kernel loads those pages into the host page cache. Subsequent reads hit memory. The PG process's RSS inside the 1GB container measured at 154MB.

This architectural separation is why RCTE can run inside a 1GB container serving 22.4K RPS. The working set (hot index pages, frequently accessed rows) fits in page cache on the host regardless of the container's RSS limit.

Why Neo4j fails to start at 1GB

Neo4j 5.20 with GDS 2.6.9 requires approximately 512MB of JVM heap minimum. On top of that: JVM runtime initialization, Netty networking stack, class loading, GC metadata structures. Configuring heap=512MB and page_cache=256MB in a 1GB container puts total memory demand over the limit during JVM initialization. The process fails before serving a single query.

The 4GB result is more instructive. With heap=2GB and page_cache=1GB, throughput fell only 4.4% from baseline. Neo4j maintains an object cache inside the JVM heap in addition to the OS page cache, so even with page_cache reduced to 1GB, the heap cache partially compensates. Graceful degradation works well — but there is a hard floor at the JVM's minimum heap requirement. Any deployment plan needs to account for this floor.

Why MemGraph OOM-kills at 1GB

MemGraph's defining property is "100% in-RAM." All graph data is held in memory for query processing. Nodes live in a skiplist structure; edges are stored as per-node adjacency lists. With 745K nodes and 1.14M edges loaded, the MemGraph container's RSS was 820MB.

That is 96% of the 1GB container limit. During query processing, temporary memory allocations — join result buffers, sort intermediates, Cypher plan execution state — push RSS past the limit. The kernel fires the OOM-killer; the MemGraph process terminates. The container auto-restarts, MemGraph reloads data, the cycle repeats. Nine restarts in 120 seconds; 6.5% of requests succeeded.

The critical difference from RCTE and Neo4j: there is no graceful degradation path. RCTE can spill to OS page cache. Neo4j degrades slowly as page cache shrinks. MemGraph's entire value proposition is in-RAM performance; it has nowhere to put data when memory runs out. This failure mode is an operational risk — insufficient memory headroom means unannounced service interruptions under load.


Analytics Workload — FalkorDB Reverses

FalkorDB, which failed catastrophically under concurrent OLTP, shows a completely different profile for batch analytics. Single-client wall-time measurements.

Algorithm Neo4j GDS 2.6.9 FalkorDB algo.* Ratio
PageRank (maxIter=20) 4.98s 0.57s FalkorDB 8.7×
BFS (single seed, full graph) 2.32s 0.38s FalkorDB 6.1×
Louvain 16.73s Not supported

FalkorDB, which was dozens of times slower than Neo4j on OLTP, is 8.7× faster on PageRank. The explanation is GraphBLAS.

GraphBLAS represents graph algorithms as sparse matrix operations. PageRank is the power iteration r' = αAr + (1−α)/n, where A is the adjacency matrix. A graph with 1.14M edges has a 587K × 587K adjacency matrix with only 1.14M nonzero entries — an extremely sparse structure. GraphBLAS's sparse matrix-vector multiplication processes only those nonzero entries, which is far more efficient than dense matrix arithmetic.

Neo4j GDS implements PageRank by iterating over nodes in the JVM heap. JVM garbage collection pauses interrupt the computation; object indirection adds memory access overhead.

One important clarification: this advantage is not from multi-threading. sysmon at 0.1s sampling showed FalkorDB using 0.4 of 8 cores during PageRank. At 745K nodes, the sparse matrix multiplication fits within one core. FalkorDB's analytics advantage is algorithmic efficiency, not parallelism.

Why RCTE cannot practically compute PageRank

A PL/pgSQL implementation of PageRank is technically possible. It is not practically viable. Each iteration requires a temporary table creation, a hash aggregate over all nodes, and updating the rank values. PostgreSQL processes this in a single backend with no parallelism. The "state update + convergence loop" pattern doesn't map well to set-based SQL. At 745K nodes, expect 1–2 orders of magnitude slower than Neo4j GDS.

The practical pattern: export the PG flat-table graph to NetworkX or igraph as a one-time offline batch job → compute Louvain/Leiden communities → write community_id back as a column → serve it with a plain SQL lookup at query time. LightRAG's community detection fits this pattern.


Differentiation Queries — When a Graph DB is Actually Needed

Q1–Q7 measured retrieval throughput. Q8–Q15 tested whether RCTE can handle queries that are structurally native to graph engines. Measured at u=10 × 60 seconds.

Query Meaning Neo4j (u=10) RCTE (u=10) Gap
Q8 variable-length [*1..4] Depth 1–4 variable traversal 2,820 RPS 0.2 RPS 14,100×
Q11 shortestPath (depth ≤ 6) Shortest path between chunks 5,739 RPS 0.02 RPS 286,950×
Q13 multi-filter + EXISTS Filter with existence check 55.6 RPS 137.8 RPS RCTE 2.5× better
Q15 doc-scoped multi-join Document-scoped composite join 10,797 RPS 6,015 RPS Neo4j 1.8×

The 14,000× and 300,000× gaps are not performance differences. They reflect a structural mismatch. Here is why.

Why WITH RECURSIVE explodes on variable-length paths

SQL WITH RECURSIVE CTE implements BFS at the table level.

WITH RECURSIVE traversal AS (
    SELECT chunk_id, entity_id, 0 AS depth
    FROM rcte_mentions WHERE chunk_id = $1
    
    UNION ALL
    
    SELECT m.chunk_id, m.entity_id, t.depth + 1
    FROM rcte_mentions m
    JOIN traversal t ON m.entity_id = t.entity_id
    WHERE t.depth < 4
)
SELECT DISTINCT chunk_id FROM traversal LIMIT 50;

Two structural problems.

First: no automatic visited-set tracking. In a graph with cycles, already-visited nodes get revisited. A → B → C → A → B → C... Without cycle detection, the intermediate result set grows unboundedly until LIMIT stops it — but only after the cardinality explosion has already materialized into memory.

Second: no LIMIT push-down. PostgreSQL materializes each recursive step completely before advancing to the next. At depth 4 with 1.14M edges, each level's frontier expands geometrically. Billions of intermediate rows can accumulate before the LIMIT clause is applied at the final SELECT.

Cypher handles this differently. It uses a lazy iterator: traversal stops the moment LIMIT results are found. The visited set is built-in; cycles are detected and skipped automatically. The LIMIT propagates into the execution plan. For Q8 (depth 1–4, LIMIT 50), Neo4j finds 50 results and stops; RCTE attempts to materialize all reachable paths up to depth 4.

Why this weakness does not matter for LightRAG

LightRAG's runtime query distribution:

  • Q2 (chunk → entity), Q3 (entity → chunk): 1-hop — RCTE dominant
  • Q4 (2-hop), Q6 (co-entity): 2-hop max — RCTE sufficient
  • Q5, Q15: joins and expands — RCTE sufficient

Variable-length paths and shortestPath do not appear in LightRAG's retrieval flow. The reason is semantic: hop distance in the entity graph is not semantic relevance. Chunk A and Chunk B being 3 hops apart in the graph doesn't mean they're related. LightRAG finds entry entities via vector similarity, then collects 1–2 hop neighbors. The queries where RCTE fails are outside this workload.

Neo4j is the right choice when the workload includes: anti-fraud cycle detection (N-hop suspicious accounts), supply chain tracing (which final products contain a given component), recommendation latent path discovery. In those patterns, RCTE is not a slower alternative — it is impractical.

AGE's differentiation query results

AGE hit STUCK (single-query wall time exceeding 20 minutes) on 6 of 8 differentiation queries at u=10. The cypher() wrapper's concurrency contention doesn't just reduce throughput — under concurrent complex-pattern queries, it blocks all workers. This is a qualitative difference from Neo4j being slower. It means AGE is operationally dangerous for multi-user concurrent workloads with complex graph patterns.

To isolate the AGE gap's root cause, we measured AgensGraph (PG 16 fork with native Cypher executor, no cypher() wrapper). Q1 simple lookup ran at 4,795 RPS at u=1 — more than 100× faster than AGE, confirming the wrapper cost. Q7 multi-seed UNWIND stalled identically to AGE, confirming that the PG planner's graph pattern handling is a separate, independent cause. Removing the wrapper fixes one problem; the PG planner limit remains.


pgRouting — A Different Tool for a Different Domain

pgRouting is a PostgreSQL extension for GIS and road-network routing, originating in 2006 as an OpenStreetMap routing tool. It exposes Dijkstra, A*, TSP, and ~50 other algorithms as SQL functions. We ran Q11 shortestPath on the same 1.14M edge dataset.

Engine Q11 RPS (u=10) p50 p99
Neo4j 5,739 0.17ms
pgRouting 2.0 5,035ms 6,234ms

A single query averages 5 seconds. 2,870× slower than Neo4j. The query itself is simple:

SELECT count(*) FROM pgr_dijkstra(
    'SELECT id, source, target, cost, reverse_cost FROM pgr_edges',
    $src::bigint, $dst::bigint,
    directed := false
)

The cost breakdown explains the latency:

pgr_dijkstra() invocation
  → 1. Execute 'SELECT ... FROM pgr_edges'   — fetches all 1.14M rows, every call
  → 2. Build in-memory BGL graph from rows   — every call
  → 3. Run Dijkstra algorithm                — takes a few ms
  → 4. Return path as row set

Steps 1 and 2 dominate. Fetching 1.14M rows from disk and constructing a Boost Graph Library graph from scratch takes ~4.8 seconds. The algorithm itself (step 3) completes in milliseconds. There is no graph caching between calls — pgRouting is a SQL function, and each invocation has its own execution context.

This call model works for road networks because edge counts are manageable (thousands to hundreds of thousands). pgRouting's design target is single or infrequent path queries on spatial networks, not concurrent multi-user retrieval against million-edge graphs. Native graph databases keep the graph resident in memory; pgRouting reconstructs it every time.


Final Decision Guide

Workload First choice Second choice Avoid
LightRAG / GraphRAG retrieval (1–2 hop) RCTE Neo4j (if Cypher familiarity matters) AGE, FalkorDB
Memory ≤ 4GB constraint RCTE Neo4j (graceful degradation) MemGraph (OOM cliff)
Variable-length path / shortestPath Neo4j RCTE (impractical)
Batch analytics (PageRank / Louvain) Neo4j GDS FalkorDB algo.* (auxiliary) RCTE / AGE
OLTP + analytics mixed RCTE + offline batch Neo4j single engine
Cypher + high throughput MemGraph Neo4j
Road network / GIS routing pgRouting RCTE

Recommended stack for LightRAG: PostgreSQL 17 + pgvector + RCTE flat tables + offline NetworkX or igraph for community detection (write community_id column back). A single PostgreSQL instance handles the complete GraphRAG stack.

When Neo4j is the right choice: the workload includes variable-length paths or shortestPath; the team writes Cypher and migration cost is high; the system is not specifically LightRAG-style retrieval but knowledge graph reasoning, fraud detection, or recommendation latent path.


Scope

These conclusions apply to neighbor-lookup-heavy knowledge graph retrieval — the pattern of LightRAG, GraphRAG retrieval phases, and entity-relation search. Systems where Leiden community detection is central to the retrieval flow (Microsoft GraphRAG global summarization) fall outside this scope. A separate validation run on 100 internal company documents processed through LightRAG (1,493 extracted entities) confirmed the same engine rankings and query patterns — details in the lightrag-pg-rcte post.