PostgreSQL Lakehouse (1/2) — When DuckLake Hit a Wall, pg_lake Was There
- 1. PostgreSQL Lakehouse (1/2) — When DuckLake Hit a Wall, pg_lake Was There
- 2. PostgreSQL Lakehouse (2/2) — Distributed Processing and Citus Integration
The idea was straightforward: PostgreSQL as the Lakehouse catalog. DuckDB, Spark, and Trino could all query the same tables, while snapshot history, partition info, and file locations lived in PostgreSQL. No Databricks Unity Catalog. No AWS Glue. Just the PostgreSQL instance already running in production.
Eight phases of experiments later, the solution came from neither the technology we started with nor the obvious alternatives — but from an open-source project released in November 2025.
Why PostgreSQL as Catalog
The classic data architecture separates OLTP from OLAP. PostgreSQL handles transactions; Redshift or BigQuery handles analytics. A Lakehouse blurs that boundary. Data sits in object storage (S3, GCS) in open formats (Parquet, Iceberg), and multiple engines read and write the same files.
[DuckDB standalone] [pg_duckdb] [PySpark] [Trino]
↓ ↓ ↓ ↓
PostgreSQL (catalog — table metadata, snapshots, file paths)
↓
MinIO / GCS (Parquet files)The core assumption: PostgreSQL can act as a "metadata store and transaction coordinator." Each engine asks PostgreSQL which files to read. Writes register new snapshots back to PostgreSQL. The appeal was reusing existing infrastructure instead of running a separate catalog service.
The first candidate to try this was DuckLake.
Phase 1–2: DuckLake — The Catalog Won't Connect
DuckLake is the Lakehouse table format built by the DuckDB team, announced in 2024. The official documentation listed PostgreSQL catalog support. We started with a local MinIO environment to verify basics.
DuckDB standalone creating a DuckLake table backed by a PostgreSQL catalog works fine:
import duckdb
conn = duckdb.connect()
conn.execute("INSTALL ducklake; LOAD ducklake;")
conn.execute("""
ATTACH 'ducklake:postgres:host=localhost dbname=catalog' AS lake
""")
conn.execute("CREATE TABLE lake.orders (id INT, amount FLOAT, ts TIMESTAMP)")
conn.execute("INSERT INTO lake.orders VALUES (1, 99.9, NOW())")The problem appeared the moment we tried the same ATTACH from within pg_duckdb (DuckDB embedded inside PostgreSQL):
-- Executed inside pg_duckdb — fails
SELECT duckdb.raw_query($$ ATTACH 'ducklake:postgres:...' AS lake $$);ERROR: symbol lookup error: /usr/lib/postgresql/17/lib/pg_duckdb.so:
undefined symbol: PQconnectdblibpq symbol collision. Understanding why this happens requires knowing how pg_duckdb works.
pg_duckdb embeds DuckDB as a shared library inside the PostgreSQL process. When postgres starts, pg_duckdb.so loads, and DuckDB loads with it. Everything runs inside a single process.
DuckLake's PostgreSQL catalog connector is implemented by calling libpq (PostgreSQL's C client library) from inside DuckDB — essentially reconnecting to PostgreSQL from within the PostgreSQL process itself. The problem: PostgreSQL already loaded libpq into the same process address space when it started. Two versions of libpq symbols exist in the same process, and the dynamic linker cannot resolve which one to use.
postgres process memory:
[PostgreSQL runtime] ← libpq already loaded
[pg_duckdb.so]
[DuckDB engine]
[DuckLake catalog connector] ← calls PQconnectdb()
→ symbol lookup: PQconnectdb
→ collision: which libpq?This is structural. No runtime configuration, no environment variables, no LD_PRELOAD workaround can fix it. The DuckDB and DuckLake GitHub issue trackers had open bugs with no fix scheduled.
The actual scope of DuckLake's multi-engine support:
| Environment | PostgreSQL catalog | Trino | Spark |
|---|---|---|---|
| DuckDB standalone | PASS | FAIL | Limited |
| pg_duckdb | FAIL (libpq collision) | — | — |
DuckLake is a DuckDB-only table format. It is an excellent choice if DuckDB standalone is your only query engine. For multi-engine Lakehouse, it is structurally unsuited.
Phase 2 showed PySpark could query PostgreSQL directly via JDBC for distributed execution. PGSpider, which markets itself as distributed Parquet querying, turned out to be a UNION ALL wrapper over multiple nodes — not genuine distributed processing.
Phase 3: TPC-H SF=100 on GCS — Storage Performance Reality
After confirming DuckLake's limits, we switched the table format to Apache Iceberg. Iceberg is the open table format originated by Netflix, now an industry standard. It layers snapshot metadata on top of Parquet files, enabling time travel, schema evolution, and multi-engine read/write.
We moved storage to GCS and ran all 22 TPC-H SF=100 queries against it.
| Metric | Result |
|---|---|
| Queries executed | 22 of 22 |
| Success rate | 100% |
| Data size | ~100GB / ~866M rows |
| GCS cold query latency | ~650× slower than local NVMe |
| GCS + session cache (re-execution) | ~2.6× slower than NVMe |
All 22 queries succeeded. The 650× cold latency number is the operational reality check.
Reading Parquet from GCS means HTTP over TLS requests to object storage. Fetching 100GB means network transfer: GCS throughput in the hundreds of MB/s range, versus NVMe in the GB/s range, plus network round-trip latency that NVMe simply does not have.
Session cache reduces repeat-query latency to 2.6× slower than NVMe. But this only applies when the same query runs again on cached data. Different filters, different tables, or expired cache brings back the full cold penalty.
Building production query SLAs on GCS without a caching strategy is a risk. "Fast ad-hoc analytics on cloud storage" means "fast ad-hoc on warm cache." First-run latency is 650× the NVMe baseline. TPC-H correctness was fully validated at 100GB scale; performance requires a caching plan.
Phase 4: Iceberg + Nessie Multi-Engine Validation
Iceberg resolved the format question, but a new one appeared: who manages the catalog? Engines need to know which snapshot is current — which Parquet files to read for a given table. A catalog service answers this.
Nessie is the most widely used open-source Iceberg catalog, contributed to by Netflix, Dremio, and others. It provides a REST API for Iceberg catalog operations and supports Git-style branching and commits on table state.
Three engines querying the same Iceberg tables via Nessie, running TPC-H Q1 (SF=1, 6M rows):
| Engine | TPC-H Q1 runtime | Notes |
|---|---|---|
| pg_duckdb | 0.37s | Direct iceberg_scan() function |
| Trino | 2.36s | Nessie REST Catalog |
| Spark | 10.8s | JDBC Catalog + S3FileIO |
pg_duckdb is fastest because iceberg_scan() reads Iceberg Parquet directly through DuckDB's columnar engine — no catalog roundtrip overhead for the data itself. Trino has distribution overhead that shows at small scale. Spark includes JVM startup cost.
All three engines returned identical results from the same Iceberg snapshot. Multi-engine interoperability is confirmed.
But Nessie is a separate service. It runs in its own container, needs its own operational overhead, and is the catalog — not PostgreSQL. The original goal was "PostgreSQL as catalog." Nessie as catalog with PostgreSQL as application DB is a different architecture.
Phase 4.5: Discovering pg_lake
Snowflake Labs released pg_lake as open source in November 2025. The project implements Apache Iceberg natively inside PostgreSQL — no separate catalog service. PostgreSQL itself handles Iceberg DDL and DML.
Architecture
Client (psql / JDBC / application)
↓ Port 5439
PostgreSQL 18.1 + pg_lake 3.3
├── pg_lake_iceberg (Iceberg v2 implementation)
├── pg_lake_table (data lake file FDW)
└── pg_lake_engine (query routing)
↓ Unix Socket
pgduck_server (DuckDB columnar execution engine)
↓
MinIO S3 (Iceberg Parquet files)The key component is pgduck_server. When a query arrives, pg_lake routes aggregation and scan operations through a Unix socket to a locally running DuckDB engine. DuckDB reads the Parquet files columnar-style and returns results. The client sees standard PostgreSQL SQL; DuckDB handles the execution.
What It Looks Like to Use
-- Standard CREATE TABLE with USING iceberg
CREATE TABLE orders (
order_id INTEGER,
customer_id INTEGER,
amount DECIMAL(10, 2),
order_date TIMESTAMP DEFAULT NOW(),
metadata JSONB
) USING iceberg;
-- Standard SQL INSERT
INSERT INTO orders VALUES (1, 100, 299.99, NOW(), '{"channel": "web"}');
-- UPDATE and DELETE work
UPDATE orders SET amount = 349.99 WHERE order_id = 1;
DELETE FROM orders WHERE customer_id = 101;
-- Full ACID transactions
BEGIN;
INSERT INTO orders VALUES (3, 102, 599.00, NOW(), NULL);
ROLLBACK; -- clean rollback, snapshot not committed
-- Schema evolution
ALTER TABLE orders ADD COLUMN region TEXT;
-- JSONB queries work natively
SELECT * FROM orders WHERE metadata->>'channel' = 'web';Behind the scenes, all of this produces standard Iceberg v2 Parquet files in MinIO, with Iceberg snapshot JSON metadata managed by PostgreSQL. Any engine that can read Iceberg can read these files.
Performance
TPC-H Q1 on 60K rows: 10–12ms. The DuckDB columnar engine underneath makes short work of analytical aggregations even at this scale.
Four-Way Interoperability — All Verified
| Direction | Method | Result |
|---|---|---|
| Spark → pg_lake read | FDW + metadata.json path | PASS |
| pg_lake → Spark read | JDBC Catalog + S3FileIO | PASS |
| pg_lake → DuckDB read | iceberg_scan(metadata_path) |
PASS |
| pg_lake → Trino read | REST Catalog Bridge | PASS |
pg_lake produces standard Iceberg v2 format, so Spark, DuckDB, and Trino can all read it directly. A Spark job can write an Iceberg table that pg_lake then serves as a native PostgreSQL table. This is the multi-engine interoperability the original goal required.
DuckLake vs Iceberg (Nessie) vs pg_lake
| Criterion | DuckLake | Iceberg + Nessie | pg_lake |
|---|---|---|---|
| PostgreSQL catalog | FAIL (libpq collision) | Nessie separate service | Native |
| pg_duckdb support | FAIL | PASS | PASS |
| Trino support | FAIL | PASS | PASS |
| Spark support | Limited | PASS | PASS |
| DuckDB standalone | PASS | PASS | PASS |
| Full ACID | Limited | Limited | Full |
| Schema evolution | PASS | PASS | PASS |
| External services | 0–1 | 2 (Nessie + storage) | 1 (storage only) |
| Ecosystem standard | DuckDB-only | Industry standard | Iceberg standard |
pg_lake's honest limitations: pre-GA as of v3.3, requires PostgreSQL 18.1, and needs the pgduck_server daemon running alongside. Operational complexity is higher than DuckDB standalone.
But among open-source options available in late 2025, pg_lake was the only project satisfying all four conditions simultaneously: PostgreSQL as catalog, full ACID, Iceberg industry standard, and verified multi-engine interoperability.
Part 2 covers the distributed processing comparison (Ray+DuckDB vs Daft vs Smallpond), the Citus integration experiment (what failed and what succeeded), and the final architecture.