11 min read
AI assisted

In-DB AI, Five Systems Compared — Why Most Are Misguided

Oracle 23ai, EDB AIDB, PostgresML, Timescale pgai, pg_aidb — five systems and the abstraction-boundary discussion

Oracle 23ai, PostgresML, and EDB AIDB are converging on the same answer: expose every AI operation as a SQL function. This post covers why most of these implementations get the abstraction boundary wrong, what the five major systems actually do, and what a better boundary looks like.


Why people want AI inside the database

The data gravity argument is straightforward. When documents, user records, and event history are already in PostgreSQL, pulling them out to feed an embedding pipeline adds latency, complicates consistency guarantees, and multiplies infrastructure surface area. A call like:

SELECT ai.retrieve_and_generate('What is our refund policy?', 'docs');

has real appeal. One line, one transaction, no external service to deploy. For a developer prototyping on a laptop, this is the right ergonomic default.

There is also a genuine technical argument for in-DB ML: zero-copy data access. If a SELECT result is already in the database's process memory, it can be handed directly to an ML operation running in the same process — no serialization, no network transfer. That appeal is real, though it has limits in modern AI workloads.

Agentic AI is compute-bound, not transfer-bound: an LLM call takes 1–30 seconds, while moving the input payload takes tens of milliseconds. RAG and vector search return narrow result sets — top-k retrieval ends at the index, so saving a few milliseconds of data movement on a 10-row return is noise. Traditional in-DB ML has always had limited model coverage, harder debugging, and more opaque production tuning than external serving infrastructure. And transactional safety and ML pipelines rarely need to be coupled; most pipelines are fine with batch or near-real-time, making real-time guarantees unused overhead rather than a feature.

The more promising direction is open formats: Iceberg, Parquet, Lance, and Arrow let external compute engines — Spark, Ray, DuckDB, ClickHouse, Polars, vLLM — access data at near-zero-copy cost without fusing inference into the storage layer. The framing is not "move the ML into the database" but rather: standardize the format so compute can come to the data without needing to own it.


The hard constraint: process-per-connection meets multi-second AI I/O

PostgreSQL uses a process-per-connection model. Every client connection spawns an OS process. When that process is waiting on I/O — including an HTTP call to an LLM — it is completely stalled. No event loop, no worker thread pool, no coroutine yield point. The process holds its connection and does nothing else until the response arrives.

AI workloads are slow by any database standard:

Operation Typical latency
Query embedding (OpenAI API) 200ms – 1s
Local ONNX embedding 100ms – several seconds
Reranking 100ms – several seconds
LLM text generation 1s – 30s+

With 100 concurrent requests hitting ai.generate(), that is 100 backend processes, all stalled, waiting on LLM responses that take 1–30 seconds each. New connection attempts pile up. If max_connections = 100, the service is down.

PgBouncer does not fix this. In transaction mode, PgBouncer reclaims a connection only after the transaction ends. An LLM call inside a SQL function is a transaction in progress. The backend holds the connection for the full duration of the LLM response — PgBouncer cannot touch it.

Session mode with a generous pool size buys headroom. With 100 connections and 500ms embedding latency, roughly 200 embedding requests per second are sustainable before the pool saturates. That is useful for low-to-mid concurrency internal tooling. It is not a design for production user-facing AI traffic.


How the five systems respond

Oracle 23ai / 26ai

Oracle's approach is the most architecturally distinctive. Rather than accept HTTP latency for embeddings, it runs an ONNX runtime in-process inside the database. The HNSW graph lives in the shared global area (SGA). Query-time vector embedding happens without any network hop:

SELECT VECTOR_EMBEDDING(my_onnx_model USING text_column AS data)
FROM documents;

No HTTP call means no external round-trip latency for that step.

In-process ONNX solves embedding latency for ONNX-compatible models; it does not apply to frontier LLM APIs. GPT-4o, Claude, Gemini, and every other frontier LLM remain external HTTP calls, meaning LLM generation — the 1–30 second step — still blocks the backend process for the full response duration. DBMS_VECTOR_CHAIN hits an external endpoint and holds the process until the response arrives. High-concurrency LLM generation therefore hits the same connection-pool constraint that affects any synchronous in-DB AI approach.

EnterpriseDB AIDB

EDB AIDB's headline interface is clean:

SELECT aidb.retrieve('What is the refund policy?', 'docs_pipeline');
SELECT aidb.retrieve_and_generate('Summarize the architecture', 'docs_pipeline');

Ingestion is handled by a background worker — aidb.set_auto_preparer() kicks off a BGW that watches for new data and processes it asynchronously. That is the right call for ingestion: batch-oriented, no concurrency pressure, no user waiting on the other end.

The query path is different. At retrieval and generation time, the calls are synchronous HTTP. The backend process blocks on the embedding call, then blocks on the LLM call. The BGW design buys nothing for query-time latency.

EDB is direct about the target market: internal tools, data analyst workflows, batch pipelines. That is an honest characterization of what synchronous in-DB AI can safely do. For anything with concurrent user-facing traffic, the connection model is the ceiling.

PostgresML (pgml)

PostgresML took the most aggressive position: put model weights inside the database itself.

SELECT pgml.generate('distilgpt2', 'What is PostgreSQL?');
SELECT pgml.embed('intfloat/e5-small', 'search query');

The idea was to eliminate the external API call entirely. No HTTP, no external service, no network latency. Models load into the extension's memory space and inference runs in-process.

Active development has effectively stalled since 2024. The resource isolation problem is structural: loading multi-gigabyte model weights into shared library memory inside a database process creates contention that is incompatible with a database's memory management model. GPU memory allocation, batching strategy, and inference throughput optimization are problems that dedicated model serving infrastructure — vLLM, TensorRT-LLM, Triton — solves through years of specialized engineering. Model lifecycle management (loading, eviction, versioning) and database connection lifecycle management operate on different axes and do not compose cleanly inside a single process. Inference infrastructure and storage infrastructure have incompatible scaling profiles: inference scales horizontally by query type and load; storage scales by data volume and access patterns. Fusing them forces both to scale together.

Timescale pgai + vectorizer-worker

Timescale's design separates what the database owns from what an external worker owns. The database holds embedding storage (pgvector), work queues, configuration, and pipeline metadata. The vectorizer-worker is a separate process that polls a PostgreSQL-managed queue, calls the embedding API, and writes results back to the database. The database never blocks on an AI HTTP call during the ingestion path.

Query-time retrieval — ai.openai_chat_complete and similar — does still block synchronously, and Timescale is explicit that interactive retrieval carries that cost. The split is between ingestion (the highest-volume path, handled asynchronously via the external worker) and interactive queries (synchronous, acknowledged as such).

The result is that the database acts as queue manager, config store, and result store; AI compute lives outside; the extension does not attempt to serve as an inference host. The production stack assembles pgvectorscale (ANN indexing), pg_textsearch (BM25), pgai (embedding pipeline interface), and vectorizer-worker (external compute) as separate components, each with one job.

pg_aidb

pg_aidb separates the SQL interface from the blocking limit. The dual-mode API makes that boundary explicit rather than hiding it:

-- Synchronous: development, analytics, low-concurrency. Blocks explicitly.
SELECT ai.search(query => 'question', pipeline => 'docs', top_k => 10);
SELECT ai.generate(query => 'question', pipeline => 'docs');

-- Async: production. Returns immediately, result in ai.results table.
SELECT ai.search_async('question', 'docs');   -- returns request_id
SELECT ai.generate_async('question', 'docs'); -- returns request_id

The async path uses LISTEN/NOTIFY with an outbox table. The backend process returns a request ID immediately; an external pipeline-worker service handles the actual embedding and LLM calls; results land in ai.results and a NOTIFY signals completion. Clients that cannot use persistent connections — PgBouncer transaction mode, for instance — poll the results table directly. NOTIFY is a hint, not a guarantee.

The positioning is as an agent memory infrastructure layer rather than a RAG platform. Semantic memory (RAG over documents) is implemented today. Episodic memory, working memory, and retrieval policy are on the roadmap. Follow-up posts cover the dual-mode API and the memory-layer implementation.


Side-by-side

System Business-logic intrusion Ingestion/search coherence Query-time blocking Abstraction boundary clarity
Oracle 23ai/26ai Medium Medium High (LLM path) Medium
EDB AIDB Medium Medium High (query path) Medium
PostgresML High Low High Poor
Timescale pgai Low High Medium (interactive) Strong
pg_aidb Low High Low (async default) Strong

"Business-logic intrusion" here means how much the extension encourages moving application logic — prompt construction, retrieval strategy, generation orchestration — into SQL. Lower is better, because application logic that lives in SQL is harder to test, version, and reason about independently from data access patterns.

Designs that reduce blocking impact do so either through external workers (Timescale's vectorizer-worker, pg_aidb's pipeline-worker) or through an explicit dual-mode API that separates synchronous and asynchronous paths.


The structural reason this approach repeats

The common failure mode across the weaker designs is treating the database extension as an application framework. When a SQL function handles chunking strategy, prompt construction, retrieval reranking, and LLM generation in a single call, the application layer has moved into the database. That is not simplification — it is relocation.

PostgresML represents the extreme case: the extension attempted to own model weights, inference compute, and the SQL interface in one package. Each of those is its own infrastructure domain with its own scaling requirements, operational concerns, and failure modes.

EDB and Oracle are a subtler version of the same pattern: the SQL interface is clean and the documentation is good, but the extensions are written as if the blocking problem does not exist for query-time calls. Production teams typically discover this during a traffic spike.

The pattern repeats partly because of deployment cycle coupling — shipping everything as a single extension simplifies the initial sales and onboarding story, even if it complicates operations later. A SQL function that makes a synchronous LLM call is an HTTP client with a SQL interface. The framing makes it look safer than it is.


The thin-core, vertical-bundle pattern

This pattern is not specific to the AI-in-DB space.

Timescale does not just ship pgai. The production stack is pgvectorscale (ANN indexing) + pg_textsearch (BM25) + pgai (embedding pipeline interface) + vectorizer-worker (external compute). Each layer has one job. The database core stays thin; specialized capabilities are bundled vertically around it.

The same pattern appears elsewhere. MongoDB + Atlas keeps the core storage and query engine separate from Atlas Search, Atlas Vector Search, and Atlas Data Federation — each capability is opt-in, and the database does not attempt to serve as an inference host. Supabase bundles Postgres with pgvector, Auth, Storage, Realtime, and Edge Functions, but each component has clear ownership: Edge Functions handle application logic, the database does not run user code. ClickHouse has announced a bundled stack (ClickStack) with separate service components for ingest, query, and integration — the query engine stays focused; adjacent capabilities are layered around it.

The shared pattern: keep the database core thin, bundle adjacent capabilities vertically, and maintain a clear boundary between data infrastructure and application logic. Different parts of the stack have genuinely different operational profiles — inference involves GPU allocation, batching, and model lifecycle concerns; BM25 indexing and vector ANN have their own scaling characteristics — and treating them as separate components reflects that.


Where pg_aidb sits

pg_aidb is positioned as an agent memory layer — semantic memory today, episodic and working memory in later episodes. Follow-up posts cover the dual-mode API design, the ingestion pipeline, the memory schema, and the operational profile at different concurrency levels.


Closing

The five systems differ in where they place the boundary between SQL interface convenience and AI I/O blocking. Oracle and EDB keep the SQL surface and absorb the blocking. Timescale and pg_aidb push the heavy work to external workers. PostgresML attempted in-process inference and the design has stalled. The placement of the abstraction boundary determines each system's operational profile.