#LLM | Latent Space

#LLM

한국어

July 2, 2026 · AI

Wiring codex to a Local Qwen — ollama, vLLM, and the developer role

Point OpenAI codex CLI at a local Qwen or gemma in five minutes

#codex #LLM #vLLM #ollama

June 25, 2026 · AI

Speeding up LLM inference with MTP and diffusion

MTP and diffusion inference on Gemma 4 and Qwen 3.6, fp8 on one H100

#LLM #vLLM #MTP #Speculative Decoding

May 26, 2026 · AI

Harness as Environment

How harness design determines whether agents actually adapt.

#AI #Agent #Harness #Memory

May 22, 2026 · AI

Weights, Prompts, Codes as Parameters

Weights, prompts, and code as parameters at different layers of a learnable policy space

#AI #Agent #LLM #POMDP

May 20, 2026 · AI

Apple Silicon LLM Inference — Five Backends Compared

Benchmarking Qwen3.5-9B on Apple Silicon across MLX, llama.cpp, Ollama, omlx, and vLLM Metal — single-request throughput, prefill scaling, decode vs input length, and concurrency response

#LLM #Apple Silicon #MLX #llama.cpp

May 20, 2026 · AI

Long-Context Evaluation — NIAH and Lost in the Middle

NIAH limits, the Lost in the Middle effect, alternative benchmarks, and measured recall across four reasoning-effort modes

#LLM #Long-context #NIAH #Benchmark

May 20, 2026 · AI

NVIDIA NIM API — Free Inference for GLM, Kimi, Nemotron, and Gemma 4

NVIDIA's build.nvidia.com offers 100+ models on H100 infrastructure for free. Plug it directly into Claude Code, Cursor, or any OpenAI-compatible coding agent.

#NVIDIA #NIM #AI #API