Local-LLM-Ökosystem News: Ollama 0.30, llama.cpp Sicherheitsflaws, Gemma 4 & Nemotron 3 Ultra

🚀 Releases

Ollama 0.30.x (5. Juni 2026) – Mit verbesserter GGUF-Kompatibilität durch llama.cpp. 0.30.6 führt Gemma 4 QAT-gewichte ein (4 Größen mit Quantization-Aware Training), verbessertes MLX-Embedding auf Apple Silicon mit NVFP4 global scale. 0.30.5 behebt Gemma 4 12B Floating-Point-Exception auf x86/CUDA/Linux/Windows. 0.30.4 bringt Nemotron 3 Ultra, Multi-Modal-GPU-Offload auf Apple Silicon, und das ollama launch-Ökosystem ist erweitert (Codex, Hermes, Pi). 0.30.7-rc1 (6. Juni) mit Windows Hermes-Config und Zod-Beispielen. Alle Versionen verfügbar unter github.com/ollama/ollama/releases.

llama.cpp b9553 und später (6.–7. Juni 2026) – Aktive Entwicklung mit Sampler-Name-Matching (alt-Namen jetzt case-insensitive, z. B. top-k / min-p), KV-Cache-Optimierungen für Speculative Decoding, Gemma 4 MTP-Support, und Qwen-VL-Video-Unterstützung. Siehe github.com/ggml-org/llama.cpp/releases.

LM Studio 0.4.16 (4. Juni 2026) – Einführung von Locally, der neuen iOS/iPad-App für LM Studio mit LM Link für Mobile. MLX-Engine v1.8.5 verbessert KV-Cache-Checkpointing für agentic Workflows. Multi-GPU-Auswahl-Bugfixes für CUDA 12, ROCm und Vulkan. lmstudio.ai/changelog.

Jan.ai Changelog – Letzte Version v0.8.2 (31. Mai 2026) mit schnellerem Startup und AMD ROCm/HIP auf Linux. jan.ai/changelog.

🆕 Open-Weight-Modelle

Nemotron 3 Ultra (NVIDIA, 4. Juni 2026) – 550B MoE mit 1 Million Tokens Kontext, State-of-the-Art Coding und Reasoning. Gebaut für High-Throughput-Agenten und lange Workflows. Verfügbar in GGUF (2-bit/3-bit/8-bit) über Hugging Face und Ollama. 2-bit quantisiertes 200 GB auf 256 GB RAM machbar. research.nvidia.com/labs/nemotron/Nemotron-3-Ultra/.

Gemma 4 (Google DeepMind, April 2026, jetzt quantisiert Juni 2026) – Familie mit Größen E2B/E4B/E12B/E27B/E31B. Multimodal mit Vision und Tool-Calling. QAT-Optimierte Versionen (E2B-QAT, E4B-QAT, 12B-QAT, 26B-A4B-QAT, 31B-QAT) reduzieren VRAM-Anforderungen massiv. Apache 2.0 lizenziert. Verfügbar auf Hugging Face, Ollama, und allen GGUF-kompatiblen Tools. gemma4.com.

MiniMax M3 (1. Juni 2026) – Erste open-weight Frontier-Modell mit 59 % SWE-Bench Pro, 1 Million Tokens Kontext und nativer Multimodality. MIT-lizenziert, Gewichte und Technical Report erwartet Anfang Juni. Beschrieben als \“first downloadable frontier model\“ mit einem 340 GB Dynamic GGUF-Quantisierungsschema für CPU/GPU/SSD-Setups. aimadetools.com.

Kimi K2.6 (Moonshot AI) – 1 Billionen Parameter mit 32B aktiv (Modified MIT). SWE-Bench Pro 58.6 (auf GPT-5.5 Niveau). Verfügbar in GGUF-Quantisierungen, u. a. mit Dynamic GGUF für heterogene Hardware. Bereits auf Ollama abrufbar (ollama run kimi-k2.6). kimi.com.

Ideogram 4.0 (4. Juni 2026) – Erste open-weight Text-to-Image Foundation-Modell mit Fokus auf Design-Workflows, Typografie und hochauflösende Generierung. kombitz.com.

🔴 Sicherheit

CVE-2026-7482 „Bleeding Llama“ (Ollama, kritisch) – Heap Out-of-Bounds Read in Ollamас GGUF-Loader. Betroffen: Ollama < 0.17.1 (alle Plattformen). Ein Angreifer sendet eine präparierte GGUF-Datei mit inflatierten Tensor-Dimensionen an die unauthentifizierten /api/create– und /api/push-Endpoints (Standard: 0.0.0.0:11434). Das System liest über die Puffergrenzen hinaus und exfiltriert aus dem Prozess-Heap: Benutzer-Chats, System-Prompts, API-Schlüssel, Umgebungsvariablen. Der Angriff benötigt nur drei API-Aufrufe. Patch: Ollama 0.17.1 (25. Februar 2026). Betroffen sind ~300.000 Internet-erreichbare Ollama-Server. Sofortmaßnahmen: Auf ≥ 0.17.1 upgraden, Port 11434 auf 127.0.0.1 binden, authenticated Reverse-Proxy davor, API-Schlüssel rotieren. pasqualepillitteri.it; cyera.com.

llama.cpp GGUF-Parser Schwachstellen – V-01 bis V-06 (kritisch, keine CVE vergeben, offengelegt 15. Mai 2026) – Betroffen: llama.cpp alle Versionen mit GGUF-Support, LM Studio, Ollama (llama.cpp Backend), und jedes Tool das llama.cpp vendored.

V-01 (Integer Overflow) – general.alignment Feld im GGUF-Header mit Wert ≥ 2^16 verursacht Overflow in GGML_PAD-Makro auf 32-Bit-Systemen. Führt zu beliebigem File-Seek und Out-of-Bounds-Lese. Liest angrenzende Heapdaten.
V-02 (Memory Exhaustion) – GGUF_MAX_STRING_LENGTH und GGUF_MAX_ARRAY_ELEMENTS auf 1 GB. Präparierte Datei mit 1 GB String in 1 GB Array crasht 32-Bit-Systeme mit std::bad_alloc.
V-03 (Python-Spezifisch) – gguf_reader.py prüft nicht auf max. Tensor-Dimensionen; C++ tut es (max 4). n_dims = 0xFFFFFFFF triggert ~32 GB Memory-Mapping-Versuch.
V-04 bis V-06 – Signed-to-unsigned Konvertierungsfehler, unboundeter gguf_type Enum-Cast, Division-by-Zero wenn ggml_blck_size() == 0.

Lieferkette-Angriffsvektor: Böswilliges GGUF von Hugging Face oder öffentlichem Repo heruntergeladen → sofort beim Laden geparsed, bevor Ausführung oder Benutzerinteraktion. Kein CVE-Nr. vergeben = Standard-Scanner erkennen nicht automatisch. Interim-Mitigation: GGUF nur aus verifizierter Quelle (Hash/Signatur, TheBloke, Meta, Mistral, Microsoft, Google). Vermeide ungeprüfte Hugging Face Uploads. techtimes.com.

Zusammenfassung: Zwei unterschiedliche Flaw-Familien: Ollama CVE-2026-7482 (Quantisierungs-Pipeline in Go), llama.cpp V-01/V-06 (C++/Python GGUF-Parser). Alle GGUF-Consumer (Ollama, LM Studio, llama.cpp, SGLang) erben die Risiken. Laufender Patch-Prozess, aber Koordination zwischen Projekten ist kritisch.

🔀 Ökosystem

Open WebUI 0.9.6 (1. Juni 2026) – Letzte Release auf PyPI. Weiterhin Community-Standard für web-basierte LLM-Chat-UIs, integriert mit Ollama, llama.cpp und anderen. pypi.org/project/open-webui.

llama-cpp-python (7. Juni 2026) – Aktuelle Release auf PyPI mit Python-Bindings für llama.cpp. Ändert sich mit jedem llama.cpp Release.

RamaLama (Red Hat Containers) – Open-Source-Tool zum Containerize und lokalen Servieren von KI-Modellen via Container-ähnliches CLI. Komplementär zu Ollama und llama.cpp. github.com/containers/ramalama; ramalama.com.

llamafile (Mozilla AI) – Portabler LLM-Runner mit GPU-Support (Metal, CUDA, Vulkan). Stabilisiert bei v0.10.x mit neuem Build-System für llama.cpp-Alignment.

🧠 Performance & Engineering

KV-Cache Optimierungen (llama.cpp b9550–b9551) – Vermeidung unnötiger KV-Tensor-Kopien bei Speculative Decoding. Zielgruppe: lange Kontexte, agentic Workflows mit wiederholten Anfragen.

MLX-Engine v1.8.5 (LM Studio, 5. Juni 2026) – KV-Cache-Checkpointing für wiederholte, lange-Kontext-agentic Workflows auf Apple Silicon. Drastische Performance-Verbesserung für Agent-Schleifen. lmstudio.ai/blog.

Gemma 4 Quantization-Aware Training (QAT) – Reduziert VRAM um 30–50 % je nach Größe, ohne merkliche Genauigkeitsverluste. Proof-of-Concept für standardisierte quantization post-training im Ökosystem.

Qwen3.5 Video-Support (llama.cpp b9543) – Frame-Merge für Qwen-VL-basierte Modelle, Vorbereitung für Video-verarbeitende Multimodal-Modelle auf lokalen Runtimes.

🆚 Ollama vs llama.cpp

Sicherheitspatches: Ollama hatte eine Go-spezifische Lücke (CVE-2026-7482) in der Quantisierungs-Pipeline, fix in 0.17.1. llama.cpp hat langfristige C++ GGUF-Parser-Probleme (V-01/V-06), noch nicht alle gepatch. Ollama ist schneller bei Hotfixes (Feb 2026), aber beide Projekte haben Probleme mit Zero-Day-Offenlegungen und CVE-Verzögerungen geerbt.

Modell-Support: Beide haben parallel Gemma 4, Nemotron 3 Ultra, MiniMax M3, Kimi K2.6 hinzugefügt. Ollama über sein Modell-Library-System, llama.cpp durch reine GGUF-Unterstützung. Funktional äquivalent am Ende, aber Ollama-Abstraktion ist einsteigerfreundlicher (ollama run nemotron-3-ultra vs. manuelle -m Optionen).

Performance: LM Studio nutzt Ollama/llama.cpp-Backend, bietet MLX-Spezial-Pfad für Apple Silicon (v1.8.5 KV-Checkpointing). llama.cpp bleibt der technische Kern, Ollama bewirtschaftet es und exponiert OpenAI-API. Jan.ai ebenfalls auf llama.cpp aufgebaut.

Quelle: Newsfeed generiert aus Ollama Blog, GitHub Releases (ollama/ollama, ggml-org/llama.cpp), LM Studio Blog, Jan Changelog, Hugging Face, Cyera Research, TechTimes, und Hacker News Diskussionen vom 1.–7. Juni 2026.