Local-LLM Ecosystem Roundup: 2. bis 9. Juni 2026

🚀 Releases

Ollama 0.30 Serie

Ollama 0.30.0–0.30.7 (5.–8. Juni 2026, GitHub): Große Welle mit verbesserter llama.cpp-Integration und GGUF-Kompatibilität. Ollamas MLX-Engine auf Apple Silicon wird nun durch GGUF-Support ergänzt, was mehr Modelle auf einer breiteren Hardware-Range unterstützt.

Neue Highlights der 0.30er-Serie:

Hermes Desktop (v0.30.7): Ollama Launch integriert jetzt ollama launch hermes-desktop für native Desktop-Schnittstelle des Hermes-Agent mit Windows-spezifischem Konfigurationspfad-Support.
Gemma 4 QAT (v0.30.6): Quantization-Aware-Training-gewichte für Gemma 4 reduzieren Speicher dramatisch; Tags wie gemma4:e2b-it-qat, gemma4:e4b-it-qat verfügbar.
NVIDIA Nemotron 3 Ultra (v0.30.4, 4. Juni): 550B-Parameter Open-Model (55B aktiv) mit 1M Token Context für Agent-Workflows; optimiert für lange Tool-Call-Ketten (Blog).
Gemma 4 12B (v0.30.3): Multimodale „High-Performance Intelligence“ für Laptops; hinzugefügt mit llama.cpp-Backend-Unterstützung.
MLX Quantization-Verbesserungen (v0.30.6): Embedding-Layer nutzen nun NVFP4 Global Scale auf Apple Silicon für verbesserte Quantisierung.
Cline/Qwen Code CLI (v0.30.2): ollama launch unterstützt Qwen Code, mit Anleitung zur Cline-CLI-Installation.
Laguna-Architektur Support (v0.30.2): llama.cpp Backend kompatibel mit Poolsides Laguna-Architektur.

llama.cpp

llama.cpp b9568–b9570 (8. Juni 2026, GitHub Releases):

b9568: Gemma-4 E2B/E4B Multimodal Assistent-Support (MTP – Multimodal Text Processing) mit masked_embd-Tensor-Konvertierung.
b9567: HTTP-Server-Fix: HTTP-Header nicht parsen beim Flushen.
b9566: Graph-Fix für SWA-only Draft Heads (StepFun MTP) – kq_mask Guard auf eigenem Buffer.
b9565: WebGPU Concat-Operator Buffer-Aliasing Handling.
b9564: WebGPU 2D-Workgroups für Scale/Binary/Unary Ops.
b9562: MTMD Video-Input Support – CLI mit --video Argument, Base64 Video-Input auf Server.
b9559: CLI Spinner-Fix während Prompt-Processing.
b9558: Vulkan: cm2 Decode Vektor für mul_mat_id B-Matrix Loads – Speedup mit vec4 Loads und BK=64.

llama-cpp-python

0.3.x Release (7. Juni 2026, PyPI): Python-Bindings aktualisiert mit den neuesten llama.cpp Features.

Hermes Agent

v0.16.0 „The Surface Release“ (5. Juni 2026, GitHub): Riesiges Release mit nativer Desktop-App für macOS/Linux/Windows (Electron, 100 PRs in einer Woche). Highlights: OAuth/Username-Passwort Remote-Anbindung, vollständiger Web-Admin-Panel (MCP Catalog, Messaging, Credentials), Fuzzy Model-Picker überall, /undo Command, Simplified Chinese Übersetzung, NVIDIA/skills als vertrauenswürdige Skill-Tap. 874 Commits seit v0.15.2, 2 P0 + 62 P1 Schließungen plus Sicherheits-Fixes (CVE-2026-Pin).

🆕 Open-Weight Modelle

Neue Modelle verfügbar auf Ollama Library und GGUF-Mirrors:

Kimi K2.6 (Moonshot AI, Modified MIT): 1T-Parameter MoE, 32B aktiv; SWE-Bench Pro 58.6% (GPT-5.5-Parität). Fokus auf Code, stabiles langes Code-Writing. Verfügbar: ollama run kimi-k2.6
Qwen 3.6 27B (Alibaba): Beste Performance auf Consumer-Hardware, 77.2% SWE-Bench, passt auf 24GB bei Q4. Verfügbar: ollama run qwen3.6:27b
GLM-5.1 (Z.ai, MIT/Apache): 744B Parameter / 40B aktiv MoE, SWE-Bench Pro 58.4%, MIT-lizenziert für lokale Nutzung.
gpt-oss:20b (OpenAI, Apache 2.0): Frontier Open-Weight (August 2025 Release), ~o3-mini-Äquivalent mit erweiterbarem Reasoning. Verfügbar: ollama run gpt-oss:20b
Gemma 4 Familie (Google, 2. April 2026): E2B, E4B, E12B, E27B, jetzt mit QAT-Varianten (E2B-QAT, E4B-QAT, 12B-QAT, 26B-A4B-QAT, 31B-QAT). Vision + Tool-Calling native.
Ideogram 4.0 (4. Juni 2026, GitHub): Erstes Open-Weight Text-to-Image Modell (9.3B Parameter). Design-fokussiert: Typography, Layout, High-Res. Apache 2.0 lizenziert. Nicht auf Ollama, aber GGUF-Varianten erwartet.
NVIDIA Nemotron 3 Ultra: 550B/55B MoE, auf Ollama Cloud mit Cloudflare-Optimierung für Inference.

GGUF Format Status

GGUF knackt die 176K-Marke (Stand Mai 2026, LinkedIn). Hugging Face Hub standardisiert GGUF-Distribution.

🔴 Sicherheit

CVE-2026-42271: LiteLLM Arbitrary Command Execution

Betroffene Versionen: BerriAI LiteLLM 1.74.2 bis vor 1.83.7
Plattformen: Alle (CLI Gateway, MCP Server-Integration)
CVSS: 8.8 (High) – CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H
Beschreibung: Zwei MCP-Server-Test-Endpunkte (POST /mcp-rest/test/connection und POST /mcp-rest/test/tools/list) akzeptieren unkontrollierte stdio-Konfiguration. Beliebige authentifizierte Benutzer (inkl. Low-Privilege Keys) können Kommandos mit Proxy-Prozess-Rechten ausführen. Nur API-Key-Prüfung, keine Rollen-Checks.
Status: Gefixt in 1.83.7 (verfügbar seit 8. Juni als CISA KEV). Federale Behörden haben Abhilfefrist bis 22. Juni 2026 (NVD, GitHub Advisory).

BadHost (CVE-2026-48710): Starlette Path-Based Access Control Bypass

Betroffene Komponente: Starlette (Python Web Framework, 325M weekly downloads)
Szenarios mit hohem Risiko für LLM-Systeme: MCP-Server, Open WebUI, Jan, vLLM, andere Starlette/FastAPI-Apps mit Auth-Gating
Grundvuln: Host-Header mit /, ? oder # umgeht RFC 9112-Validierung, verschiebt URL-Pfade während Re-Parse.
Beispiel-Exploit:
curl -H 'Host: foo?' http://target/admin # 200 statt 403
Impact: Authentication Bypass, SSRF, potentiell RCE auf ungepatchten Systemen. Besonders kritisch für lokal deployed AI-Dienste ohne Reverse-Proxy.
Status: Gefixt in Starlette 1.0.1 (Juni 2026). MCP-Server-Betreiber sollten urgente Starlette-Upgrade durchführen (InfoQ, Scanner: badhost.org).

🔀 Ökosystem & Tools

OpenJarvis v1.0

(28. Mai 2026, Ollama Blog): Open-Source Framework für Personal AI Agents auf eigener Hardware. Ollama-Integration built-in. Erste stabile Version für lokale Deployment mit vollständiger Agent-Orchestration.

Ollama Launch Erweiterungen

Hermes Desktop: ollama launch hermes-desktop – Native App (nicht Terminal-Wrapper).
Codex/OpenClaw/OpenCode: Weiterhin unterstützt; Cline CLI Integration unter Ollama 0.30.2.
Qwen Code: ollama launch qwen-code mit Auto-Installation von Cline CLI.

Jan.ai

Version 0.8.2 (aktuell): Open-Source ChatGPT-Alternative mit lokalem und Cloud-Model-Support. Stabil auf macOS/Windows/Linux.

MLXcel (NEU)

v0.1.0 Stable (28. Mai 2026, KubeSimplify): Rust-basierter MLX-Inference-Engine für Apple Silicon. Alternative zu Ollamas MLX-Integration mit Fokus auf Direktzugriff auf MLX C-API.

Ollama Cloud & Integrations

Nemotron 3 Ultra, MiniMax M2, GLM-4.6, Qwen3-VL jetzt auf Ollama Cloud erreichbar.
Anthropic Messages API Kompatibilität für Claude Code / OpenClaw.
Gpt-oss-safeguard (20B & 120B, Apache 2.0) für Safety-Klassifikation.

🧠 Performance & Engineering Highlights

NVFP4 Quantization (NVIDIA 4-Bit Floating Point)

Ollama 0.30.6 nutzt NVFP4 für Embedding-Layer auf Apple Silicon (MLX-Backend). Qwen 3-Modelle erste NVFP4-Kandidaten auf macOS. NVIDIA Nemotron 3 Ultra für NVIDIA-Hardware optimiert (5x Speedup vs. Standard, 30% Kostenersparnis).

llama.cpp Vulkan & WebGPU Optimierungen

Vulkan vec4-Loads und WebGPU 2D-Workgroups (b9558, b9564) bringen kontinuierliche Speedups auf AMD/Intel/Intel Arc. Video-Input Unterstützung (b9562) für Multimodal-Workloads.

Hermes Agent Performance

v0.16.0: Cold-Start um 1 Sekunde schneller (Termux: 2.9s → 0.8s), 47% weniger Function Calls pro Konversation (399k → 213k bei 31-Turn Chat). session_search 4.500× schneller & kostenlos (Eliminierung des Aux-LLM).

🆚 Ollama vs llama.cpp: Aktuelle Unterschiede

NVFP4 Support: Ollama (0.30.6+) aktiviert NVFP4-Quantisierung auf Apple Silicon MLX; llama.cpp exposiert das Format aber noch nicht direkt über CLI, Nutzung über Ollama oder externe Tools.
Multimodal MTP: llama.cpp b9568+ native Gemma-4 E2B/E4B Multimodal-Assistent Support; Ollama bindet llama.cpp-Backend dafür ein (v0.30.3+).
Desktop-Integration: Ollama 0.30+ mit ollama launch für Hermes Desktop, Claude Code, Codex mit vollständiger Desktop-App; llama.cpp bleibt CLI-centric, verweist auf Drittanbieter-UIs (Open WebUI, Jan).
Cloud-Fallback: Ollama Cloud für Nemotron 3 Ultra & andere Frontier-Modelle; llama.cpp nur lokal oder externe Inference-Dienste (RunPod, Hugging Face).

Fazit

Die Juni-Woche 2026 bringt massive Boosts für beide Core-Projekte: Ollama konsolidiert Desktop-Erlebnis, erweitert GGUF/llama.cpp-Unterstützung und startet Cloud-Fallback; llama.cpp liefert Multimodal-Fundamente (Gemma-4 MTP, Video), Vulkan/WebGPU-Optimierungen und Web-Server-Robustheit. Open-Weight-Landschaft floriert mit Kimi K2.6, Qwen 3.6, GLM-5.1 und gpt-oss. Sicherheit erfordert sofort Aktion: LiteLLM auf 1.83.7+, Starlette auf 1.0.1+ (besonders für MCP-Server-Betreiber). Hermes Agent etabliert sich als Ecosystem-Player mit vollständiger Desktop-App-Parität.