[P] vLLM-MLX: Native Apple Silicon LLM inference – 464 tok/s on M4 Max
Hey everyone! I built vLLM-MLX – a framework that uses Apple’s MLX for native GPU acceleration. What it does: – OpenAI-compatible API (drop-in replacement for your existing code) – Multimodal support: Text, Images, Video, Audio – all in one server – Continuous batching for concurrent users (3.4x speedup) – TTS in 10+ languages (Kokoro, Chatterbox models) – MCP tool calling support Performance on M4 Max: – Llama-3.2-1B-4bit → 464 tok/s – Qwen3-0.6B → 402 tok/s – Whisper STT […]