My M5 Max, Gemma 4, MLX LOCAL Stack. (This KILLS MODEL PROVIDERS)

Name: My M5 Max, Gemma 4, MLX LOCAL Stack. (This KILLS MODEL PROVIDERS)
Uploaded: 2026-04-20T13:00:00Z
Channel: IndyDevDan
Description: Model providers DON'T want you to see this video. The M5 Max just exposed the dirty secret of the cloud LLM economy: you're renting what you could alrea...

IndyDevDan · Beginner ·🧠 Large Language Models ·1h ago

Skills: LLM Engineering90%

Model providers DON'T want you to see this video. The M5 Max just exposed the dirty secret of the cloud LLM economy: you're renting what you could already OWN. 🔥 While Anthropic and OpenAI APIs go down AGAIN mid-recording, my local stack keeps shipping. Private. Cheap. Fast. On-device. This is the beginning of the end for the API rental racket. 🎥 FEATURED LINKS: • MLX, Gemma4, Qwen3.6, Pi agent live-bench codebase: https://github.com/disler/live-bench • Tactical Agentic Coding: https://agenticengineer.com/tactical-agentic-coding?y=00Y-p62sk0s 📚 RESOURCES • Nvidia NVFP4: https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ • Apple M5 GPU Neural Accelerators: https://machinelearning.apple.com/research/exploring-llms-mlx-m5 • mlx-vlm: https://github.com/ml-explore/mlx-lm • Ollama Gemma4 Model: https://ollama.com/library/gemma4 • Ollama MLX Blog: https://ollama.com/blog/mlx • Pi coding agent: http://pi.dev • Gemma4 26 nvfp4: https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-nvfp4 • Vitalik Eth Secure LLMs: https://vitalik.eth.limo/general/2026/04/02/secure_llms.html ⚡ Here's the uncomfortable truth most engineers are ignoring: you're paying a premium for cloud inference when your M5 Max, M4 Max, or even Apple Silicon you already own can run state-of-the-art local LLMs RIGHT NOW. Gemma 4, Qwen 3.5, MLX variants optimized for Apple AI hardware are quietly eating the model providers' lunch. 🧠 In this head-to-head benchmark, I pit the M5 Max vs the M4 Max across three brutal local inference tests: raw prompt throughput, context scaling with Graph Walks, and full agentic coding workflows via the Pi coding agent. The results are going to reshape how you think about local agents. 💣 THE CONTROVERSIAL FINDING: If you're running GGUF models on Apple Silicon in 2026, you're leaving 2x performance on the table. MLX smokes GGUF. Not by a little. By a LOT. 118 tokens per second vs 60. Almost double the pre-fill sp

Watch on YouTube ↗ (saves to browser)