Show HN: KTransformers:671B DeepSeek-R1 on a Single Machine-286 tokens/s Prefill

📰 Hacker News · sssummer

Hey Hacker News! We are excited to share the new version of KTransformers, a flexible framework designed for cutting-edge LLM inference optimizations! Leveraging state-of-the-art kernels from llamafile and marlin, KTransformers seamlessly enhances the performance of HuggingFace Transformers, making it possible to operate large 671B MoE models or extremely long 1M context locally with promising speed. KTransformers is a Python-centric framework designed with extensibility at its core. By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI. For example, it allows you to integrate with all your familiar frontends, such as the VS Code plugin backed by Tabby. To demonstrate its capability, we present two showcase demos: - GPT-4/o1-level Local VSCode Copilot: It runs the huge 671B DeepSeek-Coder-V2's Q4_K_M variant

Published 10 Feb 2025
Read full article → ← Back to Reads