microsoft/VibeVoice

📰 Simon Willison's Blog

Learn to use Microsoft's VibeVoice for speech-to-text tasks with speaker diarization, and apply it to real-world audio files using the mlx-audio tool.

intermediate Published 27 Apr 2026

Action Steps

Install the mlx-audio tool and download the VibeVoice-ASR-4bit model
Run the uv command with mlx-audio to generate speech-to-text transcripts
Configure the command with options such as --max-tokens to handle longer audio files
Test the tool with different audio file formats such as .wav and .mp3
Use Datasette Lite to browse and explore the resulting JSON transcript

Who Needs to Know This

Developers and data scientists on a team can benefit from using VibeVoice for speech-to-text tasks, especially when working with audio data that requires speaker diarization.

Key Insight

💡 VibeVoice can handle up to an hour of audio and provides speaker diarization, making it a useful tool for speech-to-text tasks.