microsoft/VibeVoice

📰 Simon Willison's Blog

Learn to use Microsoft's VibeVoice for speech-to-text tasks with speaker diarization, and apply it to real-world audio files using the mlx-audio tool.

intermediate Published 27 Apr 2026
Action Steps
  1. Install the mlx-audio tool and download the VibeVoice-ASR-4bit model
  2. Run the uv command with mlx-audio to generate speech-to-text transcripts
  3. Configure the command with options such as --max-tokens to handle longer audio files
  4. Test the tool with different audio file formats such as .wav and .mp3
  5. Use Datasette Lite to browse and explore the resulting JSON transcript
Who Needs to Know This

Developers and data scientists on a team can benefit from using VibeVoice for speech-to-text tasks, especially when working with audio data that requires speaker diarization.

Key Insight

💡 VibeVoice can handle up to an hour of audio and provides speaker diarization, making it a useful tool for speech-to-text tasks.

Share This
🗣️ Try Microsoft's VibeVoice for speech-to-text with speaker diarization! 📊
Read full article → ← Back to Reads