Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

📰 ArXiv cs.AI

Learn how to optimize ASR models for low-latency on-device inference, achieving high accuracy and compact size

advanced Published 17 Apr 2026

Action Steps

Evaluate state-of-the-art ASR architectures using encoder-decoder, transducer, and LLM-based paradigms
Compare performance across batch, chunked, and streaming inference modes
Optimize model size and latency using techniques such as pruning and knowledge distillation
Implement low-latency inference on CPU without GPU acceleration using optimized models
Test and benchmark optimized models on edge devices to ensure high accuracy and low latency

Who Needs to Know This

Speech recognition engineers and researchers can benefit from this study to develop efficient ASR models for edge devices, improving user experience and reducing latency

Key Insight

💡 Compact, high-accuracy ASR models can be achieved through systematic empirical study and optimization of state-of-the-art architectures