Automatic Speech Recognition (ASR) transcribes spoken audio to text with multilingual support and speech translation to English
Audio is sampled at 16 kHz and converted to a 10s window. A transformer encoder processes the spectrogram, and a transformer decoder autoregressively
predicts text tokens