Audio ML Papers

πŸ† Most Important Audio ML Papers of All Time

The most influential audio machine learning papers β€” curated by impact, novelty, and field-defining significance. Deep-learning era only.

61 landmark papers Β· Organized by year Β· Updated May 2026

πŸ… Hall of Fame β€” Most Cited

#1πŸ“š 8.1k
WaveNet: A Generative Model for Raw Audio92
DeepMind Technologies Β· 2016
#2πŸ“š 7.9k
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations91
Meta (formerly Facebook AI) Β· 2020
#3πŸ“š 6.6k
Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)84
OpenAI Β· 2022
#4πŸ“š 4.4k
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units83
Meta Β· 2021
#5πŸ“š 4.0k
Conformer: Convolution-augmented Transformer for Speech Recognition84
Google Inc. Β· 2020
#6πŸ“š 3.1k
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin83
Baidu Research - Silicon Valley AI Lab Β· 2015
#7πŸ“š 3.0k
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (Tacotron 2)83
Google Β· 2017
#8πŸ“š 3.0k
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing89
Microsoft Β· 2021
#9πŸ“š 2.6k
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis83
Kakao Enterprise Β· 2020
#10πŸ“š 2.4k
Listen, Attend and Spell84
Google Β· 2015

2025

Yuan et al.; large-scale music LM with lyrics conditioning; open-source music generation at scale

Ruibin Yuan, Hanfeng Lin, Shuyue Guo ... Β· landmark
We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minute...
πŸ“š 57 citations
πŸ’¬ Reddit

2024

DΓ©fossez et al., Kyutai; first real-time full-duplex speech LLM; simultaneous listening and speaking

Alexandre DΓ©fossez, Laurent MazarΓ©, Manu Orsini ... Β· landmark
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks...
πŸ“š 470 citations

Chen et al., CMU; flow matching TTS with rectified flow; state-of-the-art quality with fast inference

Yushen Chen, Zhikang Niu, Ziyang Ma ... Β· landmark
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler token...
πŸ“š 352 citations

Kong et al., NVIDIA; ICML 2024; in-context learning + RAG + multi-turn dialogue over audio; SOTA audio understanding

Zhifeng Kong, Arushi Goel, Rohan Badlani ... Β· landmark
Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio und...

2023

Wang et al., Microsoft; 3-second voice cloning using EnCodec tokens + language model

Chengyi Wang, Sanyuan Chen, Yu Wu ... Β· landmark
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task r...
πŸ“š 1.1k citations
πŸ’¬ Reddit

Liu et al.; latent diffusion for text-to-audio; CLAP-conditioned; first practical text-to-sound system

Haohe Liu, Zehua Chen, Yi Yuan ... Β· landmark
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that i...
πŸ“š 743 citations

Chu et al., Alibaba; universal audio LLM with 30+ tasks; strong multilingual speech + sound understanding

Yunfei Chu, Jin Xu, Xiaohuan Zhou ... Β· landmark
Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existi...
πŸ“š 692 citations

Kumar et al., Descript; improved codec with pitch-invariant quantization; open-source standard

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs ... Β· landmark
Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that e...
πŸ“š 656 citations

Agostinelli et al., Google; text-conditional music generation; MuLan embeddings; raised music gen quality bar

Andrea Agostinelli, Timo I. Denk, ZalΓ‘n Borsos ... Β· landmark
We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generate...
πŸ”— Papers This Influenced
πŸ“š 646 citations
πŸ’¬ Reddit

Copet et al., Meta; single-stage music generation from text/melody; open-source AudioCraft framework

Jade Copet, Felix Kreuk, Itai Gat ... Β· landmark
We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together...
πŸ“š 639 citations
πŸ’¬ Reddit

Tang et al., Tsinghua; dual-encoder LLM for speech + audio understanding; broad audio QA capabilities

Β· landmark
πŸ“š 521 citations

Le et al., Meta; flow-matching TTS at scale; in-context learning for voice styles

Matthew Le, Apoorv Vyas, Bowen Shi ... Β· landmark
Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive i...
πŸ“š 472 citations

Huang et al., Tencent; prompt-enhanced audio generation; pseudo-prompts for data augmentation

Rongjie Huang, Jiawei Huang, Dongchao Yang ... Β· landmark
Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling lo...
πŸ“š 463 citations

Liu et al.; unified audio/speech/music generation via GPT-2 + diffusion pipeline

Haohe Liu, Yi Yuan, Xubo Liu ... Β· landmark
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To br...
πŸ“š 435 citations

Rubenstein et al., Google; LLM extended with audio tokens; jointly models text and speech

Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen ... Β· landmark
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate ...
πŸ“š 433 citations

Shen et al., Microsoft; diffusion-based zero-shot TTS with natural prosody

Β· landmark
πŸ“š 355 citations

Li et al., Columbia; style diffusion + adversarial training; first open-source TTS to rival commercial systems

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan ... Β· landmark
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random ...
πŸ“š 241 citations

Gong et al., MIT; instruction-following audio LLM; understands and reasons about sound and music

Β· landmark
πŸ“š 237 citations

Siuzdak; frequency-domain GAN vocoder; faster and better than HiFi-GAN

Hubert Siuzdak Β· landmark
Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-i...
πŸ“š 215 citations

Barrault et al., Meta; unified model for speech-to-speech, speech-to-text, text-to-speech across 100+ languages

Seamless Communication, LoΓ―c Barrault, Yu-An Chung ... Β· landmark
What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have...
πŸ“š 162 citations

2022

Radford et al., OpenAI; 680k hours weak supervision; multilingual; became the standard open ASR system

Alec Radford, Jong Wook Kim, Tao Xu ... Β· landmark
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are ofte...
πŸ”— Papers This Influenced
πŸ“š 6.6k citations
πŸ’¬ Reddit

Baevski et al., Meta; unified self-supervised framework across modalities; strong for speech

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu ... Β· landmark
While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework...
πŸ“š 1.1k citations

DΓ©fossez et al., Meta; open-source neural codec; backbone of VALL-E, MusicGen, and AudioCraft

Alexandre DΓ©fossez, Jade Copet, Gabriel Synnaeve ... Β· landmark
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multisca...
πŸ”— Papers This Influenced
πŸ“š 1.1k citations

Wu et al., Microsoft; contrastive audio-text pretraining; audio equivalent of CLIP; widely used for retrieval/eval

Yusong Wu, Ke Chen, Tianyu Zhang ... Β· landmark
Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To a...
πŸ”— Papers This Influenced
πŸ“š 966 citations

Borsos et al., Google; hierarchical language model over SoundStream tokens; coherent long-form audio

ZalΓ‘n Borsos, RaphaΓ«l Marinier, Damien Vincent ... Β· landmark
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers...
πŸ”— Papers This Influenced
πŸ“š 889 citations
πŸ’¬ Reddit

Saeki et al.; MOS prediction model; standard automatic MOS estimator for TTS evaluation

Takaaki Saeki, Detai Xin, Wataru Nakata ... Β· landmark
We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for...
πŸ“š 508 citations

Kreuk et al., Meta; first high-quality text-to-general-audio system; part of AudioCraft

Β· landmark
πŸ“š 423 citations

Lee et al., NVIDIA; scaled HiFi-GAN with anti-aliased activations; strong universal vocoder

Sang-gil Lee, Wei Ping, Boris Ginsburg ... Β· landmark
Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments. In this work, ...
πŸ“š 418 citations

Lu et al.; diffusion models for speech enhancement; enabled generative approach to noise reduction

Julius Richter, Simon Welker, Jean-Marie Lemercier ... Β· landmark
In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination o...
πŸ“š 352 citations

Tan et al., Microsoft; first TTS system to achieve human-level naturalness on LJSpeech

Β· landmark
πŸ“š 313 citations

Gardner et al., Google Magenta; ICLR 2022; T5 sequence-to-sequence multi-instrument transcription across datasets

Josh Gardner, Ian Simon, Ethan Manilow ... Β· landmark
Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple ins...

Bittner et al., Spotify; ICASSP 2022; lightweight audio-to-MIDI with pitch bend; widely deployed open-source transcriber

Rachel M. Bittner, Juan JosΓ© Bosch, David Rubinstein ... Β· landmark
Automatic Music Transcription (AMT) has been recognized as a key enabling technology with a wide range of applications. Given the task's complexity, best results have typically been reported for systems focusing on specific settings, e.g. instrument-specific systems tend to yield...

2021

Hsu et al., Meta; BERT-style masked prediction for speech; surpassed wav2vec 2.0

Β· landmark
πŸ”— Papers This Influenced
πŸ“š 4.4k citations
πŸ’¬ Reddit

Chen et al., Microsoft; denoising + masked prediction; best self-supervised speech model for years

Sanyuan Chen, Chengyi Wang, Zhengyang Chen ... Β· landmark
Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., lear...
πŸ“š 3.0k citations

Kim et al.; end-to-end TTS surpassing 2-stage systems; became the dominant TTS architecture

Jaehyeon Kim, Jungil Kong, Juhee Son Β· landmark
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natu...
πŸ”— Papers This Influenced
πŸ“š 1.2k citations

Zeghidour et al., Google; first neural audio codec; RVQ-based; enabled AudioLM

Β· landmark
πŸ”— Papers This Influenced
πŸ“š 1.2k citations
πŸ’¬ Reddit

Casanova et al.; VITS-based zero-shot multi-speaker TTS; cross-lingual voice conversion

Edresson Casanova, Julian Weber, Christopher Shulby ... Β· landmark
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-sh...
πŸ“š 585 citations

2020

Baevski et al., Meta; quantized contrastive learning; 10 min labels β†’ near supervised performance

Β· landmark
πŸ”— Papers This Influenced
πŸ“š 7.9k citations
πŸ’¬ Reddit

Gulati et al., Google; CNN + Transformer for speech; became the dominant ASR encoder architecture

Anmol Gulati, James Qin, Chung-Cheng Chiu ... Β· landmark
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploi...
πŸ“š 4.0k citations

Kong et al.; multi-period discriminator GAN vocoder; best quality/speed tradeoff for years

Β· landmark
πŸ”— Papers This Influenced
  • BigVGAN (2022)
  • Vocos (2023)
  • Used as vocoder in VITS, FastSpeech 2, NaturalSpeech, etc. (2021)
πŸ“š 2.6k citations
πŸ’¬ Reddit

Kong et al.; diffusion for waveform synthesis; vocoder + unconditional generation; launched audio diffusion

Zhifeng Kong, Wei Ping, Jiaji Huang ... Β· landmark
In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps...
πŸ“š 1.9k citations

Ren et al., Microsoft; duration/pitch/energy predictors; cleaner non-autoregressive TTS

Yi Ren, Chenxu Hu, Xu Tan ... Β· landmark
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide...
πŸ”— Papers This Influenced
πŸ“š 1.7k citations

Kong et al., QMUL; pretrained CNNs on AudioSet; became the standard backbone for audio tagging and classification

Qiuqiang Kong, Yin Cao, Turab Iqbal ... Β· landmark
Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and sound event detection. Recently, neural networks have be...
πŸ“š 1.4k citations

Dhariwal et al., OpenAI; multi-scale VQ-VAE + autoregressive model for raw audio music with lyrics

Prafulla Dhariwal, Heewoo Jun, Christine Payne ... Β· landmark
We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at s...
πŸ“š 934 citations
πŸ’¬ Reddit

Reddy et al., Microsoft; non-intrusive automatic MOS for noise-suppressed speech; standard in speech enhancement

Chandan K A Reddy, Vishak Gopal, Ross Cutler Β· landmark
Human subjective evaluation is the gold standard to evaluate speech quality optimized for human perception. Perceptual objective metrics serve as a proxy for subjective scores. The conventional and widely used metrics require a reference clean speech signal, which is unavailable ...
πŸ“š 499 citations

2019

Kumar et al.; GAN-based real-time vocoder; orders of magnitude faster than WaveNet

Kundan Kumar, Rithesh Kumar, Thibault de Boissiere ... Β· landmark
Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of arch...
πŸ“š 1.1k citations

Kilgour et al., Google; audio equivalent of FID; standard for evaluating audio/music generation quality

Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek ... Β· landmark
We propose the FrΓ©chet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure the perceived effect of a wide ...
πŸ”— Papers This Influenced
πŸ“š 327 citations

DΓ©fossez et al., Meta; waveform-domain music source separation; became the open-source standard

Alexandre DΓ©fossez, Nicolas Usunier, LΓ©on Bottou ... Β· landmark
Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments.Contrarily to many audio synthesis tasks...
πŸ“š 311 citations

Ren et al., Microsoft; non-autoregressive TTS; 270x speedup over autoregressive models

Β· landmark
πŸ’¬ Reddit

Schneider et al., Meta; first contrastive self-supervised learning for speech; precursor to wav2vec 2.0

Β· landmark

2018

Prenger et al., NVIDIA; normalizing flow vocoder; first real-time neural vocoder

Ryan Prenger, Rafael Valle, Bryan Catanzaro Β· landmark
In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. Wa...
πŸ“š 1.1k citations

Wan et al., Google; generalized end-to-end loss for speaker embeddings; standard speaker verification approach

Li Wan, Quan Wang, Alan Papir ... Β· landmark
In this paper, we propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient than our previous tuple-based end-to-end (TE2E) loss function. Unlike TE2E, the GE2E loss function updates the network i...
πŸ“š 1.0k citations

Jia et al., Google; speaker-conditioned TTS using d-vectors; generalized multi-speaker TTS

Ye Jia, Yu Zhang, Ron J. Weiss ... Β· landmark
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder...
πŸ“š 939 citations

Huang et al., Google Brain; relative attention for long-range music structure; enabled coherent MIDI generation

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit ... Β· landmark
Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on ...
πŸ“š 562 citations

2017

Shen et al., Google; Tacotron 2 combined with WaveNet vocoder; MOS near human quality

Jonathan Shen, Ruoming Pang, Ron J. Weiss ... Β· landmark
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet m...
πŸ”— Papers This Influenced
πŸ“š 3.0k citations
πŸ’¬ Reddit

Wang et al., Google; seq2seq TTS from text to mel-spectrogram; replaced pipeline TTS

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton ... Β· landmark
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, w...
πŸ”— Papers This Influenced
πŸ“š 2.0k citations

2016

Oord et al., DeepMind; first autoregressive raw waveform model; defined the field of neural TTS

Aaron van den Oord, Sander Dieleman, Heiga Zen ... Β· landmark
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently tr...
πŸ”— Papers This Influenced
πŸ“š 8.1k citations
πŸ’¬ Reddit

Mehri et al.; hierarchical RNN for raw audio; showed unconditional audio generation is feasible

Soroush Mehri, Kundan Kumar, Ishaan Gulrajani ... Β· landmark
In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks...
πŸ“š 622 citations

2015

Amodei et al., Baidu; scaled CTC-based ASR; multilingual; near-human on some benchmarks

Dario Amodei, Rishita Anubhai, Eric Battenberg ... Β· landmark
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a...
πŸ“š 3.1k citations

Chan et al., Google Brain; attention-based encoder-decoder for ASR; foundational seq2seq approach

William Chan, Navdeep Jaitly, Quoc V. Le ... Β· landmark
We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. ...
πŸ“š 2.4k citations

2014

Hannun et al., Baidu; end-to-end deep RNN ASR; first to beat traditional pipelines at scale

Awni Hannun, Carl Case, Jared Casper ... Β· landmark
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform p...
πŸ“š 2.2k citations