Audio ML Papers

🏆 Best Audio ML Papers of All Time

The most influential audio machine learning papers — curated by impact, novelty, and field-defining significance.

55 landmark papers · Organized by year · Updated April 2026

2025

Yuan et al.; large-scale music LM with lyrics conditioning; open-source music generation at scale

Ruibin Yuan, Hanfeng Lin, Shuyue Guo ... · landmark
We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minute...
💬 Reddit · HN

2023

Wang et al., Microsoft; 3-second voice cloning using EnCodec tokens + language model

Chengyi Wang, Sanyuan Chen, Yu Wu ... · landmark
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task r...
💬 Reddit · HN

Shen et al., Microsoft; diffusion-based zero-shot TTS with natural prosody

Kai Shen, Zeqian Ju, Xu Tan ... · landmark
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and...

Le et al., Meta; flow-matching TTS at scale; in-context learning for voice styles

Matthew Le, Apoorv Vyas, Bowen Shi ... · landmark
Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive i...

Kumar et al., Descript; improved codec with pitch-invariant quantization; open-source standard

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs ... · landmark
Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that e...

Agostinelli et al., Google; text-conditional music generation; MuLan embeddings; raised music gen quality bar

Andrea Agostinelli, Timo I. Denk, Zalán Borsos ... · landmark
We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generate...
🔗 Papers This Influenced
💬 Reddit · HN

Copet et al., Meta; single-stage music generation from text/melody; open-source AudioCraft framework

Jade Copet, Felix Kreuk, Itai Gat ... · landmark
We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together...
💬 Reddit · HN

Liu et al.; latent diffusion for text-to-audio; CLAP-conditioned; first practical text-to-sound system

Haohe Liu, Zehua Chen, Yi Yuan ... · landmark
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that i...

Liu et al.; unified audio/speech/music generation via GPT-2 + diffusion pipeline

Haohe Liu, Yi Yuan, Xubo Liu ... · landmark
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To br...

Siuzdak; frequency-domain GAN vocoder; faster and better than HiFi-GAN

Hubert Siuzdak · landmark
Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-i...

Gong et al., MIT; instruction-following audio LLM; understands and reasons about sound and music

Yuan Gong, Hongyin Luo, Alexander H. Liu ... · landmark
The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-...

Tang et al., Tsinghua; dual-encoder LLM for speech + audio understanding; broad audio QA capabilities

Changli Tang, Wenyi Yu, Guangzhi Sun ... · landmark
Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we...

Chu et al., Alibaba; universal audio LLM with 30+ tasks; strong multilingual speech + sound understanding

Yunfei Chu, Jin Xu, Xiaohuan Zhou ... · landmark
Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existi...

Rubenstein et al., Google; LLM extended with audio tokens; jointly models text and speech

Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen ... · landmark
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate ...

Thickstun et al., Stanford; infilling-based music transformer; enables interactive music generation

John Thickstun, David Hall, Chris Donahue ... · landmark
We introduce anticipation: a method for constructing a controllable generative model of a temporal point process (the event process) conditioned asynchronously on realizations of a second, correlated process (the control process). We achieve this by interleaving sequences of even...

Huang et al., Tencent; prompt-enhanced audio generation; pseudo-prompts for data augmentation

Rongjie Huang, Jiawei Huang, Dongchao Yang ... · landmark
Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling lo...

2022

Tan et al., Microsoft; first TTS system to achieve human-level naturalness on LJSpeech

Xu Tan, Jiawei Chen, Haohe Liu ... · landmark
Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions b...

Défossez et al., Meta; open-source neural codec; backbone of VALL-E, MusicGen, and AudioCraft

Alexandre Défossez, Jade Copet, Gabriel Synnaeve ... · landmark
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multisca...
🔗 Papers This Influenced
💬 Reddit

Radford et al., OpenAI; 680k hours weak supervision; multilingual; became the standard open ASR system

Alec Radford, Jong Wook Kim, Tao Xu ... · landmark
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are ofte...
🔗 Papers This Influenced
💬 Reddit · HN

Borsos et al., Google; hierarchical language model over SoundStream tokens; coherent long-form audio

Zalán Borsos, Raphaël Marinier, Damien Vincent ... · landmark
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers...
🔗 Papers This Influenced
💬 Reddit

Kreuk et al., Meta; first high-quality text-to-general-audio system; part of AudioCraft

Felix Kreuk, Gabriel Synnaeve, Adam Polyak ... · landmark
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation...

Lee et al., NVIDIA; scaled HiFi-GAN with anti-aliased activations; strong universal vocoder

Sang-gil Lee, Wei Ping, Boris Ginsburg ... · landmark
Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments. In this work, ...

Wu et al., Microsoft; contrastive audio-text pretraining; audio equivalent of CLIP; widely used for retrieval/eval

Yusong Wu, Ke Chen, Tianyu Zhang ... · landmark
Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To a...
🔗 Papers This Influenced
💬 Reddit

Zeng et al., Microsoft; BERT pretraining for symbolic music; OctupleMIDI encoding; strong music understanding

Mingliang Zeng, Xu Tan, Rui Wang ... · landmark
Symbolic music understanding, which refers to the understanding of music from the symbolic data (e.g., MIDI format, but not audio), covers many music applications such as genre classification, emotion classification, and music pieces matching. While good music representations are...

Lu et al.; diffusion models for speech enhancement; enabled generative approach to noise reduction

Julius Richter, Simon Welker, Jean-Marie Lemercier ... · landmark
In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination o...

Baevski et al., Meta; unified self-supervised framework across modalities; strong for speech

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu ... · landmark
While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework...

2021

Kim et al.; end-to-end TTS surpassing 2-stage systems; became the dominant TTS architecture

Jaehyeon Kim, Jungil Kong, Juhee Son · landmark
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natu...
🔗 Papers This Influenced
💬 Reddit

Zeghidour et al., Google; first neural audio codec; RVQ-based; enabled AudioLM

Neil Zeghidour, Alejandro Luebs, Ahmed Omran ... · landmark
We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a res...
🔗 Papers This Influenced
💬 Reddit

Hsu et al., Meta; BERT-style masked prediction for speech; surpassed wav2vec 2.0

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai ... · landmark
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths...
🔗 Papers This Influenced
💬 Reddit

Chen et al., Microsoft; denoising + masked prediction; best self-supervised speech model for years

Sanyuan Chen, Chengyi Wang, Zhengyang Chen ... · landmark
Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., lear...

Saeki et al.; MOS prediction model; standard automatic MOS estimator for TTS evaluation

Takaaki Saeki, Detai Xin, Wataru Nakata ... · landmark
We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for...

Casanova et al.; VITS-based zero-shot multi-speaker TTS; cross-lingual voice conversion

Edresson Casanova, Julian Weber, Christopher Shulby ... · landmark
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-sh...

2020

Ren et al., Microsoft; duration/pitch/energy predictors; cleaner non-autoregressive TTS

Yi Ren, Chenxu Hu, Xu Tan ... · landmark
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide...
🔗 Papers This Influenced

Kong et al.; multi-period discriminator GAN vocoder; best quality/speed tradeoff for years

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae · landmark
Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative...
🔗 Papers This Influenced
  • BigVGAN (2022)
  • Vocos (2023)
  • Used as vocoder in VITS, FastSpeech 2, NaturalSpeech, etc. (2021)
💬 Reddit

Gulati et al., Google; CNN + Transformer for speech; became the dominant ASR encoder architecture

Anmol Gulati, James Qin, Chung-Cheng Chiu ... · landmark
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploi...

Baevski et al., Meta; quantized contrastive learning; 10 min labels → near supervised performance

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed ... · landmark
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and sol...
🔗 Papers This Influenced
💬 Reddit · HN

Dhariwal et al., OpenAI; multi-scale VQ-VAE + autoregressive model for raw audio music with lyrics

Prafulla Dhariwal, Heewoo Jun, Christine Payne ... · landmark
We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at s...
💬 Reddit · HN

Kong et al.; diffusion for waveform synthesis; vocoder + unconditional generation; launched audio diffusion

Zhifeng Kong, Wei Ping, Jiaji Huang ... · landmark
In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps...

Reddy et al., Microsoft; non-intrusive automatic MOS for noise-suppressed speech; standard in speech enhancement

Chandan K A Reddy, Vishak Gopal, Ross Cutler · landmark
Human subjective evaluation is the gold standard to evaluate speech quality optimized for human perception. Perceptual objective metrics serve as a proxy for subjective scores. The conventional and widely used metrics require a reference clean speech signal, which is unavailable ...

2019

Ren et al., Microsoft; non-autoregressive TTS; 270x speedup over autoregressive models

· landmark
💬 Reddit

Kumar et al.; GAN-based real-time vocoder; orders of magnitude faster than WaveNet

Kundan Kumar, Rithesh Kumar, Thibault de Boissiere ... · landmark
Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of arch...

Schneider et al., Meta; first contrastive self-supervised learning for speech; precursor to wav2vec 2.0

· landmark

Kilgour et al., Google; audio equivalent of FID; standard for evaluating audio/music generation quality

Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek ... · landmark
We propose the Fréchet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure the perceived effect of a wide ...
🔗 Papers This Influenced

Défossez et al., Meta; waveform-domain music source separation; became the open-source standard

Alexandre Défossez, Nicolas Usunier, Léon Bottou ... · landmark
Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments.Contrarily to many audio synthesis tasks...

2018

Prenger et al., NVIDIA; normalizing flow vocoder; first real-time neural vocoder

Ryan Prenger, Rafael Valle, Bryan Catanzaro · landmark
In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. Wa...

Huang et al., Google Brain; relative attention for long-range music structure; enabled coherent MIDI generation

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit ... · landmark
Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on ...

Wan et al., Google; generalized end-to-end loss for speaker embeddings; standard speaker verification approach

Li Wan, Quan Wang, Alan Papir ... · landmark
In this paper, we propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient than our previous tuple-based end-to-end (TE2E) loss function. Unlike TE2E, the GE2E loss function updates the network i...

Jia et al., Google; speaker-conditioned TTS using d-vectors; generalized multi-speaker TTS

Ye Jia, Yu Zhang, Ron J. Weiss ... · landmark
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder...

2017

Wang et al., Google; seq2seq TTS from text to mel-spectrogram; replaced pipeline TTS

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton ... · landmark
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, w...
🔗 Papers This Influenced

Shen et al., Google; Tacotron 2 combined with WaveNet vocoder; MOS near human quality

Jonathan Shen, Ruoming Pang, Ron J. Weiss ... · landmark
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet m...
🔗 Papers This Influenced
💬 Reddit

2016

Oord et al., DeepMind; first autoregressive raw waveform model; defined the field of neural TTS

Aaron van den Oord, Sander Dieleman, Heiga Zen ... · landmark
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently tr...
🔗 Papers This Influenced
💬 Reddit · HN

Mehri et al.; hierarchical RNN for raw audio; showed unconditional audio generation is feasible

Soroush Mehri, Kundan Kumar, Ishaan Gulrajani ... · landmark
In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks...

2015

Amodei et al., Baidu; scaled CTC-based ASR; multilingual; near-human on some benchmarks

Dario Amodei, Rishita Anubhai, Eric Battenberg ... · landmark
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a...

Chan et al., Google Brain; attention-based encoder-decoder for ASR; foundational seq2seq approach

William Chan, Navdeep Jaitly, Quoc V. Le ... · landmark
We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. ...

2014

Hannun et al., Baidu; end-to-end deep RNN ASR; first to beat traditional pipelines at scale

Awni Hannun, Carl Case, Jared Casper ... · landmark
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform p...