Audio ML Papers

Week of January 18 - January 25, 2026

Subcategories: All (46) | Speech Synthesis (5) | Music Synthesis (6) | Ambient Synthesis (0) | Quality Assessment (3) | Enhancement (7) | Asr (9) | Other (16)
← Previous Week | Next Week → | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 88)
Luca Cerovaz, Michele Mancusi, Emanuele Rodolà · ICASSP 2026
Audio codecs power discrete music generative modelling, music streaming, and immersive media by shrinking PCM audio to bandwidth-friendly bitrates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram domains typically struggle with phase m...
#2 TOP PAPER (Score: 88)
Leying Zhang, Tingxiao Zhou, Haiyang Sun ... · arXiv
While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often u...
#3 TOP PAPER (Score: 84)
Esther Sun, Abinay Reddy Naini, Carlos Busso · ICASSP 2026
Discrete speech tokens offer significant advantages for storage and language model integration, but their application in speech emotion recognition (SER) is limited by paralinguistic information loss during quantization. This paper presents a comprehensive investigation of discre...
Saturday, January 24, 2026
Luca Cerovaz, Michele Mancusi, Emanuele Rodolà · ICASSP 2026
Audio codecs power discrete music generative modelling, music streaming, and immersive media by shrinking PCM audio to bandwidth-friendly bitrates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram domains typically struggle with phase m...
Luca Cerovaz, Michele Mancusi, Emanuele Rodolà · ICASSP 2026
Audio codecs power discrete music generative modelling, music streaming and immersive media by shrinking PCM audio to bandwidth-friendly bit-rates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram-domains typically struggle with phase m...
Friday, January 23, 2026
Esther Sun, Abinay Reddy Naini, Carlos Busso · ICASSP 2026
Discrete speech tokens offer significant advantages for storage and language model integration, but their application in speech emotion recognition (SER) is limited by paralinguistic information loss during quantization. This paper presents a comprehensive investigation of discre...
Haoxu Wang, Biao Tian, Yiheng Jiang ... · ICASSP 2026
Generative speech enhancement offers a promising alternative to traditional discriminative methods by modeling the distribution of clean speech conditioned on noisy inputs. Post-training alignment via reinforcement learning (RL) effectively aligns generative models with human pre...
David A. Kelly, Hana Chockler · arXiv
It is well-known that audio classifiers often rely on non-musically relevant features and spurious correlations to classify audio. Hence audio classifiers are easy to manipulate or confuse, resulting in wrong classifications. While inducing a misclassification is not hard, until ...
Ke Xue, Chang Sun, Rongfei Fan ... · arXiv
Mamba, a selective state-space model (SSM), has emerged as an efficient alternative to Transformers for speech modeling, enabling long-sequence processing with linear complexity. While effective in speech separation, existing approaches, whether in the time or time-frequency doma...
Seung Gyu Jeong, Seong-Eun Kim · arXiv
Automated respiratory sound classification supports the diagnosis of pulmonary diseases. However, many deep models still rely on cycle-level analysis and suffer from patient-specific overfitting. We propose PC-MCL (Patient-Consistent Multi-Cycle Learning) to address these limitat...
Ayush Pratap Singh, Harshit Singh, Nityanand Mathur ... · arXiv
Neural text-to-speech (TTS) systems systematically mispronounce low-resource proper nouns, particularly non-English names, brands, and geographic locations, due to their underrepresentation in predominantly English training corpora. Existing solutions typically rely on expensive ...
Jing Hu, Danxiang Zhu, Xianlong Luo ... · arXiv
Large Audio Language Models (LALMs) have garnered significant research interest. Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities. We hypothesize that this limitation stems from the fa...
Thursday, January 22, 2026
Leying Zhang, Tingxiao Zhou, Haiyang Sun ... · arXiv
While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often u...
Hengfan Zhang, Yueqian Lin, Hai Helen Li ... · arXiv
Deploying Audio-Language Models (Audio-LLMs) on edge infrastructure exposes a persistent tension between perception depth and computational efficiency. Lightweight local models tend to produce passive perception - generic summaries that miss the subtle evidence required for multi...
Abdul Hannan, Daniele Falavigna, Shah Nawaz ... · ICASSP 2026
Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic on...
Dingdong Wang, Shujie Liu, Tianhua Zhang ... · arXiv
Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This p...
Federico Bruzzone, Walter Cazzola, Matteo Brancaleoni ... · arXiv
Real-time automatic speech recognition systems are increasingly integrated into interactive applications, from voice assistants to live transcription services. However, scaling these systems to support multiple concurrent clients while maintaining low latency and high accuracy re...
Aref Farhadipour, Jan Marquenie, Srikanth Madikeri ... · ICASSP 2026
The development of robust, multilingual speaker recognition systems is hindered by a lack of large-scale, publicly available and multilingual datasets, particularly for the read-speech style crucial for applications like anti-spoofing. To address this gap, we introduce the TidyVo...
Lalaram Arya, Mrinmoy Bhattacharjee, Adarsh C. R. ... · arXiv
Direct Speech-to-Speech Translation (S2ST) has gained increasing attention for its ability to translate speech from one language to another, while reducing error propagation and latency inherent in traditional cascaded pipelines. However, existing direct S2ST systems continue to ...
Junjie Li, Kong Aik Lee · arXiv
An utterance-level speaker embedding is typically obtained by aggregating a sequence of frame-level representations. However, in real-world scenarios, individual frames encode not only speaker-relevant information but also various nuisance factors. As a result, different frames c...
Prakash Dhungana, Sayed Ahmad Salehi · arXiv
Keyword Spotting (KWS) systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges due to domain shifts caused by varying noise and recording conditions. To address this, we propose a comprehensive framework for continual learn...
Hangrui Hu, Xinfa Zhu, Ting He ... · arXiv
In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel vo...
Wednesday, January 21, 2026
Chun-Yi Kuan, Kai-Wei Chang, Hung-yi Lee · arXiv
Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain li...
Lin Zhang, Johan Rohdin, Xin Wang ... · arXiv
The advances in generative AI have enabled the creation of synthetic audio which is perceptually indistinguishable from real, genuine audio. Although this stellar progress enables many positive applications, it also raises risks of misuse, such as for impersonation, disinformatio...
Luca Barbisan, Marco Levorato, Fabrizio Riente · arXiv
Developing algorithms for sound classification, detection, and localization requires large amounts of flexible and realistic audio data, especially when leveraging modern machine learning and beamforming techniques. However, most existing acoustic simulators are tailored for indo...
Steven Vander Eeckt, Hugo Van hamme · ICASSP 2026
Catastrophic forgetting remains a major challenge for continual learning (CL) in automatic speech recognition (ASR), where models must adapt to new domains without losing performance on previously learned conditions. Several CL methods have been proposed for ASR, and, recently, w...
Tobias Raichle, Erfan Amini, Bin Yang · ICASSP 2026
Adapting speech enhancement (SE) models to unseen environments is crucial for practical deployments, yet test-time adaptation (TTA) for SE remains largely under-explored due to a lack of understanding of how SE models degrade under domain shifts. We observe that mask-based SE mod...
Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen ... · arXiv
Recent advances in text-to-music generation (TTM) have yielded high-quality results, but often at the cost of extensive compute and the use of large proprietary internal data. To improve the affordability and openness of TTM training, an open-source generative model backbone that...
Ju-ho Kim, Youngmoon Jung, Joon-Young Yang ... · ICASSP 2026
Deploying speaker verification on resource-constrained devices remains challenging due to the computational cost of high-capacity models; knowledge distillation (KD) offers a remedy. Classical KD entangles target confidence with non-target structure in a Kullback-Leibler term, li...
Florian Grötschla, Arunasish Sen, Alessandro Lombardi ... · arXiv
We present VCNAC, a variable channel neural audio codec. Our approach features a single encoder and decoder parametrization that enables native inference for different channel setups, from mono speech to cinematic 5.1 channel surround audio. Channel compatibility objectives ensur...
Hongfu Liu, Zhouying Cui, Xiangming Gu ... · Findings of EACL 2026
Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiv...
Tuesday, January 20, 2026
Youngmoon Jung, Joon-Young Yang, Ju-ho Kim ... · ICASSP 2026
Short-utterance speaker verification remains challenging due to limited speaker-discriminative cues in short speech segments. While existing methods focus on enhancing speaker encoders, the embedding learning strategy still forces a single fixed-dimensional representation reused ...
Fei Yang, Xuanfan Ni, Renyi Yang ... · arXiv
Recent advances in audio-language models have demonstrated remarkable success on short, segment-level speech tasks. However, real-world applications such as meeting transcription, spoken document understanding, and conversational analysis require robust models capable of processi...
Nikita Kuzmin, Songting Liu, Kong Aik Lee ... · ICASSP 2026
Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can ...
Ziqian Wang, Xianjun Xia, Chuanzeng Huang ... · ICASSP 2026
We present S$^2$Voice, the winning system of the Singing Voice Conversion Challenge (SVCC) 2025 for both the in-domain and zero-shot singing style conversion tracks. Built on the strong two-stage Vevo baseline, S$^2$Voice advances style control and robustness through several cont...
Theodore Aptekarev, Vladimir Sokolovsky, Gregory Furman · arXiv
Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201, which demonstrated high sensitivity in cla...
Lingling Dai, Andong Li, Cheng Chi ... · AAAI 2026
In the field of audio generation, signal-to-noise ratio (SNR) has long served as an objective metric for evaluating audio quality. Nevertheless, recent studies have shown that SNR and its variants are not always highly correlated with human perception, prompting us to raise the q...
Aafiya Hussain, Gaurav Srivastava, Alvi Ishmam ... · arXiv
Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio...
Monday, January 19, 2026
Bo Ren, Ruchao Fan, Yelong Shen ... · IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
Speech large language models (LLMs) have driven significant progress in end-to-end speech understanding and recognition, yet they continue to struggle with accurately recognizing rare words and domain-specific terminology. This paper presents a novel fine-tuning method, Reinforce...
Hui-Peng Du, Yang Ai, Xiao-Hang Jiang ... · ICASSP 2026
This paper targets a new scenario that integrates speech separation with speech compression, aiming to disentangle multiple speakers while producing discrete representations for efficient transmission or storage, with applications in online meetings and dialogue archiving. To add...
Zining Liang, Runbang Wang, Xuzhou Ye ... · arXiv
Immersive spatial audio has become increasingly critical for applications ranging from AR/VR to home entertainment and automotive sound systems. However, existing generative methods remain constrained to low-dimensional formats such as binaural audio and First-Order Ambisonics (F...
Jihoo Jung, Ji-Hoon Kim, Doyeop Kwak ... · ICASSP 2026
We introduce UNMIXX, a novel framework for multiple singing voices separation (MSVS). While related to speech separation, MSVS faces unique challenges: data scarcity and the highly correlated nature of singing voices mixture. To address these issues, we propose UNMIXX with three ...
Sunday, January 18, 2026
Pu Wang, Shinji Watanabe, Hugo Van hamme · IEEE ICASSP 2026
Parameter-efficient fine-tuning (PEFT) is a scalable approach for adapting large speech foundation models to new domains. While methods such as LoRA and its state-of-the-art variants reduce adaptation costs, they typically allocate parameters uniformly across model subspaces, whi...
Jakob Kienegger, Timo Gerkmann · IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Latest advances in deep spatial filtering for Ambisonics demonstrate strong performance in stationary multi-speaker scenarios by rotating the sound field toward a target speaker prior to multi-channel enhancement. For applicability in dynamic acoustic conditions with moving speak...
Sina Khanagha, Bunlong Lay, Timo Gerkmann · IEEE ICASSP 2026
Single-channel speech enhancement models face significant performance degradation in extremely noisy environments. While prior work has shown that complementary bone-conducted speech can guide enhancement, effective integration of this noise-immune modality remains a challenge. T...
Linzhi Wu, Xingyu Zhang, Hao Yuan ... · ICASSP 2026
Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To...
Xin Jing, Jiadong Wang, Andreas Triantafyllopoulos ... · ICASSP 2026
The ambiguity of human emotions poses several challenges for machine learning models, as they often overlap and lack clear delineating boundaries. Contrastive language-audio pretraining (CLAP) has emerged as a key technique for generalisable emotion recognition. However, as conve...
Haowei Lou, Hye-young Paik, Wen Hu ... · AAAI-26 (Main Technical Track)
Learning representative embeddings for different types of speaking styles, such as emotion, age, and gender, is critical for both recognition tasks (e.g., cognitive computing and human-computer interaction) and generative tasks (e.g., style-controllable speech generation). In thi...
Xinhao Mei, Gael Le Lan, Haohe Liu ... · ICASSP 2026
Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on re...