Audio ML Papers

Last 7 Days (April 28 - May 05, 2026)

Subcategories: All (29) | Speech Synthesis (6) | Music Synthesis (6) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (2) | Asr (1) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (14)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 89)
Ismail Rasim Ulgen, Zexin Cai, Nicholas Andrews ... · arXiv
To preserve or not to preserve prosody is a central question in voice anonymization. Prosody conveys meaning and affect, yet is tightly coupled with speaker identity. Existing methods either discard prosody for privacy or lack a principled mechanism to control the utility-privacy...
#2 TOP PAPER (Score: 83)
Chong-Xin Gan, Peter Bell, Man-Wai Mak ... · arXiv
The joint training of speech enhancement and speaker embedding networks for speaker recognition is widely adopted under noisy acoustic environments. While effective, this paradigm often fails to leverage the generalization and robustness benefits inherent in large-scale speech en...
#3 TOP PAPER (Score: 83)
Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu ... · arXiv
Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- ...
Friday, May 01, 2026
Kuan-Po Huang, Bo-Ru Lu, Byeonggeun Kim ... · arXiv
Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines ...
Zuyao You, Zhesong Yu, Mingyu Liu ... · arXiv
In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and langu...
Venkata Pushpak Teja Menta · arXiv
A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, H...
Kazuya Tateishi, Akira Takahashi, Atsuo Hiroe ... · CVPR 2026 Sight and Sound Workshop
Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straight...
Yawen Qin, Ke Qiu, Qin Zhang · arXiv
Dance serves as both a cultural cornerstone and a medium for personal expression, yet the rapid growth of online dance content has made personalized discovery increasingly difficult. Text-based dance retrieval offers a natural interface for users to search with choreographic inte...
Ziyi Yang, Zhengding Luo, Yisong Zou ... · INTER-NOISE 2026
To address the limitations of existing Generative Fixed-Filter Active Noise Control (GFANC) methods, which rely on filter decomposition and recombination and require supervised learning with labeled data, this paper proposes a Transformer-based End-to-End Control-Filter Generatio...
Thursday, April 30, 2026
Yi Zhu, Brahmi Dwivedi, Jayaram Raghuram ... · ICML 2026
Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a ...
Yurii Halychanskyi, Nimet Beyza Bozdag, Mark Hasegawa-Johnson ... · arXiv
Accented automatic speech recognition (ASR) often degrades due to the limited availability of accented training data. Prior work has explored accent modeling in low-resource settings, but existing approaches typically require minutes to hours of labeled speech, which may still be...
Nitin Choudhury, Nikhil Kumar, Aditya Kumar Sinha ... · International Conference on Multimedia & Expo (ICME) 2026, 7th International Workshop on Surveillance Data Processing
Wide exploration on robocall surveillance research is hindered due to limited access to public datasets, due to privacy concerns. In this work, we first curate Robo-SAr, a synthetic robocall dataset designed for robocall surveillance research. Robo-SAr comprises of ~200 unwanted ...
Christiaan M. Geldenhuys, Thomas R. Niesler · arXiv
We show that pretrained acoustic embeddings classify elephant vocalisations at a level approaching that of end-to-end supervised neural networks, without any fine-tuning of the embedding model. This result is of practical importance because annotated bioacoustic data are scarce a...
Nazar Kozak · IEEE/ACM Transactions on Audio, Speech, and Language Processing
Audio-based stuttering systems to date have been trained for detection -- what disfluency is present now -- leaving prediction, the capability needed for closed-loop intervention, unstudied at deployable scale. We train a 616K-parameter CNN on SEP-28k (Apple, 20,131 three-second ...
Dominik Klement, Alexander Polok, Nguyen Hai Phong ... · HSCMA 2026 Workshop at ICASSP 2026
Multi-talker automatic speech recognition (ASR) in conversational recordings remains an open problem, particularly in scenarios with large portion of overlapping speech where identifying and transcribing a target speaker is difficult from audio alone. Visual cues can help resolve...
Wednesday, April 29, 2026
Lara Gauder, Pablo Riera, Andrea Slachevsky ... · arXiv
We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the tr...
Qituan Shangguan, Junhao Du, Kunyang Peng ... · arXiv
Cross-lingual speaker verification suffers from severe language-speaker entanglement. This causes systematic degradation in the hardest scenario: correctly accepting utterances from the same speaker across different languages while rejecting those from different speakers sharing ...
Mingyu Zhao, Zijian Lin, Kun Wei ... · ICME 2026
Conventional neural speech codecs suffer from severe intelligibility degradation at ultra-low bitrates, where the bottleneck transitions from acoustic distortion to semantic loss. To address this issue, this paper conducts a systematic investigation into the role and fundamental ...
Bo Cheng, Songjun Cao, Xiaoming Zhang ... · arXiv
Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable ...
Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou ... · arXiv
Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples....
Tuesday, April 28, 2026
Zhaoyan Pan, Hengyang Zhou, Xiangdong Li ... · arXiv
Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, o...
Chunlei Meng, Jiabin Luo, Pengbin Feng ... · arXiv
Multimodal Sentiment Analysis (MSA) requires integrating language, acoustic, and visual signals without sacrificing modality-specific sentiment evidence. Existing methods mainly improve either shared-private decomposition or cross-modal interaction. Although effective, both ultim...
Amanuel Gizachew Abebe, Yasmin Moslem · International Conference on Spoken Language Translation (IWSLT 2026)
Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system s...
Venkata Pushpak Teja Menta · arXiv
Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages,...
Xuzheng He, Nan Nan, Zhilin Wang ... · arXiv
Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We prese...
Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee · arXiv
Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in ...
Kexue Wang, Yinfeng Yu, Liejun Wang · International Conference on Intelligent Computing 2026
To establish empathy with machines, it is essential to fully understand human emotional changes. However, research in multimodal emotion recognition often overlooks one problem: individual expressive traits vary significantly, which means that different people may express emotion...
Venkata Pushpak Teja Menta · arXiv
Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or...
Yichen Wang, Charles Patrick Martin · International Conference on New Interfaces for Musical Expression (NIME) 2026
This performance presents a duet between two intelligent musical instruments, Sù (to trace back; to go upstream) and Agentier (playing on agentic clavier), and their human performers, connected through feedback loops. Rather than treating AI as a tool that responds predictably to...