Audio ML Papers

Week of November 02 - November 09, 2025

Subcategories: All (15) | Speech Synthesis (2) | Music Synthesis (5) | Ambient Synthesis (0) | Quality Assessment (2) | Enhancement (0) | Asr (2) | Other (4)
← Previous Week | Next Week → | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 93)
Jiatong Shi, Jionghao Han, Yichen Lu ... · arXiv
Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language...
#2 TOP PAPER (Score: 84)
Louis Bradshaw, Alexander Spangher, Stella Biderman ... · arXiv
While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an i...
#3 TOP PAPER (Score: 84)
Vladimir Despotovic, Peter Pocta, Andrej Zgank · Biomedical Signal Processing and Control 113 (2026) 109047 · Biomedical Signal Processing and Control
Remote monitoring of cardiovascular diseases plays an essential role in early detection of abnormal cardiac function, enabling timely intervention, improved preventive care, and personalized patient treatment. Abnormalities in the heart sounds can be detected automatically via co...
Saturday, November 08, 2025
Haoran Wang, Jiatong Shi, Jinchuan Tian ... · arXiv
Neural audio codecs have recently enabled high-fidelity reconstruction at high compression rates, especially for speech. However, speech and non-speech audio exhibit fundamentally different spectral characteristics: speech energy concentrates in narrow bands around pitch harmonic...
Friday, November 07, 2025
Hardik B. Sailor, Aw Ai Ti, Chen Fang Yih Nancy ... · arXiv
We present MERaLiON-SER, a robust speech emotion recognition model de- signed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discr...
Hardik B. Sailor, Aw Ai Ti, Chen Fang Yih Nancy ... · arXiv
We present MERaLiON-SER, a robust speech emotion recognition model de- signed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discr...
Shubhr Singh, Kiran Bhat, Xavier Riley ... · arXiv
The proliferation of distorted, compressed, and manipulated music on modern media platforms like TikTok motivates the development of more robust audio fingerprinting techniques to identify the sources of musical recordings. In this paper, we develop and evaluate new neural audio ...
Thursday, November 06, 2025
Vladimir Despotovic, Peter Pocta, Andrej Zgank · Biomedical Signal Processing and Control 113 (2026) 109047 · Biomedical Signal Processing and Control
Remote monitoring of cardiovascular diseases plays an essential role in early detection of abnormal cardiac function, enabling timely intervention, improved preventive care, and personalized patient treatment. Abnormalities in the heart sounds can be detected automatically via co...
Yutong Wen, Ke Chen, Prem Seetharaman ... · arXiv
Recent breakthroughs in language-queried audio source separation (LASS) have shown that generative models can achieve higher separation audio quality than traditional masking-based approaches. However, two key limitations restrict their practical use: (1) users often require oper...
Ali Boudaghi, Hadi Zare · arXiv
Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such...
Wednesday, November 05, 2025
Gabriel Pirlogeanu, Alexandru-Lucian Georgescu, Horia Cucu · 13th Conference on Speech Technology and Human-Computer Dialogue (SpeD 2025)
In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcr...
Ilya Borovik, Dmitrii Gavrilev, Vladimir Viro · Proceedings of the 33rd ACM International Conference on Multimedia (MM '25), October 27-31, 2025, Dublin, Ireland, pp. 10699-10708 · Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)
Emotions are fundamental to the creation and perception of music performances. However, achieving human-like expression and emotion through machine learning models for performance rendering remains a challenging task. In this work, we present SyMuPe, a novel framework for develop...
Jing Peng, Yi Yang, Xu Li ... · arXiv
Recent advances in Speech Large Language Models (Speech LLMs) have paved the way for unified architectures across diverse speech understanding tasks. However, prevailing alignment paradigms rely heavily on large-scale audio-text paired data and computationally intensive training,...
Jiyoung Lee, Song Park, Sanghyuk Chun ... · arXiv
This paper proposes VoxStudio, the first unified and end-to-end speech-to-image model that generates expressive images directly from spoken descriptions by jointly aligning linguistic and paralinguistic information. At its core is a speech information bottleneck (SIB) module, whi...
Monday, November 03, 2025
Jiatong Shi, Jionghao Han, Yichen Lu ... · arXiv
Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language...
Louis Bradshaw, Alexander Spangher, Stella Biderman ... · arXiv
While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an i...
Cedric Chan, Jianjing Kuang · arXiv
Prosody is essential for speech technology, shaping comprehension, naturalness, and expressiveness. However, current text-to-speech (TTS) systems still struggle to accurately capture human-like prosodic variation, in part because existing evaluation methods for prosody remain lim...
Siyin Wang, Zengrui Jin, Changli Tang ... · arXiv
In the era of large language models (LLMs) and artificial general intelligence (AGI), computer audition must evolve beyond traditional paradigms to fully leverage the capabilities of foundation models, towards more comprehensive understanding, more natural generation and more hum...