Audio ML Papers

Week of November 16 - November 23, 2025

Subcategories: All (14) | Speech Synthesis (1) | Music Synthesis (4) | Ambient Synthesis (2) | Quality Assessment (0) | Enhancement (2) | Asr (3) | Other (2)
← Previous Week | Next Week → | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 84)
Satvik Dixit, Koichi Saito, Zhi Zhong ... ยท arXiv
Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligne...
#2 TOP PAPER (Score: 84)
Callie C. Liao, Duoduo Liao, Ellie L. Zhang ยท IEEE Big Data 2025
Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal A...
#3 TOP PAPER (Score: 83)
Jonathan Yaffe, Ben Maman, Meinard Mรผller ... ยท arXiv
Automatic Music Transcription (AMT) converts audio recordings into symbolic musical representations. Training deep neural networks (DNNs) for AMT typically requires strongly aligned training pairs with precise frame-level annotations. Since creating such datasets is costly and im...
Saturday, November 22, 2025
Kexin Li, Mandar Chitre ยท arXiv
Accurate modeling of time-varying underwater acoustic channels is essential for the design, evaluation, and deployment of reliable underwater communication systems. Conventional physics models require detailed environmental knowledge, while stochastic replay methods are constrain...
Friday, November 21, 2025
Callie C. Liao, Duoduo Liao, Ellie L. Zhang ยท IEEE Big Data 2025
Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal A...
Thursday, November 20, 2025
Wei-Cheng Tseng, Xuanru Zhou, Mingyue Huo ... ยท arXiv
Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing audio-language models primarily excel at retrieval ...
Mohan Shi, Xiong Xiao, Ruchao Fan ... ยท arXiv
Joint automatic speech recognition (ASR) and speaker diarization aim to answer the question "who spoke what" in multi-speaker scenarios. In this paper, we present an end-to-end speech large language model (Speech-LLM) for Joint strEamable DIarization and aSr (JEDIS-LLM). The mode...
Rui Sang, Yuxuan Liu ยท arXiv
Voice cloning technology poses significant privacy threats by enabling unauthorized speech synthesis from limited audio samples. Existing defenses based on imperceptible adversarial perturbations are vulnerable to common audio preprocessing such as denoising and compression. We p...
Wednesday, November 19, 2025
Mohit Sharma, Robbe Van Rompaey, Wouter Lanneer ... ยท IEEE Transactions on Signal Processing, vol. 73, pp. 4155-4169, year 2025 ยท IEEE Transactions on Signal Processing
This paper addresses the challenges in short-time Fourier transform (STFT) domain subband adaptive filtering, in particular, subband system identification. Previous studies in this area have primarily focused on setups with subband filtering at a downsampled rate, implemented usi...
Dorien Herremans, Abhinaba Roy ยท AAAI-2026 Senior Member Track
Recent advances in generative AI for music have achieved remarkable fidelity and stylistic diversity, yet these systems often fail to align with nuanced human preferences due to the specific loss functions they use. This paper advocates for the systematic application of preferenc...
Hokuto Munakata, Takehiro Imamura, Taichi Nishimura ... ยท arXiv
We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The early study of AMR trained the model with solely synthe...
Tuesday, November 18, 2025
Jonathan Yaffe, Ben Maman, Meinard Mรผller ... ยท arXiv
Automatic Music Transcription (AMT) converts audio recordings into symbolic musical representations. Training deep neural networks (DNNs) for AMT typically requires strongly aligned training pairs with precise frame-level annotations. Since creating such datasets is costly and im...
Wei Liu, Jiahong Li, Yiwen Shao ... ยท arXiv
Speech-LLM models have demonstrated great performance in multi-modal and multi-task speech understanding. A typical speech-LLM paradigm is integrating speech modality with a large language model (LLM). While the Whisper encoder was frequently adopted in previous studies for speec...
Xinxin Tang, Bin Qin, Yufang Li ยท arXiv
Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by...
Monday, November 17, 2025
Satvik Dixit, Koichi Saito, Zhi Zhong ... ยท arXiv
Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligne...
Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa ... ยท AAAI 2026
Generative models have shown remarkable performance in speech enhancement (SE), achieving superior perceptual quality over traditional discriminative approaches. However, existing generative SE approaches often overlook the risk of hallucination under severe noise, leading to inc...
Zhe Sun, Yujun Cai, Jiayu Yao ... ยท arXiv
Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, ...