Audio ML Papers

Last 7 Days (April 08 - April 15, 2026)

Subcategories: All (48) | Speech Synthesis (6) | Music Synthesis (7) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (4) | Asr (9) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (17)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 92)
Yassine El Kheir, Arnab Das, Yixuan Xiao ... ยท arXiv
Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, w...
#2 TOP PAPER (Score: 91)
Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar ... ยท arXiv
We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (...
#3 TOP PAPER (Score: 90)
Qi Wang, Zhexu Shen, Meng Chen ... ยท arXiv
Vocal-to-accompaniment (V2A) generation, which aims to transform a raw vocal recording into a fully arranged accompaniment, inherently requires jointly addressing an accompaniment trilemma: preserving acoustic authenticity, maintaining global coherence with the vocal track, and p...
Monday, April 13, 2026
Xi Chen, Wei Xue, Yike Guo ยท arXiv
Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiti...
Thomas Deppisch ยท arXiv
Multichannel speech enhancement is widely used as a front-end in microphone array processing systems. While most existing approaches produce a single enhanced signal, direction-preserving multiple-input multiple-output (MIMO) methods instead aim to provide enhanced multichannel s...
Shuiyuan Wang, Zhixian Zhao, Hongfei Yue ... ยท arXiv
Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a compre...
Tao Feng, Yuxiang Wang, Yuancheng Wang ... ยท arXiv
Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but targ...
Jialing Wang, Yue Zhao, Yuhao Zhang ... ยท arXiv
Recent advances in Speech Large Language Models (Speech-LLMs) have made significant progress, greatly enhancing multimodal interaction capabilities.However, their application in low-resource and dialect-diverse environments still faces challenges. The severe scarcity of Tibetan d...
Sunday, April 12, 2026
Zeyue Tian, Binxin Yang, Zhaoyang Liu ... ยท arXiv
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three...
Matteo Spanio, Ilay Guler, Antonio Rodร  ยท arXiv
Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by e...
Toranosuke Manabe, Yuto Shibata, Shinnosuke Takamichi ... ยท ICPR 2026
Deep learning models have improved sign language-to-text translation and made it easier for non-signers to understand signed messages. When the goal is spoken communication, a naive approach is to convert signed messages into text and then synthesize speech via Text-to-Speech (TT...
Shivam Chauhan, Ajay Pundhir ยท ICASSP 2026
Modern audio systems universally employ mel-scale representations derived from 1940s Western psychoacoustic studies, potentially encoding cultural biases that create systematic performance disparities. We present a comprehensive evaluation of cross-cultural bias in audio front-en...
Qian Zhang, Yuqin Cao, Yixuan Gao ... ยท arXiv
Video-to-Audio (V2A) generation is essential for immersive multimedia experiences, yet its evaluation remains underexplored. Existing benchmarks typically assess diverse audio types under a unified protocol, overlooking the fine-grained requirements of distinct audio categories. ...
Hongwei Xu ยท arXiv
MeloTune is an iPhone-deployed music agent that instantiates the Mesh Memory Protocol (MMP) and Symbolic-Vector Attention Fusion (SVAF) as a production system for affect-aware music curation with peer-to-peer mood coupling. Each device runs two closed-form continuous-time (CfC) n...
Jielin Qiu, Ming Zhu, Wenting Zhao ... ยท arXiv
Audio-native large language models (audio-LLMs) commonly use Whisper as their audio encoder. However, Whisper was trained exclusively on speech data, producing weak representations for music and environmental sound. This forces downstream audio-LLMs to compensate through extensiv...
Saturday, April 11, 2026
Xingjian Yang, Yudong Yang, Zhixing Guo ... ยท arXiv
The psychological profile that structurally documents the case of a depression patient is essential for psychotherapy. Large language models can be applied to summarize the profiles from counseling speech, however, it may suffer from long-context forgetting and produce unverifiab...
Hangbin Yu, Yudong Yang, Rongfeng Su ... ยท arXiv
Automatic depression detection using speech signals with acoustic and textual modalities is a promising approach for early diagnosis. Depression-related patterns exhibit sparsity in speech: diagnostically relevant features occur in specific segments rather than being uniformly di...
Ori Yonay, Tracy Hammond, Tianbao Yang ยท arXiv
Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate th...
Mariano Fernรกndez Mรฉndez ยท arXiv
Cross-modal retrieval between audio recordings and symbolic music representations (MIDI) remains challenging because continuous waveforms and discrete event sequences encode different aspects of the same performance. We study descriptor injection, the augmentation of modality-spe...
Friday, April 10, 2026
Chunhao Bi, Houqiang Zhong, Zhixin Xu ... ยท arXiv
Spatial audio is fundamental to immersive virtual experiences, yet synthesizing high-fidelity binaural audio from sparse observations remains a significant challenge. Existing methods typically rely on implicit neural representations conditioned on visual priors, which often stru...
Mintong Kang, Chen Fang, Bo Li ยท arXiv
Audio has rapidly become a primary interface for foundation models, powering real-time voice assistants. Ensuring safety in audio systems is inherently more complex than just "unsafe text spoken aloud": real-world risks can hinge on audio-native harmful sound events, speaker attr...
Wataru Nakata, Yuki Saito, Kazuki Yamauchi ... ยท arXiv
Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuita...
Pengbo Lyu, Xiangyu Zhao, Chengwei Liu ... ยท arXiv
We propose a generative framework for multi-track music source separation (MSS) that reformulates the task as conditional discrete token generation. Unlike conventional approaches that directly estimate continuous signals in the time or frequency domain, our method combines a Con...
Yunqiang Wang, Hengyuan Na, Di Wu ... ยท arXiv
Audio large language models (ALLMs) enable rich speech-text interaction, but they also introduce jailbreak vulnerabilities in the audio modality. Existing audio jailbreak methods mainly optimize jailbreak success while overlooking utility preservation, as reflected in transcripti...
Fei Liu, Yang Ai, Hui-Peng Du ... ยท arXiv
Audio super-resolution aims to recover missing high-frequency details from bandwidth-limited low-resolution audio, thereby improving the naturalness and perceptual quality of the reconstructed signal. However, most existing methods directly operate in the waveform or time-frequen...
Muhammad Usama Saleem, Tejasvi Ravi, Tianyu Xu ... ยท arXiv
Multimodal music creation requires models that can both generate audio from high-level cues and edit existing mixtures in a targeted manner. Yet most multimodal music systems are built for a single task and a fixed prompting interface, making their conditioning brittle when guida...
Qixuan Huang, Khalid Zaman, Masashi Unoki ยท arXiv
Auditory large language models (ALLMs) have demonstrated strong general capabilities in audio understanding and reasoning tasks. However, their reliability is still undermined by hallucination issues. Existing hallucination evaluation methods are formulated as binary classificati...
Ziwei Li, Lukuang Dong, Saierdaer Yusuyin ... ยท arXiv
Integrating pretrained speech encoders with large language models (LLMs) is promising for ASR, but performance and data efficiency depend on the speech-language interface. A common choice is a learned projector that maps encoder features into the LLM embedding space, whereas an a...
Jian Zhu, Jianwei Cui, Shihao Chen ... ยท arXiv
We present AccompGen, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, AccompGen produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innova...
Aditya Narayan Sankaran, Reza Farahbakhsh, Noel Crespi ยท arXiv
Abusive speech detection is becoming increasingly important as social media shifts towards voice-based interaction, particularly in multilingual and low-resource settings. Most current systems rely on automatic speech recognition (ASR) followed by text-based hate speech classific...
Zihe Wei, Yuezun Li ยท arXiv
Detecting video deepfakes has become increasingly urgent in recent years. Given the audio-visual information in videos, existing methods typically expose deepfakes by modeling cross-modal correspondence using specifically designed architectures with publicly available datasets. W...
Thursday, April 09, 2026
Yingjie Yu, Mingyuan Wu, Ahmadreza Eslaminia ... ยท arXiv
QoS-QoE translation is a fundamental problem in multimedia systems because it characterizes how measurable system and network conditions affect user-perceived experience. Although many prior studies have examined this relationship, their findings are often developed for specific ...
Yuankun Xie, Haonan Cheng, Jiayi Zhou ... ยท ACM Multimedia 2026
The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content produ...
Xiaosu Su, Zihan Sun, Peilei Jia ... ยท
Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance gene...
Chanhyuk Choi, Taesoo Kim, Donggyu Lee ... ยท arXiv
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive fl...
Gabriel Dubus, Thรฉau d'Audiffret, Claire Auger ... ยท
Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce Deep...
Zhicheng Ouyang, Seong-Gyun Leem, Bach Viet Do ... ยท arXiv
Conversational AI has made significant progress, yet generating expressive and controllable text-to-speech (TTS) remains challenging. Specifically, controlling fine-grained voice styles and emotions is notoriously difficult and typically requires massive amounts of heavily annota...
Matthew Maciejewski, Samuele Cornell ยท
Noisy speech separation systems are typically trained on fully-synthetic mixtures, limiting generalization to real-world scenarios. Though training on mixtures of in-domain (thus often noisy) speech is possible, we show that this leads to undesirable optima where mixture noise is...
Jing Peng, Chenghao Wang, Yi Yang ... ยท
Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcript...
Yuan Xie, Jiaqi Song, Guang Qiu ... ยท arXiv
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and over...
Linge Wang, Yingying Chen, Bingke Zhu ... ยท arXiv
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches desi...
Hanif Rahman ยท arXiv
Word error rate (WER) is the dominant metric for automatic speech recognition, yet it cannot detect a systematic failure mode: models that produce fluent output in the wrong writing system. We define Script Fidelity Rate (SFR), the fraction of hypothesis characters in the target ...
Wednesday, April 08, 2026
Yuxuan Wang, Peize He, Xiyan Gui ... ยท
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in th...
Zikai Liu, Ziqian Wang, Xingchen Li ... ยท
Target Speaker Extraction (TSE) aims to isolate a specific speaker's voice from a mixture, guided by a pre-recorded enrollment. While TSE bypasses the global permutation ambiguity of blind source separation, it remains vulnerable to speaker confusion, where models mistakenly extr...
Ya Zhao, Yinfeng Yu, Liejun Wang ยท
Cross-lingual Speech Emotion Recognition (CLSER) aims to identify emotional states in unseen languages. However, existing methods heavily rely on the semantic synchrony of complete labels and static feature stability, hindering low-resource languages from reaching high-resource p...
Tornike Karchkhadze, Shlomo Dubnov ยท
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playb...
Ameenudeen P E, Charumathi Narayanan, Sriram Ganapathy ยท
Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the o...
Nursadul Mamun, John H. L. Hansen ยท 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) ยท 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recen...