Audio ML Papers

Week of January 04 - January 11, 2026

Subcategories: All (34) | Speech Synthesis (12) | Music Synthesis (2) | Ambient Synthesis (3) | Quality Assessment (0) | Enhancement (3) | Asr (5) | Other (9)
← Previous Week | Next Week → | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 88)
Yusheng Dai, Zehua Chen, Yuxuan Jiang ... · arXiv
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight...
#2 TOP PAPER (Score: 87)
Chunyu Qiang, Jun Wang, Xiaopeng Wang ... · arXiv
Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded gen...
#3 TOP PAPER (Score: 84)
Hao Jiang, Edgar Choueiri · arXiv
A binaural rendering framework for personal sound zones (PSZs) is proposed to enable multiple head-tracked listeners to receive fully independent stereo audio programs. Current PSZ systems typically rely on monophonic rendering and therefore cannot control the left and right ears...
Saturday, January 10, 2026
Hao Jiang, Edgar Choueiri · arXiv
A binaural rendering framework for personal sound zones (PSZs) is proposed to enable multiple head-tracked listeners to receive fully independent stereo audio programs. Current PSZ systems typically rely on monophonic rendering and therefore cannot control the left and right ears...
K. A. Shahriar · arXiv
Audio deepfake detection has become increasingly challenging due to rapid advances in speech synthesis and voice conversion technologies, particularly under channel distortions, replay attacks, and real-world recording conditions. This paper proposes a resolution-aware audio deep...
Linfei Li, Lin Zhang, Zhong Wang ... · AAAI 2025
Although Coordinate-MLP-based implicit neural representations have excelled in representing radiance fields, 3D shapes, and images, their application to audio signals remains underexplored. To fill this gap, we investigate existing implicit neural representations, from which we e...
Friday, January 09, 2026
Bang Zeng, Beilong Tang, Wang Xiang ... · arXiv
Target speaker extraction (TSE) aims to recover the speech signal of a desired speaker from a mixed audio recording, given a short enrollment utterance. Most existing TSE approaches are based on discriminative modeling paradigms. Although effective at suppressing interfering spea...
Zhixian Zhao, Shuiyuan Wang, Guojian Li ... · ICASSP 2026
Driven by the rapid advancement of Large Language Models (LLMs), particularly Audio-LLMs and Omni-models, spoken dialogue systems have evolved significantly, progressively narrowing the gap between human-machine and human-human interactions. Achieving truly ``human-like'' communi...
Thursday, January 08, 2026
Junyang Chen, Yuhang Jia, Hui Wang ... · arXiv
Automatic speech editing aims to modify spoken content based on textual instructions, yet traditional cascade systems suffer from complex preprocessing pipelines and a reliance on explicit external temporal alignment. Addressing these limitations, we propose CosyEdit, an end-to-e...
Dekun Chen, Xueyao Zhang, Yuancheng Wang ... · arXiv
This study proposes FlexiVoice, a text-to-speech (TTS) synthesis system capable of flexible style control with zero-shot voice cloning. The speaking style is controlled by a natural-language instruction and the voice timbre is provided by a speech reference in zero-shot manner. F...
Da-Hee Yang, Joon-Hyuk Chang · IEEE Signal Processing Letters
Noise-robust automatic speech recognition (ASR) has been commonly addressed by applying speech enhancement (SE) at the waveform level before recognition. However, speech-level enhancement does not always translate into consistent recognition improvements due to residual distortio...
Kaiwen Luo, Liang Lin, Yibo Zhang ... · arXiv
Although Audio Large Language Models (ALLMs) have witnessed substantial advancements, their long audio understanding capabilities remain unexplored. A plethora of benchmarks have been proposed for general audio tasks, they predominantly focus on short-form clips, leaving without ...
Dawei Huang, Yongjie Lv, Ruijie Xiong ... · arXiv
Speech Emotion Recognition (SER) systems often assume congruence between vocal emotion and lexical semantics. However, in real-world interactions, acoustic-semantic conflict is common yet overlooked, where the emotion conveyed by tone contradicts the literal meaning of spoken wor...
Xingyuan Li, Mengyue Wu · arXiv
Detecting medical conditions from speech acoustics is fundamentally a weakly-supervised learning problem: a single, often noisy, session-level label must be linked to nuanced patterns within a long, complex audio recording. This task is further hampered by severe data scarcity an...
Wednesday, January 07, 2026
Haitao Li, Chunxiang Jin, Chenglin Li ... · arXiv
Zero-shot text-to-speech models can clone a speaker's timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference. As a result, synthesizing speech with a desired style often requires carefully selecting reference audio, which i...
Florian Schmid, Chi Ian Tang, Sanjeel Parekh ... · arXiv
Temporal detection problems appear in many fields including time-series estimation, activity recognition and sound event detection (SED). In this work, we propose a new approach to temporal event modeling by explicitly modeling event onsets and offsets, and by introducing boundar...
Yifan Hu, Peiji Yang, Zhisheng Wang ... · arXiv
Multi-speaker automatic speech recognition (MASR) aims to predict ''who spoke when and what'' from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ''wh...
Yunpei Li, Xun Zhou, Jinchao Wang ... · arXiv
In prior work, we introduced IndexTTS 2, a zero-shot neural text-to-speech foundation model comprising two core components: a transformer-based Text-to-Semantic (T2S) module and a non-autoregressive Semantic-to-Mel (S2M) module, which together enable faithful emotion replication ...
Tuesday, January 06, 2026
Yusheng Dai, Zehua Chen, Yuxuan Jiang ... · arXiv
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight...
Zhisheng Zhang, Xiang Li, Yixuan Zhou ... · arXiv
Neural Audio Codecs (NACs) can reduce transmission overhead by performing compact compression and reconstruction, which also aim to bridge the gap between continuous and discrete signals. Existing NACs can be divided into two categories: multi-codebook and single-codebook codecs....
Mikhail Silaev, Konstantinos Drossos, Tuomas Virtanen · Accepted for publication in Workshop Proceedings of the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing
Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level ...
Yuankun Xie, Xiaoxuan Guo, Jiayi Zhou ... · arXiv
Recent advances in audio large language models (ALLMs) have made high-quality synthetic audio widely accessible, increasing the risk of malicious audio deepfakes across speech, environmental sounds, singing voice, and music. Real-world audio deepfake detection (ADD) therefore req...
Yuhuan You, Lai Wei, Xihong Wu ... · arXiv
Existing large audio-language models perceive the world as "mono" -- a single stream of audio that ignores the critical spatial dimension ("where") required for universal acoustic scene analysis. To bridge this gap, we first introduce a hierarchical framework for Auditory Scene A...
Mengze Hong, Di Jiang, Zeying Xie ... · arXiv
As audio deepfakes transition from research artifacts to widely available commercial tools, robust biometric authentication faces pressing security threats in high-stakes industries. This paper presents a systematic empirical evaluation of state-of-the-art speaker authentication ...
Guo Yifan, Tian Yao, Suo Hongbin ... · Proc. INTERSPEECH 2023, 4918--4922 · Proc. INTERSPEECH 2023
With the development of teleconferencing and in-vehicle voice assistants, far-field multi-speaker speech recognition has become a hot research topic. Recently, a multi-channel transformer (MCT) has been proposed, which demonstrates the ability of the transformer to model far-fiel...
Yifan Yang, Bing Han, Hui Wang ... · arXiv
Modeling fine-grained speaking styles remains challenging for language-speech representation pre-training, as existing speech-text models are typically trained with coarse captions or task-specific supervision, and scalable fine-grained style annotations are unavailable. We prese...
Yishu Lei, Shuwei He, Jing Hu ... · arXiv
Extending the input modality of Large Language Models~(LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically \textit{heterogeneous}, entangling attributes such as speech, musi...
Kwok-Ho Ng, Tingting Song, Yongdong WU ... · arXiv
Advanced speech synthesis technologies have enabled highly realistic speech generation, posing security risks that motivate research into audio deepfake detection (ADD). While state space models (SSMs) offer linear complexity, pure causal SSMs architectures often struggle with th...
Qifan Liang, Yuansen Liu, Ruixin Wei ... · arXiv
While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. I...
Monday, January 05, 2026
Maryam Abbasihafshejani, AHM Nazmus Sakib, Murtuza Jadliwala · arXiv
The rapid advancement of speech synthesis technologies, including text-to-speech (TTS) and voice conversion (VC), has intensified security and privacy concerns related to voice cloning. Recent defenses attempt to prevent unauthorized cloning by embedding protective perturbations ...
Sunday, January 04, 2026
Chunyu Qiang, Jun Wang, Xiaopeng Wang ... · arXiv
Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded gen...
Peidong Wang, Zhiming Ma, Xin Dai ... · arXiv
Existing fraud detection methods predominantly rely on transcribed text, suffering from ASR errors and missing crucial acoustic cues like vocal tone and environmental context. This limits their effectiveness against complex deceptive strategies. To address these challenges, we fi...
Yujiao Jiang, Qingmin Liao, Zongqing Lu · arXiv
Co-speech gesture generation is a critical area of research aimed at synthesizing speech-synchronized human-like gestures. Existing methods often suffer from issues such as rhythmic inconsistency, motion jitter, foot sliding and limited multi-sampling diversity. In this paper, we...
Qundong Shi, Jie Zhou, Biyuan Lin ... · arXiv
The development of audio foundation models has accelerated rapidly since the emergence of GPT-4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces thr...
Zhiyuan Zhao, Lijian Lin, Ye Zhu ... · arXiv
We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ens...
Yong Ren, Jiangyan Yi, Jianhua Tao ... · arXiv
Instruct Text-to-Speech (InstructTTS) leverages natural language descriptions as style prompts to guide speech synthesis. However, existing InstructTTS methods mainly rely on a direct combination of audio-related labels or their diverse rephrasings, making it difficult to handle ...
MOSI. AI, :, Donghua Yu ... · arXiv
Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further cons...