Audio ML Papers

Predictive Directional Selective Fixed-Filter Active Noise Control for Moving Sources via a Convolutional Recurrent Neural Network

Boxiang Wang, Zhengding Luo, Dongyuan Shi ... · arXiv

Directional Selective Fixed-Filter Active Noise Control (D-SFANC) can effectively attenuate noise from different directions by selecting the suitable pre-trained control filter based on the Direction-of-Arrival (DoA) of the current noise. However, this method is weak at tracking ...

Directional Selective Fixed-Filter Active Noise Control (D-SFANC) can effectively attenuate noise from different directions by selecting the suitable pre-trained control filter based on the Direction-of-Arrival (DoA) of the current noise. However, this method is weak at tracking the direction variations of non-stationary noise, such as that from a moving source. Therefore, this work proposes a Predictive Directional SFANC (PD-SFANC) method that uses a Convolutional Recurrent Neural Network (CRNN) to capture the hidden temporal dynamics of the moving noise and predict the control filter to cancel future noise. Accordingly, the proposed method can significantly improve its noise-tracking ability and dynamic noise-reduction performance. Furthermore, numerical simulations confirm the superiority of the proposed method for handling moving sources across various movement scenarios, compared to several representative ANC baselines.

Institutional Affiliations

Primary: Nanyang Technological University

All Institutions: Nanyang Technological University, Northwestern Polytechnical University

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a novel PD-SFANC method that leverages CRNNs for proactive noise control in dynamic environments. This work significantly advances the field of active noise control by addressing the challenges of tracking moving noise sources, offering a promising solution that could enhance the performance of ANC systems in real-world applications.

Comprehensive Analysis

Methodology Assessment

The proposed Predictive Directional SFANC (PD-SFANC) method effectively integrates a Convolutional Recurrent Neural Network (CRNN) for predicting the Direction-of-Arrival (DoA) of moving noise sources. The methodology is well-structured, utilizing a pre-trained control filter library and a dual-module architecture that separates the predictive and real-time noise control processes. This design addresses the limitations of existing methods, particularly the lag in filter adaptation for moving sources, showcasing a significant advancement in active noise control systems.

Experimental Evaluation

The experiments are comprehensive, utilizing numerical simulations to evaluate the performance of PD-SFANC against established baseline methods. The authors provide detailed descriptions of the simulation setup, including the dataset construction and the noise scenarios tested. The results demonstrate that PD-SFANC outperforms traditional methods in various movement scenarios, with robust noise reduction performance and accurate DoA predictions, reinforcing the effectiveness of the proposed approach.

Reproducibility

The paper mentions that the code will be available on GitHub, which is a positive aspect for reproducibility. However, specific implementation details, such as hyperparameters and training settings, could be more explicitly stated to facilitate easier replication of the results by other researchers.

Limitations

One limitation is that the proposed method is designed for single-source scenarios, which may restrict its applicability in environments with multiple overlapping noise sources. Additionally, while the CRNN shows strong performance, its reliance on a pre-trained filter library may limit adaptability to entirely new noise types not represented in the training data.

Broader Impact

The implications of this research extend to various fields where noise control is critical, such as automotive, aviation, and consumer electronics. The ability to effectively manage noise from moving sources can enhance user experience in products like headphones, smart devices, and automotive noise cancellation systems, potentially leading to broader adoption of advanced ANC technologies. The main contribution of this paper is the introduction of a novel PD-SFANC method that leverages CRNNs for proactive noise control in dynamic environments. This work significantly advances the field of active noise control by addressing the challenges of tracking moving noise sources, offering a promising solution that could enhance the performance of ANC systems in real-world applications.

Analysis: Full Paper • Full text: 18,330 characters

Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection

Khalid Zaman, Masashi Unoki · arXiv

Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, ...

Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, thereby preserving a higher degree of naturalness that makes imitation-based speech forgery significantly more challenging to detect using conventional acoustic or cepstral features. To overcome this challenge, this study proposes an auditory perception-based Spectro-Temporal Modulation (STM) representation framework for human-imitated speech detection. The STM representations are derived from two cochlear filterbank models: the Gammatone Filterbank (GTFB), which simulates frequency selectivity and can be regarded as a first approximation of cochlear filtering, and the Gammachirp Filterbank (GCFB), which further models both frequency selectivity and level-dependent asymmetry. These STM representations jointly capture temporal and spectral fluctuations in speech signals, corresponding to changes over time in the spectrogram and variations along the frequency axis related to human auditory perception. We also introduce a Segmental-STM representation to analyze short-term modulation patterns across overlapping time windows, enabling high-resolution modeling of temporal speech variations. Experimental results show that STM representations are effective for human-imitated speech detection, achieving accuracy levels close to those of human listeners. In addition, Segmental-STM representations are more effective, surpassing human perceptual performance. The findings demonstrate that perceptually inspired spectro-temporal modeling is promising for detecting imitation-based speech attacks and improving voice authentication robustness.

Institutional Affiliations

Primary: Japan Advanced Institute of Science and Technology

All Institutions: Japan Advanced Institute of Science and Technology

ML Relevance Analysis (83)

The paper presents a comprehensive framework for detecting human-imitated speech through innovative auditory-inspired representations, addressing a critical gap in the field. The methodology is well-founded in auditory processing principles, and the experimental results demonstrate significant advancements in detection accuracy, highlighting the potential for real-world applications in voice authentication and security.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel Spectro-Temporal Modulation (STM) representation framework based on auditory perception, utilizing Gammatone and Gammachirp filterbanks to capture temporal and spectral fluctuations in human-imitated speech. The methodology is well-grounded in auditory processing principles, and the introduction of Segmental-STM representation enhances the modeling of short-term modulation patterns, which is a significant advancement over conventional acoustic features. The approach is innovative, addressing a critical gap in the detection of human-imitated speech, which has been underexplored in existing literature.

Experimental Evaluation

The experimental setup is robust, utilizing a dataset specifically designed for human-imitated speech detection. The results indicate that the proposed STM representations outperform traditional acoustic features, achieving accuracy levels comparable to human listeners. The inclusion of multiple classifiers (SVM, KNN, Extra Trees) strengthens the evaluation, and the performance metrics are clearly presented. However, the dataset size could be a limitation, as only 100 samples were used for testing, which may affect the generalizability of the findings.

Reproducibility

The paper provides a detailed description of the methodology, including the computation of STM representations and the machine learning models used. However, the lack of a publicly available dataset or code repository limits reproducibility. Future work should consider sharing the dataset and implementation details to facilitate independent validation of results.

Limitations

The primary limitation is the small dataset size, which may restrict the robustness of the findings and their applicability to broader contexts. Additionally, while the results are promising, the study does not address potential variations in performance across different languages or speaker characteristics, which could affect the generalizability of the approach.

Broader Impact

The proposed framework has significant implications for voice authentication and security systems, particularly in contexts where human-imitated speech poses a threat. By improving detection capabilities, this work could enhance the security of voice-based systems, making them more resilient against imitation attacks. The findings also contribute to the understanding of auditory perception in speech processing, potentially influencing future research in related fields. The paper presents a comprehensive framework for detecting human-imitated speech through innovative auditory-inspired representations, addressing a critical gap in the field. The methodology is well-founded in auditory processing principles, and the experimental results demonstrate significant advancements in detection accuracy, highlighting the potential for real-world applications in voice authentication and security.

Analysis: Full Paper • Full text: 33,632 characters

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

Mingchen Shao, Hang Su, Wenjie Tian ... · arXiv

While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these...

While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these limitations to the lack of data, benchmarks, and modeling approaches tailored for long-form temporal awareness. To bridge this gap, we first construct LAT-Chronicle, a 1.2k hour long-form audio dataset with temporal annotations across real-world scenarios. We further develop LAT-Bench, the first human-verified benchmark supporting audio up to 30 minutes while covering three core tasks: Dense Audio Caption, Temporal Audio Grounding, and Targeted Audio Caption. Leveraging these resources, we propose LAT-Audio, formulating temporal awareness as a progressive global-to-local reasoning paradigm. A global timeline is first constructed as an aligned temporal-semantic context,and the Think-With-Audio Chain-of-Thought (TWA-CoT) is then introduced to perform iterative reasoning by incorporating local audio information via tool use. Experiments show that LAT-Audio surpasses existing models on long-form audio temporal awareness tasks and improves robustness to input duration. We release the dataset, benchmark, and model to facilitate future research at https://github.com/alanshaoTT/LAT-Audio-Repo.

Institutional Affiliations

Primary: Northwestern Polytechnical University

All Institutions: Northwestern Polytechnical University, Independent Researcher

GitHub

ML Relevance Analysis (88)

The main contribution of this paper is the introduction of a novel framework and dataset for improving temporal awareness in long-form audio understanding, which significantly advances the state of the art in audio language models. The comprehensive methodology, robust experimental validation, and potential applications underscore its significance in the field of machine learning and audio processing.

Comprehensive Analysis

Methodology Assessment

The paper presents a comprehensive methodology that addresses the limitations of existing Large Audio Language Models (LALMs) in handling long-form audio. The authors construct a new dataset (LAT-Chronicle) and benchmark (LAT-Bench) specifically designed for Long-form Audio Temporal Awareness (LATA) tasks, which include Dense Audio Captioning, Temporal Audio Grounding, and Targeted Audio Captioning. The proposed LAT-Audio framework introduces a novel global-to-local reasoning paradigm and the Think-With-Audio Chain-of-Thought (TWA-CoT) approach, which iteratively refines audio understanding by leveraging local audio segments based on a constructed global timeline. This innovative approach is well-justified and effectively addresses the challenges posed by long-form audio inputs.

Experimental Evaluation

The experimental evaluation is robust, demonstrating the effectiveness of LAT-Audio against existing models across multiple tasks. The authors provide thorough comparisons with baseline models and conduct ablation studies to validate the importance of key components such as the global timeline and TWA-CoT. The results show significant improvements in performance metrics, indicating that the proposed methods enhance temporal awareness and robustness in long-form audio understanding. The inclusion of a diverse dataset and human-verified benchmarks adds credibility to the findings.

Reproducibility

The paper includes detailed implementation details and a clear description of the training strategy, which enhances the reproducibility of the results. The authors provide access to the dataset, benchmark, and model through a GitHub repository, facilitating further research and validation of their findings by the community.

Limitations

While the proposed framework shows promise, there are limitations, such as the computational overhead introduced by multi-turn reasoning and tool use, which may hinder real-time applications. Additionally, the focus on single-audio inputs limits the framework's applicability in more complex multimodal scenarios. Future work is needed to enhance efficiency and extend the framework to broader contexts.

Broader Impact

The research has significant implications for various applications, including automated transcription, audio search engines, and multimedia content analysis. By improving long-form audio understanding, the work can enhance user experiences in domains such as education, entertainment, and accessibility for the hearing impaired. The open-source nature of the project encourages further innovation and exploration in the field of audio language processing. The main contribution of this paper is the introduction of a novel framework and dataset for improving temporal awareness in long-form audio understanding, which significantly advances the state of the art in audio language models. The comprehensive methodology, robust experimental validation, and potential applications underscore its significance in the field of machine learning and audio processing.

Analysis: Full Paper • Full text: 50,026 characters

Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis

Haopeng Geng, Longfei Yang, Xi Chen ... · arXiv

Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cue...

Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cues, while explicit canonical priors bias predictions toward intended targets. To address these bottlenecks, we propose a prompt-free framework decoupling acoustic fidelity from canonical guidance. First, we introduce CROTTC, an acoustic model enforcing monotonic, frame-level alignment to accurately capture pronunciation deviations. Second, we implicitly inject mispronunciation information via the IF strategy under the knowledge transfer principle. Experiments show CROTTC-IF achieves a 71.77% F1-score on L2-ARCTIC and 71.70% F1-score on the Iqra'Eval2 leaderboard. With empirical analysis, we demonstrate that decoupling acoustics from explicit priors provides highly robust MDD.

Institutional Affiliations

Primary: The University of Tokyo

All Institutions: The University of Tokyo

ML Relevance Analysis (84)

The main contribution of this paper is the introduction of a prompt-free paradigm for mispronunciation detection that effectively separates acoustic fidelity from canonical bias, leading to improved diagnostic accuracy. This work significantly advances the field of MDD by addressing critical methodological challenges and demonstrating state-of-the-art performance across diverse benchmarks, thus paving the way for future research and applications in language learning and speech recognition.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel framework, CROTTC-IF, which effectively decouples acoustic fidelity from canonical guidance in Mispronunciation Detection and Diagnosis (MDD). The methodology is well-structured, incorporating a frame-wise acoustic model (CROTTC) that utilizes Optimal Temporal Transport Classification (OTTC) to capture fine-grained mispronunciation cues. Additionally, the Indirect Fusion (IF) strategy allows for implicit knowledge transfer, enhancing the model's performance without relying on explicit canonical prompts. The integration of Consistency Regularization further stabilizes predictions, showcasing a comprehensive approach to addressing the limitations of existing MDD systems.

Experimental Evaluation

The experimental evaluation is robust, with the authors conducting extensive tests on multiple datasets, including L2-ARCTIC and Iqra'Eval2. The reported F1-scores of 71.77% and 71.70% demonstrate competitive performance compared to state-of-the-art methods. The paper includes ablation studies that effectively highlight the contributions of different components of the proposed framework, providing a clear understanding of the impact of each method on overall performance.

Reproducibility

The paper provides detailed implementation details, including architecture specifications, training protocols, and hyperparameter settings. However, the lack of a publicly accessible code repository limits the reproducibility of the results, as external researchers cannot easily verify or build upon the findings.

Limitations

While the proposed framework shows promise, the paper does not address potential limitations regarding the generalizability of the model to spontaneous speech or other languages beyond the tested datasets. Additionally, the reliance on specific datasets may introduce biases that could affect the model's applicability in diverse real-world scenarios.

Broader Impact

The advancements in MDD presented in this paper have significant implications for various applications, particularly in language learning and automated speech recognition. By improving the accuracy of mispronunciation detection, the framework can enhance educational tools for language learners and contribute to more effective speech therapy solutions. The main contribution of this paper is the introduction of a prompt-free paradigm for mispronunciation detection that effectively separates acoustic fidelity from canonical bias, leading to improved diagnostic accuracy. This work significantly advances the field of MDD by addressing critical methodological challenges and demonstrating state-of-the-art performance across diverse benchmarks, thus paving the way for future research and applications in language learning and speech recognition.

Analysis: Full Paper • Full text: 37,532 characters

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

Li Li, Ming Cheng, Weixin Zhu ... · arXiv

Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potenti...

Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potential of unified modeling for this task, but jointly learning speaker attribution, temporal structure, and lexical recognition remains difficult and data-intensive. At the current stage, leveraging reliable speaker diarization as an explicit structural prior provides a practical and efficient way to simplify this task. To effectively exploit such priors, we propose DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process. Given an audio chunk and diarization results, DM-ASR decomposes transcription into a sequence of speaker- and time-conditioned queries, each corresponding to one speaker in one time segment. This formulation converts multi-speaker recognition into a series of structured sub-tasks, explicitly decoupling speaker-temporal structure from linguistic content and enabling effective integration of diarization cues with the reasoning capability of large language models. We further introduce an optional word-level timestamp prediction mechanism that interleaves word and timestamp tokens, yielding richer structured outputs and better transcription quality. Our analysis shows that diarization systems provide more reliable speaker identities and segment-level boundaries, while LLMs excel at modeling linguistic content and long-range dependencies, demonstrating their complementary strengths. Experiments on Mandarin and English benchmarks show that the proposed approach achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches.

Institutional Affiliations

Primary: Wuhan University

All Institutions: Wuhan University, Tencent Ethereal Audio Lab, The Chinese University of Hong Kong

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of DM-ASR, a diarization-aware multi-speaker ASR framework that effectively combines speaker attribution and temporal grounding through a structured dialogue generation approach. This innovative methodology not only improves transcription quality but also demonstrates the potential of integrating diarization cues with large language models, marking a significant advancement in the field of automatic speech recognition.

Comprehensive Analysis

Methodology Assessment

The proposed DM-ASR framework innovatively reformulates the multi-speaker ASR task as a multi-turn dialogue generation process, effectively integrating speaker diarization cues into the transcription process. This approach decouples speaker identity and temporal information from linguistic content, allowing for a structured generation that enhances both transcription accuracy and robustness against imperfect diarization cues. The introduction of special tokens for speaker and timestamp information, alongside the optional word-level timestamp prediction, represents a significant methodological advancement in the field.

Experimental Evaluation

The experiments conducted on both Mandarin and English datasets demonstrate the effectiveness of DM-ASR, achieving competitive performance with smaller models and limited training data compared to larger, more data-intensive systems. The results indicate that the framework not only outperforms traditional cascaded systems but also rivals state-of-the-art end-to-end models, showcasing the practical applicability and generalizability of the proposed method across different languages and conversational contexts.

Reproducibility

The paper provides detailed implementation information, including the architecture of the model, training procedures, and datasets used, which enhances reproducibility. However, the lack of publicly available code or demo URLs limits the ability for others to directly replicate the findings without additional effort.

Limitations

One notable limitation is the reliance on external diarization systems, which can introduce errors that affect overall performance. Additionally, while the model shows robustness against imperfect cues, it does not consistently outperform strong diarization front-ends under all conditions, indicating a potential area for improvement. The paper also does not explore the scalability of the method to larger datasets or more complex conversational scenarios.

Broader Impact

The DM-ASR framework has significant implications for real-world applications in multi-speaker environments such as meetings, interviews, and call centers. By improving the accuracy of speaker attribution and temporal grounding in ASR systems, it could enhance accessibility for users requiring accurate transcriptions, such as those with hearing impairments. Furthermore, the integration of LLMs with diarization cues could pave the way for more advanced conversational AI systems capable of understanding and generating human-like dialogue. The main contribution of this paper is the introduction of DM-ASR, a diarization-aware multi-speaker ASR framework that effectively combines speaker attribution and temporal grounding through a structured dialogue generation approach. This innovative methodology not only improves transcription quality but also demonstrates the potential of integrating diarization cues with large language models, marking a significant advancement in the field of automatic speech recognition.

Analysis: Full Paper • Full text: 40,787 characters

Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations

Maximilian Wachter, Sebastian Murgul, Michael Heizmann · 5th International Conference on SMART MULTIMEDIA (ICSM), 2025

Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this wo...

Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this work, we introduce a novel deep learning approach for quantizing MIDI performances using a priori beat information. Our method leverages the transformer architecture to effectively process synchronized score and performance data for training a quantization model. Key components of our approach include dataset preparation, a beat-based pre-quantization method to align performance and score times within a unified framework, and a MIDI tokenizer tailored for this task. We adapt a transformer model based on the T5 architecture to meet the specific requirements of rhythm quantization. The model is evaluated using a set of score-level metrics designed for objective assessment of quantization performance. Through systematic evaluation, we optimize both data representation and model architecture. Additionally, we apply performance and score augmentations, such as transposition, note deletion, and performance-side time jitter, to enhance the model's robustness. Finally, a qualitative analysis compares our model's quantization performance against state-of-the-art probabilistic and deep-learning models on various example pieces. Our model achieves an onset F1-score of 97.3% and a note value accuracy of 83.3% on the ASAP dataset. It generalizes well across time signatures, including those not seen during training, and produces readable score output. Fine-tuning on instrument-specific datasets further improves performance by capturing characteristic rhythmic and melodic patterns. This work contributes a robust and flexible framework for beat-based MIDI quantization using transformer models.

Institutional Affiliations

Primary: Klangio GmbH

All Institutions: Klangio GmbH, Institute of Industrial Information Technology, Karlsruhe Institute of Technology

ML Relevance Analysis (82)

This paper presents a novel transformer-based approach for beat-based rhythm quantization of MIDI performances, significantly advancing the field of Automatic Music Transcription. The integration of beat annotations into the quantization process enhances the model's performance and flexibility, marking a meaningful contribution to music information retrieval.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, leveraging a transformer architecture tailored for rhythm quantization by incorporating beat annotations. The preprocessing steps for aligning performance and score data are well-defined, and the tokenization scheme is innovative, allowing for efficient encoding of musical data. The model's adaptability to different time signatures and its ability to generalize across unseen time signatures are significant contributions. However, the reliance on a priori beat information may limit its applicability in scenarios where such data is not available.

Experimental Evaluation

The experiments are comprehensive, utilizing a suitable dataset (ASAP) that includes diverse performance MIDI files. The evaluation metrics are well-chosen, focusing on onset F1-score and note value accuracy, which are critical for assessing quantization performance. The results demonstrate strong performance compared to state-of-the-art models, indicating the effectiveness of the proposed approach. However, the paper could benefit from more extensive comparisons with a broader range of existing methods.

Reproducibility

The paper provides sufficient details on the model architecture, training process, and evaluation metrics, which would allow other researchers to replicate the study. However, the absence of a publicly available code repository limits reproducibility.

Limitations

The main limitations include the dependency on beat annotations, which may not always be available, and the model's performance on more complex time signatures that were not part of the training set. Additionally, the focus on piano and guitar data may restrict the model's generalizability to other instruments.

Broader Impact

This work has significant implications for music information retrieval and automatic music transcription, offering a new approach to rhythm quantization that could enhance the usability of MIDI data in various applications, including music education, performance analysis, and music generation. The model's ability to generalize across different time signatures and instruments could lead to broader applications in music technology. This paper presents a novel transformer-based approach for beat-based rhythm quantization of MIDI performances, significantly advancing the field of Automatic Music Transcription. The integration of beat annotations into the quantization process enhances the model's performance and flexibility, marking a meaningful contribution to music information retrieval.

Analysis: Full Paper • Full text: 30,369 characters

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

Chengyou Wang, Hongfei Yue, Guojian Li ... · arXiv

Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversatio...

Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversations. The Full-Duplex Interaction Track of ICASSP 2026 Human-like Spoken Dialogue Systems Challenge (HumDial Challenge) aims to advance the evaluation of full-duplex systems by offering a framework for handling real-time interruptions, speech overlap, and dynamic turn negotiation. We introduce a comprehensive benchmark for full-duplex spoken dialogue systems, built from the HumDial Challenge. We release a high-quality dual-channel dataset of real human-recorded conversations, capturing interruptions, overlapping speech, and feedback mechanisms. This dataset forms the basis for the HumDial-FDBench benchmark, which assesses a system's ability to handle interruptions while maintaining conversational flow. Additionally, we create a public leaderboard to compare the performance of open-source and proprietary models, promoting transparent, reproducible evaluation. These resources support the development of more responsive, adaptive, and human-like dialogue systems.

Institutional Affiliations

Primary: Nanjing University

All Institutions: Nanjing University, Northwestern Polytechnical University, AISHELL

GitHub

ML Relevance Analysis (83)

This paper presents a comprehensive study on full-duplex interaction in spoken dialogue systems, introducing a novel dataset and evaluation framework that significantly advance the field. The methodology is well-structured, and the results demonstrate the potential for developing more human-like dialogue systems, addressing key challenges in real-time conversational dynamics.

Comprehensive Analysis

Methodology Assessment

The paper introduces a dual-channel dataset that captures realistic conversational dynamics, including interruptions and overlapping speech, which is a significant advancement over existing datasets that primarily focus on single-channel recordings. The methodology for dataset construction combines LLM-generated scripts with human recordings, ensuring both authenticity and control over interaction behavior. The evaluation framework, HumDial-FDBench, is well-structured, providing clear metrics for assessing system performance in real-time dialogue scenarios. This comprehensive approach allows for a nuanced understanding of full-duplex interaction, making it a valuable resource for future research.

Experimental Evaluation

The experimental results are robust, with a clear comparison of various models' performance on the released benchmark. The paper provides detailed metrics for interruption handling, rejection behavior, and response latency, which are critical for evaluating the effectiveness of dialogue systems in real-world scenarios. The inclusion of a public leaderboard enhances the transparency and reproducibility of the results, encouraging further development in this area. However, the paper could benefit from more extensive discussion on the specific experimental setups and conditions under which the models were evaluated.

Reproducibility

The paper emphasizes the release of a publicly available dataset and benchmark, which facilitates reproducibility. The authors provide a clear methodology for data collection and evaluation metrics, allowing other researchers to replicate their experiments. However, the lack of detailed implementation specifics for the models evaluated may hinder full reproducibility for those attempting to build upon this work.

Limitations

One limitation is the potential bias in the dataset construction, as it relies on scripted dialogues performed by professional actors, which may not fully capture the variability of spontaneous human interactions. Additionally, the paper acknowledges challenges related to background noise and speaker overlap, which could affect model performance in real-world applications. The evaluation metrics primarily focus on behavioral correctness and latency, potentially overlooking other important aspects of dialogue quality.

Broader Impact

The resources provided by this research have significant implications for the development of more natural and responsive spoken dialogue systems. By addressing the limitations of traditional turn-taking paradigms, this work paves the way for advancements in human-computer interaction, with applications in customer service, virtual assistants, and conversational agents. The emphasis on real-time interaction and the ability to handle interruptions could lead to more engaging and effective communication tools. This paper presents a comprehensive study on full-duplex interaction in spoken dialogue systems, introducing a novel dataset and evaluation framework that significantly advance the field. The methodology is well-structured, and the results demonstrate the potential for developing more human-like dialogue systems, addressing key challenges in real-time conversational dynamics.

Analysis: Full Paper • Full text: 23,070 characters

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

Jialong Mai, Xiaofen Xing, Xiangmin Xu · arXiv

Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowled...

Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowledge, MAGIC-TTS is the first TTS model with explicit local timing control over token-level content duration and pause. MAGIC-TTS is enabled by explicit token-level duration conditioning, carefully prepared high-confidence duration supervision, and training mechanisms that correct zero-value bias and make the model robust to missing local controls. On our timing-control benchmark, MAGIC-TTS substantially improves token-level duration and pause following over spontaneous synthesis. Even when no timing control is provided, MAGIC-TTS maintains natural high-quality synthesis. We further evaluate practical local editing with a scenario-based benchmark covering navigation guidance, guided reading, and accessibility-oriented code reading. In this setting, MAGIC-TTS realizes a reproducible uniform-timing baseline and then moves the edited regions toward the requested local targets with low mean bias. These results show that explicit fine-grained controllability can be implemented effectively in a high-quality TTS system and can support realistic local timing-editing applications.

Institutional Affiliations

Primary: South China University of Technology

All Institutions: South China University of Technology

ML Relevance Analysis (83)

MAGIC-TTS introduces the first TTS model with explicit local timing control over token-level content duration and pause. This comprehensive analysis highlights the model's innovative approach to TTS, its rigorous methodology, and its potential to significantly impact the field of speech synthesis by improving the quality and controllability of generated speech.

Comprehensive Analysis

Methodology Assessment

The methodology presented in MAGIC-TTS is robust, leveraging a flow-based TTS backbone to achieve explicit local timing control over token-level content duration and pause. The authors introduce a novel training mechanism that incorporates high-confidence duration supervision and zero-value correction, which effectively addresses the challenges of local timing manipulation in TTS systems. The separation of timing control from the acoustic generation process is a significant improvement, allowing for precise control without compromising synthesis quality. The detailed explanation of the training data pipeline and the careful construction of timing supervision demonstrate a thorough understanding of the complexities involved in TTS systems.

Experimental Evaluation

The experiments are well-designed, utilizing a comprehensive timing-control benchmark to validate the effectiveness of MAGIC-TTS. The results show substantial improvements in token-level duration and pause accuracy when explicit controls are provided, with clear metrics such as mean absolute error and correlation coefficients. The ablation studies further strengthen the claims by isolating the contributions of key components, confirming the importance of zero-value correction and cross-validated timing supervision. The practical local editing scenarios also illustrate the model's versatility and real-world applicability.

Reproducibility

The paper provides sufficient details regarding the experimental setup, including model architecture, training configurations, and evaluation protocols, which supports reproducibility. However, the absence of a publicly available demo or project URL limits the practical reproducibility of the results, as external researchers would need to replicate the entire setup from scratch.

Limitations

One limitation is the reliance on high-confidence supervision, which may not be easily attainable in all datasets or languages, potentially affecting the model's generalizability. Additionally, while the paper demonstrates improvements in timing control, it does not extensively explore the impact of these improvements on user experience or subjective quality assessments in real-world applications.

Broader Impact

The advancements in fine-grained controllability in TTS systems have significant implications for applications such as navigation guidance, accessibility tools, and interactive voice assistants. By enabling precise local timing manipulation, MAGIC-TTS can enhance the expressiveness and naturalness of synthesized speech, making it more adaptable to various contexts and user needs. MAGIC-TTS introduces the first TTS model with explicit local timing control over token-level content duration and pause. This comprehensive analysis highlights the model's innovative approach to TTS, its rigorous methodology, and its potential to significantly impact the field of speech synthesis by improving the quality and controllability of generated speech.

Analysis: Full Paper • Full text: 40,583 characters

PHOTON: Non-Invasive Optical Tracking of Key-Lever Motion in Historical Keyboard Instruments

Noah Jaffe, John Ashley Burgoyne · arXiv

This paper introduces PHOTON (PHysical Optical Tracking of Notes), a non-invasive optical sensing system for measuring key-lever motion in historical keyboard instruments. PHOTON tracks the vertical displacement of the key lever itself, capturing motion shaped by both performer i...

This paper introduces PHOTON (PHysical Optical Tracking of Notes), a non-invasive optical sensing system for measuring key-lever motion in historical keyboard instruments. PHOTON tracks the vertical displacement of the key lever itself, capturing motion shaped by both performer input and the instrument's mechanically imposed, time-varying load. Reflective optical sensors mounted beneath the distal end of each lever provide continuous displacement, timing, and articulation data without interfering with the action. Unlike existing optical systems designed for modern pianos, PHOTON accommodates the diverse geometries, limited clearances, and non-standard layouts of harpsichords, clavichords, and early fortepianos. Its modular, low-profile architecture enables high-resolution, low-latency sensing across multiple manuals and variable key counts. Beyond performance capture, PHOTON provides real-time MIDI output and supports empirical study of expressive gesture, human-instrument interaction, and the construction of instrument-specific MIDI corpora using real historical mechanisms. The complete system is released as open-source hardware and software, from schematics and PCB layouts developed in KiCad to firmware written in CircuitPython, lowering the barrier to adoption, replication, and extension.

Institutional Affiliations

Primary: Institute for Logic, Language, and Computation

All Institutions: Institute for Logic, Language, and Computation, University of Amsterdam

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of the PHOTON system, a non-invasive optical tracking technology for historical keyboard instruments that facilitates detailed analysis of key-lever motion and expressive gesture. This innovative approach, combined with its open-source nature, positions PHOTON as a valuable tool for researchers and performers alike, potentially transforming the study and practice of historical keyboard music.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is innovative and well-structured, focusing on a non-invasive optical sensing system tailored for historical keyboard instruments. The use of reflective optical sensors to measure key-lever motion is a significant advancement over existing systems, which are primarily designed for modern pianos. The modular and low-profile design allows for high-resolution data capture while accommodating the unique geometries of historical instruments. The authors provide a thorough explanation of the hardware design, including sensor selection, calibration, and integration, which demonstrates a strong understanding of the mechanical constraints involved. The open-source nature of the project enhances its accessibility and encourages further research and development.

Experimental Evaluation

While the paper does not present extensive experimental results, it includes a case study that illustrates the effectiveness of the PHOTON system in capturing key-action behavior on a harpsichord. The authors provide motion traces that reveal fine-grained aspects of touch and articulation, which are crucial for understanding performance nuances. However, more comprehensive experiments comparing PHOTON with existing systems or evaluating its performance across various historical instruments would strengthen the paper's contributions.

Reproducibility

The authors emphasize reproducibility by providing detailed schematics, PCB layouts, and firmware source code. The use of widely available components and open-source tools further supports the project's replicability. The inclusion of a custom KiCad plugin for sensor placement is particularly noteworthy, as it simplifies the adaptation of the system to different keyboard layouts.

Limitations

One limitation of the study is the lack of extensive empirical validation across a broader range of historical keyboard instruments. While the case study is informative, additional data from various setups would provide a more robust evaluation of the system's capabilities. Furthermore, ethical considerations regarding unobtrusive sensing are briefly mentioned but could benefit from a more in-depth discussion.

Broader Impact

The PHOTON system has the potential to significantly impact the fields of musicology, performance practice, and instrument design. By enabling detailed empirical studies of expressive gesture and human-instrument interaction, it opens new avenues for research that have been historically underrepresented. The integration of real-time MIDI output and the ability to create instrument-specific MIDI corpora can enhance both educational and performance contexts, making historical keyboard instruments more accessible to contemporary musicians. The main contribution of this paper is the introduction of the PHOTON system, a non-invasive optical tracking technology for historical keyboard instruments that facilitates detailed analysis of key-lever motion and expressive gesture. This innovative approach, combined with its open-source nature, positions PHOTON as a valuable tool for researchers and performers alike, potentially transforming the study and practice of historical keyboard music.

Analysis: Full Paper • Full text: 38,310 characters

Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven's Piano and Cello Sonatas, 1930--2012

Ignasi Sole · arXiv

Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic...

Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic gradient of the portamento slide, measured in Hz/second, and demonstrates its measurement using a protocol combining Sonic Visualizer's melodic spectrogram layer, GIMP pixel analysis, and metric calibration against the spectrogram's known frequency axis. The gradient captures what duration alone cannot: the steepness of the pitch trajectory, which encodes the expressive character of the slide independently of its length. Applied to the opening measures of. Specifically because their monophonic texture permits reliable spectrographic pitch tracking. The method yields gradient values ranging from approximately 600~Hz/s in late-period recordings to over 4,000~Hz/s in early twentieth-century performances. The paper further documents a gain-recovery protocol that extends the analysable corpus to analogue recordings from the 1930s where portamento traces are faint in digital transfer. Applying the method to a corpus of 22 recordings spanning 1930--2012, the paper tests the hypothesis that gradient steepness correlates negatively with tempo: that slower performances produce steeper, longer slides while faster performances produce shallower slides or none at all. The results support this hypothesis, suggesting that the widely documented decline of portamento across the twentieth century is not a binary transition from presence to absence but a continuou

Institutional Affiliations

Primary: unknown

All Institutions: unknown

ML Relevance Analysis (75)

This paper introduces a new quantitative descriptor for portamento in string performance, significantly enhancing the analysis of expressive techniques in historical recordings. The innovative methodology and empirical findings provide valuable insights into the evolution of musical expression, making a meaningful contribution to the fields of musicology and audio analysis.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel methodology for measuring portamento in string performance through a spectrographic gradient, which is a significant advancement over existing binary measures of portamento presence and duration. The combination of Sonic Visualizer for spectrogram analysis and GIMP for pixel analysis is innovative, allowing for a more nuanced understanding of musical expressiveness. The calibration of the gradient measurement to physical units (Hz/second) adds rigor and comparability to the findings.

Experimental Evaluation

The experiments are well-structured, utilizing a corpus of 22 recordings spanning over eight decades. The analysis of gradient values and their correlation with tempo provides empirical support for the paper's hypotheses. The use of historical recordings adds depth to the findings, showing a continuous decline in portamento expressiveness rather than a simple absence.

Reproducibility

The methodology is detailed, with clear steps for measurement and calibration, which should allow for reproducibility by other researchers. However, the reliance on human judgment in placing reference points for gradient measurement introduces variability that could affect reproducibility.

Limitations

The study is limited to specific passages of two sonatas, which may not generalize across the entire cello repertoire. Additionally, the subjective nature of reference point placement could lead to inconsistencies in gradient measurement. The calibration constants are also specific to the settings used, which may limit comparisons with other studies.

Broader Impact

This research has the potential to influence both musicology and performance practice by providing a quantitative framework for analyzing expressive techniques in string performance. The findings could inform teaching practices and performance interpretations, as well as contribute to the broader understanding of stylistic evolution in music. This paper introduces a new quantitative descriptor for portamento in string performance, significantly enhancing the analysis of expressive techniques in historical recordings. The innovative methodology and empirical findings provide valuable insights into the evolution of musical expression, making a meaningful contribution to the fields of musicology and audio analysis.

Analysis: Full Paper • Full text: 32,543 characters

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

Menghe Ma, Siqing Wei, Yuecheng Xing ... · arXiv

Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to br...

Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to bridge the gap between superficial pattern recognition and the underlying musical logic. This landscape is further complicated by severe notation biases toward Western staff and the inherent unreliability of "LLM-as-a-judge" metrics, which often mask structural reasoning failures with systemic hallucinations. To establish a more rigorous standard, we introduce ONOTE, a multi-format benchmark that utilizes a deterministic pipeline--grounded in canonical pitch projection--to eliminate subjective scoring biases across diverse notation systems. Our evaluation of leading omnimodal models exposes a fundamental disconnect between perceptual accuracy and music-theoretic comprehension, providing a necessary framework for diagnosing reasoning vulnerabilities in complex, rule-constrained domains.

Institutional Affiliations

Primary: Beijing University of Posts and Telecommunications

All Institutions: Beijing University of Posts and Telecommunications, China Conservatory of Music, Nanyang Technological University

GitHub

ML Relevance Analysis (88)

The paper introduces ONOTE, a comprehensive benchmark for evaluating Omnimodal Notation Processing, which addresses critical gaps in the assessment of music intelligence systems. The methodology and results presented are significant contributions to the field, paving the way for more effective and interpretable models in music AI.

Comprehensive Analysis

Methodology Assessment

The proposed ONOTE benchmark introduces a structured and deterministic evaluation framework for Omnimodal Notation Processing (ONP), addressing the limitations of existing models that often rely on subjective evaluations. The methodology effectively integrates multiple notation systems and tasks, ensuring a comprehensive assessment of model capabilities across auditory, visual, and symbolic domains. The use of canonical pitch projection and sequence alignment to eliminate biases is particularly innovative, allowing for a more rigorous comparison of model performance.

Experimental Evaluation

The experiments conducted on leading omnimodal models reveal significant insights into their performance across various tasks, including Visual Score Understanding (VSU), Cross-Format Notation Conversion (CNC), Audio-to-Symbolic Transcription (AST), and Symbolic Music Generation (SMG). The results highlight a clear disconnect between perceptual accuracy and music-theoretic comprehension, underscoring the benchmark's effectiveness in diagnosing reasoning vulnerabilities. The dataset construction and evaluation metrics are well-defined, providing a robust foundation for future research.

Reproducibility

The paper provides detailed implementation details and a clear methodology for constructing the ONOTE benchmark, which enhances reproducibility. The availability of the dataset and code on GitHub further supports the reproducibility of the results, allowing other researchers to validate and build upon the work.

Limitations

While the benchmark addresses several critical issues in music notation processing, it may still be limited by the inherent biases present in the datasets used for training and evaluation. Additionally, the focus on specific notation systems may not fully encompass the diversity of global musical representations, potentially limiting the generalizability of the findings.

Broader Impact

The ONOTE benchmark has the potential to significantly influence the field of music intelligence by providing a standardized evaluation framework that encourages the development of more robust and interpretable omnimodal systems. Its implications extend beyond academic research, potentially impacting music education, automated composition, and music analysis tools. The paper introduces ONOTE, a comprehensive benchmark for evaluating Omnimodal Notation Processing, which addresses critical gaps in the assessment of music intelligence systems. The methodology and results presented are significant contributions to the field, paving the way for more effective and interpretable models in music AI.

Analysis: Full Paper • Full text: 35,385 characters

Before the Mic: Physical-Layer Voiceprint Anonymization with Acoustic Metamaterials

Zhiyuan Ning, Zhanyong Tang, Xiaojiang Chen ... · arXiv

Voiceprints are widely used for authentication; however, they are easily captured in public settings and cannot be revoked once leaked. Existing anonymization systems operate inside recording devices, which makes them ineffective when microphones or software are untrusted, as in ...

Voiceprints are widely used for authentication; however, they are easily captured in public settings and cannot be revoked once leaked. Existing anonymization systems operate inside recording devices, which makes them ineffective when microphones or software are untrusted, as in conference rooms, lecture halls, and interviews. We present EchoMask, the first practical physical-layer system for real-time voiceprint anonymization using acoustic metamaterials. By modifying sound waves before they reach the microphone, EchoMask prevents attackers from capturing clean voiceprints through compromised devices. Our design combines three key innovations: frequency-selective interference to disrupt voiceprint features while preserving speech intelligibility, an acoustic-field model to ensure stability under speaker movement, and reconfigurable structures that create time-varying interference to prevent learning or canceling a fixed acoustic pattern. EchoMask is low-cost, power-free, and 3D-printable, requiring no machine learning, software support, or microphone modification. Experiments conducted across eight microphones in diverse environments demonstrate that EchoMask increases the Miss-match Rate, i.e., the fraction of failed voiceprint matching attempts, to over 90%, while maintaining high speech intelligibility.

Institutional Affiliations

Primary: Northwest University

All Institutions: Northwest University, University of Leeds

ML Relevance Analysis (83)

This paper presents a pioneering approach to voiceprint anonymization using acoustic metamaterials, addressing critical challenges in real-time applications while maintaining speech intelligibility. The combination of innovative design principles and thorough experimental validation positions this work as a significant contribution to the field of audio privacy and security.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is innovative, leveraging acoustic metamaterials for voiceprint anonymization in real-time scenarios. The authors effectively address three critical challenges: maintaining speech intelligibility while disrupting identity cues, ensuring stability under speaker movement, and preventing predictable acoustic patterns. The design principles are well-structured, focusing on targeted low-frequency perturbation, dynamic stability, and passive randomization, which collectively enhance the robustness of the system. The use of numerical simulations and physical experimentation to validate the design is commendable, although the lack of machine learning integration may limit adaptability in some contexts.

Experimental Evaluation

The experiments are comprehensive, evaluating the system across various microphones and real-world conditions. The results demonstrate a high Miss-match Rate (MMR) of over 90%, indicating effective voiceprint protection while maintaining speech intelligibility. The inclusion of subjective listening tests (Mean Opinion Score) further strengthens the evaluation by providing insights into perceived audio quality. However, the paper could benefit from a more detailed breakdown of the experimental setup and conditions to enhance transparency.

Reproducibility

While the paper provides a solid theoretical foundation and experimental results, it lacks specific implementation details that would facilitate reproducibility. Key parameters, such as the exact configurations of the metamaterials and the experimental setups, are not thoroughly documented. Additionally, the absence of a project URL or code repository limits the ability of other researchers to replicate the work.

Limitations

The primary limitations include the reliance on passive metamaterials, which may restrict adaptability to varying acoustic environments and speaker dynamics. The system's performance under extreme conditions (e.g., very high noise levels or rapid speaker movement) is not fully explored. Furthermore, while the approach is innovative, it does not incorporate machine learning techniques that could enhance performance through adaptive learning.

Broader Impact

The implications of this research are significant, particularly in enhancing privacy and security in voice-based authentication systems. The ability to anonymize voiceprints in real-time without requiring modifications to existing devices opens up new avenues for protecting users in public and shared environments. The findings could influence future designs of microphones and voice interaction systems, promoting user privacy in increasingly digital and interconnected spaces. This paper presents a pioneering approach to voiceprint anonymization using acoustic metamaterials, addressing critical challenges in real-time applications while maintaining speech intelligibility. The combination of innovative design principles and thorough experimental validation positions this work as a significant contribution to the field of audio privacy and security.

Analysis: Full Paper • Full text: 50,026 characters

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

Tong Zhao, Chenghao Zhang, Yutao Zhu ... · arXiv

Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on imag...

Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on images, largely overlooking audio, especially in the setting of interleaved audio-text contextual retrieval. In this work, we introduce the Audio-Text Interleaved contextual Retrieval (ATIR) task, where queries can alternate between audio and text modalities. We construct an ATIR benchmark by integrating several Automatic Speech Recognition (ASR), QA, and retrieval datasets, ultimately unifying four types of contextual retrieval tasks. This benchmark substantially addresses the limitations of existing audio retrieval datasets in semantic retrieval. To study this task, we evaluate several off-the-shelf retrievers and train our ATIR model based on a Multimodal Large Language Model (MLLM). We further introduce a novel token compression mechanism that is orthogonal to existing compression methods, thereby alleviating the issue of excessive audio tokens in MLLM-based ATIR models. Experimental results demonstrate that our ATIR model achieves substantial improvements over strong baselines.

Institutional Affiliations

Primary: Renmin University of China

All Institutions: Renmin University of China

ML Relevance Analysis (82)

The paper presents a novel approach to audio-text interleaved contextual retrieval, introducing the ATIR task and a benchmark that significantly enhances the capabilities of existing retrieval systems. The comprehensive methodology, innovative technical contributions, and thorough experimental validation position this work as a meaningful advancement in the field of multimodal information retrieval.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper is robust, introducing the ATIR task and a comprehensive benchmark that addresses the limitations of existing audio retrieval datasets. The novel token compression mechanism and the bi-encoder architecture with a token selector module are innovative contributions that enhance the performance of interleaved audio-text retrieval. The synthesis pipeline for data generation is well-structured, ensuring high-quality multimodal data that is critical for training effective models.

Experimental Evaluation

The experimental evaluation is thorough, demonstrating significant improvements over strong baselines across various retrieval settings. The use of multiple metrics (Recall@k and nDCG@k) provides a comprehensive assessment of model performance. The ablation studies effectively validate the contributions of the proposed components, particularly the token selector's impact on retrieval efficiency and accuracy.

Reproducibility

The paper provides detailed implementation information, including model architecture, training configurations, and hyperparameters, which supports reproducibility. However, the lack of a publicly available project or demo URL limits accessibility for other researchers wishing to replicate the results.

Limitations

The paper acknowledges limitations, such as the focus on single-document retrieval and the potential for future exploration of more complex retrieval scenarios. Additionally, the lightweight representation design may restrict performance in certain contexts, and the evaluation is primarily centered on QA-centric tasks, leaving broader applications untested.

Broader Impact

The introduction of the ATIR task and benchmark has the potential to significantly influence multimodal retrieval research, particularly in applications involving conversational agents and hybrid search systems. The findings could lead to advancements in how audio and text are integrated for more effective information retrieval systems. The paper presents a novel approach to audio-text interleaved contextual retrieval, introducing the ATIR task and a benchmark that significantly enhances the capabilities of existing retrieval systems. The comprehensive methodology, innovative technical contributions, and thorough experimental validation position this work as a meaningful advancement in the field of multimodal information retrieval.

Analysis: Full Paper • Full text: 44,517 characters

From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR

Nan Xu, Shiheng Li, Shengchao Hou · arXiv

We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphoni...

We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.

Institutional Affiliations

Primary: FindLab

All Institutions: FindLab

ML Relevance Analysis (82)

The paper introduces a novel two-stage OMR approach that effectively decodes complex polyphonic music into structured formats, significantly advancing the field of music recognition. The methodology leverages innovative techniques to address longstanding challenges in music transcription, with implications for both practical applications and future research directions.

Comprehensive Analysis

Methodology Assessment

The paper presents a two-stage Optical Music Recognition (OMR) pipeline that innovatively formulates the second stage as a structure decoding problem. The use of topology recognition with a probability-guided search (BeadSolver) is a significant methodological advancement, addressing the complex challenges of voice separation and timing in polyphonic music. The integration of procedural generation with recognition-feedback annotations for training data further enhances the robustness of the proposed method.

Experimental Evaluation

The experiments are well-structured, comparing the proposed BeadSolver against rule-based and linear-equations baselines. The results demonstrate clear improvements in the quality of the structured output, indicating that the proposed method effectively addresses the limitations of existing approaches. However, specific quantitative results and metrics used for evaluation could be more explicitly detailed to strengthen the findings.

Reproducibility

The paper outlines the methodology and provides a clear description of the data pipeline and model architecture, which aids in reproducibility. However, the absence of publicly available code or datasets limits the ability to fully replicate the results.

Limitations

The paper does not address potential limitations in handling highly variable music notations or the scalability of the proposed method to broader music genres beyond piano scores. Additionally, the reliance on procedural generation for training data may introduce biases that are not fully explored.

Broader Impact

The proposed OMR system has the potential to significantly enhance the accessibility of historical and contemporary music scores, enabling better integration into digital music platforms and educational tools. This could foster greater engagement with music education and preservation efforts. The paper introduces a novel two-stage OMR approach that effectively decodes complex polyphonic music into structured formats, significantly advancing the field of music recognition. The methodology leverages innovative techniques to address longstanding challenges in music transcription, with implications for both practical applications and future research directions.

Analysis: Full Paper • Full text: 50,026 characters

Embedding-Based Intrusive Evaluation Metrics for Musical Source Separation Using MERT Representations

Paul A. Bereuter, Alois Sontacchi · DAGA 2026 (Annual German Conference on Acoustics)

Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test,...

Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test, which is considered the gold standard evaluation method. As an alternative approach in singing voice separation, embedding-based intrusive metrics that leverage latent representations from large self-supervised audio models such as Music undERstanding with large-scale self-supervised Training (MERT) embeddings have been introduced. In this work, we analyze the correlation of perceptual audio quality ratings with two intrusive embedding-based metrics: a mean squared error (MSE) and an intrusive variant of the Fréchet Audio Distance (FAD) calculated on MERT embeddings. Experiments on two independent datasets show that these metrics correlate more strongly with perceptual audio quality ratings than traditional BSS-Eval metrics across all analyzed stem and model types.

Institutional Affiliations

Primary: University of Music and Performing Arts Graz

All Institutions: University of Music and Performing Arts Graz

GitHub

ML Relevance Analysis (78)

The main contribution of this paper is the introduction of embedding-based intrusive evaluation metrics for musical source separation, which demonstrate stronger correlations with perceptual audio quality ratings than traditional BSS-Eval metrics. This work significantly advances the evaluation methodologies in the field, providing a more perceptually relevant framework for assessing audio separation models.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel approach to evaluate musical source separation (MSS) using embedding-based intrusive metrics derived from MERT representations. The methodology is well-structured, leveraging self-supervised audio models to compute metrics that correlate better with human perceptual ratings compared to traditional BSS-Eval metrics. The use of two specific metrics (MSE and an intrusive variant of FAD) is innovative, and the paper provides a clear explanation of how these metrics are calculated and their significance in the context of MSS evaluation.

Experimental Evaluation

The experiments are robust, utilizing two independent datasets (Bake-Off and GenSVS) to validate the proposed metrics. The correlation analysis conducted using Spearman's rank correlation coefficient (SRCC) and Pearson's correlation coefficient (PCC) is appropriate and effectively demonstrates the superiority of the embedding-based metrics over traditional methods. The results are well-presented, with clear tables and figures that summarize the findings.

Reproducibility

The paper provides sufficient detail about the datasets and the implementation of the metrics, including references to the Python packages used. However, the absence of direct access to the datasets limits full reproducibility for external researchers. The code repository linked in the paper enhances reproducibility for the proposed metrics and analyses.

Limitations

One limitation is the reliance on specific datasets, which may not fully represent the diversity of musical sources encountered in real-world applications. Additionally, while the proposed metrics show improved correlation with perceptual ratings, the paper does not explore their performance across a broader range of audio genres or separation tasks.

Broader Impact

The findings have significant implications for the field of audio processing and music technology, as they suggest a more reliable evaluation framework for MSS models. This could lead to improved development and assessment of audio separation technologies, benefiting applications in music production, audio restoration, and content creation. The approach could also inspire further research into embedding-based evaluation metrics in other audio-related tasks. The main contribution of this paper is the introduction of embedding-based intrusive evaluation metrics for musical source separation, which demonstrate stronger correlations with perceptual audio quality ratings than traditional BSS-Eval metrics. This work significantly advances the evaluation methodologies in the field, providing a more perceptually relevant framework for assessing audio separation models.

Analysis: Full Paper • Full text: 18,292 characters

Sema: Semantic Transport for Real-Time Multimodal Agents

Jiaying Meng, Bojie Li · arXiv

Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task...

Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task-relevant semantics rather than reconstructing signals in real time. This fundamental difference shifts the transport goal from the technical problem of signal fidelity (Shannon-Weaver Level A) to the semantic problem of meaning preservation (Level B). This mismatch imposes significant overhead. In visual pipelines, screenshot upload accounts for over 60% of end-to-end action latency on constrained uplinks, and in voice pipelines, conventional transport carries massive redundancy, sending 43-64x more data than needed to maintain task accuracy. We present Sema, a semantic transport system that combines discrete audio tokenizers with a hybrid screen representation (lossless accessibility-tree or OCR text, plus compact visual tokens) and bursty token delivery that eliminates jitter buffers. In simulations under emulated WAN conditions, Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while preserving task accuracy within 0.7 percentage points of the raw baseline.

Institutional Affiliations

Primary: Unaffiliated

All Institutions: Unaffiliated, Pine AI

ML Relevance Analysis (77)

The paper presents Sema, a semantic transport system that significantly reduces bandwidth requirements for real-time multimodal agents while maintaining task accuracy. The innovative approach and strong experimental results position this work as a meaningful contribution to the field of machine learning, particularly in audio and multimodal communication contexts.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper introduces a novel semantic transport system, Sema, which shifts the focus from traditional signal fidelity to semantic meaning preservation. The authors effectively combine discrete audio tokenization with a hybrid screen representation, optimizing for real-time multimodal agent communication. The approach is well-structured, leveraging existing technologies in a new context, and the design principles are clearly articulated. However, the paper could benefit from a more detailed exploration of the implementation specifics and potential integration challenges with existing systems.

Experimental Evaluation

The experimental evaluation is robust, utilizing simulations under emulated WAN conditions to demonstrate significant reductions in uplink bandwidth for both audio and screenshots while maintaining task accuracy. The results are compelling, showcasing the effectiveness of the proposed system in practical scenarios. However, the reliance on simulation rather than real-world testing limits the generalizability of the findings.

Reproducibility

The paper lacks sufficient implementation details that would facilitate reproducibility. While the authors describe their methods and evaluations, the absence of a publicly available codebase or detailed algorithmic descriptions hinders other researchers from replicating the study.

Limitations

The primary limitations include the lack of real-world testing, which raises questions about the performance of the system in diverse network conditions. Additionally, the paper does not address potential challenges in integrating the proposed system with existing multimodal agent architectures, which could affect its adoption.

Broader Impact

The implications of this work are significant, as it addresses a critical bottleneck in multimodal agent communication by optimizing data transport for AI models rather than human users. This could lead to more efficient and responsive AI systems, enhancing applications in various domains such as virtual assistants, gaming, and remote collaboration tools. The paper presents Sema, a semantic transport system that significantly reduces bandwidth requirements for real-time multimodal agents while maintaining task accuracy. The innovative approach and strong experimental results position this work as a meaningful contribution to the field of machine learning, particularly in audio and multimodal communication contexts.

Analysis: Full Paper • Full text: 30,529 characters

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Lekai Qian, Haoyu Gu, Jingwei Zhao ... · arXiv

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences...

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.

Institutional Affiliations

Primary: New York University

All Institutions: South China University of Technology, National University of Singapore, New York University

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of BEAT, a novel tokenization framework for symbolic music generation that enhances the coherence and quality of generated music while facilitating real-time accompaniment. This work significantly advances the field by integrating structured tokenization with autoregressive modeling, addressing key challenges in music generation and representation.

Comprehensive Analysis

Methodology Assessment

The proposed BEAT tokenization method introduces a novel approach to symbolic music representation by utilizing uniform temporal steps, which addresses the limitations of existing event-based and notation-based methods. The authors effectively leverage the concept of beats as fundamental units, allowing for compact representation while maintaining temporal regularity. The methodology is well-structured, detailing the encoding process, beat-level assembly, and sequence construction, which collectively enhance the model's ability to generate coherent musical outputs. The integration of a Transformer model with this tokenization is a significant advancement, as it facilitates real-time generation and accommodates various music generation tasks.

Experimental Evaluation

The experiments conducted are comprehensive, involving both objective and subjective evaluations across multiple music generation tasks, including piano and multi-track continuation. The use of established metrics such as Groove Consistency, Scale Consistency, and Fréchet Music Distance provides a robust framework for assessing the performance of the BEAT method against baseline models. The subjective evaluations, which include listener surveys, further validate the effectiveness of BEAT in producing high-quality musical outputs. The results consistently demonstrate that BEAT outperforms existing methods, indicating its practical applicability in real-world scenarios.

Reproducibility

The paper provides sufficient implementation details, including model architecture, training datasets, and evaluation protocols, which enhance the reproducibility of the results. However, the absence of a public code repository limits the ease with which other researchers can replicate the findings. The authors could improve reproducibility by sharing their code and datasets, allowing for independent verification of their results.

Limitations

While the BEAT method shows promise, there are limitations regarding the diversity of the training datasets, which primarily reflect Western musical traditions. This cultural bias may restrict the model's applicability to other musical styles. Additionally, the reliance on subjective evaluations, while valuable, introduces variability based on listener preferences, which may not universally represent the quality of generated music.

Broader Impact

The development of BEAT has the potential to significantly impact the field of generative music AI, enhancing artistic expression and creativity. By providing a structured framework for music generation, it can assist musicians and learners in exploring new creative avenues. However, the potential for over-reliance on automated systems raises concerns about the erosion of fundamental musical skills. Furthermore, the focus on Western music could lead to a homogenization of musical styles, underscoring the need for diverse datasets in future research. The main contribution of this paper is the introduction of BEAT, a novel tokenization framework for symbolic music generation that enhances the coherence and quality of generated music while facilitating real-time accompaniment. This work significantly advances the field by integrating structured tokenization with autoregressive modeling, addressing key challenges in music generation and representation.

Analysis: Full Paper • Full text: 50,026 characters

Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

Hyunjung Joo, GyeongTaek Lee · arXiv

The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizatio...

The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizations in real-world speech. Our paper proposes Dual-Glob, a deep supervised contrastive learning framework to robustly classify fine-grained pitch accent patterns in Seoul Korean. Unlike conventional local predictive models, our approach captures holistic $F_0$ contour shapes by enforcing structural consistency between clean and augmented views in a shared latent space. To this aim, we introduce the first large-scale benchmark dataset, consisting of manually annotated 10,093 Accentual Phrases in Seoul Korean. Experimental results show that our Dual-Glob significantly outperforms strong baseline models with state-of-the-art accuracy (77.75%) and F1-score (51.54%). Therefore, our work supports AM-based intonational phonology using data-driven methodology, showing that deep contrastive learning effectively captures holistic structural features of continuous $F_0$ contours.

Institutional Affiliations

Primary: Rutgers University

All Institutions: Rutgers University, Gachon University, Hanyang Institute for Phonetics and Cognitive Sciences of Language (HIPCS)

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of the Dual-Glob framework for pitch accent classification in Seoul Korean, which leverages deep supervised contrastive learning to effectively capture the structural features of continuous $F_0$ contours. This work not only presents a novel methodology but also establishes a valuable benchmark dataset, paving the way for future research in the intersection of linguistics and machine learning.

Comprehensive Analysis

Methodology Assessment

The proposed Dual-Glob framework employs deep supervised contrastive learning to enhance pitch accent classification by focusing on the holistic representation of $F_0$ contours. This approach is innovative as it contrasts with traditional local predictive models by enforcing structural consistency between clean and augmented views, which is crucial for capturing the nuances of pitch accents in Seoul Korean. The introduction of a large-scale benchmark dataset with 10,093 manually annotated Accentual Phrases is a significant methodological advancement, providing a solid foundation for the proposed learning framework.

Experimental Evaluation

The experimental results demonstrate that the Dual-Glob framework achieves state-of-the-art performance with an accuracy of 77.75% and an F1-score of 51.54%. The paper effectively compares its results against strong baseline models, showcasing the robustness of the proposed method. However, the paper could benefit from more detailed discussions on the experimental setup, including data splits, training protocols, and hyperparameter tuning, to allow for better reproducibility and understanding of the results.

Reproducibility

The paper lacks sufficient details regarding the implementation of the Dual-Glob framework, such as specific model architectures, training procedures, and evaluation metrics. This omission may hinder reproducibility. Including a supplementary material section with code or detailed configuration settings would significantly enhance the paper's reproducibility.

Limitations

One limitation of the study is the focus on a specific language (Seoul Korean), which may limit the generalizability of the findings to other languages or dialects. Additionally, while the proposed method shows improved performance, the F1-score indicates that there may still be challenges in accurately classifying certain pitch accent patterns, suggesting room for further refinement.

Broader Impact

The findings of this research have the potential to advance the understanding of intonational phonology and improve speech recognition systems for Seoul Korean. By leveraging deep learning techniques, this work could contribute to more robust language processing tools, which may also be applicable to other tonal languages. Furthermore, the introduction of a benchmark dataset can foster further research in this area, encouraging the development of more sophisticated models for pitch accent classification. The main contribution of this paper is the introduction of the Dual-Glob framework for pitch accent classification in Seoul Korean, which leverages deep supervised contrastive learning to effectively capture the structural features of continuous $F_0$ contours. This work not only presents a novel methodology but also establishes a valuable benchmark dataset, paving the way for future research in the intersection of linguistics and machine learning.

Analysis: Full Paper • Full text: 720 characters

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan ... · arXiv

Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) traini...

Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.

Institutional Affiliations

Primary: NVIDIA

All Institutions: NVIDIA, NVIDIA

GitHub

ML Relevance Analysis (83)

The paper presents a novel Unified ASR framework that effectively bridges the performance gap between offline and streaming automatic speech recognition systems. This work is significant for its methodological innovations, comprehensive experimental validation, and potential impact on the deployment of efficient ASR solutions in various applications.

Comprehensive Analysis

Methodology Assessment

The paper introduces a Unified ASR framework for RNNT that effectively combines offline and streaming capabilities within a single model. The use of chunk-limited attention and dynamic chunked convolutions is well-justified, addressing the challenges of context limitations in streaming scenarios. The innovative mode-consistency regularization (MCR-RNNT) is a significant methodological advancement, as it directly targets the performance gap between offline and streaming modes. The dual-mode training strategy is also a thoughtful approach to optimizing model performance across different operational contexts.

Experimental Evaluation

The experiments are comprehensive, utilizing a large dataset of 120,000 hours of labeled English speech, which is crucial for validating the proposed methods. The evaluation across multiple test sets enhances the robustness of the results. The paper reports significant improvements in streaming accuracy while maintaining offline performance, which is a critical requirement for practical ASR systems. The ablation studies provide valuable insights into the effectiveness of the proposed MCR-RNNT loss and the impact of varying right context parameters.

Reproducibility

The authors mention that the model and code will be open-sourced, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics that would aid in replicating the experiments, such as hyperparameter settings and training configurations.

Limitations

While the proposed methods show promise, the paper does not thoroughly discuss potential limitations, such as the computational overhead introduced by dual-mode training and the scalability of the approach to other languages or dialects. Additionally, the performance in extremely low-latency scenarios could be further explored.

Broader Impact

The advancements in unified ASR systems have significant implications for real-world applications, especially in environments requiring both high accuracy and low latency, such as virtual assistants and real-time transcription services. The open-sourcing of the model also encourages further research and development in the ASR community. The paper presents a novel Unified ASR framework that effectively bridges the performance gap between offline and streaming automatic speech recognition systems. This work is significant for its methodological innovations, comprehensive experimental validation, and potential impact on the deployment of efficient ASR solutions in various applications.

Analysis: Full Paper • Full text: 19,787 characters

Self-Noise Reduction for Capacitive Sensors via Photoelectric DC Servo: Application to Condenser Microphones

Hirotaka Obo, Atsushi Tsuchiya, Tadashi Ebihara ... · arXiv

The self-noise of capacitive sensors, primarily caused by thermal noise from the gate-bias resistor in the preamplifier, imposes a fundamental limit on measurement sensitivity. In electret condenser microphones (ECMs), this resistor simultaneously determines the noise low-pass cu...

The self-noise of capacitive sensors, primarily caused by thermal noise from the gate-bias resistor in the preamplifier, imposes a fundamental limit on measurement sensitivity. In electret condenser microphones (ECMs), this resistor simultaneously determines the noise low-pass cutoff frequency and the signal high-pass cutoff frequency through a single RC time constant, creating a trade-off between noise reduction and signal bandwidth. This paper proposes PDS-Amp (Photoelectric DC Servo Amplifier), a circuit technique that replaces the gate-bias resistor with a photoelectric element functioning as an ultra-high-impedance current source. A DC servo loop using lag-lead compensation feeds back the preamplifier output through an LED to control the photocurrent, thereby stabilizing the gate bias while decoupling the noise and signal cutoff frequencies. A custom photosensor based on the external photoelectric effect of a zinc photocathode was fabricated to achieve sub-picoampere dark current, overcoming the limitations of commercial semiconductor photodiodes. Combined with a cascode JFET preamplifier that minimizes input capacitance through bootstrap action, PDS-Amp achieved a self-noise of 11 dBA with a 12 pF dummy microphone. Despite using a small-diameter ECM capsule, this performance is comparable to that of large-diaphragm condenser microphones costing several thousand dollars. Recording experiments with an actual ECM capsule qualitatively confirmed a significant reduction in background noise. The proposed technique is applicable not only to microphones but broadly to capacitive sensors including accelerometers, pressure sensors, and pyroelectric sensors.

Institutional Affiliations

Primary: National Agriculture and Food Research Organization (NARO)

All Institutions: National Agriculture and Food Research Organization (NARO), University of Tsukuba

ML Relevance Analysis (83)

This paper presents the PDS-Amp, a novel circuit technique that effectively reduces self-noise in capacitive sensors, demonstrating significant improvements in performance and potential applications across various sensor technologies. The comprehensive methodology and experimental validation underscore its importance in advancing the field of audio and sensor technology.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is innovative as it introduces the PDS-Amp, which replaces the conventional gate-bias resistor with a photoelectric element to significantly reduce self-noise in capacitive sensors. The use of a DC servo loop to stabilize the gate bias while decoupling noise and signal cutoff frequencies is a novel approach that addresses the inherent trade-offs in traditional designs. The theoretical background is well-articulated, providing a solid foundation for the proposed method.

Experimental Evaluation

The experiments conducted are thorough, including noise spectral density comparisons and self-noise evaluations using both dummy microphones and actual ECM capsules. The results demonstrate a significant reduction in self-noise, achieving 11 dBA, which is a substantial improvement over conventional methods. The qualitative recording experiments further validate the effectiveness of the proposed technique in real-world applications.

Reproducibility

While the paper provides detailed descriptions of the circuit design and experimental setup, the lack of publicly available code or a project repository limits reproducibility. Future work should include sharing the circuit schematics and experimental data to enhance reproducibility.

Limitations

The paper acknowledges potential limitations regarding the long-term stability of the custom photosensor, the increased complexity of the circuit due to the DC servo loop, and the need for close proximity between the photoelectric element and the LED. These factors could pose challenges in practical applications.

Broader Impact

The PDS-Amp technique has significant implications for various capacitive sensors beyond microphones, including accelerometers and pressure sensors, potentially leading to advancements in sensor technology across multiple fields. The ability to achieve low self-noise without increasing size or voltage requirements could revolutionize the design of compact, high-performance sensors. This paper presents the PDS-Amp, a novel circuit technique that effectively reduces self-noise in capacitive sensors, demonstrating significant improvements in performance and potential applications across various sensor technologies. The comprehensive methodology and experimental validation underscore its importance in advancing the field of audio and sensor technology.

Analysis: Full Paper • Full text: 49,360 characters

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

Shuhai Peng, Hui Lu, Jinjiang Liu ... · arXiv

While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation du...

While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing Paradigm that ensures highly efficient and stable streaming inference. To ensure the coherence between the extracted speech segments, we design a historical context refinement mechanism that mitigates boundary discontinuities by leveraging historical information. Experiments on Libri2Mix show that while AR generative baseline exhibits performance degradation at low latencies, our approach maintains 100% stability and superior intelligibility. Furthermore, our streaming results are comparable to or even surpass offline baselines. Additionally, our model achieves a Real-Time-Factor (RTF) of 0.248 on consumer-level GPUs. This work provides empirical evidence that AR generative backbones are viable for latency-sensitive applications through the Chunk-wise Interleaved Splicing Paradigm.

Institutional Affiliations

Primary: Shenzhen International Graduate School, Tsinghua University

All Institutions: Shenzhen International Graduate School, Tsinghua University, The Chinese University of Hong Kong, SenseTime Research

ML Relevance Analysis (83)

The paper presents the first autoregressive generative backbone tailored for streaming Target Speaker Extraction, filling a critical research void. The technical contributions, particularly the innovative Chunk-wise Interleaved Splicing Paradigm and historical context refinement mechanism, represent significant advancements in the field, with the potential to improve real-time audio processing applications substantially.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel autoregressive model specifically designed for streaming Target Speaker Extraction (TSE), introducing the Chunk-wise Interleaved Splicing Paradigm. This approach effectively addresses the mismatch between training and streaming inference by ensuring causality and stability in real-time applications. The historical context refinement mechanism is a significant addition that enhances the coherence of extracted speech segments, mitigating boundary discontinuities. The methodology is well-structured, with clear definitions and a logical flow from problem identification to proposed solutions. The use of autoregressive models in a streaming context is innovative, and the interleaved splicing paradigm is a clever engineering solution that maintains efficiency.

Experimental Evaluation

The experimental results are robust, showcasing a comprehensive evaluation against both generative and discriminative baselines. The use of the Libri2Mix dataset is appropriate, and the metrics employed (DNSMOS, NISQA, WER, etc.) are relevant for assessing speech quality and intelligibility. The results demonstrate that the proposed method not only maintains stability at low latencies but also achieves comparable or superior performance to offline models. The ablation studies provide valuable insights into the effectiveness of the historical context refinement and input strategies, reinforcing the contributions of the proposed methodology.

Reproducibility

The paper provides sufficient implementation details, including the architecture of the model, the training protocol, and the evaluation metrics. However, the lack of a public demo or project URL limits the reproducibility of the results. Future work could benefit from sharing code and models to facilitate further research and validation by the community.

Limitations

One limitation is the reliance on specific datasets, which may not fully capture the diversity of real-world scenarios. Additionally, while the proposed method shows promise, the performance at extreme low latencies (e.g., below 80ms) is not thoroughly evaluated. There may also be concerns regarding the generalizability of the model to other languages or dialects, which could affect its applicability in diverse settings.

Broader Impact

This work has significant implications for real-time applications such as teleconferencing, voice-controlled systems, and multi-turn dialogue interactions. By enabling high-quality speech extraction in latency-sensitive environments, the proposed method can enhance user experiences in various audio processing applications. The approach could also inspire further research into autoregressive models for other real-time audio tasks, potentially leading to broader advancements in the field of speech processing. The paper presents the first autoregressive generative backbone tailored for streaming Target Speaker Extraction, filling a critical research void. The technical contributions, particularly the innovative Chunk-wise Interleaved Splicing Paradigm and historical context refinement mechanism, represent significant advancements in the field, with the potential to improve real-time audio processing applications substantially.

Analysis: Full Paper • Full text: 25,524 characters

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan ... · ACL 2026

The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CFs). Consequently, CF detection has attracted increasing attention from the resea...

The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CFs). Consequently, CF detection has attracted increasing attention from the research community. However, existing studies predominantly focus on English or Chinese, leaving the vulnerability of Indic languages largely unexplored. To bridge this gap, we introduce Indic-CodecFake (ICF) dataset, the first large-scale benchmark comprising real and NAC-synthesized speech across multiple Indic languages, diverse speaker profiles, and multiple NAC types. We use IndicSUPERB as the real speech corpus for generation of ICF dataset. Our experiments demonstrate that state-of-the-art (SOTA) CF detectors trained on English-centric datasets fail to generalize to ICF, underscoring the challenges posed by phonetic diversity and prosodic variability in Indic speech. Further, we present systematic evaluation of SOTA ALMs in a zero-shot setting on ICF dataset. We evaluate these ALMs as they have shown effectiveness for different speech tasks. However, our findings reveal that current ALMs exhibit consistently poor performance. To address this, we propose SATYAM, a novel hyperbolic ALM tailored for CF detection in Indic languages. SATYAM integrates semantic representations from Whisper and prosodic representations from TRILLsson using through Bhattacharya distance in hyperbolic space and subsequently performs the same alignment procedure between the fused speech representation and an input conditioning prompt. This dual-stage fusion framework enables SATYAM to effectively model hierarchical relationships both within speech (semantic-prosodic) and across modalities (speech-text). Extensive evaluations show that SATYAM consistently outperforms competitive end-to-end and ALM-based baselines on the ICF benchmark.

Institutional Affiliations

Primary: IIIT-Delhi, India

All Institutions: UPES, India, Veer Bahadur Singh Purvanchal University, India, IIIT-Delhi, India

GitHub

ML Relevance Analysis (82)

The paper significantly advances the field of audio deepfake detection by introducing a dedicated dataset for Indic languages and a novel hyperbolic ALM tailored for this context. The comprehensive methodology and robust experimental results underscore its potential impact on both academic research and practical applications in speech technology.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel hyperbolic ALM, SATYAM, which integrates semantic and prosodic representations for detecting speech deepfakes in Indic languages. The methodology is well-structured, leveraging Bhattacharya distance in hyperbolic space for alignment, which is a unique approach in the context of audio deepfake detection. The dual-stage fusion framework is innovative, allowing for effective modeling of hierarchical relationships within speech and across modalities. The use of existing models like Whisper and TRILLsson as encoders is appropriate, and the choice of hyperbolic geometry adds a compelling dimension to the representation of speech data.

Experimental Evaluation

The experiments are robust, featuring a comprehensive evaluation of the proposed ICF dataset and comparisons against state-of-the-art models. The zero-shot evaluation of SOTA ALMs demonstrates the challenges in generalizing to Indic languages, providing a strong rationale for the development of SATYAM. The results show significant improvements over existing baselines, validating the effectiveness of the proposed framework. The paper also includes thorough ablation studies that highlight the contributions of various components, enhancing the credibility of the findings.

Reproducibility

The authors provide a clear description of the dataset generation process, model architecture, and training procedures, which facilitates reproducibility. The inclusion of a project URL with code and dataset access further supports this aspect. However, the paper could benefit from more detailed hyperparameter settings and training configurations to enhance clarity.

Limitations

One notable limitation is the reliance on a single family of LLM decoders, which may restrict the generalizability of the findings. Additionally, while the proposed framework shows promise, the performance on noisy conditions could be further explored to assess robustness in real-world applications. The paper acknowledges these limitations and suggests future work to explore alternative encoder and decoder architectures.

Broader Impact

The introduction of the ICF dataset and the SATYAM model has significant implications for speech technology in multilingual contexts, particularly in combating the rising threat of speech deepfakes. By focusing on Indic languages, the work addresses a critical gap in the literature and provides a foundation for future research in low-resource language settings. The ethical considerations outlined also reflect a responsible approach to the potential misuse of the technology. The paper significantly advances the field of audio deepfake detection by introducing a dedicated dataset for Indic languages and a novel hyperbolic ALM tailored for this context. The comprehensive methodology and robust experimental results underscore its potential impact on both academic research and practical applications in speech technology.

Analysis: Full Paper • Full text: 41,222 characters

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

Jianbo Ma, Richard Cartwright · arXiv

Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of...

Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of-Details (CoD), a novel framework that explicitly models temporal coarse-to-fine dynamics in speech generation using a cascaded architecture. Our method progressively refines temporal details across multiple stages, with each stage targeting a specific temporal granularity. All temporal detail predictions are performed using a shared decoder, enabling efficient parameter utilization across different temporal resolutions. Notably, we observe that the lowest detail level naturally performs phonetic planning without the need for an explicit phoneme duration predictor. We evaluate our method on several datasets and compare it against several baselines. Experimental results show that CoD achieves competitive performance with significantly fewer parameters than existing approaches. Our findings demonstrate that explicit modeling of temporal dynamics with the CoD framework leads to more natural speech synthesis.

Institutional Affiliations

Primary: Canva Research

All Institutions: Canva Research, Dolby Laboratories

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of the Chain-of-Details framework, which innovatively extends the coarse-to-fine generation paradigm to incorporate temporal dynamics in TTS synthesis, leading to more natural speech generation with improved efficiency. This work represents a meaningful advancement in the field of audio synthesis, combining theoretical insights with practical applications that could influence future developments in TTS technologies.

Comprehensive Analysis

Methodology Assessment

The proposed Chain-of-Details (CoD) framework introduces a novel approach to modeling temporal dynamics in Text-To-Speech (TTS) synthesis through a multi-stage, cascaded architecture. This method effectively refines speech generation across various temporal resolutions, which is a significant advancement over existing coarse-to-fine generation paradigms. The use of a shared decoder across different temporal levels enhances parameter efficiency and consistency. The methodology is well-grounded in previous work, yet it innovatively extends the temporal modeling aspect, which has been largely overlooked in prior TTS systems.

Experimental Evaluation

The experimental evaluation is robust, utilizing multiple datasets, including LibriSpeech and SeedTTS, to validate the effectiveness of the CoD framework. The results demonstrate competitive performance in terms of Word Error Rate (WER) with fewer parameters compared to existing models, indicating a significant improvement in efficiency. The inclusion of ablation studies further strengthens the findings by providing insights into the effects of different temporal levels and token types.

Reproducibility

The paper provides detailed implementation specifics, including model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the lack of publicly available code or demo URLs may hinder broader accessibility for researchers looking to replicate or build upon this work.

Limitations

While the CoD framework shows promise, the paper does not address potential limitations related to the scalability of the model to more complex speech patterns or the generalization to diverse languages and accents. Additionally, the reliance on specific datasets may limit the applicability of the findings to other contexts.

Broader Impact

The implications of this research are significant, as improved TTS systems can enhance accessibility for individuals with speech impairments, improve user experiences in virtual assistants, and contribute to advancements in human-computer interaction. The explicit modeling of temporal dynamics could also pave the way for more nuanced applications in multimedia content creation and entertainment. The main contribution of this paper is the introduction of the Chain-of-Details framework, which innovatively extends the coarse-to-fine generation paradigm to incorporate temporal dynamics in TTS synthesis, leading to more natural speech generation with improved efficiency. This work represents a meaningful advancement in the field of audio synthesis, combining theoretical insights with practical applications that could influence future developments in TTS technologies.

Analysis: Full Paper • Full text: 23,584 characters

APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

Deshui Miao, Yameng Gu, Chao Yang ... · arXiv

This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA-based Ref-VOS pipeline, the propose...

This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA-based Ref-VOS pipeline, the proposed system introduces two additional front-end stages: speech transcription and visual existence verification. Specifically, we first employ VibeVoice-ASR to convert long-form spoken input into a structured textual transcript. Since audio-derived queries are inherently noisy and may describe entities that are not visually present in the video, we then introduce an Omni-based judgment module to determine whether the transcribed target can be grounded in the visual content. If the target is judged to be absent, the pipeline terminates early and outputs all-zero masks. Otherwise, the transcript is transformed into a segmentation-oriented prompt and fed into Sa2VA to obtain a coarse mask trajectory over the full video. Importantly, this trajectory is treated as an initial semantic hypothesis rather than a final prediction. On top of it, an agentic refinement layer evaluates query reliability, temporal relevance, anchor quality, and potential error sources, and may invoke SAM3 to improve spatial boundary precision and temporal consistency. The resulting framework explicitly decomposes the MEVIS\_Audio task into audio-to-text conversion, visual existence verification, coarse video segmentation, and agent-guided refinement. Such a staged design is substantially more appropriate for audio-conditioned Ref-VOS than directly sending noisy ASR outputs into a segmentation model.

Institutional Affiliations

Primary: Harbin Institute of Technology

All Institutions: Harbin Institute of Technology, Pengcheng Laboratory, University of California at Merced

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a robust pipeline for audio-aware referring video object segmentation that effectively addresses the challenges posed by noisy audio inputs through a staged processing approach. This work significantly advances the state of the art in Ref-VOS by explicitly modeling uncertainties from both speech recognition and visual grounding, thereby enhancing the accuracy and reliability of segmentation outcomes.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a novel pipeline for Audio-aware Referring Video Object Segmentation (Ref-VOS) that effectively addresses the unique challenges posed by audio-derived queries. By incorporating stages for speech transcription and visual existence verification, the authors create a robust framework that minimizes the impact of ASR noise and enhances segmentation accuracy. The staged design allows for a clear separation of tasks, which is a significant improvement over traditional models that treat audio input as a straightforward text query. This decomposition into distinct stages not only clarifies the processing flow but also allows for targeted improvements at each step, showcasing a thoughtful approach to the problem.

Experimental Evaluation

The experimental evaluation includes a well-structured ablation study that demonstrates the incremental benefits of each component of the proposed pipeline. The results indicate that the addition of the visual existence judgment stage significantly enhances performance, highlighting the importance of addressing ASR errors before segmentation. The reported scores reflect a competitive performance in the MEVIS_Audio challenge, providing evidence of the effectiveness of the proposed method. However, the paper could benefit from a more extensive evaluation against baseline models and a broader set of datasets to further validate the approach.

Reproducibility

The paper lacks detailed implementation specifics that would facilitate reproducibility. While the methodology is clearly outlined, the absence of code or supplementary materials makes it difficult for other researchers to replicate the results. Providing access to the model architecture, training procedures, and datasets used would greatly enhance the reproducibility of the findings.

Limitations

One limitation of the proposed approach is its reliance on the quality of the ASR system, VibeVoice-ASR. If the ASR output is significantly flawed, it could lead to erroneous visual existence judgments and subsequent segmentation errors. Additionally, the complexity of the pipeline may introduce challenges in real-time applications, where speed is critical. The paper also does not discuss the potential impact of varying audio qualities or accents on ASR performance, which could affect the generalizability of the approach.

Broader Impact

The proposed framework has significant implications for various applications, including video surveillance, content retrieval, and interactive media where audio queries are prevalent. By improving the robustness of video segmentation in the presence of noisy audio inputs, this work could enhance user experiences in multimedia applications and contribute to advancements in human-computer interaction. The methodology also opens avenues for further research in multimodal learning, particularly in integrating audio and visual data for more complex tasks. The main contribution of this paper is the introduction of a robust pipeline for audio-aware referring video object segmentation that effectively addresses the challenges posed by noisy audio inputs through a staged processing approach. This work significantly advances the state of the art in Ref-VOS by explicitly modeling uncertainties from both speech recognition and visual grounding, thereby enhancing the accuracy and reliability of segmentation outcomes.

Analysis: Full Paper • Full text: 16,529 characters

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

Xiang He, Chenxing Li, Jinting Wang ... · arXiv

Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (...

Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (CoT) fine-tuning, which is limited by training data quality, or on reinforcement learning (RL) with coarse rewards that do not directly evaluate reasoning quality. As a result, the generated reasoning chains often appear well-structured yet lack specific acoustic grounding. We propose Audio-DeepThinker, a framework built on two core ideas. First, we introduce a hybrid reasoning similarity reward that directly supervises the quality of generated reasoning chains by combining an LLM evaluator assessing logical path alignment, key step coverage, and analytical depth with an embedding similarity component enforcing semantic alignment with reference reasoning chains. Second, we propose a progressive two-stage curriculum that enables high-quality CoT reasoning to emerge through pure RL exploration, without any supervised reasoning fine-tuning, from an instruction-tuned model that possesses no prior chain-of-thought capability. Stage 1 trains on foundational audio QA with the hybrid reward to foster basic reasoning patterns, while Stage 2 shifts to acoustically challenging boundary cases with an LLM-only reward for greater reasoning diversity. Audio-DeepThinker achieves state-of-the-art results on MMAR (74.0%), MMAU-test-mini (78.5%), and MMSU (77.26%), winning 1st Place in the Interspeech 2026 Audio Reasoning Challenge (Single Model Track). Interpretability analyses further reveal that RL training primarily reshapes upper-layer MoE gating mechanisms and that reasoning tokens crystallize progressively in the upper transformer layers, offering mechanistic insights into how audio reasoning emerges through exploration.

Institutional Affiliations

Primary: The Hong Kong University of Science and Technology (Guangzhou)

All Institutions: Tencent AI Lab, The Hong Kong University of Science and Technology (Guangzhou)

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of Audio-DeepThinker, a novel framework that enables high-quality chain-of-thought reasoning to emerge in large audio-language models through reinforcement learning, significantly advancing the field of audio reasoning. The combination of innovative methodologies and strong experimental results positions this work as a meaningful contribution to the machine learning community, particularly in the audio domain.

Comprehensive Analysis

Methodology Assessment

The proposed methodology of Audio-DeepThinker is innovative, leveraging a hybrid reasoning similarity reward and a progressive two-stage curriculum to enhance reasoning capabilities in audio-language models. The integration of reinforcement learning without prior supervised fine-tuning is a significant advancement, allowing the model to develop reasoning skills through exploration. The detailed data construction pipeline and the dual reward system provide a structured approach to improving model performance while ensuring acoustic grounding in reasoning chains.

Experimental Evaluation

The experiments conducted on multiple benchmarks (MMAR, MMAU, and MMSU) demonstrate the effectiveness of Audio-DeepThinker, achieving state-of-the-art results. The comprehensive evaluation metrics, including accuracy and Rubrics scores, validate the model's performance across various audio reasoning tasks. The ablation studies further substantiate the contributions of the hybrid reward and the two-stage training approach.

Reproducibility

While the paper provides a thorough description of the methodology and experimental setup, including hyperparameters and training details, it lacks direct links to code or datasets, which may hinder reproducibility. The absence of a project URL or demo limits the ability for others to replicate the results.

Limitations

One limitation is the reliance on automated data construction, which may introduce biases or inaccuracies in the generated reasoning chains. Additionally, the model's performance on boundary cases may still be constrained by the inherent challenges of audio reasoning, and further exploration of diverse audio contexts is needed to fully assess its robustness.

Broader Impact

The advancements presented in Audio-DeepThinker have significant implications for audio understanding and reasoning tasks, potentially enhancing applications in accessibility, education, and interactive audio systems. The framework could pave the way for more sophisticated audio-language models capable of nuanced reasoning, thereby improving user interactions with audio content. The main contribution of this paper is the introduction of Audio-DeepThinker, a novel framework that enables high-quality chain-of-thought reasoning to emerge in large audio-language models through reinforcement learning, significantly advancing the field of audio reasoning. The combination of innovative methodologies and strong experimental results positions this work as a meaningful contribution to the machine learning community, particularly in the audio domain.

Analysis: Full Paper • Full text: 43,242 characters

Latent Fourier Transform

Mason Wang, Cheng-Zhi Anna Huang · arXiv

We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative music models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking laten...

We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative music models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking latents in the frequency domain during training, our method yields representations that can be manipulated coherently at inference. This allows us to generate musical variations and blends from reference examples while preserving characteristics at desired timescales, which are specified as frequencies in the latent space. LatentFT parallels the role of the equalizer in music production: while traditional equalizers operates on audible frequencies to shape timbre, LatentFT operates on latent-space frequencies to shape musical structure. Experiments and listening tests show that LatentFT improves condition adherence and quality compared to baselines. We also present a technique for hearing frequencies in the latent space in isolation, and show different musical attributes reside in different regions of the latent spectrum. Our results show how frequency-domain control in latent space provides an intuitive, continuous frequency axis for conditioning and blending, advancing us toward more interpretable and interactive generative music models.

Institutional Affiliations

Primary: Massachusetts Institute of Technology

All Institutions: Massachusetts Institute of Technology

Demo · GitHub

ML Relevance Analysis (83)

The paper presents the Latent Fourier Transform, a novel framework for generative music models that enhances frequency-domain control over musical patterns. This work significantly advances the interpretability and interactivity of generative audio systems, offering a new approach to music creation that leverages latent-space representations and Fourier transforms.

Comprehensive Analysis

Methodology Assessment

The paper introduces the Latent Fourier Transform (LatentFT), which innovatively combines a diffusion autoencoder with a latent-space Fourier transform to manipulate musical patterns based on timescale. The methodology effectively utilizes frequency-domain controls to enhance generative music models, allowing for coherent manipulation of musical attributes at specified timescales. The masking of latents in the frequency domain during training is a novel approach that facilitates the generation of variations while preserving desired characteristics, paralleling traditional audio equalizers but operating in latent space. The end-to-end training framework is well-structured, and the use of Fourier transforms to separate musical patterns by timescale is a significant advancement in the field.

Experimental Evaluation

The experiments conducted are robust, utilizing a substantial dataset (MTG-Jamendo) and comparing the proposed method against several relevant baselines. The evaluation metrics include both quantitative measures (e.g., Mel-Cepstral Distortion, Frechet Audio Distance) and qualitative assessments through listening tests, which are essential for validating the model's performance in generating high-quality audio. The results demonstrate that LatentFT outperforms existing methods in terms of adherence to conditions and audio quality, showcasing its effectiveness in practical applications.

Reproducibility

The authors provide a comprehensive reproducibility statement, including links to their GitHub repository, which contains the code for training, generating, and blending examples. They also detail their experimental setup, model architectures, and hyperparameters, which should facilitate replication of their results by other researchers in the field.

Limitations

While the paper presents a compelling framework, it does not explore the potential computational costs associated with the proposed method, especially in real-time applications. Additionally, the subjective nature of music quality could lead to variability in listener preferences, which may not be fully captured in the quantitative metrics used. The paper could also benefit from a more extensive discussion on the implications of the latent frequency manipulations on different genres or styles of music.

Broader Impact

The Latent Fourier Transform has the potential to significantly impact the field of generative music models by providing a more interpretable and interactive framework for music generation. Its ability to manipulate musical structures at various timescales could enhance creative processes in music production, allowing artists and producers to explore new soundscapes and compositions. Furthermore, the framework could pave the way for future research in audio signal processing and machine learning applications in music, contributing to advancements in both academic and commercial domains. The paper presents the Latent Fourier Transform, a novel framework for generative music models that enhances frequency-domain control over musical patterns. This work significantly advances the interpretability and interactivity of generative audio systems, offering a new approach to music creation that leverages latent-space representations and Fourier transforms.

Analysis: Full Paper • Full text: 50,026 characters

LLM-Codec: Neural Audio Codec Meets Language Model Objectives

Ho-Lam Chung, Yiming Chen, Hung-yi Lee · arXiv

Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction. This mismatch injects acoustically driven uncertainty into the discrete token space and increases language-model...

Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction. This mismatch injects acoustically driven uncertainty into the discrete token space and increases language-model perplexity. We propose \ours, which augments codec training with language-model-facing objectives while keeping both codec and LLM architectures unchanged. \ours introduces (i) future token prediction with Medusa-style multi-step heads to encourage multi-step predictability, and (ii) semantic alignment that matches audio and text representations via a memory-bank contrastive loss. A differentiable Gumbel bridge enables end-to-end gradients from these objectives to the codec encoder. On SALMon speech coherence, token LMs trained on \ours reach 61.6% accuracy (+12.1 points over AUV) while reducing perplexity 35. On Codec-SUPERB-tiny, \ours improves speech Mel distance by 5.0% over AUV while simultaneously achieving the learnability gains, demonstrating that reconstruction fidelity and token predictability can be improved together.

Institutional Affiliations

Primary: National Taiwan University

All Institutions: National Taiwan University, ASUS Intelligent Cloud Services, NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)

GitHub

ML Relevance Analysis (83)

This paper presents a significant advancement in the training of neural audio codecs by integrating language model objectives, resulting in improved token predictability and reconstruction fidelity. The innovative methodology and comprehensive experimental evaluation position this work as a valuable contribution to the field of audio processing and machine learning.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel training framework for neural audio codecs that incorporates language model objectives to enhance token predictability while maintaining reconstruction fidelity. The methodology is well-structured, employing future token prediction and semantic alignment as complementary regularizers. The use of a differentiable Gumbel bridge for end-to-end optimization is a significant technical contribution, allowing gradients to flow through the quantization process. The approach is innovative in its simplicity, modifying only the training objectives without altering existing codec or LLM architectures.

Experimental Evaluation

The experiments are comprehensive, evaluating the proposed method on two benchmarks: SALMon for speech coherence and Codec-SUPERB-tiny for reconstruction quality. The results demonstrate substantial improvements in both token predictability (35% reduction in perplexity) and reconstruction fidelity (5% improvement in Mel distance) compared to baseline codecs. The paper provides clear comparisons against strong baselines and effectively illustrates the benefits of the proposed framework through rigorous evaluation metrics.

Reproducibility

The paper includes detailed implementation details, model configurations, and training procedures, which enhance reproducibility. The authors specify hyperparameters, training phases, and the architecture of the models used, making it easier for other researchers to replicate the experiments. However, the reliance on specific datasets and the need for paired transcripts may limit broader applicability.

Limitations

The primary limitation is the dependence on speech-text correspondence for semantic alignment, which may not generalize well to untranscribed audio or non-speech domains. Additionally, the training overhead introduced by auxiliary heads may pose challenges for scaling to larger models or datasets. The paper also notes that the evaluation focuses on read speech, which may not capture the complexities of conversational speech.

Broader Impact

The proposed method has significant implications for improving spoken language models, potentially enhancing applications in speech synthesis, voice assistants, and audio processing. By addressing the predictability of tokens in audio codecs, the framework could lead to more coherent and contextually aware speech generation systems. The integration of language model objectives into audio processing represents a promising direction for future research and development in multimodal AI systems. This paper presents a significant advancement in the training of neural audio codecs by integrating language model objectives, resulting in improved token predictability and reconstruction fidelity. The innovative methodology and comprehensive experimental evaluation position this work as a valuable contribution to the field of audio processing and machine learning.

Analysis: Full Paper • Full text: 42,608 characters

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Samuel G. Balter, Ethan Jerzak, Connor T. Jerzak · ACL Findings (2026)

Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often la...

Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a forced-completion loss probe that scores heuristic-specific reasoning prefixes--including columnar multiplication, distributive decomposition, and rounding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.

Institutional Affiliations

Primary: University of Texas at Austin

All Institutions: University of Texas at Austin, National University of Singapore (NUS)

ML Relevance Analysis (83)

The paper presents a comprehensive examination of arithmetic performance in multimodal LLMs, introducing a novel benchmark and methodology that reveals critical insights into model behavior across different representations. The findings contribute meaningfully to the understanding of computational limitations in AI systems, particularly in the context of multimodal interactions.

Comprehensive Analysis

Methodology Assessment

The paper introduces a controlled multimodal multiplication benchmark that systematically varies digit length, digit sparsity, representation, and modality. This approach is methodologically sound as it isolates the effects of these variables on model performance, providing a clear framework for evaluating arithmetic capabilities across different modalities. The definition of arithmetic load (C) as a predictor of performance is a novel and insightful contribution, allowing for a compact representation of computational complexity. The use of forced-completion loss probes to assess heuristic preferences is an innovative method that adds depth to the analysis of model behavior.

Experimental Evaluation

The experiments are robust, involving multiple model families and a variety of modalities (text, image, audio). The logistic regression analysis effectively demonstrates the relationship between arithmetic load and accuracy, with R-squared values indicating a strong predictive capability. The results reveal significant insights into how different models handle arithmetic tasks under varying conditions, highlighting the computational challenges faced by multimodal LLMs. However, the reliance on synthetic templates for problem generation may limit the generalizability of the findings to real-world scenarios.

Reproducibility

The paper provides detailed descriptions of the experimental setup, including model configurations and evaluation protocols, which enhances reproducibility. However, the absence of publicly available datasets or code repositories limits the ability of other researchers to replicate the study fully. The methodology for generating the benchmark and the specific models used are well-documented, but sharing the actual implementation would further support reproducibility.

Limitations

The study's focus on multiplication limits its applicability to other arithmetic operations, such as addition or division, which may exhibit different computational characteristics. Additionally, the model coverage is restricted to specific families, and the synthetic nature of the digit templates may not accurately reflect the complexity of real-world arithmetic problems. The controlled rendering of inputs may also overlook the challenges presented by messy real-world data.

Broader Impact

This research has the potential to significantly influence the development of multimodal LLMs, particularly in applications requiring precise arithmetic capabilities. By identifying the computational limits and preferred strategies of these models, the findings can inform future training methodologies and benchmark designs. The insights gained could lead to improved performance in practical applications such as educational tools, automated reasoning systems, and AI-driven assistants that require reliable arithmetic processing. The paper presents a comprehensive examination of arithmetic performance in multimodal LLMs, introducing a novel benchmark and methodology that reveals critical insights into model behavior across different representations. The findings contribute meaningfully to the understanding of computational limitations in AI systems, particularly in the context of multimodal interactions.

Analysis: Full Paper • Full text: 50,026 characters

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

Yuan Xie, Jiaqi Song, Guang Qiu ... · arXiv

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, lea...

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.

Institutional Affiliations

Primary: NIO

All Institutions: NIO

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of NIM4-ASR, a novel LLM-based ASR framework that optimizes efficiency and robustness through a multi-stage training paradigm and innovative hotword customization techniques. This work represents a significant step forward in addressing the practical challenges faced by existing ASR systems, particularly in real-time applications.

Comprehensive Analysis

Methodology Assessment

The methodology presented in NIM4-ASR is robust and innovative, addressing key limitations of existing LLM-based ASR systems. The authors propose a multi-stage training paradigm that effectively delineates the roles of the encoder and the LLM, which is a significant improvement over conventional methods. The introduction of an iterative asynchronous SFT stage and an ASR-specialized reinforcement learning stage enhances both recognition quality and robustness. The use of phoneme-level retrieval for hotword customization is particularly noteworthy, as it allows for efficient adaptation to new entities while maintaining low latency. The overall design is well-structured, with a clear focus on practical deployment challenges.

Experimental Evaluation

The experimental evaluation is comprehensive, covering a wide range of benchmarks, both public and internal. The results demonstrate that NIM4-ASR achieves state-of-the-art performance with a relatively small model size of 2.3B parameters, outperforming larger models in specific scenarios, particularly in entity-intensive tasks. The evaluation metrics are appropriate, and the authors provide a thorough comparison with existing models, highlighting the advantages of their approach. However, the paper could benefit from more detailed discussions on the statistical significance of the results.

Reproducibility

The paper provides a detailed description of the training setup, including the architecture, training stages, and evaluation metrics. However, the lack of a publicly available implementation or code repository limits reproducibility. The authors could enhance this aspect by providing access to their model and training data, which would allow other researchers to validate their findings.

Limitations

One limitation of the work is the focus on specific scenarios, such as real-time speech interactions, which may not generalize to all ASR applications. Additionally, while the paper addresses hallucination issues, it does not provide extensive empirical evidence on the effectiveness of the proposed solutions across diverse acoustic environments. The reliance on phoneme-level retrieval may also pose challenges in languages with complex phonetic structures.

Broader Impact

The advancements made in NIM4-ASR have the potential to significantly improve user experiences in real-time speech applications, particularly in resource-constrained environments. The ability to customize hotword recognition and enhance robustness against noise can lead to more reliable and efficient voice interfaces in various domains, including automotive and smart home systems. The work also contributes to the ongoing research in integrating LLMs with audio processing, paving the way for future innovations in multimodal AI systems. The main contribution of this paper is the introduction of NIM4-ASR, a novel LLM-based ASR framework that optimizes efficiency and robustness through a multi-stage training paradigm and innovative hotword customization techniques. This work represents a significant step forward in addressing the practical challenges faced by existing ASR systems, particularly in real-time applications.

Analysis: Full Paper • Full text: 50,026 characters

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

Hao Meng, Siyuan Zheng, Shuran Zhou ... · IEEE ICASSP 2026

Large Language Models (LLMs) show promise in lyric-to-melody generation, but models trained with Supervised Fine-Tuning (SFT) often produce musically implausible melodies with issues like poor rhythm and unsuitable vocal ranges, a phenomenon we term "constraint violation". To add...

Large Language Models (LLMs) show promise in lyric-to-melody generation, but models trained with Supervised Fine-Tuning (SFT) often produce musically implausible melodies with issues like poor rhythm and unsuitable vocal ranges, a phenomenon we term "constraint violation". To address this, we propose a novel alignment framework that instills musical knowledge without human annotation. We define rule-based musical constraints to automatically generate a preference dataset from an SFT model's outputs. The model is then aligned through a sequential process, first using Direct Preference Optimization (DPO) on paired preference data, followed by Kahneman-Tversky Optimization (KTO) on unpaired negative samples. Experimental results demonstrate that our aligned model substantially reduces rule violations and outperforms strong baselines in both objective and subjective evaluations, generating melodies with substantially improved musicality and coherence. An interactive demo with audio comparisons is available at https://arain233.github.io/AligningMelody-demo.

Institutional Affiliations

Primary: Zuoyebang Education Technology

All Institutions: Zuoyebang Education Technology

Demo

ML Relevance Analysis (82)

The paper presents a novel framework for aligning language models in lyric-to-melody generation, significantly enhancing musicality and coherence through rule-based constraints and preference optimization techniques. This work contributes meaningfully to the intersection of machine learning and creative arts, providing a scalable solution to a longstanding challenge in generative music systems.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is robust and innovative, introducing a sequential alignment framework that leverages rule-based musical constraints to enhance the lyric-to-melody generation process. The authors effectively utilize Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) to refine the model's outputs, demonstrating a clear understanding of the challenges in generative music models. The structured approach to generating a preference dataset without human intervention is particularly noteworthy, as it addresses a significant bottleneck in traditional reinforcement learning methods that rely on human feedback.

Experimental Evaluation

The experimental evaluation is comprehensive, utilizing both objective and subjective metrics to assess the performance of the proposed model against strong baselines. The use of a large dataset for training and evaluation, alongside the detailed ablation studies, provides substantial evidence of the model's effectiveness. The results indicate significant improvements in musical quality, with the proposed method outperforming existing systems in both symbolic metrics and human evaluations, which adds credibility to the findings.

Reproducibility

The paper provides sufficient implementation details, including the architecture of the model, training procedures, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly available code repository may hinder full reproducibility for other researchers interested in validating the results or building upon the work.

Limitations

While the approach is innovative, it is limited by its reliance on predefined musical constraints, which may not capture the full complexity of musical creativity and expression. Additionally, the model's performance may vary with different genres or styles of music, which is not thoroughly explored in the paper. The subjective evaluation, while valuable, is based on a limited number of volunteers, which may not represent a broader audience's preferences.

Broader Impact

The implications of this research are significant for the fields of music generation and artificial intelligence. By improving the quality of generated melodies, this work can enhance applications in music composition tools, interactive voice agents, and creative AI systems, potentially transforming how music is created and experienced. The approach also opens avenues for future research in integrating more complex musical structures and user-defined constraints, fostering greater creativity in AI-generated music. The paper presents a novel framework for aligning language models in lyric-to-melody generation, significantly enhancing musicality and coherence through rule-based constraints and preference optimization techniques. This work contributes meaningfully to the intersection of machine learning and creative arts, providing a scalable solution to a longstanding challenge in generative music systems.

Analysis: Full Paper • Full text: 19,515 characters

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

HaeJun Yoo, Yongseop Shin, Insung Lee ... · ACL 2026 Main Conference

Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment o...

Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To systematically evaluate robustness beyond caption-style queries, we introduce User-Intent Queries (UIQs) - five formulations reflecting natural search behaviors: questions, commands, keyword tags, paraphrases, and exclusion-based negative queries. For negative queries, we develop a hard negative mining pipeline and propose discrimination metrics (HNSR, TFR) assessing models' ability to suppress acoustically similar distractors. Experiments on AudioCaps, Clotho, and MECAT show that OEA achieves comparable text-to-audio retrieval performance to state-of-the-art M2D-CLAP, while demonstrating clear advantages in two critical areas: (1) dominant text-to-text retrieval (+22% relative improvement), and (2) substantially superior hard negative discrimination (+4.3%p HNSR@10, +34.7% relative TFR@10), revealing that LLM backbones provide superior semantic understanding of complex queries.

Institutional Affiliations

Primary: Sogang University

All Institutions: Sogang University

Demo

ML Relevance Analysis (82)

The paper introduces Omni-Embed-Audio (OEA), a novel multimodal architecture for audio-text retrieval that significantly improves retrieval robustness through the use of User-Intent Queries and innovative evaluation metrics. The comprehensive methodology and experimental validation demonstrate its potential to enhance real-world audio search applications.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel architecture, Omni-Embed-Audio (OEA), which integrates multimodal large language models (LLMs) for audio-text retrieval. The methodology is well-structured, leveraging a unified encoder architecture that processes both text and audio through a shared transformer backbone. The introduction of User-Intent Queries (UIQs) is a significant advancement, as it reflects real-world search behaviors rather than relying solely on traditional caption-style queries. The hard negative mining pipeline and the proposed discrimination metrics (HNSR, TFR) are innovative contributions that enhance the evaluation of retrieval robustness.

Experimental Evaluation

The experiments conducted on multiple datasets (AudioCaps, Clotho, and MECAT) demonstrate the effectiveness of OEA compared to state-of-the-art models like M2D-CLAP. The results indicate clear advantages in text-to-text retrieval and hard negative discrimination, showcasing the model's robustness across different query types. The use of extensive experiments to validate the proposed UIQ benchmark further strengthens the findings.

Reproducibility

The paper provides detailed implementation details, including the architecture, training objectives, and evaluation methodologies. The release of UIQ benchmark datasets and a web demo enhances reproducibility, allowing other researchers to validate and build upon the work. However, the reliance on a specific LLM (GPT-5.1) for query generation may limit broader applicability.

Limitations

The paper acknowledges several limitations, including the dependency on multimodal LLMs with native audio understanding, which restricts the range of base encoders. Additionally, the potential for data leakage between training and evaluation datasets raises concerns about the validity of performance metrics. The authors also note that the hard negative mining process may miss certain forms of acoustic confusion, and the UIQ generation may not fully represent all user query styles.

Broader Impact

The advancements in audio-text retrieval have significant implications for various applications, including multimedia content search, voice-activated assistants, and interactive AI systems. By addressing the limitations of traditional benchmarks and introducing more realistic query formulations, this work paves the way for more robust and user-friendly audio retrieval systems. The release of benchmark datasets also encourages further research in this area. The paper introduces Omni-Embed-Audio (OEA), a novel multimodal architecture for audio-text retrieval that significantly improves retrieval robustness through the use of User-Intent Queries and innovative evaluation metrics. The comprehensive methodology and experimental validation demonstrate its potential to enhance real-world audio search applications.

Analysis: Full Paper • Full text: 46,373 characters

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

Vaibhavi Lokegaonkar, Aryan Vijay Bhosale, Vishnu Raj ... · arXiv

Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this pape...

Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this paper, we present Video-Robin, a novel text-conditioned video-to-music generation model that enables fast, high-quality, semantically aligned music generation for video content. To balance musical fidelity and semantic understanding, Video-Robin integrates autoregressive planning with diffusion-based synthesis. Specifically, an autoregressive module models global structure by semantically aligning visual and textual inputs to produce high-level music latents. These latents are subsequently refined into coherent, high-fidelity music using local Diffusion Transformers. By factoring semantically driven planning into diffusion-based synthesis, Video-Robin enables fine-grained creator control without sacrificing audio realism. Our proposed model outperforms baselines that solely accept video input and additional feature conditioned baselines on both in-distribution and out-of-distribution benchmarks with a 2.21x speed in inference compared to SOTA. We will open-source everything upon paper acceptance.

Institutional Affiliations

Primary: University of Maryland College Park

All Institutions: University of Maryland College Park, Dolby Laboratories

ML Relevance Analysis (85)

Video-Robin presents a novel approach to video-to-music generation by integrating autoregressive planning with diffusion-based synthesis, significantly enhancing the ability to create semantically aligned music for video content. The comprehensive experimental evaluation and introduction of a new benchmark highlight its technical contributions and relevance to the field of machine learning in audio generation.

Comprehensive Analysis

Methodology Assessment

The methodology presented in Video-Robin is innovative, integrating autoregressive planning with diffusion-based synthesis to enhance video-to-music generation. The use of a multimodal semantic language model for planning and a refinement head utilizing local Diffusion Transformers is a notable advancement. The architecture effectively balances musical fidelity and semantic understanding, allowing for fine-grained control over the generated music. The two-stage training strategy, which includes pretraining on text-to-music generation followed by video-to-music fine-tuning, is a well-structured approach that leverages existing datasets effectively.

Experimental Evaluation

The experiments are comprehensive, utilizing both in-distribution and out-of-distribution benchmarks to validate the model's performance. The introduction of the ReelBench dataset is a significant contribution, providing a structured evaluation framework for text-conditioned video-to-music generation. The results demonstrate that Video-Robin outperforms existing models on multiple metrics, showcasing its effectiveness in generating high-quality, semantically aligned music. The use of human evaluations alongside quantitative metrics adds depth to the experimental assessment.

Reproducibility

The paper provides detailed implementation specifics, including architecture configurations, training procedures, and evaluation metrics, which enhance reproducibility. However, the lack of open-source code or demo links limits the ability for others to replicate the results directly. The authors mention plans to open-source their work upon acceptance, which is a positive step towards improving reproducibility.

Limitations

The paper acknowledges limitations, such as the focus on short-form videos and the dependency on frozen representation components, which may restrict the model's expressivity in niche genres. The evaluation metrics, while comprehensive, may not fully capture the nuances of music-video alignment and creator intent, suggesting a need for more refined metrics in future work.

Broader Impact

The potential applications of Video-Robin are significant, particularly in the context of content creation for social media platforms. By enabling creators to generate music that aligns with their artistic intent, the model could enhance the creative process and democratize access to high-quality music production. The integration of multimodal inputs also opens avenues for further research in generative models that combine audio and visual data. Video-Robin presents a novel approach to video-to-music generation by integrating autoregressive planning with diffusion-based synthesis, significantly enhancing the ability to create semantically aligned music for video content. The comprehensive experimental evaluation and introduction of a new benchmark highlight its technical contributions and relevance to the field of machine learning in audio generation.

Analysis: Full Paper • Full text: 46,270 characters

Prosody as Supervision: Bridging the Non-Verbal--Verbal for Multilingual Speech Emotion Recognition

Girish, Mohd Mujtaba Akhtar, Muskaan Singh · ACL 2026 (main)

In this work, we introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal...

In this work, we introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal speech and suffer from poor cross-lingual transfer, our approach reformulates LRM-SER as non-verbal-to-verbal transfer, where supervision from a labeled non-verbal source domain is adapted to unlabeled verbal speech across multiple target languages. To this end, we propose NOVA ARC, a geometry-aware framework that models affective structure in the Poincaré ball, discretizes paralinguistic patterns via a hyperbolic vector-quantized prosody codebook, and captures emotion intensity through a hyperbolic emotion lens. For unsupervised adaptation, NOVA-ARC performs optimal transport based prototype alignment between source emotion prototypes and target utterances, inducing soft supervision for unlabeled speech while being stabilized through consistency regularization. Experiments show that NOVA-ARC delivers the strongest performance under both non-verbal-to-verbal adaptation and the complementary verbal-to-verbal transfer setting, consistently outperforming Euclidean counterparts and strong SSL baselines. To the best of our knowledge, this work is the first to move beyond verbal-speech-centric supervision by introducing a non-verbal-to-verbal transfer paradigm for SER.

Institutional Affiliations

Primary: UPES, India

All Institutions: UPES, Veer Bahadur Singh Purvanchal University, Ulster University

GitHub

ML Relevance Analysis (83)

This work presents a groundbreaking approach to multilingual speech emotion recognition by introducing a non-verbal-to-verbal transfer paradigm, significantly advancing the field by addressing the challenges of low-resource settings and enhancing the robustness of emotion recognition systems.

Comprehensive Analysis

Methodology Assessment

The methodology introduces a novel approach to multilingual speech emotion recognition (SER) by leveraging non-verbal vocalizations as a source of supervision. The proposed NOVA-ARC framework utilizes hyperbolic geometry to model affective structures and employs optimal transport for aligning emotion prototypes with target utterances. This innovative non-verbal-to-verbal transfer paradigm is a significant departure from traditional SER methods that rely on labeled verbal data, showcasing a well-thought-out architecture that integrates various advanced techniques.

Experimental Evaluation

The experiments are comprehensive, utilizing multiple datasets that span various languages and emotional expressions. The results demonstrate that NOVA-ARC consistently outperforms traditional Euclidean models and strong self-supervised learning baselines, validating the effectiveness of the proposed approach across different settings. The inclusion of diverse corpora and rigorous evaluation metrics strengthens the findings.

Reproducibility

The paper provides detailed descriptions of the models and training procedures, including hyperparameters and dataset configurations. However, the absence of a publicly available implementation or code repository may hinder full reproducibility. The authors could enhance reproducibility by sharing their code and trained models.

Limitations

The study acknowledges limitations in evaluating spontaneous conversational speech and the potential challenges when emotion categories are closely related. The reliance on publicly available datasets may also restrict the generalizability of the findings to real-world applications.

Broader Impact

The proposed framework has significant implications for developing more robust and scalable SER systems, particularly in low-resource settings. By decoupling emotional expression from language-specific cues, this research could enhance the accessibility of emotion recognition technologies across diverse linguistic backgrounds and applications, such as conversational agents and assistive technologies. This work presents a groundbreaking approach to multilingual speech emotion recognition by introducing a non-verbal-to-verbal transfer paradigm, significantly advancing the field by addressing the challenges of low-resource settings and enhancing the robustness of emotion recognition systems.

Analysis: Full Paper • Full text: 34,226 characters

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

Yi-Cheng Lin, Yusuke Hirota, Sung-Feng Huang ... · arXiv

Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairnes...

Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairness. We propose VIBE, a framework that evaluates generative bias through open-ended tasks such as personalized recommendations, using real-world human recordings. Unlike MCQs, our method allows stereotypical associations to manifest organically without predefined options, making it easily extensible to new tasks. Evaluating 11 state-of-the-art LALMs reveals systematic biases in realistic scenarios. We find that gender cues often trigger larger distributional shifts than accent cues, indicating that current LALMs reproduce social stereotypes.

Institutional Affiliations

Primary: National Taiwan University

All Institutions: National Taiwan University, NVIDIA

GitHub

ML Relevance Analysis (83)

The paper presents VIBE, a novel framework for evaluating generative bias in large audio-language models through open-ended tasks using real-world speech. This innovative approach not only addresses a critical gap in current evaluation methods but also provides actionable insights into the biases present in LALMs, making it a significant contribution to the field of machine learning and audio processing.

Comprehensive Analysis

Methodology Assessment

The proposed VIBE framework innovatively shifts the evaluation of generative biases in LALMs from traditional MCQ formats to open-ended tasks that allow for the organic emergence of biases. This approach is significant as it leverages real-world audio recordings, capturing a broader range of paralinguistic cues and phonetic variability. The methodology is well-structured, with a clear focus on extracting structured attributes from generated text, thus allowing for a quantifiable assessment of biases. The use of human-validated extraction methods adds robustness to the findings.

Experimental Evaluation

The experiments are comprehensive, evaluating 11 state-of-the-art LALMs across five diverse tasks that reflect realistic applications. The findings reveal systematic biases, particularly highlighting the stronger influence of gender cues over accent cues. The statistical rigor applied in measuring biases through total variation distance (TVD) and permutation tests strengthens the validity of the results. However, the paper could benefit from a more detailed discussion of the statistical methods used for bias quantification.

Reproducibility

The paper mentions the use of specific datasets and provides a URL for the evaluation prompts, which aids reproducibility. However, there is no explicit mention of the code or model weights being made available, which could hinder full reproducibility of the experiments. The authors should consider releasing their code and models to facilitate further research.

Limitations

The study is limited to English speech and does not account for spontaneous conversation, which may not fully represent natural vocal variation. Additionally, the definition of bias used may not encompass all aspects of fairness, such as individual fairness or intersectional bias. The reliance on specific datasets may also limit the generalizability of the findings to other languages or cultural contexts.

Broader Impact

The implications of this research are significant for applications of LALMs in real-world scenarios, particularly in voice assistants and customer service, where biased outputs can perpetuate stereotypes. The framework provides a valuable tool for developers and researchers to audit and understand the biases in their models, potentially leading to more equitable AI systems. The focus on real-world speech recordings enhances the relevance of the findings to practical applications. The paper presents VIBE, a novel framework for evaluating generative bias in large audio-language models through open-ended tasks using real-world speech. This innovative approach not only addresses a critical gap in current evaluation methods but also provides actionable insights into the biases present in LALMs, making it a significant contribution to the field of machine learning and audio processing.

Analysis: Full Paper • Full text: 35,657 characters

HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

Mohd Mujtaba Akhtar, Girish, Muskaan Singh · ACL 2026

In this study, we present Healthcare Codec-Fake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We intentionally focus on codec based synthetic speech in this work, since neural codec decoding forms a core building block in modern spee...

In this study, we present Healthcare Codec-Fake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We intentionally focus on codec based synthetic speech in this work, since neural codec decoding forms a core building block in modern speech generation pipelines. First, we release Healthcare CodecFake, the first pathology-aware dataset containing paired real and NAC-synthesized speech across multipl clinical conditions and codec families. Our evaluations show that SOTA codec-fake detectors trained primarily on healthy speech perform poorly on Healthcare CodecFake, highlighting the need for HCFD-specific models. Second, we demonstrate that PaSST outperforms existing speech-based models for HCFD, benefiting from its patch-based spectro-temporal representation. Finally, we propose PHOENIX-Mamba, a geometry-aware framework that models codec-fakes as multiple self-discovered modes in hyperbolic space and achieves the strongest performance on HCFD across clinical conditions and codecs. Experiments on HCFK show that PHOENIX-Mamba (PaSST) achieves the best overall performance, reaching 97.04 Acc on E-Dep, 96.73 on E-Alz, and 96.57 on E-Dys, while maintaining strong results on Chinese with 94.41 (Dep), 94.40 (Alz), and 93.20 (Dys). This geometry-aware formulation enables self-discovered clustering of heterogeneous codec-fake modes in hyperbolic space, facilitating robust discrimination under pathological speech variability. PHOENIX-Mamba achieves topmost performance on the HCFD task across clinical conditions and codecs.

Institutional Affiliations

Primary: Veer Bahadur Singh Purvanchal University, India

All Institutions: Veer Bahadur Singh Purvanchal University, UPES, Ulster University

GitHub

ML Relevance Analysis (82)

The paper presents a comprehensive study on Healthcare CodecFake Detection (HCFD), introducing a novel dataset and a geometry-aware framework that significantly improves the detection of audio deepfakes in pathological speech. The methodology and experimental results provide valuable insights and advancements in the field, addressing a critical need for robust detection mechanisms in healthcare audio communication.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel framework, PHOENIX-Mamba, which employs a geometry-aware approach to model codec-fakes in pathological speech. The methodology is well-structured, utilizing a hyperbolic space for clustering and evidence representation, which is a significant step forward in addressing the challenges posed by codec artifacts in healthcare audio. The use of pre-trained models and the detailed explanation of the evidence-driven classification process enhances the robustness of the proposed solution. The integration of temporal modeling and multi-evidence pooling is particularly noteworthy, as it allows for better handling of the variability inherent in pathological speech.

Experimental Evaluation

The experiments are comprehensive, utilizing a newly created dataset (Healthcare CodecFake) that is both diverse and relevant to the problem domain. The authors benchmarked their approach against state-of-the-art models, demonstrating significant performance improvements across various clinical conditions and languages. The results are well-presented, with clear metrics (accuracy, F1 score, and EER) that validate the effectiveness of the proposed method. The cross-pathology and unseen codec evaluations further strengthen the findings, showcasing the model's generalization capabilities.

Reproducibility

The authors provide a clear commitment to reproducibility by sharing the dataset access, code, and evaluation resources. They detail the codec generation pipeline and the specific configurations used in experiments, which is crucial for others to replicate their work. However, the paper could benefit from more explicit links to the shared resources and clearer instructions for accessing the datasets.

Limitations

While the paper addresses a critical gap in audio deepfake detection in healthcare, it acknowledges limitations in terms of the range of clinical conditions and languages covered. The focus on codec-based resynthesis means that other forms of audio manipulation are not addressed, which could limit the applicability of the findings in broader contexts. Additionally, the evaluation does not consider the potential for real-world channel effects, which may impact the performance of the proposed models.

Broader Impact

The implications of this research are significant, particularly in the context of healthcare, where the integrity of audio communication is paramount. The proposed solutions could enhance the security of telehealth services and protect against potential misuse of AI-generated audio. By establishing a benchmark for codec-fake detection in pathological speech, the work lays the groundwork for future advancements in this area, potentially leading to more reliable healthcare communication systems. The paper presents a comprehensive study on Healthcare CodecFake Detection (HCFD), introducing a novel dataset and a geometry-aware framework that significantly improves the detection of audio deepfakes in pathological speech. The methodology and experimental results provide valuable insights and advancements in the field, addressing a critical need for robust detection mechanisms in healthcare audio communication.

Analysis: Full Paper • Full text: 44,090 characters

Audio ML Papers

🏆 Top Papers This Week

Institutional Affiliations

ML Relevance Analysis (91)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (91)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (88)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (88)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (84)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (82)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility