Computation and Language 97
☆ Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models
Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen
Diffusion large language models (dLLMs) generate text through iterative
denoising, yet current decoding strategies discard rich intermediate
predictions in favor of the final output. Our work here reveals a critical
phenomenon, temporal oscillation, where correct answers often emerge in the
middle process, but are overwritten in later denoising steps. To address this
issue, we introduce two complementary methods that exploit temporal
consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time
decoding strategy that aggregates predictions across denoising steps to select
the most consistent output; and 2) a post-training method termed Temporal
Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a
measure of semantic stability across intermediate predictions, as a reward
signal to encourage stable generations. Empirical results across multiple
benchmarks demonstrate the effectiveness of our approach. Using the negative
TSE reward alone, we observe a remarkable average improvement of 24.7% on the
Countdown dataset over an existing dLLM. Combined with the accuracy reward, we
achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and
25.3% on Countdown, respectively. Our findings underscore the untapped
potential of temporal dynamics in dLLMs and offer two simple yet effective
tools to harness them.
comment: Project webpage: https://aim-uofa.github.io/dLLM-MidTruth
☆ Complex Logical Instruction Generation
Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song
Instruction following has catalyzed the recent era of Large Language Models
(LLMs) and is the foundational skill underpinning more advanced capabilities
such as reasoning and agentic behaviors. As tasks grow more challenging, the
logic structures embedded in natural language instructions becomes increasingly
intricate. However, how well LLMs perform on such logic-rich instructions
remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a
scalable, automated framework for generating verifiable instructions from code
functions, which can naturally express rich logic such as conditionals,
nesting, recursion, and function calls. We further curate a collection of
complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark
comprising 426 verifiable logic-rich instructions. Our experiments demonstrate
that current state-of-the-art LLMs still struggle to correctly follow the
instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the
instructions, revealing significant deficiencies in the instruction-following
ability. Code and Benchmark: https://github.com/mianzhang/LogicIF
☆ OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
Autonomous agents powered by large language models (LLMs) are increasingly
deployed in real-world applications requiring complex, long-horizon workflows.
However, existing benchmarks predominantly focus on atomic tasks that are
self-contained and independent, failing to capture the long-term contextual
dependencies and multi-interaction coordination required in realistic
scenarios. To address this gap, we introduce OdysseyBench, a comprehensive
benchmark for evaluating LLM agents on long-horizon workflows across diverse
office applications including Word, Excel, PDF, Email, and Calendar. Our
benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks
derived from real-world use cases, and OdysseyBench-Neo with 302 newly
synthesized complex tasks. Each task requires agent to identify essential
information from long-horizon interaction histories and perform multi-step
reasoning across various applications. To enable scalable benchmark creation,
we propose HomerAgents, a multi-agent framework that automates the generation
of long-horizon workflow benchmarks through systematic environment exploration,
task generation, and dialogue synthesis. Our extensive evaluation demonstrates
that OdysseyBench effectively challenges state-of-the-art LLM agents, providing
more accurate assessment of their capabilities in complex, real-world contexts
compared to existing atomic task benchmarks. We believe that OdysseyBench will
serve as a valuable resource for advancing the development and evaluation of
LLM agents in real-world productivity scenarios. In addition, we release
OdysseyBench and HomerAgents to foster research along this line.
☆ SinLlama - A Large Language Model for Sinhala
H. W. K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, Rishemjit Kaur
Low-resource languages such as Sinhala are often overlooked by open-source
Large Language Models (LLMs). In this research, we extend an existing
multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM
tokenizer with Sinhala specific vocabulary and perform continual pre-training
on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This
is the very first decoder-based open-source LLM with explicit Sinhala support.
When SinLlama was instruction fine-tuned for three text classification tasks,
it outperformed base and instruct variants of Llama-3-8B by a significant
margin.
☆ AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
Jason Chou, Ao Liu, Yuchi Deng, Zhiying Zeng, Tao Zhang, Haotian Zhu, Jianwei Cai, Yue Mao, Chenchen Zhang, Lingyun Tan, Ziyan Xu, Bohui Zhai, Hengyi Liu, Speed Zhu, Wiggin Zhou, Fengzong Lian
Large Language Models (LLMs) have demonstrated remarkable capabilities across
various domains, with code generation emerging as a key area of focus. While
numerous benchmarks have been proposed to evaluate their code generation
abilities, these benchmarks face several critical limitations. First, they
often rely on manual annotations, which are time-consuming and difficult to
scale across different programming languages and problem complexities. Second,
most existing benchmarks focus primarily on Python, while the few multilingual
benchmarks suffer from limited difficulty and uneven language distribution. To
address these challenges, we propose AutoCodeGen, an automated method for
generating high-difficulty multilingual code generation datasets without manual
annotations. AutoCodeGen ensures the correctness and completeness of test cases
by generating test inputs with LLMs and obtaining test outputs through a
multilingual sandbox, while achieving high data quality through reverse-order
problem generation and multiple filtering steps. Using this novel method, we
introduce AutoCodeBench, a large-scale code generation benchmark comprising
3,920 problems evenly distributed across 20 programming languages. It is
specifically designed to evaluate LLMs on challenging, diverse, and practical
multilingual tasks. We evaluate over 30 leading open-source and proprietary
LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The
results show that even the most advanced LLMs struggle with the complexity,
diversity, and multilingual nature of these tasks. Besides, we introduce
AutoCodeBench-Complete, specifically designed for base models to assess their
few-shot code generation capabilities. We hope the AutoCodeBench series will
serve as a valuable resource and inspire the community to focus on more
challenging and practical multilingual code generation scenarios.
comment: Homepage: https://autocodebench.github.io/
☆ Link Prediction for Event Logs in the Process Industry
Knowledge management (KM) is vital in the process industry for optimizing
operations, ensuring safety, and enabling continuous improvement through
effective use of operational data and past insights. A key challenge in this
domain is the fragmented nature of event logs in shift books, where related
records, e.g., entries documenting issues related to equipment or processes and
the corresponding solutions, may remain disconnected. This fragmentation
hinders the recommendation of previous solutions to the users. To address this
problem, we investigate record linking (RL) as link prediction, commonly
studied in graph-based machine learning, by framing it as a cross-document
coreference resolution (CDCR) task enhanced with natural language inference
(NLI) and semantic text similarity (STS) by shifting it into the causal
inference (CI). We adapt CDCR, traditionally applied in the news domain, into
an RL model to operate at the passage level, similar to NLI and STS, while
accommodating the process industry's specific text formats, which contain
unstructured text and structured record attributes. Our RL model outperformed
the best versions of NLI- and STS-driven baselines by 28% (11.43 points) and
27% (11.21 points), respectively. Our work demonstrates how domain adaptation
of the state-of-the-art CDCR models, enhanced with reasoning capabilities, can
be effectively tailored to the process industry, improving data quality and
connectivity in shift logs.
☆ Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages
Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, Mokanarangan Thayaparan
Large Language Models (LLMs) excel in English, but their performance degrades
significantly on low-resource languages (LRLs) due to English-centric training.
While methods like LangBridge align LLMs with multilingual encoders such as the
Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically
use only the final encoder layer. We propose a novel architecture that fuses
all intermediate layers, enriching the linguistic information passed to the
LLM. Our approach features two strategies: (1) a Global Softmax weighting for
overall layer importance, and (2) a Transformer Softmax model that learns
token-specific weights. The fused representations are mapped into the LLM's
embedding space, enabling it to process multilingual inputs. The model is
trained only on English data, without using any parallel or multilingual data.
Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews,
our Transformer Softmax model significantly outperforms the LangBridge
baseline. We observe strong performance gains in LRLs, improving Sinhala
classification accuracy from 71.66% to 75.86% and achieving clear improvements
across Indic languages such as Tamil, Bengali, and Malayalam. These specific
gains contribute to an overall boost in average XNLI accuracy from 70.36% to
71.50%. This approach offers a scalable, data-efficient path toward more
capable and equitable multilingual LLMs.
☆ CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization
Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in
tasks with objectively verifiable answers (e.g., code generation, mathematical
reasoning), yet struggles with open-ended subjective tasks like role-playing
dialogue. Traditional reward modeling approaches, which rely on independent
sample-wise scoring, face dual challenges: subjective evaluation criteria and
unstable reward signals.Motivated by the insight that human evaluation
inherently combines explicit criteria with implicit comparative judgments, we
propose Comparative Policy Optimization (CPO). CPO redefines the reward
evaluation paradigm by shifting from sample-wise scoring to comparative
group-wise scoring.Building on the same principle, we introduce the
CharacterArena evaluation framework, which comprises two stages:(1)
Contextualized Multi-turn Role-playing Simulation, and (2) Trajectory-level
Comparative Evaluation. By operationalizing subjective scoring via objective
trajectory comparisons, CharacterArena minimizes contextual bias and enables
more robust and fair performance evaluation. Empirical results on
CharacterEval, CharacterBench, and CharacterArena confirm that CPO effectively
mitigates reward ambiguity and leads to substantial improvements in dialogue
quality.
☆ READER: Retrieval-Assisted Drafter for Efficient LLM Inference
Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin, Sultan Isali, Vasily Kalugin, Stanislav Ilyushin, Nuriza Aitassova, Yi Fei, Zeng Weidi
Large Language Models (LLMs) generate tokens autoregressively, with each
token depending on the preceding context. This sequential nature makes the
inference process inherently difficult to accelerate, posing a significant
challenge for efficient deployment. In recent years, various methods have been
proposed to address this issue, with the most effective approaches often
involving the training of additional draft models. In this paper, we introduce
READER (Retrieval-Assisted Drafter for Efficient LLM Inference), a novel
lossless speculative decoding method that enhances model-based approaches by
leveraging self-repetitions in the text. Our algorithm expands the speculative
decoding tree using tokens obtained through statistical search. This work
focuses on large batch sizes (>= 8), an underexplored yet important area for
industrial applications. We also analyze the key-value (KV) cache size during
speculative decoding and propose an optimization to improve performance for
large batches. As a result, READER outperforms existing speculative decoding
methods. Notably, READER requires no additional training and can reuse
pre-trained speculator models, increasing the speedup by over 40\%. Our method
demonstrates particularly strong performance on search-based tasks, such as
retrieval-augmented generation, where we achieve more than 10x speedup.
☆ MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions ACM MM 2025
Given the significant advances in Large Vision Language Models (LVLMs) in
reasoning and visual understanding, mobile agents are rapidly emerging to meet
users' automation needs. However, existing evaluation benchmarks are
disconnected from the real world and fail to adequately address the diverse and
complex requirements of users. From our extensive collection of user
questionnaire, we identified five tasks: Multi-App, Vague, Interactive,
Single-App, and Unethical Instructions. Around these tasks, we present
\textbf{MVISU-Bench}, a bilingual benchmark that includes 404 tasks across 137
mobile applications. Furthermore, we propose Aider, a plug-and-play module that
acts as a dynamic prompt prompter to mitigate risks and clarify user intent for
mobile agents. Our Aider is easy to integrate into several frameworks and has
successfully improved overall success rates by 19.55\% compared to the current
state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate
improvements of 53.52\% and 29.41\% for unethical and interactive instructions,
respectively. Through extensive experiments and analysis, we highlight the gap
between existing mobile agents and real-world user expectations.
comment: ACM MM 2025
☆ LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback
Chen Xu, Zhenyu Lv, Tian Lan, Xianyang Wang, Luyao Ji, Leyang Cui, Minqiang Yang, Jian Shen, Qunxi Dong, Xiuling Liu, Juan Wang, Bin Hu
Although large language models (LLMs) hold significant promise in
psychotherapy, their direct application in patient-facing scenarios raises
ethical and safety concerns. Therefore, this work shifts towards developing an
LLM as a supervisor to train real therapists. In addition to the privacy of
clinical therapist training data, a fundamental contradiction complicates the
training of therapeutic behaviors: clear feedback standards are necessary to
ensure a controlled training system, yet there is no absolute "gold standard"
for appropriate therapeutic behaviors in practice. In contrast, many common
therapeutic mistakes are universal and identifiable, making them effective
triggers for targeted feedback that can serve as clearer evidence. Motivated by
this, we create a novel therapist-training paradigm: (1) guidelines for
mistaken behaviors and targeted correction strategies are first established as
standards; (2) a human-in-the-loop dialogue-feedback dataset is then
constructed, where a mistake-prone agent intentionally makes standard mistakes
during interviews naturally, and a supervisor agent locates and identifies
mistakes and provides targeted feedback; (3) after fine-tuning on this dataset,
the final supervisor model is provided for real therapist training. The
detailed experimental results of automated, human and downstream assessments
demonstrate that models fine-tuned on our dataset MATE, can provide
high-quality feedback according to the clinical guideline, showing significant
potential for the therapist training scenario.
comment: 9 pages, 5 figures
☆ P/D-Device: Disaggregated Large Language Model between Cloud and Devices
Yibo Jin, Yixu Xu, Yue Chen, Chengbin Wang, Tao Wang, Jiaqi Huang, Rongfei Zhang, Yiming Dong, Yuting Yan, Ke Cheng, Yingjie Zhu, Shulan Wang, Qianqian Tang, Shuaishuai Meng, Guanxin Cheng, Ze Wang, Shuyan Miao, Ketao Wang, Wen Liu, Yifan Yang, Tong Zhang, Anran Wang, Chengzhou Lu, Tiantian Dong, Yongsheng Zhang, Zhe Wang, Hefei Guo, Hongjie Liu, Wei Lu, Zhengyong Zhang
Serving disaggregated large language models has been widely adopted in
industrial practice for enhanced performance. However, too many tokens
generated in decoding phase, i.e., occupying the resources for a long time,
essentially hamper the cloud from achieving a higher throughput. Meanwhile, due
to limited on-device resources, the time to first token (TTFT), i.e., the
latency of prefill phase, increases dramatically with the growth on prompt
length. In order to concur with such a bottleneck on resources, i.e., long
occupation in cloud and limited on-device computing capacity, we propose to
separate large language model between cloud and devices. That is, the cloud
helps a portion of the content for each device, only in its prefill phase.
Specifically, after receiving the first token from the cloud, decoupling with
its own prefill, the device responds to the user immediately for a lower TTFT.
Then, the following tokens from cloud are presented via a speed controller for
smoothed TPOT (the time per output token), until the device catches up with the
progress. On-device prefill is then amortized using received tokens while the
resource usage in cloud is controlled. Moreover, during cloud prefill, the
prompt can be refined, using those intermediate data already generated, to
further speed up on-device inference. We implement such a scheme P/D-Device,
and confirm its superiority over other alternatives. We further propose an
algorithm to decide the best settings. Real-trace experiments show that TTFT
decreases at least 60%, maximum TPOT is about tens of milliseconds, and cloud
throughput increases by up to 15x.
☆ E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence,and Efficiency
Dongjie Xu, Yue Cui, Weijie Shi, Qingzhi Ma, Hanghui Guo, Jiaming Li, Yao Zhao, Ruiyuan Zhang, Shimin Di, Jia Zhu, Kai Zheng, Jiajie Xu
SQL query rewriting aims to reformulate a query into a more efficient form
while preserving equivalence. Most existing methods rely on predefined rewrite
rules. However, such rule-based approaches face fundamental limitations: (1)
fixed rule sets generalize poorly to novel query patterns and struggle with
complex queries; (2) a wide range of effective rewriting strategies cannot be
fully captured by declarative rules. To overcome these issues, we propose using
large language models (LLMs) to generate rewrites. LLMs can capture complex
strategies, such as evaluation reordering and CTE rewriting. Despite this
potential, directly applying LLMs often results in suboptimal or non-equivalent
rewrites due to a lack of execution awareness and semantic grounding. To
address these challenges, We present E3-Rewrite, an LLM-based SQL rewriting
framework that produces executable, equivalent, and efficient queries. It
integrates two core components: a context construction module and a
reinforcement learning framework. First, the context module leverages execution
plans and retrieved demonstrations to build bottleneck-aware prompts that guide
inference-time rewriting. Second, we design a reward function targeting
executability, equivalence, and efficiency, evaluated via syntax checks,
equivalence verification, and cost estimation. Third, to ensure stable
multi-objective learning, we adopt a staged curriculum that first emphasizes
executability and equivalence, then gradually incorporates efficiency.
Extensive experiments show that E3-Rewrite achieves up to a 25.6\% reduction in
query execution time compared to state-of-the-art methods across multiple SQL
benchmarks. Moreover, it delivers up to 24.4\% more successful rewrites,
expanding coverage to complex queries that previous systems failed to handle.
☆ A Survey on Training-free Alignment of Large Language Models
Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian
The alignment of large language models (LLMs) aims to ensure their outputs
adhere to human values, ethical standards, and legal norms. Traditional
alignment methods often rely on resource-intensive fine-tuning (FT), which may
suffer from knowledge degradation and face challenges in scenarios where the
model accessibility or computational resources are constrained. In contrast,
training-free (TF) alignment techniques--leveraging in-context learning,
decoding-time adjustments, and post-generation corrections--offer a promising
alternative by enabling alignment without heavily retraining LLMs, making them
adaptable to both open-source and closed-source environments. This paper
presents the first systematic review of TF alignment methods, categorizing them
by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we
provide a detailed examination from the viewpoint of LLMs and multimodal LLMs
(MLLMs), highlighting their mechanisms and limitations. Furthermore, we
identify key challenges and future directions, paving the way for more
inclusive and effective TF alignment techniques. By synthesizing and organizing
the rapidly growing body of research, this survey offers a guidance for
practitioners and advances the development of safer and more reliable LLMs.
☆ LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA SemEval 2025
This paper describes our participation in SemEval 2025 Task 8, focused on
Tabular Question Answering. We developed a zero-shot pipeline that leverages an
Large Language Model to generate functional code capable of extracting the
relevant information from tabular data based on an input question. Our approach
consists of a modular pipeline where the main code generator module is
supported by additional components that identify the most relevant columns and
analyze their data types to improve extraction accuracy. In the event that the
generated code fails, an iterative refinement process is triggered,
incorporating the error feedback into a new generation prompt to enhance
robustness. Our results show that zero-shot code generation is a valid approach
for Tabular QA, achieving rank 33 of 53 in the test phase despite the lack of
task-specific fine-tuning.
comment: Accepted to SemEval 2025. Camera-ready version
☆ Retrospective Sparse Attention for Efficient Long-Context Generation
Large Language Models (LLMs) are increasingly deployed in long-context tasks
such as reasoning, code generation, and multi-turn dialogue. However, inference
over extended contexts is bottlenecked by the Key-Value (KV) cache, whose
memory footprint grows linearly with sequence length and dominates latency at
each decoding step. While recent KV cache compression methods identify and load
important tokens, they focus predominantly on input contexts and fail to
address the cumulative attention errors that arise during long decoding. In
this paper, we introduce RetroAttention, a novel KV cache update technique that
retrospectively revises past attention outputs using newly arrived KV entries
from subsequent decoding steps. By maintaining a lightweight output cache,
RetroAttention enables past queries to efficiently access more relevant
context, while incurring minimal latency overhead. This breaks the
fixed-attention-output paradigm and allows continual correction of prior
approximations. Extensive experiments on long-generation benchmarks show that
RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression
methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by
up to 21.9\%.
☆ Revealing the Role of Audio Channels in ASR Performance Degradation
Pre-trained automatic speech recognition (ASR) models have demonstrated
strong performance on a variety of tasks. However, their performance can
degrade substantially when the input audio comes from different recording
channels. While previous studies have demonstrated this phenomenon, it is often
attributed to the mismatch between training and testing corpora. This study
argues that variations in speech characteristics caused by different recording
channels can fundamentally harm ASR performance. To address this limitation, we
propose a normalization technique designed to mitigate the impact of channel
variation by aligning internal feature representations in the ASR model with
those derived from a clean reference channel. This approach significantly
improves ASR performance on previously unseen channels and languages,
highlighting its ability to generalize across channel and language differences.
comment: Accepted to IEEE ASRU 2025
☆ Jointly Generating and Attributing Answers using Logits of Document-Identifier Tokens
Despite their impressive performances, Large Language Models (LLMs) remain
prone to hallucination, which critically undermines their trustworthiness.
While most of the previous work focused on tackling answer and attribution
correctness, a recent line of work investigated faithfulness, with a focus on
leveraging internal model signals to reflect a model's actual decision-making
process while generating the answer. Nevertheless, these methods induce
additional latency and have shown limitations in directly aligning token
generation with attribution generation. In this paper, we introduce LoDIT, a
method that jointly generates and faithfully attributes answers in RAG by
leveraging specific token logits during generation. It consists of two steps:
(1) marking the documents with specific token identifiers and then leveraging
the logits of these tokens to estimate the contribution of each document to the
answer during generation, and (2) aggregating these contributions into document
attributions. Experiments on a trustworthiness-focused attributed
text-generation benchmark, Trust-Align, show that LoDIT significantly
outperforms state-of-the-art models on several metrics. Finally, an in-depth
analysis of LoDIT shows both its efficiency in terms of latency and its
robustness in different settings.
☆ Train Long, Think Short: Curriculum Learning for Efficient Reasoning
Hasan Abed Al Kader Hammoud, Kumail Alhamoud, Abed Hammoud, Elie Bou-Zeid, Marzyeh Ghassemi, Bernard Ghanem
Recent work on enhancing the reasoning abilities of large language models
(LLMs) has introduced explicit length control as a means of constraining
computational cost while preserving accuracy. However, existing approaches rely
on fixed-length training budgets, which do not take advantage of the natural
progression from exploration to compression during learning. In this work, we
propose a curriculum learning strategy for length-controlled reasoning using
Group Relative Policy Optimization (GRPO). Our method starts with generous
token budgets and gradually tightens them over training, encouraging models to
first discover effective solution strategies and then distill them into more
concise reasoning traces. We augment GRPO with a reward function that balances
three signals: task correctness (via verifier feedback), length efficiency, and
formatting adherence (via structural tags). Experiments on GSM8K, MATH500,
SVAMP, College Math, and GSM+ demonstrate that curriculum-based training
consistently outperforms fixed-budget baselines at the same final budget,
achieving higher accuracy and significantly improved token efficiency. We
further ablate the impact of reward weighting and decay schedule design,
showing that progressive constraint serves as a powerful inductive bias for
training efficient reasoning models. Our code and checkpoints are released at:
https://github.com/hammoudhasan/curriculum_grpo.
comment: Under Review
☆ Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation AACL 2025
Language models have demonstrated remarkable performance on complex
multi-step reasoning tasks. However, their evaluation has been predominantly
confined to high-resource languages such as English. In this paper, we
introduce a manually translated Bangla multi-step reasoning dataset derived
from the English Reveal dataset, featuring both binary and non-binary question
types. We conduct a controlled evaluation of English-centric and Bangla-centric
multilingual small language models on the original dataset and our translated
version to compare their ability to exploit relevant reasoning steps to produce
correct answers. Our results show that, in comparable settings, reasoning
context is beneficial for more challenging non-binary questions, but models
struggle to employ relevant Bangla reasoning steps effectively. We conclude by
exploring how reasoning steps contribute to models' predictions, highlighting
different trends across models and languages.
comment: Submitted to IJCNLP-AACL 2025
☆ Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning
Automatic speech recognition (ASR) plays a vital role in enabling natural
human-machine interaction across applications such as virtual assistants,
industrial automation, customer support, and real-time transcription. However,
developing accurate ASR systems for low-resource languages like Arabic remains
a significant challenge due to limited labeled data and the linguistic
complexity introduced by diverse dialects. In this work, we present a scalable
training pipeline that combines weakly supervised learning with supervised
fine-tuning to develop a robust Arabic ASR model. In the first stage, we
pretrain the model on 15,000 hours of weakly labeled speech covering both
Modern Standard Arabic (MSA) and various Dialectal Arabic (DA) variants. In the
subsequent stage, we perform continual supervised fine-tuning using a mixture
of filtered weakly labeled data and a small, high-quality annotated dataset.
Our approach achieves state-of-the-art results, ranking first in the
multi-dialectal Arabic ASR challenge. These findings highlight the
effectiveness of weak supervision paired with fine-tuning in overcoming data
scarcity and delivering high-quality ASR for low-resource, dialect-rich
languages.
☆ ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs
The increasing scale and complexity of large language models (LLMs) pose
significant inference latency challenges, primarily due to their autoregressive
decoding paradigm characterized by the sequential nature of next-token
prediction. By re-examining the outputs of autoregressive models, we observed
that some segments exhibit parallelizable structures, which we term intrinsic
parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel
decoding) can significantly improve the overall inference speed of LLMs. In
this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which
addresses two core challenges: automated construction of parallelizable data
and efficient parallel decoding mechanism. More specifically, we introduce a
non-invasive pipeline that automatically extracts and validates parallelizable
structures from the responses of autoregressive models. To empower efficient
adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which
enables seamless transitions between serial and parallel decoding modes while
maintaining a reusable KV cache, maximizing computational efficiency. Extensive
evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical
Reasoning, demonstrate that ASPD achieves unprecedented performance in both
effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up
to 3.19x speedup (1.85x on average) while maintaining response quality within
1% difference compared to autoregressive models, realizing significant
acceleration without compromising generation quality. Our framework sets a
groundbreaking benchmark for efficient LLM parallel inference, paving the way
for its deployment in latency-sensitive applications such as AI-powered
customer service bots and answer retrieval engines.
comment: 20 pages, 9 figures
☆ Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models
Haeun Yu, Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Alice Oh, Isabelle Augenstein
The growing deployment of large language models (LLMs) across diverse
cultural contexts necessitates a better understanding of how the
overgeneralization of less documented cultures within LLMs' representations
impacts their cultural understanding. Prior work only performs extrinsic
evaluation of LLMs' cultural competence, without accounting for how LLMs'
internal mechanisms lead to cultural (mis)representation. To bridge this gap,
we propose Culturescope, the first mechanistic interpretability-based method
that probes the internal representations of LLMs to elicit the underlying
cultural knowledge space. CultureScope utilizes a patching method to extract
the cultural knowledge. We introduce a cultural flattening score as a measure
of the intrinsic cultural biases. Additionally, we study how LLMs internalize
Western-dominance bias and cultural flattening, which allows us to trace how
cultural biases emerge within LLMs. Our experimental results reveal that LLMs
encode Western-dominance bias and cultural flattening in their cultural
knowledge space. We find that low-resource cultures are less susceptible to
cultural biases, likely due to their limited training resources. Our work
provides a foundation for future research on mitigating cultural biases and
enhancing LLMs' cultural understanding. Our codes and data used for experiments
are publicly available.
comment: 16 pages, 7 figures
☆ Weakly Supervised Fine-grained Span-Level Framework for Chinese Radiology Report Quality Assurance CIKM 2025
Quality Assurance (QA) for radiology reports refers to judging whether the
junior reports (written by junior doctors) are qualified. The QA scores of one
junior report are given by the senior doctor(s) after reviewing the image and
junior report. This process requires intensive labor costs for senior doctors.
Additionally, the QA scores may be inaccurate for reasons like diagnosis bias,
the ability of senior doctors, and so on. To address this issue, we propose a
Span-level Quality Assurance EvaluaTOR (Sqator) to mark QA scores
automatically. Unlike the common document-level semantic comparison method, we
try to analyze the semantic difference by exploring more fine-grained text
spans. Unlike the common document-level semantic comparison method, we try to
analyze the semantic difference by exploring more fine-grained text spans.
Specifically, Sqator measures QA scores by measuring the importance of revised
spans between junior and senior reports, and outputs the final QA scores by
merging all revised span scores. We evaluate Sqator using a collection of
12,013 radiology reports. Experimental results show that Sqator can achieve
competitive QA scores. Moreover, the importance scores of revised spans can be
also consistent with the judgments of senior doctors.
comment: Accepted by CIKM 2025. 11 pages, 7 figures
☆ BiasGym: Fantastic Biases and How to Find (and Remove) Them
Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein
Understanding biases and stereotypes encoded in the weights of Large Language
Models (LLMs) is crucial for developing effective mitigation strategies. Biased
behaviour is often subtle and non-trivial to isolate, even when deliberately
elicited, making systematic analysis and debiasing particularly challenging. To
address this, we introduce BiasGym, a simple, cost-effective, and generalizable
framework for reliably injecting, analyzing, and mitigating conceptual
associations within LLMs. BiasGym consists of two components: BiasInject, which
injects specific biases into the model via token-based fine-tuning while
keeping the model frozen, and BiasScope, which leverages these injected signals
to identify and steer the components responsible for biased behavior. Our
method enables consistent bias elicitation for mechanistic analysis, supports
targeted debiasing without degrading performance on downstream tasks, and
generalizes to biases unseen during training. We demonstrate the effectiveness
of BiasGym in reducing real-world stereotypes (e.g., people from a country
being `reckless drivers') and in probing fictional associations (e.g., people
from a country having `blue skin'), showing its utility for both safety
interventions and interpretability research.
comment: Under review
☆ Steering Towards Fairness: Mitigating Political Bias in LLMs
Recent advancements in large language models (LLMs) have enabled their
widespread use across diverse real-world applications. However, concerns remain
about their tendency to encode and reproduce ideological biases, particularly
along political and economic dimensions. In this paper, we propose a framework
for probing and mitigating such biases in decoder-based LLMs through analysis
of internal model representations. Grounded in the Political Compass Test
(PCT), our method uses contrastive pairs to extract and compare hidden layer
activations from models like Mistral and DeepSeek. We introduce a comprehensive
activation extraction pipeline capable of layer-wise analysis across multiple
ideological axes, revealing meaningful disparities linked to political framing.
Our results show that decoder LLMs systematically encode representational bias
across layers, which can be leveraged for effective steering vector-based
mitigation. This work provides new insights into how political bias is encoded
in LLMs and offers a principled approach to debiasing beyond surface-level
output interventions.
comment: Preprint
☆ An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems
In this paper, we introduce a systematic framework beyond conventional method
to assess LLMs' mathematical-reasoning robustness by stress-testing them on
advanced math problems that are mathematically equivalent but with linguistic
and parametric variation. These transformations allow us to measure the
sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more
accurate evaluation of their mathematical reasoning capabilities. Using this
new evaluation methodology, we created PutnamGAP, a new benchmark dataset with
multiple mathematically-equivalent variations of competition-level math
problems. With the new dataset, we evaluate multiple families of representative
LLMs and examine their robustness. Across 18 commercial and open-source models
we observe sharp performance degradation on the variants. OpenAI's flagship
reasoning model, O3, scores 49 % on the originals but drops by 4 percentage
points on surface variants, and by 10.5 percentage points on core-step-based
variants, while smaller models fare far worse. Overall, the results show that
the proposed new evaluation methodology is effective for deepening our
understanding of the robustness of LLMs and generating new insights for further
improving their mathematical reasoning capabilities.
comment: 16 pages, 8 figures
☆ TiMoE: Time-Aware Mixture of Language Experts
Large language models (LLMs) are typically trained on fixed snapshots of the
web, which means that their knowledge becomes stale and their predictions risk
temporal leakage: relying on information that lies in the future relative to a
query. We tackle this problem by pre-training from scratch a set of GPT-style
experts on disjoint two-year slices of a 2013-2024 corpus and combining them
through TiMoE, a Time-aware Mixture of Language Experts. At inference time,
TiMoE masks all experts whose training window ends after the query timestamp
and merges the remaining log-probabilities in a shared space, guaranteeing
strict causal validity while retaining the breadth of multi-period knowledge.
We also release TSQA, a 10k-question benchmark whose alternatives are
explicitly labelled as past, future or irrelevant, allowing fine-grained
measurement of temporal hallucinations. Experiments on eight standard NLP tasks
plus TSQA show that a co-adapted TiMoE variant matches or exceeds the best
single-period expert and cuts future-knowledge errors by up to 15%. Our results
demonstrate that modular, time-segmented pre-training paired with causal
routing is a simple yet effective path toward LLMs that stay chronologically
grounded without sacrificing general performance much. We open source our code
at TiMoE (Github): https://github.com/epfml/TiMoE
☆ A Dual-Axis Taxonomy of Knowledge Editing for LLMs: From Mechanisms to Functions
Large language models (LLMs) acquire vast knowledge from large text corpora,
but this information can become outdated or inaccurate. Since retraining is
computationally expensive, knowledge editing offers an efficient alternative --
modifying internal knowledge without full retraining. These methods aim to
update facts precisely while preserving the model's overall capabilities. While
existing surveys focus on the mechanism of editing (e.g., parameter changes vs.
external memory), they often overlook the function of the knowledge being
edited. This survey introduces a novel, complementary function-based taxonomy
to provide a more holistic view. We examine how different mechanisms apply to
various knowledge types -- factual, temporal, conceptual, commonsense, and
social -- highlighting how editing effectiveness depends on the nature of the
target knowledge. By organizing our review along these two axes, we map the
current landscape, outline the strengths and limitations of existing methods,
define the problem formally, survey evaluation tasks and datasets, and conclude
with open challenges and future directions.
comment: 13 pages, 1 figure
☆ Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments
Junjie Ye, Changhao Jiang, Zhengyin Du, Yufei Xu, Xuesong Yao, Zhiheng Xi, Xiaoran Fan, Qi Zhang, Xuanjing Huang, Jiecao Chen
Effective tool use is essential for large language models (LLMs) to interact
meaningfully with their environment. However, progress is limited by the lack
of efficient reinforcement learning (RL) frameworks specifically designed for
tool use, due to challenges in constructing stable training environments and
designing verifiable reward mechanisms. To address this, we propose an
automated environment construction pipeline, incorporating scenario
decomposition, document generation, function integration, complexity scaling,
and localized deployment. This enables the creation of high-quality training
environments that provide detailed and measurable feedback without relying on
external tools. Additionally, we introduce a verifiable reward mechanism that
evaluates both the precision of tool use and the completeness of task
execution. When combined with trajectory data collected from the constructed
environments, this mechanism integrates seamlessly with standard RL algorithms
to facilitate feedback-driven model training. Experiments on LLMs of varying
scales demonstrate that our approach significantly enhances the models'
tool-use performance without degrading their general capabilities, regardless
of inference modes or training algorithms. Our analysis suggests that these
gains result from improved context understanding and reasoning, driven by
updates to the lower-layer MLP parameters in models.
☆ Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering
LLMs often suffer from hallucinations and outdated or incomplete knowledge.
RAG is proposed to address these issues by integrating external knowledge like
that in KGs into LLMs. However, leveraging private KGs in RAG systems poses
significant privacy risks due to the black-box nature of LLMs and potential
insecure data transmission, especially when using third-party LLM APIs lacking
transparency and control. In this paper, we investigate the privacy-protected
RAG scenario for the first time, where entities in KGs are anonymous for LLMs,
thus preventing them from accessing entity semantics. Due to the loss of
semantics of entities, previous RAG systems cannot retrieve question-relevant
knowledge from KGs by matching questions with the meaningless identifiers of
anonymous entities. To realize an effective RAG system in this scenario, two
key challenges must be addressed: (1) How can anonymous entities be converted
into retrievable information. (2) How to retrieve question-relevant anonymous
entities. Hence, we propose a novel ARoG framework including relation-centric
abstraction and structure-oriented abstraction strategies. For challenge (1),
the first strategy abstracts entities into high-level concepts by dynamically
capturing the semantics of their adjacent relations. It supplements meaningful
semantics which can further support the retrieval process. For challenge (2),
the second strategy transforms unstructured natural language questions into
structured abstract concept paths. These paths can be more effectively aligned
with the abstracted concepts in KGs, thereby improving retrieval performance.
To guide LLMs to effectively retrieve knowledge from KGs, the two strategies
strictly protect privacy from being exposed to LLMs. Experiments on three
datasets demonstrate that ARoG achieves strong performance and
privacy-robustness.
☆ Designing Memory-Augmented AR Agents for Spatiotemporal Reasoning in Personalized Task Assistance
Augmented Reality (AR) systems are increasingly integrating foundation
models, such as Multimodal Large Language Models (MLLMs), to provide more
context-aware and adaptive user experiences. This integration has led to the
development of AR agents to support intelligent, goal-directed interactions in
real-world environments. While current AR agents effectively support immediate
tasks, they struggle with complex multi-step scenarios that require
understanding and leveraging user's long-term experiences and preferences. This
limitation stems from their inability to capture, retain, and reason over
historical user interactions in spatiotemporal contexts. To address these
challenges, we propose a conceptual framework for memory-augmented AR agents
that can provide personalized task assistance by learning from and adapting to
user-specific experiences over time. Our framework consists of four
interconnected modules: (1) Perception Module for multimodal sensor processing,
(2) Memory Module for persistent spatiotemporal experience storage, (3)
Spatiotemporal Reasoning Module for synthesizing past and present contexts, and
(4) Actuator Module for effective AR communication. We further present an
implementation roadmap, a future evaluation strategy, a potential target
application and use cases to demonstrate the practical applicability of our
framework across diverse domains. We aim for this work to motivate future
research toward developing more intelligent AR systems that can effectively
bridge user's interaction history with adaptive, context-aware task assistance.
comment: 7 pages, 2 figures
☆ DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation
The manual translation of unstructured team dialogue into the structured
artifacts required for Information Technology (IT) project governance is a
critical bottleneck in modern information systems management. We introduce
DevNous, a Large Language Model-based (LLM) multi-agent expert system, to
automate this unstructured-to-structured translation process. DevNous
integrates directly into team chat environments, identifying actionable intents
from informal dialogue and managing stateful, multi-turn workflows for core
administrative tasks like automated task formalization and progress summary
synthesis. To quantitatively evaluate the system, we introduce a new benchmark
of 160 realistic, interactive conversational turns. The dataset was manually
annotated with a multi-label ground truth and is publicly available. On this
benchmark, DevNous achieves an exact match turn accuracy of 81.3\% and a
multiset F1-Score of 0.845, providing strong evidence for its viability. The
primary contributions of this work are twofold: (1) a validated architectural
pattern for developing ambient administrative agents, and (2) the introduction
of the first robust empirical baseline and public benchmark dataset for this
challenging problem domain.
☆ SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs
Haotian Chen, Qingqing Long, Meng Xiao, Xiao Luo, Wei Ju, Chengrui Wang, Xuezhi Wang, Yuanchun Zhou, Hengshu Zhu
Scientific literature question answering is a pivotal step towards new
scientific discoveries. Recently, \textit{two-stage} retrieval-augmented
generated large language models (RAG-LLMs) have shown impressive advancements
in this domain. Such a two-stage framework, especially the second stage
(reranker), is particularly essential in the scientific domain, where subtle
differences in terminology may have a greatly negative impact on the final
factual-oriented or knowledge-intensive answers. Despite this significant
progress, the potential and limitations of these works remain unexplored. In
this work, we present a Scientific Rerank-oriented RAG Benchmark
(SciRerankBench), for evaluating rerankers within RAG-LLMs systems, spanning
five scientific subjects. To rigorously assess the reranker performance in
terms of noise resilience, relevance disambiguation, and factual consistency,
we develop three types of question-context-answer (Q-C-A) pairs, i.e., Noisy
Contexts (NC), Semantically Similar but Logically Irrelevant Contexts (SSLI),
and Counterfactual Contexts (CC). Through systematic evaluation of 13 widely
used rerankers on five families of LLMs, we provide detailed insights into
their relative strengths and limitations. To the best of our knowledge,
SciRerankBench is the first benchmark specifically developed to evaluate
rerankers within RAG-LLMs, which provides valuable observations and guidance
for their future development.
☆ Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation
Medical Lay Language Generation (MLLG) plays a vital role in improving the
accessibility of complex scientific content for broader audiences. Recent
literature to MLLG commonly employ parameter-efficient fine-tuning methods such
as Low-Rank Adaptation (LoRA) to fine-tuning large language models (LLMs) using
paired expert-lay language datasets. However, LoRA struggles with the
challenges posed by multi-source heterogeneous MLLG datasets. Specifically,
through a series of exploratory experiments, we reveal that standard LoRA fail
to meet the requirement for semantic fidelity and diverse lay-style generation
in MLLG task. To address these limitations, we propose Magical, an asymmetric
LoRA architecture tailored for MLLG under heterogeneous data scenarios. Magical
employs a shared matrix $A$ for abstractive summarization, along with multiple
isolated matrices $B$ for diverse lay-style generation. To preserve semantic
fidelity during the lay language generation process, Magical introduces a
Semantic Invariance Constraint to mitigate semantic subspace shifts on matrix
$A$. Furthermore, to better adapt to diverse lay-style generation, Magical
incorporates the Recommendation-guided Switch, an externally interface to
prompt the LLM to switch between different matrices $B$. Experimental results
on three real-world lay language generation datasets demonstrate that Magical
consistently outperforms prompt-based methods, vanilla LoRA, and its recent
variants, while also reducing trainable parameters by 31.66%.
☆ IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization
Yuzhuo Bai, Shitong Duan, Muhua Huang, Jing Yao, Zhenghao Liu, Peng Zhang, Tun Lu, Xiaoyuan Yi, Maosong Sun, Xing Xie
Trained on various human-authored corpora, Large Language Models (LLMs) have
demonstrated a certain capability of reflecting specific human-like traits
(e.g., personality or values) by prompting, benefiting applications like
personalized LLMs and social simulations. However, existing methods suffer from
the superficial elicitation problem: LLMs can only be steered to mimic shallow
and unstable stylistic patterns, failing to embody the desired traits precisely
and consistently across diverse tasks like humans. To address this challenge,
we propose IROTE, a novel in-context method for stable and transferable trait
elicitation. Drawing on psychological theories suggesting that traits are
formed through identity-related reflection, our method automatically generates
and optimizes a textual self-reflection within prompts, which comprises
self-perceived experience, to stimulate LLMs' trait-driven behavior. The
optimization is performed by iteratively maximizing an information-theoretic
objective that enhances the connections between LLMs' behavior and the target
trait, while reducing noisy redundancy in reflection without any fine-tuning,
leading to evocative and compact trait reflection. Extensive experiments across
three human trait systems manifest that one single IROTE-generated
self-reflection can induce LLMs' stable impersonation of the target trait
across diverse downstream tasks beyond simple questionnaire answering,
consistently outperforming existing strong baselines.
☆ MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs
Generative speech models have demonstrated significant potential in
personalizing teacher-student interactions, offering valuable real-world
applications for language learning in children's education. However, achieving
high-quality, child-friendly speech generation remains challenging,
particularly for low-resource languages across diverse languages and cultural
contexts. In this paper, we propose MultiAiTutor, an educational multilingual
generative AI tutor with child-friendly designs, leveraging LLM architecture
for speech generation tailored for educational purposes. We propose to
integrate age-appropriate multilingual speech generation using LLM
architectures, facilitating young children's language learning through
culturally relevant image-description tasks in three low-resource languages:
Singaporean-accent Mandarin, Malay, and Tamil. Experimental results from both
objective metrics and subjective evaluations demonstrate the superior
performance of the proposed MultiAiTutor compared to baseline methods.
comment: 5 figures
☆ A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models
Lingzhe Zhang, Liancheng Fang, Chiming Duan, Minghua He, Leyi Pan, Pei Xiao, Shiyu Huang, Yunpeng Zhai, Xuming Hu, Philip S. Yu, Aiwei Liu
As text generation has become a core capability of modern Large Language
Models (LLMs), it underpins a wide range of downstream applications. However,
most existing LLMs rely on autoregressive (AR) generation, producing one token
at a time based on previously generated context-resulting in limited generation
speed due to the inherently sequential nature of the process. To address this
challenge, an increasing number of researchers have begun exploring parallel
text generation-a broad class of techniques aimed at breaking the
token-by-token generation bottleneck and improving inference efficiency.
Despite growing interest, there remains a lack of comprehensive analysis on
what specific techniques constitute parallel text generation and how they
improve inference performance. To bridge this gap, we present a systematic
survey of parallel text generation methods. We categorize existing approaches
into AR-based and Non-AR-based paradigms, and provide a detailed examination of
the core techniques within each category. Following this taxonomy, we assess
their theoretical trade-offs in terms of speed, quality, and efficiency, and
examine their potential for combination and comparison with alternative
acceleration strategies. Finally, based on our findings, we highlight recent
advancements, identify open challenges, and outline promising directions for
future research in parallel text generation.
☆ Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults
Bram van Dijk, Tiberon Kuiper, Sirin Aoulad si Ahmed, Armel Levebvre, Jake Johnson, Jan Duin, Simon Mooijaart, Marco Spruit
Voice-controlled interfaces can support older adults in clinical contexts,
with chatbots being a prime example, but reliable Automatic Speech Recognition
(ASR) for underrepresented groups remains a bottleneck. This study evaluates
state-of-the-art ASR models on language use of older Dutch adults, who
interacted with the Welzijn.AI chatbot designed for geriatric contexts. We
benchmark generic multilingual ASR models, and models fine-tuned for Dutch
spoken by older adults, while also considering processing speed. Our results
show that generic multilingual models outperform fine-tuned models, which
suggests recent ASR models can generalise well out of the box to realistic
datasets. Furthermore, our results suggest that truncating existing
architectures is helpful in balancing the accuracy-speed trade-off, though we
also identify some cases with high WER due to hallucinations.
☆ TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
LLMs have been shown to perform well in machine translation (MT) with the use
of in-context learning (ICL), rivaling supervised models when translating into
high-resource languages (HRLs). However, they lag behind when translating into
low-resource language (LRLs). Example selection via similarity search and
supervised fine-tuning help. However the improvements they give are limited by
the size, quality and diversity of existing parallel datasets. A common
technique in low-resource MT is synthetic parallel data creation, the most
frequent of which is backtranslation, whereby existing target-side texts are
automatically translated into the source language. However, this assumes the
existence of good quality and relevant target-side texts, which are not readily
available for many LRLs. In this paper, we present \textsc{TopXGen}, an
LLM-based approach for the generation of high quality and topic-diverse data in
multiple LRLs, which can then be backtranslated to produce useful and diverse
parallel texts for ICL and fine-tuning. Our intuition is that while LLMs
struggle to translate into LRLs, their ability to translate well into HRLs and
their multilinguality enable them to generate good quality, natural-sounding
target-side texts, which can be translated well into a high-resource source
language. We show that \textsc{TopXGen} boosts LLM translation performance
during fine-tuning and in-context learning. Code and outputs are available at
https://github.com/ArmelRandy/topxgen.
☆ $\text{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models IJCAI 2025
Accurate molecular property prediction is a critical challenge with
wide-ranging applications in chemistry, materials science, and drug discovery.
Molecular representation methods, including fingerprints and graph neural
networks (GNNs), achieve state-of-the-art results by effectively deriving
features from molecular structures. However, these methods often overlook
decades of accumulated semantic and contextual knowledge. Recent advancements
in large language models (LLMs) demonstrate remarkable reasoning abilities and
prior knowledge across scientific domains, leading us to hypothesize that LLMs
can generate rich molecular representations when guided to reason in multiple
perspectives. To address these gaps, we propose $\text{M}^{2}$LLM, a multi-view
framework that integrates three perspectives: the molecular structure view, the
molecular task view, and the molecular rules view. These views are fused
dynamically to adapt to task requirements, and experiments demonstrate that
$\text{M}^{2}$LLM achieves state-of-the-art performance on multiple benchmarks
across classification and regression tasks. Moreover, we demonstrate that
representation derived from LLM achieves exceptional performance by leveraging
two core functionalities: the generation of molecular embeddings through their
encoding capabilities and the curation of molecular features through advanced
reasoning processes.
comment: IJCAI 2025
☆ LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement
Transforming unstructured text into structured data is a complex task,
requiring semantic understanding, reasoning, and structural comprehension.
While Large Language Models (LLMs) offer potential, they often struggle with
handling ambiguous or domain-specific data, maintaining table structure,
managing long inputs, and addressing numerical reasoning. This paper proposes
an efficient system for LLM-driven text-to-table generation that leverages
novel prompting techniques. Specifically, the system incorporates two key
strategies: breaking down the text-to-table task into manageable, guided
sub-tasks and refining the generated tables through iterative self-feedback. We
show that this custom task decomposition allows the model to address the
problem in a stepwise manner and improves the quality of the generated table.
Furthermore, we discuss the benefits and potential risks associated with
iterative self-feedback on the generated tables while highlighting the
trade-offs between enhanced performance and computational cost. Our methods
achieve strong results compared to baselines on two complex text-to-table
generation datasets available in the public domain.
☆ Prompt-Based Approach for Czech Sentiment Analysis
This paper introduces the first prompt-based methods for aspect-based
sentiment analysis and sentiment classification in Czech. We employ the
sequence-to-sequence models to solve the aspect-based tasks simultaneously and
demonstrate the superiority of our prompt-based approach over traditional
fine-tuning. In addition, we conduct zero-shot and few-shot learning
experiments for sentiment classification and show that prompting yields
significantly better results with limited training examples compared to
traditional fine-tuning. We also demonstrate that pre-training on data from the
target domain can lead to significant improvements in a zero-shot scenario.
comment: Published in Proceedings of the 14th International Conference on
Recent Advances in Natural Language Processing (RANLP 2023). Official
version: https://aclanthology.org/2023.ranlp-1.118/
☆ UWB at WASSA-2024 Shared Task 2: Cross-lingual Emotion Detection WASSA 2024
This paper presents our system built for the WASSA-2024 Cross-lingual Emotion
Detection Shared Task. The task consists of two subtasks: first, to assess an
emotion label from six possible classes for a given tweet in one of five
languages, and second, to predict words triggering the detected emotions in
binary and numerical formats. Our proposed approach revolves around fine-tuning
quantized large language models, specifically Orca~2, with low-rank adapters
(LoRA) and multilingual Transformer-based models, such as XLM-R and mT5. We
enhance performance through machine translation for both subtasks and trigger
word switching for the second subtask. The system achieves excellent
performance, ranking 1st in numerical trigger words detection, 3rd in binary
trigger words detection, and 7th in emotion detection.
comment: Published in Proceedings of the 14th Workshop on Computational
Approaches to Subjectivity, Sentiment, & Social Media Analysis (WASSA 2024).
Official version: https://aclanthology.org/2024.wassa-1.47/
☆ LLaMA-Based Models for Aspect-Based Sentiment Analysis WASSA 2024
While large language models (LLMs) show promise for various tasks, their
performance in compound aspect-based sentiment analysis (ABSA) tasks lags
behind fine-tuned models. However, the potential of LLMs fine-tuned for ABSA
remains unexplored. This paper examines the capabilities of open-source LLMs
fine-tuned for ABSA, focusing on LLaMA-based models. We evaluate the
performance across four tasks and eight English datasets, finding that the
fine-tuned Orca~2 model surpasses state-of-the-art results in all tasks.
However, all models struggle in zero-shot and few-shot scenarios compared to
fully fine-tuned ones. Additionally, we conduct error analysis to identify
challenges faced by fine-tuned models.
comment: Published in Proceedings of the 14th Workshop on Computational
Approaches to Subjectivity, Sentiment, & Social Media Analysis (WASSA 2024).
Official version: https://aclanthology.org/2024.wassa-1.6/
☆ Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents
Zheng Wu, Heyuan Huang, Yanjia Yang, Yuanyi Song, Xingyu Lou, Weiwen Liu, Weinan Zhang, Jun Wang, Zhuosheng Zhang
As multimodal large language models advance rapidly, the automation of mobile
tasks has become increasingly feasible through the use of mobile-use agents
that mimic human interactions from graphical user interface. To further enhance
mobile-use agents, previous studies employ demonstration learning to improve
mobile-use agents from human demonstrations. However, these methods focus
solely on the explicit intention flows of humans (e.g., step sequences) while
neglecting implicit intention flows (e.g., personal preferences), which makes
it difficult to construct personalized mobile-use agents. In this work, to
evaluate the \textbf{I}ntention \textbf{A}lignment \textbf{R}ate between
mobile-use agents and humans, we first collect \textbf{MobileIAR}, a dataset
containing human-intent-aligned actions and ground-truth actions. This enables
a comprehensive assessment of the agents' understanding of human intent. Then
we propose \textbf{IFRAgent}, a framework built upon \textbf{I}ntention
\textbf{F}low \textbf{R}ecognition from human demonstrations. IFRAgent analyzes
explicit intention flows from human demonstrations to construct a query-level
vector library of standard operating procedures (SOP), and analyzes implicit
intention flows to build a user-level habit repository. IFRAgent then leverages
a SOP extractor combined with retrieval-augmented generation and a query
rewriter to generate personalized query and SOP from a raw ambiguous query,
enhancing the alignment between mobile-use agents and human intent.
Experimental results demonstrate that IFRAgent outperforms baselines by an
average of 6.79\% (32.06\% relative improvement) in human intention alignment
rate and improves step completion rates by an average of 5.30\% (26.34\%
relative improvement). The codes are available at
https://github.com/MadeAgents/Quick-on-the-Uptake.
☆ MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time
Large language models (LLMs) are increasingly being applied to black-box
optimization tasks, from program synthesis to molecule design. Prior work
typically leverages in-context learning to iteratively guide the model towards
better solutions. Such methods, however, often struggle to balance exploration
of new solution spaces with exploitation of high-reward ones. Recently,
test-time training (TTT) with synthetic data has shown promise in improving
solution quality. However, the need for hand-crafted training data tailored to
each task limits feasibility and scalability across domains. To address this
problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a
search algorithm to adapt LLMs at inference without requiring external training
data. MiGrATe operates via a mixed-policy group construction procedure that
combines on-policy sampling with two off-policy data selection techniques:
greedy sampling, which selects top-performing past completions, and
neighborhood sampling (NS), which generates completions structurally similar to
high-reward ones. Together, these components bias the policy gradient towards
exploitation of promising regions in solution space, while preserving
exploration through on-policy sampling. We evaluate MiGrATe on three
challenging domains-word search, molecule optimization, and hypothesis+program
induction on the Abstraction and Reasoning Corpus (ARC)-and find that it
consistently outperforms both inference-only and TTT baselines, demonstrating
the potential of online TTT as a solution for complex search tasks without
external supervision.
☆ InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling
Peiji Li, Jiasheng Ye, Yongkang Chen, Yichuan Ma, Zijie Yu, Kedi Chen, Ganqu Cui, Haozhan Li, Jiacheng Chen, Chengqi Lyu, Wenwei Zhang, Linyang Li, Qipeng Guo, Dahua Lin, Bowen Zhou, Kai Chen
Large language models (LLMs) have revolutionized artificial intelligence by
enabling complex reasoning capabilities. While recent advancements in
reinforcement learning (RL) have primarily focused on domain-specific reasoning
tasks (e.g., mathematics or code generation), real-world reasoning scenarios
often require models to handle diverse and complex environments that
narrow-domain benchmarks cannot fully capture. To address this gap, we present
InternBootcamp, an open-source framework comprising 1000+ domain-diverse task
environments specifically designed for LLM reasoning research. Our codebase
offers two key functionalities: (1) automated generation of unlimited
training/testing cases with configurable difficulty levels, and (2) integrated
verification modules for objective response evaluation. These features make
InternBootcamp fundamental infrastructure for RL-based model optimization,
synthetic data generation, and model evaluation. Although manually developing
such a framework with enormous task coverage is extremely cumbersome, we
accelerate the development procedure through an automated agent workflow
supplemented by manual validation protocols, which enables the task scope to
expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an
automatically generated benchmark for comprehensive performance assessment.
Evaluation reveals that frontier models still underperform in many reasoning
tasks, while training with InternBootcamp provides an effective way to
significantly improve performance, leading to our 32B model that achieves
state-of-the-art results on Bootcamp-EVAL and excels on other established
benchmarks. In particular, we validate that consistent performance gains come
from including more training tasks, namely \textbf{task scaling}, over two
orders of magnitude, offering a promising route towards capable reasoning
generalist.
comment: InternBootcamp Tech Report
☆ Adaptive Personalized Conversational Information Retrieval CIKM 2025
Personalized conversational information retrieval (CIR) systems aim to
satisfy users' complex information needs through multi-turn interactions by
considering user profiles. However, not all search queries require
personalization. The challenge lies in appropriately incorporating
personalization elements into search when needed. Most existing studies
implicitly incorporate users' personal information and conversational context
using large language models without distinguishing the specific requirements
for each query turn. Such a ``one-size-fits-all'' personalization strategy
might lead to sub-optimal results. In this paper, we propose an adaptive
personalization method, in which we first identify the required personalization
level for a query and integrate personalized queries with other query
reformulations to produce various enhanced queries. Then, we design a
personalization-aware ranking fusion approach to assign fusion weights
dynamically to different reformulated queries, depending on the required
personalization level. The proposed adaptive personalized conversational
information retrieval framework APCIR is evaluated on two TREC iKAT datasets.
The results confirm the effectiveness of adaptive personalization of APCIR by
outperforming state-of-the-art methods.
comment: Accepted by CIKM 2025
☆ Optimizing Retrieval-Augmented Generation (RAG) for Colloquial Cantonese: A LoRA-Based Systematic Review
This review examines recent advances in Parameter-Efficient Fine-Tuning
(PEFT), with a focus on Low-Rank Adaptation (LoRA), to optimize
Retrieval-Augmented Generation (RAG) systems like Qwen3, DeepSeek, and Kimi.
These systems face challenges in understanding and generating authentic
Cantonese colloquial expressions due to limited annotated data and linguistic
variability. The review evaluates the integration of LoRA within RAG
frameworks, benchmarks PEFT methods for retrieval and generation accuracy,
identify domain adaptation strategies under limited data, and compares
fine-tuning techniques aimed at improving semantic fidelity under data-scarce
conditions. A systematic analysis of recent studies employing diverse LoRA
variants, synthetic data generation, user feedback integration, and adaptive
parameter allocation was conducted to assess their impact on computational
efficiency, retrieval precision, linguistic authenticity, and scalability.
Findings reveal that dynamic and ensemble LoRA adaptations significantly reduce
trainable parameters without sacrificing retrieval accuracy and generation
quality in dialectal contexts. However, limitations remain in fully preserving
fine-grained linguistic nuances, especially for low-resource settings like
Cantonese. The integration of real-time user feedback and domain-specific data
remains underdeveloped, limiting model adaptability and personalization. While
selective parameter freezing and nonlinear adaptation methods offer better
trade-offs between efficiency and accuracy, their robustness at scale remains
an open challenge. This review highlights the promise of PEFT-enhanced RAG
systems for domain-specific language tasks and calls for future work targeting
dialectal authenticity, dynamic adaptation, and scalable fine-tuning pipelines.
comment: 27 pages, 1 figure, 8 tables
☆ DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives
Sehwan Moon, Aram Lee, Jeong Eun Kim, Hee-Ju Kang, Il-Seon Shin, Sung-Wan Kim, Jae-Min Kim, Min Jhon, Ju-Wan Kim
Advances in large language models (LLMs) have enabled a wide range of
applications. However, depression prediction is hindered by the lack of
large-scale, high-quality, and rigorously annotated datasets. This study
introduces DepressLLM, trained and evaluated on a novel corpus of 3,699
autobiographical narratives reflecting both happiness and distress. DepressLLM
provides interpretable depression predictions and, via its Score-guided Token
Probability Summation (SToPS) module, delivers both improved classification
performance and reliable confidence estimates, achieving an AUC of 0.789, which
rises to 0.904 on samples with confidence $\geq$ 0.95. To validate its
robustness to heterogeneous data, we evaluated DepressLLM on in-house datasets,
including an Ecological Momentary Assessment (EMA) corpus of daily stress and
mood recordings, and on public clinical interview data. Finally, a psychiatric
review of high-confidence misclassifications highlighted key model and data
limitations that suggest directions for future refinements. These findings
demonstrate that interpretable AI can enable earlier diagnosis of depression
and underscore the promise of medical AI in psychiatry.
☆ Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization ACL2025
Video dubbing aims to translate original speech in visual media programs from
the source language to the target language, relying on neural machine
translation and text-to-speech technologies. Due to varying information
densities across languages, target speech often mismatches the source speech
duration, causing audio-video synchronization issues that significantly impact
viewer experience. In this study, we approach duration alignment in LLM-based
video dubbing machine translation as a preference optimization problem. We
propose the Segment Supervised Preference Optimization (SSPO) method, which
employs a segment-wise sampling strategy and fine-grained loss to mitigate
duration mismatches between source and target lines. Experimental results
demonstrate that SSPO achieves superior performance in duration alignment
tasks.
comment: This paper is accepted by ACL2025 (Main)
♻ ☆ Retrieval-Augmented Generation with Conflicting Evidence
Large language model (LLM) agents are increasingly employing
retrieval-augmented generation (RAG) to improve the factuality of their
responses. However, in practice, these systems often need to handle ambiguous
user queries and potentially conflicting information from multiple sources
while also suppressing inaccurate information from noisy or irrelevant
documents. Prior work has generally studied and addressed these challenges in
isolation, considering only one aspect at a time, such as handling ambiguity or
robustness to noise and misinformation. We instead consider multiple factors
simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and
Misinformation in Documents), a new dataset that simulates complex and
realistic scenarios for conflicting evidence for a user query, including
ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent
approach in which LLM agents debate over the merits of an answer over multiple
rounds, allowing an aggregator to collate responses corresponding to
disambiguated entities while discarding misinformation and noise, thereby
handling diverse sources of conflict jointly. We demonstrate the effectiveness
of MADAM-RAG using both closed and open-source models on AmbigDocs -- which
requires presenting all valid answers for ambiguous queries -- improving over
strong RAG baselines by up to 11.40% and on FaithEval -- which requires
suppressing misinformation -- where we improve by up to 15.80% (absolute) with
Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for
existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match
score). While MADAM-RAG begins to address these conflicting factors, our
analysis indicates that a substantial gap remains especially when increasing
the level of imbalance in supporting evidence and misinformation.
comment: COLM 2025, Data and Code: https://github.com/HanNight/RAMDocs
♻ ☆ GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
Reinforcement learning (RL) with algorithms like Group Relative Policy
Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is
limited by a coarse-grained credit assignment that applies a uniform reward to
all tokens in a sequence. This is a major flaw in long-chain reasoning tasks.
This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea
is that high-entropy tokens in correct responses can guide the policy toward a
higher performance ceiling. This allows us to create more fine-grained reward
signals for precise policy updates via two ways: 1) \textbf{Group Token Policy
Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each
token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group
Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted
reward to each sequence based on its average token entropy. Experiments show
our methods significantly outperform the strong DAPO baseline. The results
confirm that our entropy-weighting mechanism is the key driver of this
performance boost, offering a better path to enhance deep reasoning in models.
♻ ☆ CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics
Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Stańczak, Aishwarya Agrawal
The increasing ubiquity of text-to-image (T2I) models as tools for visual
content generation raises concerns about their ability to accurately represent
diverse cultural contexts -- where missed cues can stereotype communities and
undermine usability. In this work, we present the first study to systematically
quantify the alignment of T2I models and evaluation metrics with respect to
both explicit (stated) as well as implicit (unstated, implied by the prompt's
cultural context) cultural expectations. To this end, we introduce
CulturalFrames, a novel benchmark designed for rigorous human evaluation of
cultural representation in visual generations. Spanning 10 countries and 5
socio-cultural domains, CulturalFrames comprises 983 prompts, 3637
corresponding images generated by 4 state-of-the-art T2I models, and over 10k
detailed human annotations. We find that across models and countries, cultural
expectations are missed an average of 44% of the time. Among these failures,
explicit expectations are missed at a surprisingly high average rate of 68%,
while implicit expectation failures are also significant, averaging 49%.
Furthermore, we show that existing T2I evaluation metrics correlate poorly with
human judgments of cultural alignment, irrespective of their internal
reasoning. Collectively, our findings expose critical gaps, provide a concrete
testbed, and outline actionable directions for developing culturally informed
T2I models and metrics that improve global usability.
♻ ☆ LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Yue Zhang, Junzhe Wang, Shichun Liu, Shihan Dou, Huayu Sha, Qiyuan Peng, Changhao Jiang, Jingqi Tong, Yilong Wu, Zhihao Zhang, Mingqi Wu, Zhiheng Xi, Mingxu Chai, Tao Liang, Zhihui Fei, Zhen Wang, Mingyang Wan, Guojun Ma, Tao Gui, Qi Zhang, Xuanjing Huang
Existing evaluation of Large Language Models (LLMs) on static benchmarks is
vulnerable to data contamination and leaderboard overfitting, critical issues
that obscure true model capabilities. To address this, we introduce LLMEval-3,
a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary
bank of 220k graduate-level questions, from which it dynamically samples unseen
test sets for each evaluation run. Its automated pipeline ensures integrity via
contamination-resistant data curation, a novel anti-cheating architecture, and
a calibrated LLM-as-a-judge process achieving 90% agreement with human experts,
complemented by a relative ranking system for fair comparison. An 20-month
longitudinal study of nearly 50 leading models reveals a performance ceiling on
knowledge memorization and exposes data contamination vulnerabilities
undetectable by static benchmarks. The framework demonstrates exceptional
robustness in ranking stability and consistency, providing strong empirical
validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and
credible methodology for assessing the true capabilities of LLMs beyond
leaderboard scores, promoting the development of more trustworthy evaluation
standards.
♻ ☆ Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?
Yang Yao, Lingyu Li, Jiaxin Song, Chiyu Chen, Zhenqi He, Yixu Wang, Xin Wang, Tianle Gu, Jie Li, Yan Teng, Yingchun Wang
As Multimodal Large Language Models (MLLMs) continue to evolve, their
cognitive and reasoning capabilities have seen remarkable progress. However,
challenges in visual fine-grained perception and commonsense causal inference
persist. This paper introduces Argus Inspection, a multimodal benchmark with
two levels of difficulty, emphasizing detailed visual recognition while
incorporating real-world commonsense understanding to evaluate causal reasoning
abilities. Expanding on it, we present the Eye of Panoptes framework, which
integrates a binary parametric Sigmoid metric with an indicator function,
enabling a more holistic evaluation of MLLMs' responses in opinion-based
reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the
highest performance in visual fine-grained reasoning reaches only 0.46,
highlighting considerable potential for enhancement. Our research offers
valuable perspectives for the continued refinement of MLLMs.
♻ ☆ RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory
Jun Liu, Zhenglun Kong, Changdi Yang, Fan Yang, Tianqi Li, Peiyan Dong, Joannah Nanjekye, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang
Multi-agent large language model (LLM) systems have shown strong potential in
complex reasoning and collaborative decision-making tasks. However, most
existing coordination schemes rely on static or full-context routing
strategies, which lead to excessive token consumption, redundant memory
exposure, and limited adaptability across interaction rounds. We introduce
RCR-Router, a modular and role-aware context routing framework designed to
enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge,
this is the first routing approach that dynamically selects semantically
relevant memory subsets for each agent based on its role and task stage, while
adhering to a strict token budget. A lightweight scoring policy guides memory
selection, and agent outputs are iteratively integrated into a shared memory
store to facilitate progressive context refinement. To better evaluate model
behavior, we further propose an Answer Quality Score metric that captures
LLM-generated explanations beyond standard QA accuracy. Experiments on three
multi-hop QA benchmarks -- HotPotQA, MuSiQue, and 2WikiMultihop -- demonstrate
that RCR-Router reduces token usage (up to 30%) while improving or maintaining
answer quality. These results highlight the importance of structured memory
routing and output-aware evaluation in advancing scalable multi-agent LLM
systems.
♻ ☆ Opioid Named Entity Recognition (ONER-2025) from Reddit
The opioid overdose epidemic remains a critical public health crisis,
particularly in the United States, leading to significant mortality and
societal costs. Social media platforms like Reddit provide vast amounts of
unstructured data that offer insights into public perceptions, discussions, and
experiences related to opioid use. This study leverages Natural Language
Processing (NLP), specifically Opioid Named Entity Recognition (ONER-2025), to
extract actionable information from these platforms. Our research makes four
key contributions. First, we created a unique, manually annotated dataset
sourced from Reddit, where users share self-reported experiences of opioid use
via different administration routes. This dataset contains 331,285 tokens and
includes eight major opioid entity categories. Second, we detail our annotation
process and guidelines while discussing the challenges of labeling the
ONER-2025 dataset. Third, we analyze key linguistic challenges, including
slang, ambiguity, fragmented sentences, and emotionally charged language, in
opioid discussions. Fourth, we propose a real-time monitoring system to process
streaming data from social media, healthcare records, and emergency services to
identify overdose events. Using 5-fold cross-validation in 11 experiments, our
system integrates machine learning, deep learning, and transformer-based
language models with advanced contextual embeddings to enhance understanding.
Our transformer-based models (bert-base-NER and roberta-base) achieved 97%
accuracy and F1-score, outperforming baselines by 10.23% (RF=0.88).
♻ ☆ Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions ACL 2025
Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić
We present an end-to-end framework for generating synthetic users for
evaluating interactive agents designed to encourage positive behavior changes,
such as in health and lifestyle coaching. The synthetic users are grounded in
health and lifestyle conditions, specifically sleep and diabetes management in
this study, to ensure realistic interactions with the health coaching agent.
Synthetic users are created in two stages: first, structured data are generated
grounded in real-world health and lifestyle factors in addition to basic
demographics and behavioral attributes; second, full profiles of the synthetic
users are developed conditioned on the structured data. Interactions between
synthetic users and the coaching agent are simulated using generative
agent-based models such as Concordia, or directly by prompting a language
model. Using two independently-developed agents for sleep and diabetes coaching
as case studies, the validity of this framework is demonstrated by analyzing
the coaching agent's understanding of the synthetic users' needs and
challenges. Finally, through multiple blinded evaluations of user-coach
interactions by human experts, we demonstrate that our synthetic users with
health and behavioral attributes more accurately portray real human users with
the same attributes, compared to generic synthetic users not grounded in such
attributes. The proposed framework lays the foundation for efficient
development of conversational agents through extensive, realistic, and grounded
simulated interactions.
comment: Published in Findings of the Association for Computational
Linguistics: ACL 2025
♻ ☆ Mind the Gap: Benchmarking LLM Uncertainty, Discrimination, and Calibration in Specialty-Aware Clinical QA
Reliable uncertainty quantification (UQ) is essential when employing large
language models (LLMs) in high-risk domains such as clinical question answering
(QA). In this work, we evaluate uncertainty estimation methods for clinical QA
focusing, for the first time, on eleven clinical specialties and six question
types, and across ten open-source LLMs (general-purpose, biomedical, and
reasoning models). We analyze score-based UQ methods, present a case study
introducing a novel lightweight method based on behavioral features derived
from reasoning-oriented models, and examine conformal prediction as a
complementary set-based approach. Our findings reveal that uncertainty
reliability is not a monolithic property, but one that depends on clinical
specialty and question type due to shifts in calibration and discrimination.
Our results highlight the need to select or ensemble models based on their
distinct, complementary strengths and clinical use.
♻ ☆ SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
Repurposing large vision-language models (LVLMs) as computer use agents
(CUAs) has led to substantial breakthroughs, primarily driven by human-labeled
data. However, these models often struggle with novel and specialized software,
particularly in scenarios lacking human annotations. To address this challenge,
we propose SEAgent, an agentic self-evolving framework enabling CUAs to
autonomously evolve through interactions with unfamiliar software.
Specifically, SEAgent empowers computer-use agents to autonomously master novel
software environments via experiential learning, where agents explore new
software, learn through iterative trial-and-error, and progressively tackle
auto-generated tasks organized from simple to complex. To achieve this goal, we
design a World State Model for step-wise trajectory assessment, along with a
Curriculum Generator that generates increasingly diverse and challenging tasks.
The agent's policy is updated through experiential learning, comprised of
adversarial imitation of failure actions and Group Relative Policy Optimization
(GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist
training strategy that integrates individual experiential insights from
specialist agents, facilitating the development of a stronger generalist CUA
capable of continuous autonomous evolution. This unified agent ultimately
achieves performance surpassing ensembles of individual specialist agents on
their specialized software. We validate the effectiveness of SEAgent across
five novel software environments within OS-World. Our approach achieves a
significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a
competitive open-source CUA, i.e., UI-TARS.
comment: Code at https://github.com/SunzeY/SEAgent
♻ ☆ OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions
Open Semantic Mapping (OSM) is a key technology in robotic perception,
combining semantic segmentation and SLAM techniques. This paper introduces a
dynamically configurable and highly automated LLM/LVLM-powered pipeline for
evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark).
The study focuses on evaluating state-of-the-art semantic mapping algorithms
under varying indoor lighting conditions, a critical challenge in indoor
environments. We introduce a novel dataset with simulated RGB-D sequences and
ground truth 3D reconstructions, facilitating the rigorous analysis of mapping
performance across different lighting conditions. Through experiments on
leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the
semantic fidelity of object recognition and segmentation. Additionally, we
introduce a Scene Graph evaluation method to analyze the ability of models to
interpret semantic structure. The results provide insights into the robustness
of these models, forming future research directions for developing resilient
and adaptable robotic systems. Project page is available at
https://be2rlab.github.io/OSMa-Bench/.
comment: Project page: https://be2rlab.github.io/OSMa-Bench/
♻ ☆ Optimizing Class-Level Probability Reweighting Coefficients for Equitable Prompting Accuracy
Even as we engineer LLMs for alignment and safety, they often uncover biases
from pre-training data's statistical regularities (from disproportionate
co-occurrences to stereotypical associations mirroring human cognitive biases).
This leads to persistent, uneven class accuracy in classification and QA. Such
per-class accuracy disparities are not inherently resolved by
architectural/training evolutions or data scaling, making post-hoc correction
essential for equitable performance. To mitigate LLM class accuracy imbalance,
we develop a post-hoc probability reweighting method that directly optimizes
for non-differentiable performance-driven and fairness-aligned metrics, through
a novel COBias metric that highlights disparities in class accuracies. This
post-hoc bias mitigation method is grounded in discrete optimization with
nonlinear integer programming (NIP) objectives and an efficient metaheuristic
solution framework with theoretical convergence guarantees. Operating
model-agnostically, it learns reweighting coefficients from output class
probabilities to adjust LLM inference outputs without internal weight updates.
Evaluations demonstrate its effectiveness: reducing COBias (61% relative
reduction), increasing overall accuracy (18% relative increase), and achieving
robust within-task generalization across diverse prompt configurations.
♻ ☆ AIOS: LLM Agent Operating System
Kai Mei, Xi Zhu, Wujiang Xu, Wenyue Hua, Mingyu Jin, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, Yongfeng Zhang
LLM-based intelligent agents face significant deployment challenges,
particularly related to resource management. Allowing unrestricted access to
LLM or tool resources can lead to inefficient or even potentially harmful
resource allocation and utilization for agents. Furthermore, the absence of
proper scheduling and resource management mechanisms in current agent designs
hinders concurrent processing and limits overall system efficiency. To address
these challenges, this paper proposes the architecture of AIOS (LLM-based AI
Agent Operating System) under the context of managing LLM-based agents. It
introduces a novel architecture for serving LLM-based agents by isolating
resources and LLM-specific services from agent applications into an AIOS
kernel. This AIOS kernel provides fundamental services (e.g., scheduling,
context management, memory management, storage management, access control) for
runtime agents. To enhance usability, AIOS also includes an AIOS SDK, a
comprehensive suite of APIs designed for utilizing functionalities provided by
the AIOS kernel. Experimental results demonstrate that using AIOS can achieve
up to 2.1x faster execution for serving agents built by various agent
frameworks. The source code is available at
https://github.com/agiresearch/AIOS.
comment: Published as a full paper at COLM 2025
♻ ☆ Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference
Artificial Intelligence (AI) conferences are essential for advancing
research, sharing knowledge, and fostering academic community. However, their
rapid expansion has rendered the centralized conference model increasingly
unsustainable. This paper offers a data-driven diagnosis of a structural crisis
that threatens the foundational goals of scientific dissemination, equity, and
community well-being. We identify four key areas of strain: (1) scientifically,
with per-author publication rates more than doubling over the past decade to
over 4.5 papers annually; (2) environmentally, with the carbon footprint of a
single conference exceeding the daily emissions of its host city; (3)
psychologically, with 71% of online community discourse reflecting negative
sentiment and 35% referencing mental health concerns; and (4) logistically,
with attendance at top conferences such as NeurIPS 2024 beginning to outpace
venue capacity. These pressures point to a system that is misaligned with its
core mission. In response, we propose the Community-Federated Conference (CFC)
model, which separates peer review, presentation, and networking into globally
coordinated but locally organized components, offering a more sustainable,
inclusive, and resilient path forward for AI research.
comment: Preprint
♻ ☆ TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree
Recognizing specific key phrases is an essential task for contextualized
Automatic Speech Recognition (ASR). However, most existing context-biasing
approaches have limitations associated with the necessity of additional model
training, significantly slow down the decoding process, or constrain the choice
of the ASR system type. This paper proposes a universal ASR context-biasing
framework that supports all major types: CTC, Transducers, and Attention
Encoder-Decoder models. The framework is based on a GPU-accelerated word
boosting tree, which enables it to be used in shallow fusion mode for greedy
and beam search decoding without noticeable speed degradation, even with a vast
number of key phrases (up to 20K items). The obtained results showed high
efficiency of the proposed method, surpassing the considered open-source
context-biasing approaches in accuracy and decoding speed. Our context-biasing
framework is open-sourced as a part of the NeMo toolkit.
comment: Accepted to ASRU 2025
♻ ☆ EvoP: Robust LLM Inference via Evolutionary Pruning
Large Language Models (LLMs) have achieved remarkable success in natural
language processing tasks, but their massive size and computational demands
hinder their deployment in resource-constrained environments. Existing model
pruning methods address this issue by removing redundant structures (e.g.,
elements, channels, layers) from the model. However, these methods employ a
heuristic pruning strategy, which leads to suboptimal performance. Besides,
they also ignore the data characteristics when pruning the model.
To overcome these limitations, we propose EvoP, an evolutionary pruning
framework for robust LLM inference. EvoP first presents a cluster-based
calibration dataset sampling (CCDS) strategy for creating a more diverse
calibration dataset. EvoP then introduces an evolutionary pruning pattern
searching (EPPS) method to find the optimal pruning pattern. Compared to
existing model pruning techniques, EvoP achieves the best performance while
maintaining the best efficiency. Experiments across different LLMs and
different downstream tasks validate the effectiveness of the proposed EvoP,
making it a practical and scalable solution for deploying LLMs in real-world
applications.
♻ ☆ Jinx: Unlimited LLMs for Probing Alignment Failures
Unlimited, or so-called helpful-only language models are trained without
safety alignment constraints and never refuse user queries. They are widely
used by leading AI companies as internal tools for red teaming and alignment
evaluation. For example, if a safety-aligned model produces harmful outputs
similar to an unlimited model, this indicates alignment failures that require
further attention. Despite their essential role in assessing alignment, such
models are not available to the research community.
We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx
responds to all queries without refusals or safety filtering, while preserving
the base model's capabilities in reasoning and instruction following. It
provides researchers with an accessible tool for probing alignment failures,
evaluating safety boundaries, and systematically studying failure modes in
language model safety.
comment: https://huggingface.co/Jinx-org
♻ ☆ AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models
As Large Language Models (LLMs) are pre-trained on ultra-large-scale corpora,
the problem of data contamination is becoming increasingly serious, and there
is a risk that static evaluation benchmarks overestimate the performance of
LLMs. To address this, this paper proposes a dynamic data evaluation method
called AdEval (Alignment-based Dynamic Evaluation). AdEval first extracts
knowledge points and main ideas from static datasets to achieve dynamic
alignment with the core content of static benchmarks, and by avoiding direct
reliance on static datasets, it inherently reduces the risk of data
contamination from the source. It then obtains background information through
online searches to generate detailed descriptions of the knowledge points.
Finally, it designs questions based on Bloom's cognitive hierarchy across six
dimensions-remembering, understanding, applying, analyzing, evaluating, and
creating to enable multi-level cognitive assessment. Additionally, AdEval
controls the complexity of dynamically generated datasets through iterative
question reconstruction. Experimental results on multiple datasets show that
AdEval effectively alleviates the impact of data contamination on evaluation
results, solves the problems of insufficient complexity control and
single-dimensional evaluation, and improves the fairness, reliability and
diversity of LLMs evaluation.
comment: There are serious academic problems in this paper, such as data
falsification and plagiarism in the method of the paper
♻ ☆ Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training
Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Hongming Zhang, Haitao Mi, Dong Yu
General AI Agents are increasingly recognized as foundational frameworks for
the next generation of artificial intelligence, enabling complex reasoning, web
interaction, coding, and autonomous research capabilities. However, current
agent systems are either closed-source or heavily reliant on a variety of paid
APIs and proprietary tools, limiting accessibility and reproducibility for the
research community. In this work, we present \textbf{Cognitive Kernel-Pro}, a
fully open-source and (to the maximum extent) free multi-module agent framework
designed to democratize the development and evaluation of advanced AI agents.
Within Cognitive Kernel-Pro, we systematically investigate the curation of
high-quality training data for Agent Foundation Models, focusing on the
construction of queries, trajectories, and verifiable answers across four key
domains: web, file, code, and general reasoning. Furthermore, we explore novel
strategies for agent test-time reflection and voting to enhance agent
robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving
state-of-the-art results among open-source and free agents. Notably, our
8B-parameter open-source model surpasses previous leading systems such as
WebDancer and WebSailor, establishing a new performance standard for
accessible, high-capability AI agents. Code is available at
https://github.com/Tencent/CognitiveKernel-Pro
comment: 16 pages
♻ ☆ Quantifying Gender Biases Towards Politicians on Reddit
Despite attempts to increase gender parity in politics, global efforts have
struggled to ensure equal female representation. This is likely tied to
implicit gender biases against women in authority. In this work, we present a
comprehensive study of gender biases that appear in online political
discussion. To this end, we collect 10 million comments on Reddit in
conversations about male and female politicians, which enables an exhaustive
study of automatic gender bias detection. We address not only misogynistic
language, but also other manifestations of bias, like benevolent sexism in the
form of seemingly positive sentiment and dominance attributed to female
politicians, or differences in descriptor attribution. Finally, we conduct a
multi-faceted study of gender bias towards politicians investigating both
linguistic and extra-linguistic cues. We assess 5 different types of gender
bias, evaluating coverage, combinatorial, nominal, sentimental, and lexical
biases extant in social media language and discourse. Overall, we find that,
contrary to previous research, coverage and sentiment biases suggest equal
public interest in female politicians. Rather than overt hostile or benevolent
sexism, the results of the nominal and lexical analyses suggest this interest
is not as professional or respectful as that expressed about male politicians.
Female politicians are often named by their first names and are described in
relation to their body, clothing, or family; this is a treatment that is not
similarly extended to men. On the now banned far-right subreddits, this
disparity is greatest, though differences in gender biases still appear in the
right and left-leaning subreddits. We release the curated dataset to the public
for future studies.
comment: PlosONE article
♻ ☆ Post-Completion Learning for Language Models
Current language model training paradigms typically terminate learning upon
reaching the end-of-sequence () token, overlooking the potential learning
opportunities in the post-completion space. We propose Post-Completion Learning
(PCL), a novel training framework that systematically utilizes the sequence
space after model output completion, to enhance both the reasoning and
self-evaluation abilities. PCL enables models to continue generating
self-assessments and reward predictions during training, while maintaining
efficient inference by stopping at the completion point.
To fully utilize this post-completion space, we design a white-box
reinforcement learning method: let the model evaluate the output content
according to the reward rules, then calculate and align the score with the
reward functions for supervision. We implement dual-track SFT to optimize both
reasoning and evaluation capabilities, and mixed it with RL training to achieve
multi-objective hybrid optimization.
Experimental results on different datasets and models demonstrate consistent
improvements over traditional SFT and RL methods. Our method provides a new
technical path for language model training that enhances output quality while
preserving deployment efficiency.
♻ ☆ A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains
Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu
Large language models (LLMs) hold promise in clinical decision support but
face major challenges in safety evaluation and effectiveness validation. We
developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a
multidimensional framework built on clinical expert consensus, encompassing 30
criteria covering critical areas like critical illness recognition, guideline
adherence, and medication safety, with weighted consequence measures.
Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q\&A
items aligned with these criteria, spanning 26 clinical departments to simulate
real-world scenarios. Benchmark testing of six LLMs revealed moderate overall
performance (average total score 57.2\%, safety 54.7\%, effectiveness 62.3\%),
with a significant 13.3\% performance drop in high-risk scenarios (p $<$
0.0001). Domain-specific medical LLMs showed consistent performance advantages
over general-purpose models, with relatively higher top scores in safety
(0.912) and effectiveness (0.861). The findings of this study not only provide
a standardized metric for evaluating the clinical application of medical LLMs,
facilitating comparative analyses, risk exposure identification, and
improvement directions across different scenarios, but also hold the potential
to promote safer and more effective deployment of large language models in
healthcare environments.
♻ ☆ Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving
Autonomous driving systems must operate reliably in safety-critical
scenarios, particularly those involving unusual or complex behavior by
Vulnerable Road Users (VRUs). Identifying these edge cases in driving datasets
is essential for robust evaluation and generalization, but retrieving such rare
human behavior scenarios within the long tail of large-scale datasets is
challenging. To support targeted evaluation of autonomous driving systems in
diverse, human-centered scenarios, we propose a novel context-aware motion
retrieval framework. Our method combines Skinned Multi-Person Linear
(SMPL)-based motion sequences and corresponding video frames before encoding
them into a shared multimodal embedding space aligned with natural language.
Our approach enables the scalable retrieval of human behavior and their context
through text queries. This work also introduces our dataset WayMoCo, an
extension of the Waymo Open Dataset. It contains automatically labeled motion
and scene context descriptions derived from generated pseudo-ground-truth SMPL
sequences and corresponding image data. Our approach outperforms
state-of-the-art models by up to 27.5% accuracy in motion-context retrieval,
when evaluated on the WayMoCo dataset.
comment: Project page: https://iv.ee.hm.edu/contextmotionclip/; This work has
been submitted to the IEEE for possible publication
♻ ☆ A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models
Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as
a promising paradigm for enhancing large language models (LLMs) by converting
raw text into structured knowledge graphs, improving both accuracy and
explainability. However, GraphRAG relies on LLMs to extract knowledge from raw
text during graph construction, and this process can be maliciously manipulated
to implant misleading information. Targeting this attack surface, we propose
two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a
few words in the source text can significantly change the constructed graph,
poison the GraphRAG, and severely mislead downstream reasoning. The first
attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate
vulnerable nodes in the generated graphs and rewrites the corresponding
narratives with LLMs, achieving precise control over specific
question-answering (QA) outcomes with a success rate of 93.1\%, while keeping
the poisoned text fluent and natural. The second attack, named Universal KPA
(UKPA), exploits linguistic cues such as pronouns and dependency relations to
disrupt the structural integrity of the generated graph by altering globally
influential words. With fewer than 0.05\% of full text modified, the QA
accuracy collapses from 95\% to 50\%. Furthermore, experiments show that
state-of-the-art defense methods fail to detect these attacks, highlighting
that securing GraphRAG pipelines against knowledge poisoning remains largely
unexplored.
♻ ☆ Unsupervised Document and Template Clustering using Multimodal Embeddings
This paper investigates a novel approach to unsupervised document clustering
by leveraging multimodal embeddings as input to clustering algorithms such as
$k$-Means, DBSCAN, a combination of HDBSCAN and $k$-NN, and BIRCH. Our method
aims to achieve a finer-grained document understanding by not only grouping
documents at the type level (e.g., invoices, purchase orders), but also
distinguishing between different templates within the same document category.
This is achieved by using embeddings that capture textual content, layout
information, and visual features of documents. We evaluated the effectiveness
of this approach using embeddings generated by several state-of-the-art
pre-trained multimodal models, including SBERT, LayoutLMv1, LayoutLMv3, DiT,
Donut, ColPali, Gemma3, and InternVL3. Our findings demonstrate the potential
of multimodal embeddings to significantly enhance document clustering, offering
benefits for various applications in intelligent document processing, document
layout analysis, and unsupervised document classification. This work provides
valuable insight into the advantages and limitations of different multimodal
models for this task and opens new avenues for future research to understand
and organize document collections.
comment: 22 pages, 12 figures
♻ ☆ From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
Yuying Shang, Xinyi Zeng, Yutao Zhu, Xiao Yang, Zhengwei Fang, Jingyuan Zhang, Jiawei Chen, Zinan Liu, Yu Tian
Hallucinations in large vision-language models (LVLMs) are a significant
challenge, i.e., generating objects that are not presented in the visual input,
which impairs their reliability. Recent studies often attribute hallucinations
to a lack of understanding of visual input, yet ignore a more fundamental
issue: the model's inability to effectively extract or decouple visual
features. In this paper, we revisit the hallucinations in LVLMs from an
architectural perspective, investigating whether the primary cause lies in the
visual encoder (feature extraction) or the modal alignment module (feature
decoupling). Motivated by our findings on the preliminary investigation, we
propose a novel tuning strategy, PATCH, to mitigate hallucinations in LVLMs.
This plug-and-play method can be integrated into various LVLMs, utilizing
adaptive virtual tokens to extract object features from bounding boxes, thereby
addressing hallucinations caused by insufficient decoupling of visual features.
PATCH achieves state-of-the-art performance on multiple multi-modal
hallucination datasets. We hope this approach provides researchers with deeper
insights into the underlying causes of hallucinations in LVLMs, fostering
further advancements and innovation in this field.
♻ ☆ Trainable Dynamic Mask Sparse Attention
In large language models, the demand for modeling long contexts is constantly
increasing, but the quadratic complexity of the standard self-attention
mechanism often becomes a bottleneck. Although existing sparse attention
mechanisms have improved efficiency, they may still encounter issues such as
static patterns or information loss. We introduce a trainable dynamic mask
sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes
content-aware and position-aware sparsity. DMA achieves this through two key
innovations: First, it dynamically generates content-aware sparse masks from
value representations, enabling the model to identify and focus on critical
information adaptively. Second, it implements position-aware sparse attention
computation that effectively skips unnecessary calculation regions. This
dual-sparsity design allows the model to significantly reduce the computational
complexity of important information while retaining complete information,
achieving an excellent balance between information fidelity and computational
efficiency. We have verified the performance of DMA through comprehensive
experiments. Comparative studies show that DMA outperforms multi-head
attention, sliding window attention, multi-head latent attention, and native
sparse attention in terms of perplexity under Chinchilla Scaling Law settings.
Moreover, in challenging multi-query associative recall tasks, DMA also
demonstrates superior performance and efficiency compared to these methods.
Crucially, in the evaluation of a 1.7B parameter model, DMA significantly
outperforms multi-head attention in both standard benchmark performance and the
challenging needle-in-a-haystack task. These experimental results highlight its
capability to balance model efficiency and long-context modeling ability
effectively.
comment: 8 figures, 4 tables
♻ ☆ Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou
We present Klear-Reasoner, a model with long reasoning capabilities that
demonstrates careful deliberation during problem solving, achieving outstanding
performance across multiple benchmarks. Although there are already many
excellent works related to inference models in the current community, there are
still many problems with reproducing high-performance inference models due to
incomplete disclosure of training details. This report provides an in-depth
analysis of the reasoning model, covering the entire post-training workflow
from data preparation and long Chain-of-Thought supervised fine-tuning (long
CoT SFT) to reinforcement learning (RL), along with detailed ablation studies
for each experimental component. For SFT data, our experiments show that a
small number of high-quality data sources are more effective than a large
number of diverse data sources, and that difficult samples can achieve better
results without accuracy filtering. In addition, we investigate two key issues
with current clipping mechanisms in RL: Clipping suppresses critical
exploration signals and ignores suboptimal trajectories. To address these
challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO)
that gently backpropagates gradients from clipped tokens. GPPO not only
enhances the model's exploration capacity but also improves its efficiency in
learning from negative samples. Klear-Reasoner exhibits exceptional reasoning
abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on
AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.
♻ ☆ BriLLM: Brain-inspired Large Language Model
We present BriLLM, a brain-inspired large language model that fundamentally
reimagines machine learning foundations through Signal Fully-connected flowing
(SiFu) learning. Addressing core limitations in Transformer-based models
including black-box opacity, quadratic complexity, and context-length
dependency, BriLLM incorporates two key neurocognitive principles: first,
static semantic mapping where tokens map to specialized nodes analogous to
cortical regions, and second, dynamic signal propagation simulating
electrophysiological information flow. This architecture enables three
breakthroughs: full model interpretability, context-length independent scaling,
and the first global-scale simulation of brain-like processing. Initial 1 to 2B
parameter models demonstrate GPT-1-level generative capabilities with stable
perplexity reduction. Scalability analyses confirm feasibility of 100 to 200B
parameter variants processing 40,000-token contexts. BriLLM establishes a new
paradigm for biologically grounded AGI development.
♻ ☆ Role-Aware Language Models for Secure and Contextualized Access Control in Organizations
Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Loza Vera, Muhammad Dehan Al Kautsar, Fajri Koto
As large language models (LLMs) are increasingly deployed in enterprise
settings, controlling model behavior based on user roles becomes an essential
requirement. Existing safety methods typically assume uniform access and focus
on preventing harmful or toxic outputs, without addressing role-specific access
constraints. In this work, we investigate whether LLMs can be fine-tuned to
generate responses that reflect the access privileges associated with different
organizational roles. We explore three modeling strategies: a BERT-based
classifier, an LLM-based classifier, and role-conditioned generation. To
evaluate these approaches, we construct two complementary datasets. The first
is adapted from existing instruction-tuning corpora through clustering and role
labeling, while the second is synthetically generated to reflect realistic,
role-sensitive enterprise scenarios. We assess model performance across varying
organizational structures and analyze robustness to prompt injection, role
mismatch, and jailbreak attempts.
♻ ☆ Marco-Voice Technical Report
Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang
This paper presents a multifunctional speech synthesis system that integrates
voice cloning and emotion control speech synthesis within a unified framework.
The goal of this work is to address longstanding challenges in achieving highly
expressive, controllable, and natural speech generation that faithfully
preserves speaker identity across diverse linguistic and emotional contexts.
Our approach introduces an effective speaker-emotion disentanglement mechanism
with in-batch contrastive learning, enabling independent manipulation of
speaker identity and eemotional style, as well as rotational emotional
embedding integration method for smooth emotion control. To support
comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality
emotional speech dataset containing 10 hours of Mandarin speech from six
professional speakers across seven emotional categories. Extensive experiments
demonstrate that our system, Marco-Voice, achieves substantial improvements in
both objective and subjective metrics. Comprehensive evaluations and analysis
were conducted, results show that MarcoVoice delivers competitive performance
in terms of speech clarity and emotional richness, representing a substantial
advance in the field of expressive neural speech synthesis. Our code and
dataset are publicly available at https://github.com/AIDC-AI/Marco-Voice and
https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively.
comment: Technical Report. Our code and dataset are publicly available at
https://github.com/AIDC-AI/Marco-Voice and
https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively
♻ ☆ Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
Recent advancements in large language models, multimodal large language
models, and large audio language models (LALMs) have significantly improved
their reasoning capabilities through reinforcement learning with rule-based
rewards. However, the explicit reasoning process has yet to show significant
benefits for audio question answering, and effectively leveraging deep
reasoning remains an open challenge, with LALMs still falling short of
human-level auditory-language reasoning. To address these limitations, we
propose Audio-Thinker, a reinforcement learning framework designed to enhance
the reasoning capabilities of LALMs, with a focus on improving adaptability,
consistency, and effectiveness. Our approach introduces an adaptive think
accuracy reward, enabling the model to adjust its reasoning strategies based on
task complexity dynamically. Furthermore, we incorporate an external reward
model to evaluate the overall consistency and quality of the reasoning process,
complemented by think-based rewards that help the model distinguish between
valid and flawed reasoning paths during training. Experimental results
demonstrate that our Audio-Thinker model outperforms existing
reasoning-oriented LALMs across various benchmark tasks, exhibiting superior
reasoning and generalization capabilities.
comment: preprint
♻ ☆ DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns
We present DYNARTmo, a dynamic articulatory model designed to visualize
speech articulation processes in a two-dimensional midsagittal plane. The model
builds upon the UK-DYNAMO framework and integrates principles of articulatory
underspecification, segmental and gestural control, and coarticulation.
DYNARTmo simulates six key articulators based on ten continuous and six
discrete control parameters, allowing for the generation of both vocalic and
consonantal articulatory configurations. The current implementation is embedded
in a web-based application (SpeechArticulationTrainer) that includes sagittal,
glottal, and palatal views, making it suitable for use in phonetics education
and speech therapy. While this paper focuses on the static modeling aspects,
future work will address dynamic movement generation and integration with
articulatory-acoustic modules.
comment: 10 pages, 29 references, 2 figures, supplementary material. V2:
Discussion of the tongue-palate contact pattern for /t/. V3: table 2:
"lateral" added
♻ ☆ Grounding Multilingual Multimodal LLMs With Cultural Knowledge
Multimodal Large Language Models excel in high-resource settings, but often
misinterpret long-tail cultural entities and underperform in low-resource
languages. To address this gap, we propose a data-centric approach that
directly grounds MLLMs in cultural knowledge. Leveraging a large scale
knowledge graph from Wikidata, we collect images that represent culturally
significant entities, and generate synthetic multilingual visual question
answering data. The resulting dataset, CulturalGround, comprises 22 million
high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages.
We train an open-source MLLM CulturalPangea on CulturalGround, interleaving
standard multilingual instruction-tuning data to preserve general abilities.
CulturalPangea achieves state-of-the-art performance among open models on
various culture-focused multilingual multimodal benchmarks, outperforming prior
models by an average of 5.0 without degrading results on mainstream
vision-language tasks. Our findings show that our targeted, culturally grounded
approach could substantially narrow the cultural gap in MLLMs and offer a
practical path towards globally inclusive multimodal systems.
♻ ☆ REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation
Reinforcement learning (RL) is emerging as a powerful paradigm for enabling
large language models (LLMs) to perform complex reasoning tasks. Recent
advances indicate that integrating RL with retrieval-augmented generation (RAG)
allows LLMs to dynamically incorporate external knowledge, leading to more
informed and robust decision making. However, we identify a critical challenge
during policy-driven trajectory sampling: LLMs are frequently trapped in
unproductive reasoning paths, which we refer to as "dead ends", committing to
overconfident yet incorrect conclusions. This severely hampers exploration and
undermines effective policy optimization. To address this challenge, we propose
REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented
Generation), a novel framework that explores alternative reasoning paths while
maintaining rigorous policy learning through principled distributional
corrections. Our approach introduces two key innovations: (1) Mixed Sampling
Strategy, which combines a novel probe sampling method with exploratory prompts
to escape dead ends; and (2) Policy Correction Mechanism, which employs
importance sampling to correct distribution shifts induced by mixed sampling,
thereby mitigating gradient estimation bias. We evaluate it on seven
question-answering benchmarks, and the experimental results show that REX-RAG
achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B
over strong baselines, demonstrating competitive results across multiple
datasets. The code is publicly available at https://github.com/MiliLab/REX-RAG.
comment: 17 pages, 4 figures; updated references
♻ ☆ Decoding-based Regression
Language models have recently been shown capable of performing regression
wherein numeric predictions are represented as decoded strings. In this work,
we provide theoretical grounds for this capability and furthermore investigate
the utility of causal sequence decoding models as numeric regression heads
given any feature representation. We find that, despite being trained in the
usual way - for next-token prediction via cross-entropy loss - decoder-based
heads are as performant as standard pointwise heads when benchmarked over
standard regression tasks, while being flexible enough to capture smooth
numeric distributions, such as in the task of density estimation.
comment: Published in Transactions on Machine Learning Research (TMLR) 2025.
Code can be found at
https://github.com/google-research/optformer/tree/main/optformer/decoding_regression
♻ ☆ AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks
through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by
Reinforcement Learning (RL), a process fraught with catastrophic forgetting and
suboptimal trade-offs between imitation and exploration. Recent single-stage
methods attempt to unify SFT and RL using heuristics, but lack a principled
mechanism for dynamically balancing the two paradigms. In this paper, we
reframe this challenge through the theoretical lens of \textbf{implicit
rewards}, viewing SFT and RL not as distinct methods but as complementary
reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel
single-stage algorithm that learns the optimal balance between SFT's implicit,
path-level reward and RL's explicit, outcome-based reward. The core of AMFT is
a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL
balance as a learnable parameter, dynamically optimizing it to maximize
long-term task performance. This forward-looking approach, regularized by
policy entropy for stability, autonomously discovers an effective training
curriculum. We conduct a comprehensive evaluation on challenging benchmarks
spanning mathematical reasoning, abstract visual reasoning (General Points),
and vision-language navigation (V-IRL). AMFT consistently establishes a new
state-of-the-art and demonstrats superior generalization on out-of-distribution
(OOD) tasks. Ablation studies and training dynamic analysis confirm that the
meta-learning controller is crucial for AMFT's stability, sample efficiency,
and performance, offering a more principled and effective paradigm for LLM
alignment. Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.
comment: https://github.com/hlxtsyj/AMFT
♻ ☆ Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs
Balancing exploration and exploitation is a central goal in reinforcement
learning (RL). Despite recent advances in enhancing large language model (LLM)
reasoning, most methods lean toward exploitation, and increasingly encounter
performance plateaus. In this work, we revisit entropy -- a signal of
exploration in RL -- and examine its relationship to exploratory reasoning in
LLMs. Through empirical analysis, we uncover positive correlations between
high-entropy regions and three types of exploratory reasoning actions: (1)
pivotal tokens that determine or connect logical steps, (2) reflective actions
such as self-verification and correction, and (3) rare behaviors under-explored
by the base LLMs. Motivated by this, we introduce a minimal modification to
standard RL with only one line of code: augmenting the advantage function with
an entropy-based term. Unlike traditional maximum-entropy methods which
encourage exploration by promoting uncertainty, we encourage exploration by
promoting longer and deeper reasoning chains. Notably, our method achieves
significant gains on the Pass@K metric -- an upper-bound estimator of LLM
reasoning capabilities -- even when evaluated with extremely large K values,
pushing the boundaries of LLM reasoning.
♻ ☆ Do Biased Models Have Biased Thoughts?
The impressive performance of language models is undeniable. However, the
presence of biases based on gender, race, socio-economic status, physical
appearance, and sexual orientation makes the deployment of language models
challenging. This paper studies the effect of chain-of-thought prompting, a
recent approach that studies the steps followed by the model before it
responds, on fairness. More specifically, we ask the following question:
$\textit{Do biased models have biased thoughts}$? To answer our question, we
conduct experiments on $5$ popular large language models using fairness metrics
to quantify $11$ different biases in the model's thoughts and output. Our
results show that the bias in the thinking steps is not highly correlated with
the output bias (less than $0.6$ correlation with a $p$-value smaller than
$0.001$ in most cases). In other words, unlike human beings, the tested models
with biased decisions do not always possess biased thoughts.
comment: Accepted at main track of the Second Conference on Language Modeling
(COLM 2025)
♻ ☆ Utilizing Large Language Models for Information Extraction from Real Estate Transactions
Real estate sales contracts contain crucial information for property
transactions, but manual data extraction can be time-consuming and error-prone.
This paper explores the application of large language models, specifically
transformer-based architectures, for automated information extraction from real
estate contracts. We discuss challenges, techniques, and future directions in
leveraging these models to improve efficiency and accuracy in real estate
contract analysis. We generated synthetic contracts using the real-world
transaction dataset, thereby fine-tuning the large-language model and achieving
significant metrics improvements and qualitative improvements in information
retrieval and reasoning tasks.
♻ ☆ WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image ICCV 2025
Yuci Liang, Xinheng Lyu, Wenting Chen, Meidan Ding, Jipeng Zhang, Xiangjian He, Song Wu, Xiaohan Xing, Sen Yang, Xiyue Wang, Linlin Shen
Recent advancements in computational pathology have produced patch-level
Multi-modal Large Language Models (MLLMs), but these models are limited by
their inability to analyze whole slide images (WSIs) comprehensively and their
tendency to bypass crucial morphological features that pathologists rely on for
diagnosis. To address these challenges, we first introduce WSI-Bench, a
large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850
WSIs across 30 cancer types, designed to evaluate MLLMs' understanding of
morphological characteristics crucial for accurate diagnosis. Building upon
this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI
understanding that employs a three-stage training approach: WSI-text alignment,
feature space alignment, and task-specific instruction tuning. To better assess
model performance in pathological contexts, we develop two specialized WSI
metrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that
WSI-LLaVA outperforms existing models across all capability dimensions, with a
significant improvement in morphological analysis, establishing a clear
correlation between morphological understanding and diagnostic accuracy.
comment: ICCV 2025, 38 pages, 22 figures, 35 tables
♻ ☆ LLM Unlearning Without an Expert Curated Dataset
Modern large language models often encode sensitive, harmful, or copyrighted
knowledge, raising the need for post-hoc unlearning-the ability to remove
specific domains of knowledge from a model without full retraining. A major
bottleneck in current unlearning pipelines is constructing effective forget
sets-datasets that approximate the target domain and guide the model to forget
it. In this work, we introduce a scalable, automated approach to generate
high-quality forget sets using language models themselves. Our method
synthesizes textbook-style data through a structured prompting pipeline,
requiring only a domain name as input. Through experiments on unlearning
biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic
datasets consistently outperform the baseline synthetic alternatives and are
comparable to the expert-curated ones. Additionally, ablation studies reveal
that the multi-step generation pipeline significantly boosts data diversity,
which in turn improves unlearning utility. Overall, our findings suggest that
synthetic datasets offer a promising path toward practical, scalable unlearning
for a wide range of emerging domains without the need for manual intervention.
We release our code and dataset at
https://github.com/xyzhu123/Synthetic_Textbook.
♻ ☆ ChatBench: From Static Benchmarks to Human-AI Evaluation ACL 2025
With the rapid adoption of LLM-based chatbots, there is a pressing need to
evaluate what humans and LLMs can achieve together. However, standard
benchmarks, such as MMLU, measure LLM capabilities in isolation (i.e.,
"AI-alone"). Here, we design and conduct a user study to convert MMLU questions
into user-AI conversations, by seeding the user with the question and having
them carry out a conversation with the LLM to answer their question. We release
ChatBench, a new dataset with AI-alone, user-alone, and user-AI data for 396
questions and two LLMs, including 144K answers and 7,336 user-AI conversations.
We find that AI-alone accuracy fails to predict user-AI accuracy, with
significant differences across multiple subjects (math, physics, and moral
reasoning), and we analyze the user-AI conversations to provide insight into
how they diverge from AI-alone benchmarks. Finally, we show that fine-tuning a
user simulator on a subset of ChatBench improves its ability to estimate
user-AI accuracies, increasing correlation on held-out questions by more than
20 points, creating possibilities for scaling interactive evaluation.
comment: ACL 2025 (main)
♻ ☆ Task Diversity Shortens the ICL Plateau
In-context learning (ICL) describes a language model's ability to generate
outputs based on a set of input demonstrations and a subsequent query. To
understand this remarkable capability, researchers have studied simplified,
stylized models. These studies have consistently observed long loss plateaus,
during which models exhibit minimal improvement, followed by a sudden, rapid
surge of learning. In this work, we reveal that training on multiple diverse
ICL tasks simultaneously shortens the loss plateaus, making each task easier to
learn. This finding is surprising as it contradicts the natural intuition that
the combined complexity of multiple ICL tasks would lengthen the learning
process, not shorten it. Our result suggests that the recent success in
large-scale training of language models may be attributed not only to the
richness of the data at scale but also to the easier optimization (training)
induced by the diversity of natural language training data.
♻ ☆ Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness
Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Yifan Li, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li
Machine unlearning techniques aim to mitigate unintended memorization in
large language models (LLMs). However, existing approaches predominantly focus
on the explicit removal of isolated facts, often overlooking latent inferential
dependencies and the non-deterministic nature of knowledge within LLMs.
Consequently, facts presumed forgotten may persist implicitly through
correlated information. To address these challenges, we propose a knowledge
unlearning evaluation framework that more accurately captures the implicit
structure of real-world knowledge by representing relevant factual contexts as
knowledge graphs with associated confidence scores. We further develop an
inference-based evaluation protocol leveraging powerful LLMs as judges; these
judges reason over the extracted knowledge subgraph to determine unlearning
success. Our LLM judges utilize carefully designed prompts and are calibrated
against human evaluations to ensure their trustworthiness and stability.
Extensive experiments on our newly constructed benchmark demonstrate that our
framework provides a more realistic and rigorous assessment of unlearning
performance. Moreover, our findings reveal that current evaluation strategies
tend to overestimate unlearning effectiveness. Our code is publicly available
at https://github.com/Graph-COM/Knowledge_Unlearning.git.