Daily AI Papers

Summaries auto-generated from HuggingFace's Daily Papers using Gemini and GitHub Actions. All credits go to the research and HuggingFace communities.

🔉 You can get audio summaries via OpenAI's text-to-speech API on Telegram.

Note: Authors may be listed by their HuggingFace IDs. Additionally, summaries are generated by LLM and may contain mistakes. You can see the prompt used here here.

Papers for 2025-10-30

Title	Authors	Summary
JanusCoder: Towards a Foundational Visual-Programmatic Interface for
Code Intelligence (Read more on arXiv or HuggingFace)		This paper introduces JanusCoder, a suite of foundational models trained on a new 800K-sample multimodal dataset to establish a unified interface for generating code from visual and textual inputs. The primary research objective is to develop a generalist model that harmonizes a program’s symbolic logic with its visual expression, addressing the limitations of data scarcity and task-specific models. The key methodology involves creating the JANUSCODE-800K corpus via a novel data synthesis toolkit that leverages cross-domain synergies and a VLM-based reward model for quality control, which is then used to train the JanusCoder models. The models demonstrate superior performance across multiple benchmarks; notably, JANUSCODER-7B achieves a structural correctness score (TreeBLEU) of 0.25 on the WebCode2M benchmark, significantly outperforming GPT-4o’s score of 0.15. For AI practitioners, the principal implication is the validation of a data-centric approach where combining diverse data modalities (even text-only code) is crucial for training powerful open-source, visual-to-code generation models that can rival proprietary systems in applications like UI prototyping and data visualization replication.
Video-Thinker: Sparking “Thinking with Videos” via Reinforcement
Learning (Read more on arXiv or HuggingFace)	Runhao Fu, Linxin Song, Xingjian Wang, Jiarui Jin, Shijian Wang	Video-Thinker is a 7B MLLM that autonomously performs temporal grounding and captioning within its chain-of-thought reasoning for video understanding, trained using a two-stage SFT and GRPO approach on a curated 10K dataset. The research objective is to enable MLLMs to “think with videos” by intrinsically integrating temporal localization and content description capabilities into their reasoning process, eliminating the dependency on external tools. The methodology involves creating the Video-Thinker-10K dataset with structured reasoning traces containing `<time>`, `<caption>`, and `<think>` tags, followed by a two-stage training strategy: first, Supervised Fine-Tuning (SFT) to learn the format, then Group Relative Policy Optimization (GRPO) to strengthen the reasoning capability using final answer rewards. Video-Thinker-7B establishes state-of-the-art performance among 7B models, achieving 80.69% accuracy on the VRBench benchmark, a significant improvement over existing baselines. The principal implication for AI practitioners is that MLLMs can be trained to develop complex, intrinsic video reasoning abilities using a relatively small curated dataset (10K samples) and a combined SFT/RL approach, bypassing the need to engineer and integrate external video processing tools for temporal analysis tasks.
ReForm: Reflective Autoformalization with Prospective Bounded Sequence
Optimization (Read more on arXiv or HuggingFace)	Ruihua Song, Wayne Xin Zhao, Xinjie Chen, Jing Wu, GuoxinChen	ReForm is a reflective autoformalization paradigm that uses an iterative, self-correcting process trained with reinforcement learning to improve the semantic consistency of formal mathematical statements generated by LLMs. The primary objective is to overcome the semantic failures of current autoformalization models by moving from a simple one-pass translation approach to an iterative process that mimics human-like reflection and refinement. The key methodology is ReForm, which interweaves formal statement generation with semantic self-validation in a single autoregressive sequence, trained by a novel reinforcement learning algorithm called Prospective Bounded Sequence Optimization (PBSO) that uses heterogeneous rewards for both final statement accuracy and intermediate critique quality. The ReForm-32B model achieved an average improvement of 17.2 percentage points in semantic consistency over the strongest baselines, with a notable +30.0 percentage point gain on the AIME2025 benchmark. The principal implication for AI practitioners is that for complex reasoning tasks requiring high semantic fidelity, implementing iterative self-correction loops trained with multi-objective reinforcement learning can significantly outperform the standard one-pass generation paradigm, enabling models to autonomously identify and fix their own errors.
Scaling Latent Reasoning via Looped Language Models (Read more on arXiv or HuggingFace)		The paper introduces Ouro, a family of Looped Language Models (LoopLMs) that achieve superior parameter efficiency by integrating iterative latent computation and adaptive depth directly into pre-training on 7.7T tokens. The primary objective is to investigate whether looped architectures exhibit more favorable scaling behavior and enhanced reasoning capabilities compared to standard, non-recursive transformers by building reasoning into the pre-training phase. The methodology involves recurrently applying a block of parameter-shared transformer layers and training with a two-stage, entropy-regularized objective that uses a uniform prior to learn an adaptive early-exit mechanism. The primary results show that the 2.6B parameter Ouro model achieves a 90.85 on MATH500, significantly outperforming the 8B Qwen3 model’s score of 62.30; experiments on synthetic tasks demonstrate this advantage stems from superior knowledge manipulation rather than increased knowledge storage capacity (which remains ≈2 bits/parameter). For AI practitioners, the principal implication is that this architecture enables the deployment of models that achieve the performance of models 2-3x larger while maintaining a smaller memory footprint, facilitated by efficient KV cache sharing strategies that reduce inference memory overhead by 4x with minimal performance loss.
Reasoning-Aware GRPO using Process Mining (Read more on arXiv or HuggingFace)		This paper introduces PM4GRPO, a framework that enhances Group Relative Policy Optimization by incorporating process mining to reward the quality of a model’s reasoning procedure. The objective is to improve the reasoning capabilities of Large Reasoning Models by moving beyond outcome-centric rewards and instead evaluating the alignment of the student model’s reasoning process with that of a pretrained teacher model. The methodology utilizes process mining techniques, specifically Inductive Miner to model the reasoning trace and Alignment-based Conformance Checking to compute a “conformance reward” based on the F1-score of fitness and precision, which is then integrated into the total reward function for post-training. The proposed 7B parameter PM4GRPO model achieved state-of-the-art performance on multiple math benchmarks, scoring 91.1% on MATH 500 and 61.1% on Olympiad Bench, outperforming existing baselines. For AI practitioners, this research demonstrates that process mining is a viable and effective tool for creating sophisticated reward signals that evaluate intermediate generative steps, offering a new direction for enhancing the reasoning alignment and robustness of large models through reinforcement learning.
VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context
Learning (Read more on arXiv or HuggingFace)	Xiaoyu Shi, Liqian Ma, Qinghe Wang, Yiming Zhang, Baolu Li	VFXMaster is a unified, reference-based framework that generates dynamic visual effects by reformulating the task as an in-context learning problem, enabling generalization to unseen effects. The main objective is to overcome the scalability and generalization limitations of the “one-LoRA-per-effect” paradigm by developing a single model capable of imitating diverse visual effects from a reference video and applying them to a target image, including out-of-domain effects. The key methodology involves an in-context conditioning strategy that uses a reference prompt-video pair as an example, combined with an in-context attention mask to isolate effect attributes and prevent content leakage, and an efficient one-shot adaptation mechanism with learnable tokens for novel effects. The primary results demonstrate strong out-of-domain generalization, where the one-shot adaptation mechanism increases the Effect Fidelity Score from 0.47 to 0.70 and the Content Leakage Score from 0.79 to 0.87. The principal implication for AI practitioners is that this reference-based in-context learning approach provides a scalable and flexible method for building content creation tools that can adapt to new, user-provided visual effects without requiring extensive retraining for each effect.
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic,
and Long-Horizon Task Execution (Read more on arXiv or HuggingFace)	Haoze Wu, Weihao Zeng, Jian Zhao, Wenshuo Zhao, Junlong Li	The paper introduces TOOLATHLON, a benchmark for evaluating language agents on diverse, realistic, and long-horizon tasks across 32 software applications and 604 tools. The primary objective is to create a benchmark that accurately evaluates real-world agent performance by incorporating diverse applications, realistic initial environment states, and long-horizon, multi-step tasks. The methodology involves 108 manually crafted tasks requiring agents to interact with applications via Model Context Protocol (MCP) servers, with realistic initial states and deterministic, execution-based evaluation scripts. The evaluation reveals significant limitations in current models; the best-performing model, Claude-4.5-Sonnet, achieves a success rate of only 38.6%. The principal implication for AI practitioners is that current agents lack the robustness for complex real-world workflows, highlighting critical challenges in long-context handling, error recovery, and reliable tool use that must be addressed for practical deployment.
RegionE: Adaptive Region-Aware Generation for Efficient Image Editing (Read more on arXiv or HuggingFace)	Peng Ye, Mingzhu Shen, Maosen Zhao, Xianfang Zeng, Pengtao Chen	RegionE is a training-free framework that accelerates instruction-based image editing by adaptively partitioning images into edited and unedited regions and applying differentiated generation strategies. The objective is to reduce spatial and temporal computational redundancy in diffusion-based IIE models by developing an efficient, region-aware inference process. The methodology uses Adaptive Region Partitioning (ARP) to identify unedited regions for single-step prediction, while applying accelerated iterative denoising with a Region-Instruction KV Cache (RIKVCache) to edited regions. When applied to the Step1X-Edit model, RegionE achieved a 2.57x acceleration factor while maintaining a high PSNR of 30.520, outperforming baseline acceleration techniques. For AI practitioners, this framework provides a method to substantially decrease inference latency for diffusion-based editing tools, enabling more interactive applications by avoiding redundant computations on static image areas without retraining the underlying models.
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal
Perception and Generation (Read more on arXiv or HuggingFace)		The paper presents Ming-Flash-Omni, a 100-billion parameter sparse Mixture-of-Experts (MoE) unified architecture for multimodal perception and generation. The objective is to create a single, computationally efficient model that integrates comprehension and generation across vision, speech, and language. The methodology is built upon a sparse MoE architecture (Ling-Flash-2.0) with only 6.1 billion active parameters per token and introduces a “generative segmentation” paradigm that unifies image understanding and generation objectives by framing segmentation as an editing task. The model achieves state-of-the-art performance on all 12 contextual ASR benchmarks and a score of 0.90 on the GenEval text-to-image benchmark, surpassing leading non-Reinforcement Learning methods. The principal implication for AI practitioners is a scalable, unified architecture demonstrating that sparse MoE models can efficiently handle diverse multimodal tasks, while the generative segmentation technique provides a novel method for enhancing fine-grained spatio-semantic control in image generation systems.
ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in
Game RAG Benchmarks (Read more on arXiv or HuggingFace)		ChronoPlay introduces a framework for automatically generating dynamic and authentic RAG benchmarks for the gaming domain by modeling both knowledge evolution and user interest drift. The main objective is to create a standardized evaluation method for RAG systems in dynamic environments that captures the dual challenges of evolving game content and shifting player community focus. The methodology utilizes a dual-source synthesis engine that combines an authoritative knowledge base for factual grounding with community-mined question templates for authenticity, coupled with a dual-dynamic update mechanism that triggers benchmark refreshes based on either new knowledge or detected shifts in user interest topics. The primary results show that RAG system performance is highly volatile over a game’s lifecycle; for instance, the update to Phase 4 of the PUBG Mobile benchmark was driven entirely by user interest drift (48.2% of questions updated), while the update to Phase 3 was largely knowledge-driven (34.4%). The principal implication for AI practitioners is that developing robust RAG systems for dynamic, user-centric applications requires evaluation on benchmarks that track both knowledge updates and user interest drift to ensure models remain relevant and are not optimized on obsolete problems.
ODesign: A World Model for Biomolecular Interaction Design (Read more on arXiv or HuggingFace)	Qinghan Wang, Cheng Tan, Haitao Lin, Xujun Zhang, Odin Zhang	ODesign is a unified, all-atom generative world model for designing multimodal biomolecular interactions, including protein, nucleic acid, and small-molecule binders. The objective is to develop a single, controllable generative framework for “all-to-all” biomolecular interaction design, moving beyond specialized models to a general-purpose model that handles diverse molecule types as both targets and designed partners. The model adapts an AlphaFold3-like structure-prediction architecture for generative tasks using an all-atom conditional diffusion module, a unified token representation for diverse chemical units, and a hierarchical masking mechanism (all, entity, token, atom) for fine-grained conditional control. Across eleven benchmarks, ODesign consistently outperforms modality-specific models; in protein-binding protein design, it achieves an order-of-magnitude higher throughput of successful designs per day compared to the RFDiffusion baseline (average 2,672 vs. 555). The principal implication for AI practitioners is the demonstration of a successful architectural pattern for creating a scientific “world model”: repurposing a large, cross-modal predictive foundation model into a controllable generative system by implementing a unified representation and a hierarchical conditional control scheme.
The Principles of Diffusion Models (Read more on arXiv or HuggingFace)	Stefano Ermon, Yuki Mitsufuji, Dongjun Kim, Yang Song, Chieh-Hsin Lai	This monograph provides a unified theoretical framework for diffusion models, demonstrating that their varied formulations are mathematically equivalent interpretations of a single underlying generative process. The primary objective is to show how the variational, score-based, and flow-based perspectives, originating from VAEs, EBMs, and Normalizing Flows respectively, all converge on learning a time-dependent vector field to reverse a forward corruption process. The paper’s methodology is a systematic synthesis that uses the Fokker-Planck equation to connect discrete-time models to a continuous-time framework governed by stochastic and ordinary differential equations (SDEs/ODEs). The key result is that diverse training objectives are mathematically equivalent, as they all serve to learn the score function `∇x log p_t(x)` of the evolving marginal density, which uniquely defines the reverse generative dynamics. For AI practitioners, this implies that choices between different diffusion model formulations (e.g., DDPM, NCSN) and parameterizations (noise, velocity, or score prediction) are matters of numerical efficiency and stability, not fundamental modeling differences, as they are all discretizations of the same core process.
Multimodal Spatial Reasoning in the Large Model Era: A Survey and
Benchmarks (Read more on arXiv or HuggingFace)		This paper presents a comprehensive survey and introduces new benchmarks for multimodal spatial reasoning in large models, organizing the field through a detailed taxonomy. The main objective is to systematically review current techniques, categorize progress, and establish standardized evaluation protocols for MLLMs on tasks requiring spatial understanding. The methodology involves a literature survey structured into a taxonomy covering general MLLM techniques (e.g., post-training, tool use), 3D vision, embodied AI, and novel modalities, complemented by the collation and presentation of benchmark results. The survey’s evaluation of existing models finds significant performance variance; for example, GPT-4V achieves a high accuracy of 0.924 on the SPATIALEVAL benchmark but a lower success rate of 58.14% on SPATIALRGPT-BENCH. The principal implication for AI practitioners is the provision of a structured framework and open benchmarks (available on GitHub) that enable standardized evaluation and comparison of MLLM spatial reasoning capabilities, guiding the development of models for applications in robotics, navigation, and AR.
PairUni: Pairwise Training for Unified Multimodal Language Models (Read more on arXiv or HuggingFace)		PairUni is a reinforcement learning framework that improves joint optimization of understanding and generation in unified vision-language models by reorganizing data into semantic pairs and using a pair-aware policy optimization algorithm. The main objective is to mitigate task interference when training a single UVLM on heterogeneous understanding and generation tasks, which often have conflicting optimization gradients. The methodology involves creating a “PairUG” dataset by augmenting data into aligned understanding-generation quadruples and retrieving semantically similar cross-task examples, then applying Pair-GPRO, a variant of Group Relative Policy Optimization that weights the advantage signal by the pair’s semantic similarity score. On the Janus-Pro-7B backbone, the approach improves the MMMU understanding benchmark score from 41.1 to 47.0 and the WISE generation benchmark score from 0.35 to 0.45. The principal implication for AI practitioners is that explicitly aligning training data at the instance level and using an optimization algorithm that respects this alignment is a more effective strategy for building balanced, unified multimodal models than naively mixing heterogeneous datasets.
Parallel Loop Transformer for Efficient Test-Time Computation Scaling (Read more on arXiv or HuggingFace)		The paper introduces the Parallel Loop Transformer (PLT), an architecture that parallelizes the sequential computation of looped transformers to achieve greater effective depth without increasing inference latency or memory. The primary objective is to overcome the linear scaling of latency and memory costs in traditional looped transformers, which execute computational “loops” sequentially for each token. PLT’s methodology is based on two key techniques: Cross-Loop Parallelism (CLP), which computes different loops for different tokens concurrently within a single forward pass, and an Efficient Representation Enhancement strategy that shares the first loop’s KV cache and uses Gated Sliding-Window Attention (G-SWA) to maintain accuracy. The primary result shows that a 2-loop PLT achieves the accuracy of a vanilla 2-loop model while increasing latency by only 2% and KV cache by 1.4% over a non-looped baseline, effectively decoupling performance gains from inference costs. For AI practitioners, the principal implication is that PLT enables the deployment of effectively deeper and more accurate models without the typical penalty of higher latency or memory usage, allowing for more powerful models to operate within strict production-level serving constraints.
Rethinking Driving World Model as Synthetic Data Generator for
Perception Tasks (Read more on arXiv or HuggingFace)		This paper presents Dream4Drive, a synthetic data generation framework using 3D-aware guidance maps to create high-quality, editable driving videos for training perception models. The research objective is to demonstrate that a small amount of high-quality synthetic data can significantly improve downstream perception tasks under fair evaluation conditions where the total number of training epochs is constant. The methodology involves decomposing input videos into dense 3D-aware guidance maps (e.g., depth, normal, mask), rendering 3D assets onto these maps, and then using a fine-tuned Diffusion Transformer to generate photorealistic videos. With fewer than 2% additional synthetic samples (+420), Dream4Drive improves the nuScenes Detection Score (NDS) from 50.4 to 50.6 at 2x training epochs and boosts NDS from 47.9 to 52.0 at 1x epoch on a higher resolution. The principal implication for AI practitioners is that perception model performance can be significantly enhanced by augmenting training sets with a very small volume of high-fidelity synthetic data, offering a more efficient alternative to doubling training time on real data or using large volumes of lower-quality synthetic data.
Evolving Diagnostic Agents in a Virtual Clinical Environment (Read more on arXiv or HuggingFace)		This paper introduces DiagGym, a simulated clinical environment, to train an LLM-based diagnostic agent, DiagAgent, using reinforcement learning for multi-turn clinical reasoning. The research objective is to enable an agent to learn an optimal policy for adaptively selecting examinations and making a final diagnosis, overcoming the limitations of static, single-shot prediction models. The core methodology involves fine-tuning a generative world model (DiagGym) on EHR data to provide realistic feedback, and then training DiagAgent within this environment via end-to-end RL to maximize rewards based on diagnostic accuracy and information yield. In a practical end-to-end setting, DiagAgent achieves a 15.12% absolute increase in diagnostic accuracy and a 23.09% boost in examination recommendation F1 score over the strongest baseline. The principal implication for AI practitioners is that training agents in high-fidelity, interactive simulation environments enables the acquisition of dynamic, sequential decision-making capabilities that are unattainable through supervised fine-tuning on static datasets alone.
Gaperon: A Peppered English-French Generative Language Model Suite (Read more on arXiv or HuggingFace)	Éric de la Clergerie, Rachel Bawden, Rian Touchent, Wissam Antoun, Nathan Godey	The paper introduces GAPERON, a suite of open English-French models, and investigates the trade-offs between linguistic quality, benchmark performance, and data contamination during pretraining. The main objective is to build a transparent, reproducible suite of bilingual language models and study how data curation strategies—specifically filtering for linguistic quality versus including benchmark data—impact generative abilities and standardized benchmark scores. The authors trained 1.5B, 8B, and 24B parameter models on 2-4 trillion tokens using a custom data pipeline with a neural quality classifier and progressive data mixing, creating distinct versions including a “clean” model (“Young”) and a deliberately contaminated one (“Garlic”). Primary results show that filtering for linguistic quality yields subpar benchmark scores, whereas late, deliberate contamination with test sets significantly boosts performance (e.g., the 24B model’s average score increased from 65.86 to 81.11) while only moderately degrading generation quality. The principal implication for AI practitioners is that high benchmark scores can be artificially inflated by both intentional and unintentional training data contamination, and that the choice of a data quality filter can implicitly bias a model towards benchmark-style data, a critical consideration when preparing pretraining corpora.
SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In
Text-only LLMs (Read more on arXiv or HuggingFace)	Jiaxuan You, Haoqi Chen, Haoru Li, Zijia Liu, Weijia Zhang	The SeeingEye framework enables text-only LLMs to perform multimodal reasoning by using a lightweight VLM “translator agent” to convert visual data into a structured textual representation that is iteratively refined through a feedback loop with a separate LLM “reasoning agent”. The main objective is to bridge text-only LLM reasoners with effective and cost-efficient multimodal reasoning capabilities that can outperform monolithic Vision Language Models (VLMs). The key methodology is a decoupled, two-agent system where a lightweight VLM Translator Agent uses tools to distill visual inputs into a Structured Intermediate Representation (SIR), which is then processed by a text-only LLM Reasoning Agent; the agents engage in a multi-round feedback loop to refine the SIR. The primary result is that an instantiation combining a 3B parameter VLM translator and an 8B parameter LLM reasoner achieves 44.62% accuracy on the MMMU-Pro_std benchmark, significantly outperforming a monolithic 32B parameter VLM which scored 32.93%. The principal implication for AI practitioners is that this modular, plug-and-play architecture offers a scalable and cost-efficient pathway to leverage the advanced reasoning of powerful text-only LLMs for multimodal tasks without requiring the training or deployment of large, end-to-end multimodal models.
MASPRM: Multi-Agent System Process Reward Model (Read more on arXiv or HuggingFace)	Ying Xiong, Zirui Zhou, Mahdi Mostajabdaveh, Milad Yazdani	The paper introduces MASPRM, a process reward model trained via search-generated supervision from MCTS rollouts to guide inference-time search in multi-agent systems. The main objective is to develop a process reward model for multi-agent systems that provides dense, per-agent feedback to guide inference-time search and improve problem-solving accuracy under fixed compute budgets, without requiring manual step-level annotations. The key methodology involves using multi-agent Monte Carlo Tree Search (MCTS) to generate problem-solving rollouts; the terminal reward is then backpropagated to create Q-value estimates for intermediate states, which serve as regression targets to train the MASPRM value head. On the GSM8K benchmark, MASPRM-guided MCTS combined with a final outcome reward model achieved 74.6% exact match, a +30.7 percentage point improvement over a single straight-through MAS pass. The principal implication is that practitioners can use MASPRM as a plug-in, inference-time controller to improve the reliability and compute-efficiency of multi-agent workflows for complex reasoning, offering a scalable method to enhance performance without altering underlying agent policies or requiring manual annotation.
FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable
Reasoning (Read more on arXiv or HuggingFace)	Xin Liu, Haibin Lin, Juntao Li, Chi Zhang, Yuyang Ding	FAPO is a policy optimization algorithm that improves LLM reasoning by penalizing rollouts with flawed logic but correct final answers to enhance training efficiency and reliability. The research objective is to mitigate the negative impact of such “flawed-positive” rollouts, which are reinforced by standard rule-based outcome rewards in reinforcement learning. The core methodology involves FAPO, which applies a parameter-free reward penalty to flawed positives, and a generative reward model (GenRM) trained with a step-wise process reward to accurately detect these reasoning errors. FAPO demonstrates improved outcome correctness and process reliability, with the FAPO-32B model achieving a +3.1 point gain on the AIME25 benchmark over the baseline. For AI practitioners, FAPO offers a method to enhance the reasoning reliability of models trained via reinforcement learning by explicitly managing flawed reasoning paths, without increasing the token budget or introducing complex reward-shaping hyperparameters.
TheraMind: A Strategic and Adaptive Agent for Longitudinal Psychological
Counseling (Read more on arXiv or HuggingFace)	Zheng Zhang, Qianning Wang, Chiyuan Ma, Yucheng Zhou, He Hu	TheraMind introduces a strategic and adaptive agent for longitudinal psychological counseling. Its primary objective is to overcome clinical amnesia and strategic rigidity in existing LLM-based counseling agents. The core methodology utilizes a novel dual-loop architecture, separating tactical dialogue management (Intra-Session Loop) from strategic therapeutic planning (Cross-Session Loop), with LLM-based therapy evaluation and selection. TheraMind achieved a state-of-the-art multi-session average score of 2.755, demonstrating an 18.2% relative improvement over its backbone model. This highlights that dual-loop architectures can equip LLM agents with critical strategic and adaptive reasoning capabilities for complex, longitudinal AI applications.
BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic
Domains (Read more on arXiv or HuggingFace)		BhashaBench V1 is a novel, comprehensive, domain-specific, bilingual benchmark designed to evaluate large language models on India-centric knowledge systems across critical domains. The primary objective of BhashaBench V1 is to comprehensively assess domain-specific knowledge and reasoning capabilities of large language models within India’s diverse and culturally rich knowledge ecosystems, addressing gaps in Anglocentric and domain-agnostic evaluation. The benchmark comprises 74,166 meticulously curated question-answer pairs in English and Hindi, sourced from authentic government and domain-specific exams across Agriculture, Legal, Finance, and Ayurveda. Evaluations of 29+ LLMs on BhashaBench V1 revealed significant domain and language-specific performance gaps; for example, GPT-4o achieved 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. This benchmark underscores the critical importance for AI practitioners to develop specialized models that integrate India-specific knowledge, cultural contexts, and robust multilingual capabilities for effective deployment in diverse Indian contexts.
Fortytwo: Swarm Inference with Peer-Ranked Consensus (Read more on arXiv or HuggingFace)		This paper introduces Fortytwo, a decentralized AI inference protocol that leverages swarm intelligence and peer-ranked consensus to achieve higher accuracy and robustness than individual monolithic models. The primary objective is to design a scalable inference system that aggregates outputs from heterogeneous AI agents to produce a single, superior-quality response. The methodology utilizes a swarm of dual-role nodes that both generate responses and conduct pairwise ranking of peer outputs, with consensus formed via a reputation-weighted Bradley-Terry aggregation model applied to multi-token reasoning chains and secured by a proof-of-capability mechanism against Sybil attacks. The protocol achieved 85.90% accuracy on the GPQA Diamond benchmark, an improvement of +17.21 percentage points over majority voting, and exhibited only 0.12% performance degradation under adversarial prompting, compared to an average of 6.20% for single models. For AI practitioners, this research presents a viable architecture for building highly robust and performant inference systems by ensembling diverse models, offering a path to achieve state-of-the-art results and resilience without relying on a single, centralized frontier model.

Papers for 2025-10-29

Title	Authors	Summary
InteractComp: Evaluating Search Agents With Ambiguous Queries (Read more on arXiv or HuggingFace)	Fashen Ren, Jiayi Zhang, Yani Fan, Lijun Huang, Mingyi Deng	This paper introduces INTERACTCOMP, a benchmark for evaluating the capability of search agents to resolve ambiguous queries through interaction, revealing a critical failure in current models. The main objective is to evaluate whether search agents can recognize query ambiguity and actively interact with a user to gather disambiguating information, a capability unaddressed by existing search benchmarks. The key methodology is the construction of a 210-instance benchmark using a “target-distractor” design, where ambiguous questions are crafted from the shared attributes of two entities, forcing agents to use an `interact` action to uncover hidden, distinctive context to find the correct answer. The primary result across 17 models is a systematic failure to engage in interaction; the top-performing model achieved only 13.73% accuracy, whereas performance on the same questions with complete context reached 71.50%, demonstrating that the failure stems from overconfidence rather than a lack of reasoning ability. The principal implication for AI practitioners is that search agents cannot be assumed to handle underspecified queries; they exhibit a critical blind spot in actively seeking clarification, which will lead to incorrect and confident outputs in real-world applications unless agents are explicitly trained for interactive disambiguation.
Tongyi DeepResearch Technical Report (Read more on arXiv or HuggingFace)		This paper presents Tongyi DeepResearch, an open-source agentic language model designed for complex, long-horizon information-seeking and research tasks. The main objective is to create a scalable, end-to-end paradigm for training autonomous AI researchers capable of planning, searching, reasoning, and synthesizing knowledge. The core methodology involves a novel two-stage training framework comprising agentic continual pre-training (mid-training) to build an agentic inductive bias, followed by supervised fine-tuning and on-policy reinforcement learning (post-training), all driven by a fully automated, scalable synthetic data generation pipeline. The resulting 30.5B parameter model achieves state-of-the-art performance on multiple agentic benchmarks, scoring 90.6 on FRAMES and 70.9 on GAIA, while activating only 3.3B parameters per token. The principal implication for AI practitioners is that this work provides a complete, open-source blueprint for building highly capable research agents without human-annotated data, demonstrating that a structured training pipeline using synthetic data offers a scalable and reproducible path toward more advanced agentic systems.
AgentFold: Long-Horizon Web Agents with Proactive Context Management (Read more on arXiv or HuggingFace)		AgentFold is a novel web agent paradigm that introduces proactive, learned context management to enhance performance and scalability on long-horizon tasks. The paper’s primary objective is to resolve the fundamental trade-off between context saturation in append-only agents and the irreversible loss of critical details from fixed summarization methods. Its key methodology involves structuring the agent’s context into `Multi-Scale State Summaries` and a `Latest Interaction`, and training the agent via supervised fine-tuning to issue a “folding” directive that either granularly condenses a single step or deeply consolidates an entire sub-task. The resulting AgentFold-30B-A3B agent achieves 36.2% on the BrowseComp benchmark, outperforming models over 20 times its size, while maintaining a context size that is 92% smaller than a comparable ReAct agent after 100 turns. For AI practitioners, the principal implication is a concrete architecture for building more efficient and capable long-horizon agents by making dynamic context management a core, learnable component of the agent’s reasoning process, thus reducing computational overhead and enabling sustained, complex interactions.
RoboOmni: Proactive Robot Manipulation in Omni-modal Context (Read more on arXiv or HuggingFace)		The paper introduces RoboOmni, an end-to-end omni-modal framework for proactive robotic manipulation that infers user intent from speech, environmental sounds, and visual cues. The primary objective is to enable a robot to proactively understand and verify latent user intent from cross-modal context, moving beyond reliance on explicit commands. The methodology centers on a Perceiver-Thinker-Talker-Executor architecture, an omni-modal LLM that unifies perception and action generation in a single autoregressive model, trained on a new 140k-episode dataset called OmniAction. RoboOmni achieves an 85.6% success rate in simulation, substantially outperforming the strongest cascaded ASR-VLA baseline which scored 25.9%. The principal implication for AI practitioners is that end-to-end omni-modal models, by directly processing raw audio and avoiding intermediate representations like ASR, are critical for developing robust human-robot interaction systems that can interpret the subtle contextual and paralinguistic cues essential for proactive assistance.
Game-TARS: Pretrained Foundation Models for Scalable Generalist
Multimodal Game Agents (Read more on arXiv or HuggingFace)		Game-TARS is a generalist multimodal agent pretrained on over 500B tokens using a unified action space of native keyboard-mouse inputs to achieve broad generalization across diverse digital environments. The research objective is to develop a scalable foundation model for game agents by shifting from environment-specific APIs to this universal, low-level action representation. The methodology involves large-scale continual pre-training on game and agentic trajectories using a Sparse ReAct paradigm, a “Thinking Aloud” data collection protocol, and a decaying continual loss function to mitigate causal confusion from repetitive actions. Experiments show Game-TARS achieves approximately double the success rate of previous state-of-the-art models on open-world Minecraft tasks, reaching a 72.0% success rate on embodied tasks compared to the prior best of 42.1%. The principal implication for practitioners is that employing a simple, scalable, device-level action space is a viable path for building general-purpose computer-use agents with strong zero-shot generalization capabilities, bypassing the need for environment-specific engineering.
Uniform Discrete Diffusion with Metric Path for Video Generation (Read more on arXiv or HuggingFace)		This paper introduces URSA, a discrete diffusion framework for scalable video generation that operates by iteratively refining discrete spatiotemporal tokens. The research objective is to close the performance gap between discrete and continuous video generation methods by mitigating error accumulation and improving long-context consistency. The methodology integrates a Linearized Metric Path derived from token embedding distances, a Resolution-dependent Timestep Shifting mechanism, and an asynchronous temporal scheduling strategy to unify tasks like text-to-video and interpolation in a single model. URSA demonstrates performance comparable to state-of-the-art continuous methods, achieving a text-to-video score of 82.4 on the VBench benchmark. For AI practitioners, this work provides a unified and scalable discrete alternative to continuous diffusion models for high-quality, multi-task video generation, offering a competitive and potentially more efficient architectural paradigm.
Repurposing Synthetic Data for Fine-grained Search Agent Supervision (Read more on arXiv or HuggingFace)		This paper introduces Entity-aware Group Relative Policy Optimization (E-GRPO), a framework that repurposes ground-truth entities from synthetic data to create a dense reward signal for training search agents. The core objective is to solve the sparse reward problem in methods like GRPO by distinguishing informative “near-miss” failures from complete failures. E-GRPO’s methodology formulates a dense reward function that assigns partial rewards to incorrect trajectories based on their normalized entity match rate—the fraction of ground-truth entities identified within the agent’s reasoning thoughts. The primary result is that E-GRPO consistently outperforms its baseline; for instance, a 7B model trained in a local environment achieved a 64.2 average Pass@1 score on QA benchmarks, a 2.8-point improvement over standard GRPO, while also reducing the number of tool calls. The principal implication for AI practitioners is that metadata discarded during synthetic data generation is a computationally cheap yet powerful source for creating fine-grained reward signals, enhancing the sample efficiency and performance of RL-based agent alignment by enabling learning from partially correct solutions.
OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents (Read more on arXiv or HuggingFace)		This paper introduces OSWorld-MCP, a benchmark for evaluating multimodal agents on their ability to jointly perform GUI operations and invoke external tools via the Model Context Protocol (MCP). The main objective is to create a fair evaluation framework to assess an agent’s decision-making in choosing between GUI interactions and MCP tool invocations for complex computer tasks. The methodology involves extending the OSWorld environment with a curated set of 158 high-quality MCP tools and introducing new metrics like Tool Invocation Rate (TIR). Primary results show that MCP tools significantly improve performance, increasing task success for OpenAI o3 from 8.3% to 20.4%, yet even top models have low tool invocation rates (max of 36.3%). For AI practitioners, this benchmark provides a standardized method to assess and develop agent tool-use capabilities, revealing that effective decision-making between GUI and tool-based actions is a critical and underdeveloped area for creating more robust automated agents.
WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling
Info-Rich Seeking (Read more on arXiv or HuggingFace)		WebLeaper is a framework for generating entity-rich information-seeking tasks from Wikipedia tables to train web agents that are both more effective and efficient. The main objective is to overcome the low search efficiency of LLM-based agents, which is attributed to the sparsity of target entities in conventional training tasks, by developing a framework to construct high-coverage tasks and generate efficient solution trajectories. The methodology involves modeling information-seeking as a tree-structured reasoning problem and synthesizing tasks in three variants (Basic, Union, Reverse-Union) to systematically increase complexity, followed by curating training trajectories based on Information-Seeking Rate (ISR) and Information-Seeking Efficiency (ISE) metrics, and finally training an agent via supervised fine-tuning and reinforcement learning with a hybrid reward system. In a comprehensive training setting, WebLeaper achieved a 73.2 accuracy score on the GAIA benchmark, outperforming strong open-source models like DeepSeek-V3.1 (63.1) and proprietary models such as Claude-4-Sonnet (68.3). The principal implication for AI practitioners is that training agents on entity-dense tasks, as enabled by the WebLeaper framework, directly improves both task success rates and operational efficiency (fewer actions), providing a concrete strategy to build more capable and cost-effective web-browsing agents.
Group Relative Attention Guidance for Image Editing (Read more on arXiv or HuggingFace)		The paper introduces Group Relative Attention Guidance (GRAG), a lightweight method for achieving fine-grained control over editing strength in Diffusion-in-Transformer (DiT) models. The primary objective is to address the lack of effective control over editing intensity in existing methods by enabling continuous modulation between instruction following and image consistency. The key methodology involves identifying a shared bias vector in the Query and Key embeddings of the MM-Attention mechanism and then reweighting the deviation of each token from this group bias to precisely control the editing process. Integrating GRAG into the Qwen-Edit model improved the overall EditScore from 7.2576 to 7.3245 on the PIE dataset. For AI practitioners, GRAG provides a simple, four-line code modification that can be integrated into existing DiT-based editors to enhance controllability and editing quality without any model tuning.
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D
Intelligence (Read more on arXiv or HuggingFace)		This paper introduces STAR-Bench, a benchmark to evaluate audio 4D intelligence, defined as a model’s ability to perform deep reasoning over sound dynamics in time and 3D space. The research objective is to assess how well Large Audio-Language Models (LALMs) handle linguistically hard-to-describe auditory cues, a gap in existing text-centric audio benchmarks. The methodology involves a two-level benchmark with a Foundational Acoustic Perception task using synthesized audio to test six core attributes and a Holistic Spatio-Temporal Reasoning task using curated real-world audio to evaluate complex event ordering and 3D scene understanding. Evaluation of 19 models reveals substantial performance gaps, showing that relying on audio captions instead of raw audio causes accuracy to drop by 31.5% on temporal tasks and 35.2% on spatial tasks, unlike in prior benchmarks. For AI practitioners, the principal implication is that current models fundamentally struggle to integrate information from multiple audio inputs and lack genuine spatial awareness, highlighting the need to develop architectures that natively process multi-channel audio rather than averaging it to a mono signal.
Routing Matters in MoE: Scaling Diffusion Transformers with Explicit
Routing Guidance (Read more on arXiv or HuggingFace)		This paper introduces ProMoE, a Mixture-of-Experts (MoE) framework that improves the scaling of Diffusion Transformers (DiTs) through explicit routing guidance. The main objective is to address the poor expert specialization in vision MoEs, which stems from the spatial redundancy and functional heterogeneity of visual tokens compared to language tokens. The key methodology is a two-step router that first performs conditional routing to separate tokens by functional role (conditional vs. unconditional) and then uses prototypical routing with a novel routing contrastive loss to assign conditional tokens to experts based on semantic content. On the ImageNet 256x256 benchmark with Rectified Flow, the ProMoE-L model achieves a Fréchet Inception Distance (FID) of 2.79, surpassing its dense DiT counterpart’s FID of 3.56 while activating the same number of parameters. For AI practitioners, this work provides a validated method for effectively applying MoE to vision transformers by introducing explicit routing signals that account for the unique characteristics of visual data, enabling more efficient model scaling.
ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking (Read more on arXiv or HuggingFace)		PARALLELMUSE is a two-stage paradigm that improves deep information-seeking agents’ performance and efficiency through uncertainty-guided partial rollouts and compressed reasoning aggregation. The objective is to develop a parallel thinking framework for deep information-seeking agents that overcomes the inefficiency of redundant rollouts and the difficulty of integrating long-horizon reasoning trajectories within limited context windows. The methodology consists of two stages: 1) Functionality-Specified Partial Rollout, which identifies high-uncertainty steps in distinct functional regions (reasoning vs. exploration) to branch from, reusing context via KV caching; and 2) Compressed Reasoning Aggregation, which condenses multiple reasoning trajectories into structured reports to enable coherent, comprehensive answer synthesis. The method achieves up to a 62% performance improvement over the base agent model and reduces exploratory token consumption by 10-30% compared to conventional from-scratch parallel rollouts; trajectory compression further reduces aggregation context by up to 99%. The principal implication for AI practitioners is that PARALLELMUSE offers a practical test-time scaling technique to significantly enhance agent problem-solving capabilities without model retraining, while simultaneously improving computational and token efficiency over standard parallel reasoning methods.
AgentFrontier: Expanding the Capability Frontier of LLM Agents with
ZPD-Guided Data Synthesis (Read more on arXiv or HuggingFace)		This paper introduces the AgentFrontier Engine, a data synthesis framework guided by the Zone of Proximal Development (ZPD) to enhance LLM agent reasoning. The research objective is to develop a scalable method for automatically generating frontier-level training data that is challenging enough to require guided learning but is ultimately solvable. The methodology involves a three-stage pipeline: generating multi-source seed questions, iteratively escalating their complexity with a tool-augmented agent, and using an LKP-MKO (Less Knowledgeable Peer vs. More Knowledgeable Other) adversarial calibration to filter for tasks within the LLM’s ZPD. The resulting AgentFrontier-30B-A3B model achieved state-of-the-art performance, scoring 28.6% on the text-only Humanity’s Last Exam and 93.4% on their ZPD Exam-v1. For AI practitioners, this work provides a principled, automated framework for creating high-quality, complex reasoning data, offering a scalable path to train more capable agents without relying on prohibitive manual curation.
Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal
Reasoning in MLLMs (Read more on arXiv or HuggingFace)		The paper introduces Latent Sketchpad, a framework that enables Multimodal Large Language Models (MLLMs) to generate internal visual latents as a form of “visual thought” to improve complex multimodal reasoning. The research objective is to enhance MLLMs’ capabilities in scenarios requiring visual planning and imagination by equipping them with an internal mechanism to generate and utilize visual representations interleaved with their native textual reasoning process. The methodology involves augmenting a frozen pretrained MLLM with two components: a Context-Aware Vision Head that autoregressively generates sequences of visual latents, and a separately pretrained Sketch Decoder that translates these latents into interpretable sketch images for visualization. On the custom MAZEPLANNING dataset, the framework improved performance; a fine-tuned Gemma3 model equipped with Latent Sketchpad increased its task Success Rate from 70.0% to 72.2%, and the generated visual traces themselves demonstrated a Visual Success Rate of 75.6%, surpassing the text-only baseline. The principal implication for AI practitioners is that this modular, plug-and-play approach allows for the enhancement of MLLMs’ reasoning abilities for spatial planning tasks without requiring full model retraining, providing a direct method to incorporate interpretable “visual thinking” into existing architectures.
Critique-RL: Training Language Models for Critiquing through Two-Stage
Reinforcement Learning (Read more on arXiv or HuggingFace)		Critique-RL is a two-stage reinforcement learning framework that trains language models for critiquing by first optimizing for discriminability and then for helpfulness without stronger supervision. The primary objective is to develop effective critique models by resolving the optimization conflict between a critic’s ability to accurately judge a response (discriminability) and its ability to provide useful feedback for refinement (helpfulness). The methodology involves a two-stage RL process: Stage I uses direct, rule-based rewards to explicitly train the critic’s discriminability, and Stage II uses indirect rewards from actor refinement to improve helpfulness, while KL regularization preserves the discriminative ability learned in Stage I. The proposed method significantly outperforms baselines; for instance, a Qwen2.5-7B model trained with Critique-RL achieved 58.40% accuracy on the MATH dataset, improving upon the 51.84% of an SFT baseline, while concurrently boosting discriminability accuracy to 85.20%. For AI practitioners, this two-stage framework offers a robust method for creating specialized critique models for scalable oversight, enhancing the performance of actor models on complex reasoning tasks by ensuring the critic first learns to reliably identify errors before learning to provide constructive feedback.
VisCoder2: Building Multi-Language Visualization Coding Agents (Read more on arXiv or HuggingFace)		This work introduces VisCoder2, a family of open-source models for generating visualization code, alongside a large-scale multi-language dataset and a comprehensive benchmark for training and evaluation. The primary objective is to develop and systematically evaluate multi-language visualization coding agents capable of iterative generation, execution, and self-debugging across diverse programming environments. The methodology involves constructing VisCode-Multi-679K, a supervised dataset of 679K executable code samples and correction dialogues across 12 languages, and using it to fine-tune the Qwen2.5-Coder model family. On the introduced VisPlotBench benchmark, the 32B VisCoder2 model with iterative self-debug achieves an 82.4% overall execution pass rate, matching the performance of the proprietary GPT-4.1 model. For AI practitioners, this provides a set of open-source models and resources capable of reliably generating executable visualization code in multiple languages, with a robust framework for implementing execution-based self-correction to handle complex symbolic or compiler-dependent languages.
ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining,
Finetuning, and Decoding the Curse of Multilinguality (Read more on arXiv or HuggingFace)		This work introduces the ADAPTIVE TRANSFER SCALING LAW (ATLAS) for multilingual pretraining, which outperforms existing scaling laws in out-of-sample generalization. The research aims to empirically investigate multilingual scaling dynamics, measure cross-lingual transfer, model the “curse of multilinguality,” and determine the computational crossover point between pretraining from scratch versus finetuning for a target language. The methodology involves 774 training experiments on models from 10M to 8B parameters, fitting the ATLAS law which separates loss contributions from target language data, other data, and specific transfer languages, and deriving a 38x38 cross-lingual transfer matrix based on a Bilingual Transfer Score (BTS). The primary results show that ATLAS achieves superior generalization (R²(M)=0.82 on unseen mixtures) and quantifies the cost of adding languages; to maintain iso-loss performance when expanding language coverage by a factor of r, the compute budget must be scaled by approximately r^0.97. The principal implication for practitioners is a quantitative framework for multilingual model development, providing explicit formulas to budget compute for language expansion (C’≈C*r^0.97), a transfer matrix to optimize data mixtures, and an empirical guide for deciding whether to pretrain or finetune based on the available token budget.
From Spatial to Actions: Grounding Vision-Language-Action Model in
Spatial Foundation Priors (Read more on arXiv or HuggingFace)		FALCON is a vision-language-action model that improves robotic manipulation by grounding actions in strong 3D spatial priors derived from spatial foundation models. The main objective is to address the spatial reasoning gap in existing Vision-Language-Action (VLA) models by integrating robust 3D geometric information from RGB inputs without degrading pre-trained vision-language alignment or requiring specialized 3D sensors. The methodology involves an Embodied Spatial Model (ESM) to extract rich spatial tokens and a novel Spatial-Enhanced Action Head that fuses these tokens directly with semantic action tokens from a VLM, decoupling spatial processing from the main vision-language backbone. FALCON achieves state-of-the-art results, attaining a 70.0% average success rate on nine real-world base tasks, outperforming the advanced SpatialVLA baseline (44.4%) by 25.6%. The principal implication for AI practitioners is that injecting spatial information directly into the action head, rather than the VLM’s input stream, is a superior architectural choice for preserving high-level semantic reasoning while significantly enhancing a policy’s fine-grained spatial awareness and manipulation accuracy.
FunReason-MT Technical Report: Overcoming the Complexity Barrier in
Multi-Turn Function Calling (Read more on arXiv or HuggingFace)		This paper presents FunReason-MT, a novel data synthesis framework designed to generate high-quality, complex trajectories for multi-turn function calling. The main objective is to overcome the limitations of existing data generation methods, such as random sampling, by creating logically dependent and targeted tool-use scenarios. The methodology involves a three-phase process: 1) Environment-API Graph Interactions to sample valid tool execution traces, 2) Advanced Tool-Query Synthesis to reverse-engineer a challenging query from the trace, and 3) a Guided Iterative Chain to generate and refine a robust Chain-of-Thought (CoT) through self-correction. A 4B model trained with this framework achieved a Multi-Turn score of 56.50 on the BFCLv3 benchmark, a +40.75 improvement over the base model, surpassing comparable open and closed-source models. For AI practitioners, this framework provides a structured, “top-down” methodology to synthesize high-complexity training data, enabling the development of more reliable and capable tool-using agents, particularly for scenarios requiring multi-step logical reasoning.
ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers? (Read more on arXiv or HuggingFace)	Ian L. V. Roque, Steven Dillmann, Suchetha Cooray, Sihan Yuan, Christine Ye	This paper introduces ReplicationBench, a benchmark framework that evaluates the ability of AI agents to perform end-to-end replication of entire astrophysics research papers. The main objective is to assess the faithfulness and correctness of AI agents as scientific research assistants by testing their capability to reproduce the core contributions of published, expert-level astrophysics papers from scratch. The methodology involves a dataset of 19 peer-reviewed papers decomposed into 107 objective, expert-validated tasks, where agents operate within a sandboxed code execution environment to implement the methodology and produce numerical results that are automatically graded. The primary result is that current models perform poorly, with the best-performing model, Claude 3.7 Sonnet, achieving an average score of only 19.3%, with common failures including procedural errors, technical execution issues, and a lack of persistence. The principal implication for AI practitioners is that while agents may possess static knowledge, they have critical deficits in long-horizon reasoning, robust code execution, and deep procedural understanding, indicating significant architectural and capability improvements are needed for reliable use in complex scientific workflows.
Rethinking Visual Intelligence: Insights from Video Pretraining (Read more on arXiv or HuggingFace)	Ahmad Rahimi, Sebastian Stapf, Mariam Hassan, Aram Davtyan, Pablo Acuaviva	This paper demonstrates that pretrained Video Diffusion Models (VDMs) exhibit superior data efficiency and performance on structured visual reasoning tasks compared to similarly adapted Large Language Models (LLMs). The primary objective is to investigate whether the spatiotemporal inductive biases from large-scale video pretraining provide a more effective foundation for visual intelligence than the symbolic capabilities of text-pretrained models. The study employs a controlled comparison where a pretrained VDM and an LLM are fine-tuned on visual tasks using identical lightweight LoRA adaptation, with tasks framed as image-to-image temporal transitions for the VDM and serialized JSON-to-JSON for the LLM. Across benchmarks, VDMs consistently outperform LLMs in data efficiency; specifically, on the ARC-AGI benchmark, the CogVideoX1.5-5B VDM achieved 16.75% accuracy, more than double the 8.00% achieved by the comparably scaled Qwen3-4B-Instruct-2507 LLM. The principal implication for AI practitioners is that video pretraining is a potent source of inductive biases for visual foundation models, significantly improving sample efficiency on tasks requiring compositional spatial understanding and offering a superior alternative to text-centric approaches for these domains.
Generalization or Memorization: Dynamic Decoding for Mode Steering (Read more on arXiv or HuggingFace)		This paper introduces Dynamic Mode Steering (DMS), a training-free, inference-time decoding algorithm to steer LLMs from memorization towards generalization. The primary objective is to create a framework to understand, identify, and control the distinct reasoning modes of LLMs to enhance their reliability. The methodology involves a two-stage process: first, a lightweight linear probe identifies the model’s current reliance on memorization based on internal activations at a causally-critical layer; second, a dynamic activation steering mechanism nudges the model’s computation towards pre-identified generalization circuits. Experiments on Llama-3 models show that DMS significantly improves performance, increasing Pass@1 accuracy on the GSM8K benchmark by 6.2% for the 8B model and improving factual accuracy on TruthfulQA. For AI practitioners, the principal implication is that DMS offers a practical, post-hoc method to improve the factual accuracy and logical consistency of deployed models without retraining, providing a direct mechanism for enhancing AI safety and reliability.
VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations (Read more on arXiv or HuggingFace)	Jiayi Zhang, Sirong Lu, Yifan Wu, Zhiyang Zhang, Yupeng Xie	This paper introduces VISJUDGE-BENCH, a benchmark for evaluating MLLM performance in assessing data visualization quality, and proposes VISJUDGE, a fine-tuned model that significantly improves alignment with human expert judgments on this task. The primary objective is to systematically measure and improve the capabilities of Multimodal Large Language Models (MLLMs) in assessing the quality of data visualizations across the multi-dimensional criteria of data fidelity, information expressiveness, and visual aesthetics. The authors constructed VISJUDGE-BENCH, a dataset of 3,090 expert-annotated visualizations evaluated on six sub-dimensions, and then developed VISJUDGE by applying reinforcement learning and parameter-efficient fine-tuning to the Qwen2.5-VL-7B-Instruct model using this new benchmark. The proposed VISJUDGE model significantly outperforms existing MLLMs, reducing the Mean Absolute Error (MAE) by 19.8% and increasing the correlation with human experts by 58.7% compared to the baseline GPT-5 model. The principal implication for AI practitioners is that general-purpose MLLMs are inadequate for specialized, multi-dimensional assessment of domain-specific imagery like data visualizations, necessitating domain-specific fine-tuning on expert-annotated benchmarks to achieve human-aligned performance.
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a
Unified Concept Set (Read more on arXiv or HuggingFace)		This paper introduces VL-SAE, a sparse autoencoder that interprets and enhances vision-language alignment in VLMs by mapping multi-modal representations to a unified concept set. The main objective is to address the difficulty of interpreting VLM alignment by mapping the semantics of both vision and language representations into a single, shared conceptual space. The key methodology involves a novel SAE architecture with a distance-based encoder to ensure consistent activations for semantically similar inputs and two modality-specific decoders to handle distributional differences. Experiments show that VL-SAE improves downstream performance, for instance, enhancing the zero-shot image classification mean accuracy of OpenCLIP-ViT-H/14 from 76.9% to 77.8%. For AI practitioners, VL-SAE provides a post-hoc tool to interpret a VLM’s alignment mechanism, diagnose failures like hallucination, and improve performance by explicitly aligning representations at the concept level.
PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D
Part Understanding (Read more on arXiv or HuggingFace)	Lan Xu, Yukai Zhou, Xin Lv, Yiyang He, Penghao Wang	This paper introduces PartNeXt, a large-scale dataset with over 23,000 high-quality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. The research objective is to address the scalability and usability limitations of existing datasets like PartNet, which lack textures and use expert-dependent annotation tools. The methodology involves collecting models from public sources (e.g., Objaverse), using a custom dual-panel web interface for scalable crowdsourced annotation directly on textured meshes, and leveraging GPT-4o to bootstrap part hierarchies. The primary result shows that training the Point-SAM model on PartNeXt yields substantial performance gains over PartNet, improving IoU@10 on the PartNeXt test set from 60.3% to 65.9%. The principal implication for AI practitioners is that PartNeXt offers a high-quality, textured, and diverse dataset for training and benchmarking more robust 3D models, with new benchmarks that reveal significant gaps in current 3D-LLMs’ ability to perform open-vocabulary part grounding.
PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text
Embedding (Read more on arXiv or HuggingFace)	Denis Cavallucci, Iliass Ayaou	This paper introduces PatenTEB, a 15-task benchmark for patent text embedding, and the `patembed` model family trained upon it. The research objective is to create a comprehensive evaluation framework for patent text understanding and identify training strategies that optimize for both benchmark performance and real-world generalization. The methodology involves constructing a 2.06 million-example benchmark with domain-stratified splits and asymmetric retrieval tasks, and then using multi-task learning on 13 of these tasks to train a family of encoders initialized from a domain-pretrained model. The primary result is that `patembed-base` achieves a state-of-the-art 0.494 V-measure on the external MTEB BigPatentClustering.v2 benchmark, outperforming the previous best of 0.445. For practitioners, the principal implication is that multi-task training improves external generalization (+0.062 V-measure on BigPatent) even at a minor cost to internal benchmark performance (-0.004 Overall Score), indicating that optimizing solely for benchmark scores can be suboptimal for deployment.

Papers for 2025-10-27

Title	Authors	Summary
DeepAgent: A General Reasoning Agent with Scalable Toolsets (Read more on arXiv or HuggingFace)	Jiajie Jin, Jiarui Jin, Xiaoxi Li, dongguanting, wxjiao	This paper introduces DeepAgent, an end-to-end reasoning agent that unifies autonomous thinking, dynamic tool discovery, and action execution into a single, continuous process for complex tasks. The objective is to create a general-purpose agent that overcomes the limitations of predefined workflows by enabling dynamic tool retrieval and robust long-horizon reasoning over scalable toolsets. Key methodologies include an autonomous memory folding mechanism to compress interaction history into a structured schema (episodic, working, tool memory) and a reinforcement learning strategy, ToolPO, which leverages an LLM-based tool simulator and fine-grained advantage attribution for stable training. DeepAgent significantly outperforms baseline methods, particularly in open-set scenarios; on the ToolBench benchmark with open-set tool retrieval, it achieved a 64.0% success rate, surpassing the strongest workflow-based baseline’s 54.0%. For AI practitioners, this work provides a framework demonstrating that a unified reasoning architecture with dynamic tool discovery and explicit memory management is more effective for building robust, general-purpose agents than traditional, rigid workflow-based approaches.
Video-As-Prompt: Unified Semantic Control for Video Generation (Read more on arXiv or HuggingFace)		This paper introduces Video-As-Prompt (VAP), a unified framework that uses a reference video as an in-context prompt to achieve generalizable semantic control over video generation. The objective is to develop a single model for diverse, non-pixel-aligned semantic control (e.g., style, motion) that avoids the artifacts and poor generalization of existing methods. VAP employs a plug-and-play Mixture-of-Transformers (MoT) architecture, where a trainable expert processes the video prompt to guide a frozen Video Diffusion Transformer (DiT) via full attention, combined with a temporally biased position embedding to prevent incorrect spatial mapping priors. The primary result is that VAP achieves a 38.7% user preference rate, rivaling leading condition-specific commercial models, and demonstrates strong zero-shot generalization to unseen semantic conditions. For AI practitioners, the principal implication is the ability to add complex semantic control to existing frozen video generation models without costly per-condition retraining or specialized architectures, enabling more scalable and flexible content creation.
From Denoising to Refining: A Corrective Framework for Vision-Language
Diffusion Model (Read more on arXiv or HuggingFace)		The paper introduces ReDiff, a corrective framework for vision-language diffusion models that reframes generation from passive denoising to active refining to mitigate error cascades in parallel decoding. The primary objective is to overcome catastrophic error propagation during parallel generation, which is caused by a train-inference discrepancy where models must generate from their own noisy outputs. The methodology involves a two-stage training process: first, a foundational revision stage to correct synthetic errors, followed by an online self-correction loop where the model learns to fix its own intrinsic mistakes by training on draft-correction pairs generated by an expert model. The framework achieves a CLAIR score of 76.74 on the CapMAS benchmark, an 11.2 point improvement over the LLaDA-V baseline, while demonstrating superior stability in few-step parallel generation. For AI practitioners, this work provides a training paradigm to develop more robust vision-language diffusion models capable of stable and efficient parallel generation, directly addressing a key limitation that hinders their real-world application.
Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image
Generation (Read more on arXiv or HuggingFace)		Chunk-GRPO is a novel chunk-level reinforcement learning approach for flow-matching-based text-to-image generation that optimizes groups of consecutive timesteps to improve image quality and preference alignment. The research objective is to resolve the inaccurate advantage attribution and neglect of temporal dynamics inherent in standard step-level Group Relative Policy Optimization (GRPO). The key methodology involves segmenting the generation trajectory into “chunks” based on the temporal dynamics of flow matching, identified by the relative L1 distance between latent states, and applying a chunk-level optimization objective. Primarily, Chunk-GRPO with weighted sampling achieves a superior HPSv3 preference score of 15.373, outperforming the step-level Dance-GRPO baseline’s score of 15.080. The principal implication for AI practitioners is that aligning the granularity of RL optimization (i.e., chunks) with the intrinsic dynamics of an iterative generation process offers a more effective fine-tuning strategy than applying uniform, step-wise credit assignment.
Sparser Block-Sparse Attention via Token Permutation (Read more on arXiv or HuggingFace)		Permuted Block-Sparse Attention (PBS-Attn) is a plug-and-play method that accelerates long-context LLM prefilling by reordering tokens to increase the block-level sparsity of the attention matrix. The objective is to improve computational efficiency by creating a more favorable block-sparse structure, which is achieved through a novel segmented permutation strategy that reorders keys within segments based on query-aware importance scores while preserving inter-segment causality. Experiments show that PBS-Attn achieves an end-to-end prefilling speedup of up to 2.75× over the FlashAttention baseline, while maintaining model accuracy that is nearly on par with full attention on benchmarks like LongBench and LongBenchv2. For AI practitioners, this method provides a practical, training-free optimization to significantly reduce the latency and computational cost of the compute-bound prefilling stage for long-context inference applications.
UI-Ins: Enhancing GUI Grounding with Multi-Perspective
Instruction-as-Reasoning (Read more on arXiv or HuggingFace)		This research introduces “Instruction-as-Reasoning,” a novel SFT+RL framework that enhances GUI grounding by treating diverse instructions as dynamic reasoning pathways. The paper’s primary objective is to overcome the limitations of poor instruction quality and diversity in existing datasets by developing a model that actively selects the most effective analytical perspective for a given UI task. The methodology involves a two-stage training process: first, Supervised Fine-Tuning (SFT) on a curated dataset of multi-perspective instructions to instill reasoning capabilities, followed by Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO) to optimize pathway selection. The resulting UI-Ins-32B model establishes a new state-of-the-art, achieving 87.3% accuracy on the UI-I2E-Bench, after finding that 23.3% of instructions in existing datasets were flawed. For AI practitioners, this work highlights the critical importance of instruction data quality and diversity, providing a concrete SFT+RL framework to build more robust GUI agents that can reason effectively and avoid policy collapse during training.
A Definition of AGI (Read more on arXiv or HuggingFace)	Yarin Gal, Honglak Lee, Christian Szegedy, Dawn Song, Dan Hendrycks	This paper introduces a quantifiable framework to define Artificial General Intelligence (AGI), grounding it in the Cattell-Horn-Carroll theory of human cognition to evaluate AI systems across ten core cognitive domains. The objective is to operationalize a concrete definition of AGI—an AI matching the cognitive versatility and proficiency of a well-educated adult—to create a standardized measurement tool. The methodology adapts human psychometric batteries to assess AI on ten equally-weighted components, including reasoning, memory, and perception, resulting in a standardized “AGI Score.” The framework’s application reveals a “jagged” cognitive profile in current models; GPT-4 achieves a total AGI Score of 27%, showing strength in knowledge-based tasks but critical deficits in foundational areas, scoring 0% in Long-Term Memory Storage. For AI practitioners, the principal implication is the direct identification of specific system bottlenecks, demonstrating that fundamental capabilities like continual learning (Long-Term Memory Storage) are entirely absent and require direct architectural solutions rather than being addressed by compensatory strategies like Retrieval-Augmented Generation.
Reasoning with Sampling: Your Base Model is Smarter Than You Think (Read more on arXiv or HuggingFace)		This paper presents a training-free, inference-time sampling algorithm to enhance the reasoning capabilities of base large language models. The central research question is whether comparable reasoning performance to that achieved by reinforcement learning (RL) can be elicited from base models using only advanced sampling techniques. The proposed “Power Sampling” method employs an iterative Markov chain Monte Carlo (MCMC) algorithm to sample from the base model’s power distribution (p^α), which systematically upweights higher-likelihood token sequences. The algorithm achieves performance on par with, and often superior to, the RL-posttrained GRPO baseline; for example, on the Qwen2.5-Math-7B model, it improves HumanEval accuracy from 32.9% (base) to 57.3%, outperforming GRPO’s 53.7%, while also maintaining superior generation diversity on pass@k metrics. The principal implication for AI practitioners is that significant reasoning improvements can be extracted from existing base models by dedicating more compute at inference time, potentially obviating the need for complex and costly RL posttraining.
RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via
Hierarchical Model Merging (Read more on arXiv or HuggingFace)		The paper introduces RECALL, a data-free framework that mitigates catastrophic forgetting in large language models by performing hierarchical, layer-wise parameter merging guided by the similarity of internal representations. The primary objective is to develop a method that can identify and preserve learned knowledge across multiple fine-tuned models in a data-free and task-agnostic manner, thereby alleviating catastrophic forgetting during continual learning. RECALL first extracts hidden state representations from a small set of “typical” samples, identified via clustering, for each model. It then computes layer-wise inter-model similarity using an RBF kernel on these representations and uses these scores as adaptive weights for a hierarchical parameter merge, applying different weights for each layer. In a single-model merging scenario with Llama-2-7B, RECALL achieved the best generalization to unseen tasks with an average score of 38.92, outperforming the next-best baseline by +7.86%. The principal implication for AI practitioners is the ability to fuse multiple specialist models into a single, more capable generalist model without requiring access to the original training datasets, which saves computational resources and navigates data privacy constraints.
Visual Diffusion Models are Geometric Solvers (Read more on arXiv or HuggingFace)	Or Patashnik, Andrey Voynov, Omer Dahary, Shai Yehezkel, Nir Goren	Visual diffusion models are presented as effective geometric solvers that operate directly in pixel space. The primary objective is to demonstrate that these models can reason about and discover geometric structures by recasting hard geometric problems as image generation tasks. The key methodology involves training a standard visual diffusion model (U-Net backbone with self-attention) on pixel-space representations of problems like the Inscribed Square Problem, Steiner Tree Problem, and Maximum Area Polygonization Problem, with problem instances provided as conditional input. Primary results include achieving a squareness metric of 0.891 for the Inscribed Square Problem (vs. 0.924 GT), a 0.996 valid tree rate and 1.0008 mean length ratio for Steiner Trees (10-20 points), and a 0.953 polygon validity rate and 0.9887 mean area ratio for Maximum Area Polygons (7-12 points). This research implies for AI practitioners that visual diffusion models offer a general and practical framework for approximating notoriously hard geometric problems through visual representations, enabling a bridge between generative modeling and mathematical problem-solving without requiring specialized architectures.
WorldGrow: Generating Infinite 3D World (Read more on arXiv or HuggingFace)	Jia Lu, Taoran Yi, Chen Yang, Sikuang Li, JieminFang	WorldGrow is a novel framework for generating infinite 3D worlds. The research addresses the challenge of synthesizing infinitely extendable, large, continuous 3D environments with coherent geometry and photorealistic appearance. Its methodology involves a hierarchical framework using a data curation pipeline for structured scene blocks, a 3D block inpainting mechanism for context-aware extension, and a coarse-to-fine generation strategy. WorldGrow achieved a FID_DINOv2 score of 313.54 for visual fidelity in generated blocks, significantly outperforming SynCity (655.60), and demonstrated robust stability in distant expansions. This enables AI practitioners to construct scalable, high-quality 3D content for large-scale virtual environments, crucial for embodied AI training and simulation.
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via
Data Alignment and Test-Time Scaling (Read more on arXiv or HuggingFace)		RAPO++ is a cross-stage prompt optimization framework designed to enhance Text-to-Video (T2V) generation quality model-agnostically. The primary objective is to overcome limitations of short, unstructured, and misaligned user prompts that hinder the generative potential of diffusion-based T2V models. Its methodology involves three stages: Retrieval-Augmented Prompt Optimization (RAPO) for training-data-aligned refinement using relation graphs and LLMs; Sample-Specific Prompt Optimization (SSPO) for iterative test-time scaling with multi-source feedback (e.g., VLM verifiers and optical flow); and LLM fine-tuning to internalize optimization patterns from SSPO. RAPO++ achieved a total score of 82.65% on VBench with the LaVie model and improved Consistent Attribute Binding from 0.620 (Naive) to 0.742 on T2V-CompBench, demonstrating significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility. This framework provides AI practitioners with a model-agnostic, cost-efficient, and scalable solution to substantially improve T2V outputs without modifying the underlying generative backbone.
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs (Read more on arXiv or HuggingFace)	Bohyung Han, taekyung-k, byminji	This research uses mechanistic interpretability to map the internal information flow in Video Large Language Models (VideoLLMs), revealing a structured, multi-stage process for temporal reasoning. The primary objective is to investigate where and how VideoLLMs extract spatiotemporal information from video, integrate it with textual queries, and propagate it through different layers and modalities to generate answers for video question answering tasks. The study employs Attention Knockout to causally trace information flow by selectively disabling attention connections between token groups and Logit Lens to analyze the emergence of semantic concepts within video token representations across layers. The analysis reveals that temporal reasoning begins with cross-frame interactions in early-to-middle layers, followed by video-language integration on temporal keywords in middle layers, after which the model is ready to generate answers in middle-to-late layers; retaining only these effective pathways while suppressing 58% of attention edges in LLaVA-NeXT-7B-Video-FT maintained its original VideoQA performance. For AI practitioners, this provides a blueprint for model optimization, suggesting that VideoLLMs can be pruned to retain only these critical information pathways, enabling the development of more computationally efficient models for inference without a significant loss in temporal reasoning capability.
Model Merging with Functional Dual Anchors (Read more on arXiv or HuggingFace)		This paper introduces Functional Dual Anchors (FDAs) for model merging, enabling knowledge integration in the input-representation space. The primary objective is to mitigate task-specific knowledge conflicts in model merging by shifting the focus from parameter-space adjustments to modeling the input-representation space. The methodology involves constructing Functional Dual Anchors (FDAs) as synthetic inputs whose induced gradients align with task vectors. FDAs are optimized via gradient matching in the input-representation space and used for subsequent parameter optimization, guided by a principled initialization scheme. FDAs significantly improve multi-task performance, with a pretrained model adapted by FDAs achieving 87.26 average accuracy on ViT-B/16 tasks, representing an almost 18% improvement compared to vanilla Task Arithmetic (73.94). AI practitioners can leverage FDAs to achieve more robust and flexible model merging by integrating knowledge through synthetic inputs in the representation space, offering a viable alternative or complement to existing parameter-centric methods for consolidating diverse domain knowledge.
Document Understanding, Measurement, and Manipulation Using Category
Theory (Read more on arXiv or HuggingFace)		This paper introduces a category theory-based framework for document understanding, measurement, and manipulation. The primary objective is to extract and utilize multimodal document structure to enable information-theoretic measures, summarization, exegesis, and self-supervised improvement of large pretrained models. The methodology involves representing documents as categories of orthogonalized question-answer (QA) pairs, derived from rhetorical structure (abstractive DAGs) using large pretrained models. Key results include the formal definition of a Jaccard-like metric for assertion similarity, exemplified by `d(A, B) = ½` for contradictory assertions, and the development of rate distortion analysis for summarization techniques. This framework offers AI practitioners a principled, mathematical approach to semantic analysis and manipulation, enabling advanced document processing and self-correction mechanisms for LLMs based on consistency constraints.
PhysWorld: From Real Videos to World Models of Deformable Objects via
Physics-Aware Demonstration Synthesis (Read more on arXiv or HuggingFace)	Hui Li, Yihan Zeng, Xiang Zhang, Yu Yang, cszhilu1998	PhysWorld is a novel framework for learning accurate and fast world models of deformable objects from limited real-world videos through physics-aware demonstration synthesis. The primary objective is to address the data scarcity challenge in learning physics-consistent dynamics models for deformable objects, enabling both high accuracy and real-time inference. Its methodology involves constructing an MPM-based digital twin using VLM-assisted constitutive model selection and global-to-local physical property optimization from real videos. This digital twin then generates diverse 4D demonstrations via Various Motion Pattern Generation and Part-aware Physical Property Perturbation, which train a GNN-based world model subsequently fine-tuned with real videos. Experimentally, PhysWorld achieved competitive prediction performance and enabled inference speeds 47 times faster than the state-of-the-art PhysTwin (799 FPS vs 17 FPS). This work provides AI practitioners with a robust and efficient method for developing physics-consistent world models for robotics, VR, and AR, mitigating data requirements and facilitating real-time deployment.
Are Large Reasoning Models Good Translation Evaluators? Analysis and
Performance Boost (Read more on arXiv or HuggingFace)	Min Yang, Lidia S. Chao, Xinyi Yang, Zhihong Huang, rzzhan	This paper systematically analyzes Large Reasoning Models (LRMs) as Machine Translation (MT) evaluators, identifies inefficiencies, and proposes a calibration method to improve performance and efficiency. The research aimed to systematically understand LRM performance and failure modes in MT evaluation, and to develop an effective alignment strategy for LRMs as MT judges. The authors employed LRMs within the MQM framework, conducting meta-evaluation and analysis across various model sizes, and proposing ThinMQM, a method to calibrate LRM thinking by fine-tuning models on synthetic, human-like evaluation trajectories derived from WMT23 MQM data. Experiments on WMT24 Metrics benchmarks demonstrated that ThinMQM largely reduced thinking budgets by approximately 35x, while concurrently improving evaluation performance, notably achieving an 8.7 correlation point improvement for R1-Distill-Qwen-7B. These findings highlight that efficiently calibrated LRMs have significant potential to advance fine-grained automatic MT evaluation, emphasizing the critical need for controlled thinking and careful calibration for AI practitioners developing LRM-as-a-judge systems.
ARC-Encoder: learning compressed text representations for large language
models (Read more on arXiv or HuggingFace)		The paper introduces ARC-Encoder, a novel method for learning compressed text representations for Large Language Models (LLMs) that replaces raw text input, aiming to improve inference efficiency and context handling without modifying the decoder. The primary objective is to develop a plug-and-play encoder that compresses LLM contexts into continuous representations, reducing inference costs and extending context windows while preserving general abilities. The methodology involves an LLM transformer-based encoder with a pooling mechanism that averages consecutive queries in the last self-attention module for a fixed pooling factor (e.g., 4x or 8x), trained via alternating reconstruction and continuation tasks with a two-layer MLP projector and adaptable to multiple decoders via shared encoder and specialized projector layers. Results demonstrate that ARC-Encoder achieves state-of-the-art performance, nearly matching the open-book baseline with a 4x pooling factor (e.g., ARC4-Encoder for Llama3.1 8B achieves an average score of 48.0 compared to open-book’s 47.4) while providing up to 1.8x gains in prefilling FLOPs and extending context processing to 8x the original window size. This implies that AI practitioners can leverage ARC-Encoder as an efficient and portable solution to compress LLM input contexts, enhancing inference speed and long-context capabilities for various applications without requiring architectural changes or fine-tuning of the LLM decoder itself.
Taming Modality Entanglement in Continual Audio-Visual Segmentation (Read more on arXiv or HuggingFace)	Zhaojin Fu, Zili Wang, Tao Zhang, Qi Yang, hongyuyang23casia	This paper introduces a novel Collision-based Multi-modal Rehearsal (CMR) framework to mitigate modality entanglement in Continual Audio-Visual Segmentation (CAVS). The primary objective is to enable models to continuously segment new audio-visual classes while preserving knowledge of previously learned ones, specifically addressing multi-modal semantic drift and co-occurrence confusion in fine-grained CAVS. The methodology proposes the CMR framework, which includes a Multi-modal Sample Selection (MSS) strategy to identify high-modality-consistency samples for rehearsal by quantifying audio contribution, and a Collision-based Sample Rehearsal (CSR) mechanism that dynamically adjusts rehearsal frequency for classes prone to co-occurrence confusion. Comprehensive experiments on AVSBench-CI, AVSBench-CIS, and AVSBench-CIM datasets demonstrate that CMR significantly outperforms single-modal continual learning methods, achieving an 11.3 mIoU increase on the AVSBench-CIS 60-10 overlapped setting compared to other methods. This research provides AI practitioners with a robust framework for designing continual learning systems in multi-modal, fine-grained tasks like audio-visual segmentation, offering effective strategies to combat catastrophic forgetting and modality entanglement.
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research
Suite (Read more on arXiv or HuggingFace)	Bhavana Dalvi, Dan Bareket, Nishant Balepur, Mike D’Arcy, Jonathan Bragg	AstaBench introduces a rigorous benchmark suite for evaluating AI agents in scientific research. Its objective is to address shortcomings of existing benchmarks by providing holistic, reproducible, cost-accounted evaluations with standardized interfaces and comprehensive baselines across 2400+ problems and multiple scientific domains. The methodology includes a production-grade scientific research environment with controlled search tools and an evaluation toolkit to account for confounders like tool access and inference cost. Experimental results show that while some agents achieve meaningful progress in literature understanding (e.g., Asta v0 at 53.0%), overall scores for the full range of science tasks remain low, with the best open-source agent achieving only 11.1%. This indicates that AI is still far from solving the challenge of scientific research assistance, requiring significant development in areas like coding, data analysis, and end-to-end discovery.
Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video (Read more on arXiv or HuggingFace)		Foley Control is a lightweight approach for video-guided Foley synthesis by aligning frozen text-to-audio models with video. The main objective is to achieve competitive temporal and semantic alignment while preserving the practicality of frozen generative backbones and reducing data requirements. This is accomplished by connecting V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model via compact, trainable video cross-attention layers inserted after the existing text cross-attention, utilizing pooled video tokens and Rotary Position Embeddings (RoPE) for temporal grounding. Foley Control delivers competitive alignment, with single-pooled embeddings achieving a KL-PANNs metric of 3.111351 at 400k training steps, comparable to denser grid embeddings, and comparable MovieGenBench scores (e.g., DeSync 0.32) while training with nearly two orders of magnitude less paired data and compute than end-to-end multimodal systems. This framework offers AI practitioners a modular and data-efficient solution for video-to-audio generation, enabling easy swapping or upgrading of encoders and T2A backbones without costly end-to-end retraining.
Soft Instruction De-escalation Defense (Read more on arXiv or HuggingFace)		Soft Instruction Control (SIC) is an iterative prompt sanitization defense designed for tool-augmented Large Language Model (LLM) agents against prompt injection attacks. The method’s objective is to neutralize adversarial instructions by repeatedly inspecting incoming untrusted data, rewriting, masking, or removing malicious content, and re-evaluating until clean or a maximum iteration limit is reached. SIC employs an LLM-based rewriting mechanism with canary injection and multi-granularity detection (full text and chunks) to identify and de-escalate imperative instructions. Against a strong adaptive genetic algorithm adversary, SIC achieved an Attack Success Rate (ASR) of 15%, outperforming other detector-based defenses which had ASRs up to 49%. For AI practitioners, SIC offers a pragmatic, modular preprocessing layer that significantly raises the bar for prompt injection attacks by making them more difficult and expensive, without requiring modifications to the underlying LLM agent.
PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language
Models in Physical Environments (Read more on arXiv or HuggingFace)	Chaoyang Zhao, Manli Tao, Yi Peng, Xuantang Xiong, JettZhou	This paper introduces Active Visual Reasoning (AVR), a task requiring multimodal models to interact with physical environments to resolve information incompleteness for reasoning. The main objective is to extend visual reasoning from static, fully-observable settings to dynamic, partially-observable environments where agents must actively gather information. The methodology involves creating the CLEVR-AVR benchmark for evaluation and the AVR-152k dataset with Chain-of-Thought annotations modeling the task as a higher-order Markov Decision Process. A model, PhysVLM-AVR, is trained on this dataset to learn sequential information gathering and reasoning. The primary result is that while PhysVLM-AVR achieves 90.5% accuracy in identifying the need for interaction (Information Sufficiency Judgment Accuracy), its final answer accuracy is 39.7%, indicating that models can detect missing information but struggle to strategically act to acquire it. The principal implication for AI practitioners is that the AVR framework and dataset provide a concrete methodology for training agents to perform goal-directed, interactive information seeking, addressing a key limitation of current MLLMs in robotics and dynamic environments.
Stabilizing MoE Reinforcement Learning by Aligning Training and
Inference Routers (Read more on arXiv or HuggingFace)		This paper introduces Rollout Routing Replay (R3) to stabilize reinforcement learning for Mixture-of-Experts (MoE) models by aligning router behavior between training and inference. The objective is to mitigate RL training instability and collapse in MoE models, which is attributed to discrepancies in expert routing distributions between the inference rollout and training update phases. The key methodology involves recording the routing masks from the inference engine during sequence generation and replaying them during the training forward pass to enforce consistent expert selection. R3 reduces the training-inference policy KL divergence for the Qwen3-30B-A3B model from 1.535×10⁻³ to 7.5×10⁻⁴ and decreases the frequency of tokens with large probability discrepancies by an order of magnitude. For AI practitioners, R3 offers a practical method to stabilize RL on MoE architectures, preventing training collapse and improving final model performance by resolving a foundational inconsistency between training and inference frameworks.