ICLR 2026 高评分论文集 - 平均分≥6分且最低分≥5分

1. Special Unitary Parameterized Estimators of Rotation

作者:

This paper revisits the topic of rotation estimation through the lens of special unitary matrices. We begin by reformulating Wahba’s problem using $SU(2)$ to derive multiple solutions that yield linear constraints on corresponding quaternion parameters. We then explore applications of these constraints by formulating efficient methods for related problems. Finally, from this theoretical foundation, we propose two novel continuous representations for learning rotations in neural networks. Extensive experiments validate the effectiveness of the proposed methods.

📊 评审评分

平均分: 8.50

最低分: 8

最高分: 10

评审人数: 4

详细评分: 10, 8, 8, 8

📄 openreview 📄 下载PDF

2. Feedback-driven recurrent quantum neural network universality

作者:

Quantum reservoir computing uses the dynamics of quantum systems to process temporal data, making it particularly well-suited for machine learning with noisy intermediate-scale quantum devices. Recent developments have introduced feedback-based quantum reservoir systems, which process temporal information with comparatively fewer components and enable real-time computation while preserving the input history. Motivated by their promising empirical performance, in this work, we study the approximation capabilities of feedback-based quantum reservoir computing. More specifically, we are concerned with recurrent quantum neural networks, which are quantum analogues of classical recurrent neural networks. Our results show that regular state-space systems can be approximated using quantum recurrent neural networks without the curse of dimensionality and with the number of qubits only growing logarithmically in the reciprocal of the prescribed approximation accuracy. Notably, our analysis demonstrates that quantum recurrent neural networks are universal with linear readouts, making them both powerful and experimentally accessible. These results pave the way for practical and theoretically grounded quantum reservoir computing with real-time processing capabilities.

📊 评审评分

平均分: 8.50

最低分: 8

最高分: 10

评审人数: 4

详细评分: 10, 8, 8, 8

📄 openreview 📄 下载PDF

3. Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

作者:

A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent *Ann loves pie* by binding *Ann* to *pie*, allowing it to later retrieve *Ann* when asked *Who loves pie?* Prior research on short lists of bound entities found strong evidence that LMs implement such retrieval via a **positional mechanism**, where *Ann* is retrieved based on its position in context. In this work, we find that this mechanism generalizes poorly to more complex settings; as the number of bound entities in context increases, the positional mechanism becomes noisy and unreliable in middle positions. To compensate for this, we find that LMs supplement the positional mechanism with a **lexical mechanism** (retrieving *Ann* using its bound counterpart *pie*) and a **reflexive mechanism** (retrieving *Ann* through a direct pointer). Through extensive experiments on nine models and ten binding tasks, we uncover a consistent pattern in how LMs mix these mechanisms to drive model behavior. We leverage these insights to develop a causal model combining all three mechanisms that estimates next token distributions with 95\% agreement. Finally, we show that our model generalizes to substantially longer inputs of open-ended text interleaved with entity groups, further demonstrating the robustness of our findings in more natural settings. Overall, our study establishes a more complete picture of how LMs bind and retrieve entities in-context.

📊 评审评分

平均分: 8.00

最低分: 8

最高分: 8

评审人数: 3

详细评分: 8, 8, 8

📄 openreview 📄 下载PDF

4. Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

作者:

We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21% pass@1. These results highlight fundamental trade-offs between reasoning, efficiency, robustness, and expose challenges in closing the “sim2real” gap. Gaia2 is built on the open-source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2, we aim to provide the community with a foundation for developing, benchmarking, and training the next generation of practical agent systems.

📊 评审评分

平均分: 8.00

最低分: 6

最高分: 10

评审人数: 3

详细评分: 8, 6, 10

📄 openreview 📄 下载PDF

5. Scaling with Collapse: Efficient and Predictable Training of LLM Families

作者:

Effective LLM training relies on *consistency*, meaning that key quantities—such as final losses and optimal hyperparameters—scale predictably across model sizes. Qiu et al. (2025) recently showed that this consistency extends beyond scalars: whole training loss curves can *collapse* onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon holds for LLM families trained under *practical scaling recipes*, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse thus emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) the predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, *Celerity*, using these insights, highlighting collapse as an effective tool for developing efficient LLMs.

📊 评审评分

平均分: 8.00

最低分: 8

最高分: 8

评审人数: 3

详细评分: 8, 8, 8

📄 openreview 📄 下载PDF

6. LLMs Get Lost In Multi-Turn Conversation

作者:

Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

📊 评审评分

平均分: 8.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 8, 8, 10, 6

📄 openreview 📄 下载PDF

7. Transducing Language Models

作者:

Modern language models define distributions over strings, but their outputs are not always suited to downstream task. For instance, a model generating byte-pair strings may not be suitable when word-level predictions are needed, and a DNA model may not fit applications requiring amino acids. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form. This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, they are not treated as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. Focusing on transformations representable as finite-state transducers---a commonly used state-machine abstraction for efficient string-to-string mappings---we develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target. This allows us to propagate probabilities through the transducer without altering model parameters and to *condition* on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting token-level language models to character-level language models, token-level language models to word-level models, and deriving amino-acid models from DNA models. This demonstrates inference-time adaptation of pretrained language models to match application-specific output requirements.

📊 评审评分

平均分: 8.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 8, 10, 8, 6

📄 openreview 📄 下载PDF

8. Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

作者:

Unified Speech Recognition (USR) has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition, achieving state-of-the-art results on in-distribution benchmarks. However, its reliance on autoregressive pseudo-labelling makes training expensive, while its decoupled supervision of CTC and attention branches increases susceptibility to self-reinforcing errors, particularly under distribution shifts involving longer sequences, noise, or unseen domains. We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets in a single forward pass. Although these can be globally incoherent, in the pseudo-labelling setting they enable efficient and effective knowledge transfer. Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can predict both simultaneously, benefiting from the robustness of CTC and the expressiveness of attention without costly beam search. We further propose mixed sampling to mitigate the exposure bias of the decoder relying solely on CTC inputs. The resulting method, USR 2.0, halves training time, improves robustness to out-of-distribution inputs, and achieves state-of-the-art results on LRS3, LRS2, and WildVSR, surpassing USR and modality-specific self-supervised baselines.

📊 评审评分

平均分: 8.00

最低分: 8

最高分: 8

评审人数: 4

详细评分: 8, 8, 8, 8

📄 openreview 📄 下载PDF

9. Embodied Navigation Foundation Model

作者:

Navigation is a fundamental capability in embodied AI, representing the intelligence required to perceive and interact within physical environments. To achieve such intelligence, recent advanced works leverage Vision-Language Models (VLMs), which demonstrate strong generalizability and possess a well-suited formulation for navigation. However, these approaches remain largely confined to narrow task settings and embodiment-specific architectures. In this work, we introduce a cross-embodiment and cross-task Navigation Foundation Model (NavFoM), trained on eight million navigation samples that encompass quadrupeds, drones, wheeled robots, and vehicles, and spanning diverse tasks such as vision-and-language navigation, object searching, target tracking, and autonomous driving. NavFoM employs a unified architecture that processes multimodal navigation inputs from varying camera configurations and navigation horizons. To accommodate diverse camera setups and temporal horizons, NavFoM incorporates identifier tokens that embed camera view information of embodiments and the temporal context of tasks. Furthermore, to meet the demands of real-world deployment, NavFoM controls all observation tokens using a dynamically adjusted sampling strategy under a limited token length budget. Extensive evaluations on seven public benchmarks demonstrate that our model achieves state-of-the-art or highly competitive performance across different navigation tasks and embodiments without requiring task-specific fine-tuning. Additional real-world experiments further confirm the strong generalizability and practical applicability of our approach.

📊 评审评分

平均分: 8.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 10, 8, 8, 6

📄 openreview 📄 下载PDF

10. Probabilistic Kernel Function for Fast Angle Testing

作者:

In this paper, we study the angle testing problem in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and employs a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be both theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We further apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5X-3X higher query-per-second (QPS) throughput compared to the state-of-the-art graph-based search algorithm HNSW.

📊 评审评分

平均分: 8.00

最低分: 8

最高分: 8

评审人数: 4

详细评分: 8, 8, 8, 8

📄 openreview 📄 下载PDF

11. $\tau^2$-bench: : Evaluating Conversational Agents in a Dual-Control Environment

作者:

Existing benchmarks for conversational AI agents simulate **single-control environments**, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $\tau^2$-bench, with four key contributions: 1. A novel **Telecom dual-control domain** modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2. A **compositional task generator** that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3. A **reliable user simulator** tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4. **Fine-grained analysis of agent performance** through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $\tau^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

📊 评审评分

平均分: 8.00

最低分: 8

最高分: 8

评审人数: 4

详细评分: 8, 8, 8, 8

📄 openreview 📄 下载PDF

12. Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data

作者:

Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two techniques: i) experience rehearsal and ii) execution guidance. With these modifications, the non-curated offline data substantially improves RL’s sample efficiency. Under limited sample budgets, our method achieves a 102.8% relative improvement in aggregate score over learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

📊 评审评分

平均分: 8.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 8, 8, 10, 6

📄 openreview 📄 下载PDF

13. TamperTok: Forensics-Driven Tokenized Autoregressive Framework for Image Tampering Localization

作者:

Multi-modal Large Language Models (MLLMs) offer powerful reasoning for localizing tampering in images, yet existing MLLM-based approaches suffer from suboptimal localization due to the reliance on exogenous segmentation decoders. The stitched pipeline introduces information bottlenecks during backpropagation, diluting spatial signals from the MLLM's hidden embeddings and lacking semantic priors for forensic tasks, which leads to imprecise masks and poor generalization in Image Manipulation Detection \& Localization (IMDL). To address those limitations, we propose TamperTok, which reformulates MLLM-based IMDL as an autoregressive sequence generation task. Unlike existing approaches relying on exogenous decoder for localization, TamperTok directly generates spatially grounded token sequences from the MLLM, enabling precise probabilistic mask prediction without intermediary supervisions. Specifically, we introduce Kernel Splatting Decoder (KSD) to mitigate the sharp gradients caused by deterministic map in codebook-based detokenizer via clustering-aware code smoothing while mapping tokens to binary masks. In addition, to compensate for the lacking priors of diverse tampering types, i.e., splicing and semantic forgeries, we propose a novel Scene-wise Expert Injection (SwEI) to select and inject multi-scale tampering-specific features from a forensic expert model into the MLLM. Extensive experiments show that TamperTok achieves state-of-the-art (SOTA) performance on multiple tampering localization datasets, with 20\% improvements in IoU and F1 over existing MLLM-based models, while exhibiting stronger robustness to noise perturbations and cross-domain scenarios. Codes will be released.

📊 评审评分

平均分: 8.00

最低分: 8

最高分: 8

评审人数: 4

详细评分: 8, 8, 8, 8

📄 openreview 📄 下载PDF

14. MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive Text Sources

作者:

We present MixtureVitae, an open‑access pretraining corpus built to minimize copyright risk while preserving strong downstream performance. MixtureVitae follows a risk‑mitigated sourcing strategy that combines public‑domain and permissively licensed text (e.g., CC‑BY/Apache) with carefully justified low‑risk additions (e.g., government works and EU TDM‑eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi‑stage pipeline for license‑aware filtering, safety and quality screening, and domain‑aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open‑sci‑ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb‑Edu and approach DCLM in the later stages of training. Performance is particularly strong on MMLU and competitive on QA tasks. These results demonstrate that permissive‑first, risk‑mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness.

📊 评审评分

平均分: 8.00

最低分: 8

最高分: 8

评审人数: 3

详细评分: 8, 8, 8

📄 openreview 📄 下载PDF

15. Generative Universal Verifier as Multimodal Meta-Reasoner

作者:

We introduce *Generative Universal Verifier*, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build **ViVerBench**, a comprehensive benchmark spanning $16$ categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train **OmniVerifier-7B**, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+$8.3$). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose **OmniVerifier-TTS**, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+$3.7$), and GenEval++(+$4.3$), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

📊 评审评分

平均分: 8.00

最低分: 8

最高分: 8

评审人数: 3

详细评分: 8, 8, 8

📄 openreview 📄 下载PDF

16. Multilevel Control Functional

作者:

Control variates are variance reduction techniques for Monte Carlo estimators. They play a critical role in improving Monte Carlo estimators in scientific and machine learning applications that involve computationally expensive integrals. We introduce \emph{multilevel control functionals} (MLCFs), a novel and widely applicable extension of control variates that combines non-parametric Stein-based control variates with multi-fidelity methods. We show that when the integrand and the density are smooth, and when the dimensionality is not very high, MLCFs enjoy a faster convergence rate. We provide both theoretical analysis and empirical assessments on differential equation examples, including Bayesian inference for ecological models, to demonstrate the effectiveness of our proposed approach. Furthermore, we extend MLCFs for variational inference, and demonstrate improved performance empirically through Bayesian neural network examples.

📊 评审评分

平均分: 8.00

最低分: 8

最高分: 8

评审人数: 3

详细评分: 8, 8, 8

📄 openreview 📄 下载PDF

17. $\pi^3$: Permutation-Equivariant Visual Geometry Learning

作者:

We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models will be publicly available.

📊 评审评分

平均分: 8.00

最低分: 6

最高分: 10

评审人数: 3

详细评分: 6, 10, 8

📄 openreview 📄 下载PDF

18. La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching

作者:

Recently, many generative models for de novo protein structure design have emerged. Yet, only few tackle the difficult task of directly generating fully atomistic structures jointly with the underlying amino acid sequence. This is challenging, for instance, because the model must reason over side chains that change in length during generation. We introduce La-Proteina for atomistic protein design based on a novel partially latent protein representation: coarse backbone structure is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables of fixed dimensionality, thereby effectively side-stepping challenges of explicit side-chain representations. Flow matching in this partially latent space then models the joint distribution over sequences and full-atom structures. La-Proteina achieves state-of-the-art performance on multiple generation benchmarks, including all-atom co-designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations. Notably, La-Proteina also surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure-conditioned protein design tasks. Moreover, La-Proteina is able to generate co-designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples, demonstrating La-Proteina's scalability and robustness.

📊 评审评分

平均分: 8.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 10, 8, 8, 6

📄 openreview 📄 下载PDF

19. Distributional Equivalence in Linear Non-Gaussian Latent-Variable Cyclic Causal Models: Characterization and Learning

作者:

Causal discovery with latent variables is a fundamental task. Yet most existing methods, if not all, rely on strong structural assumptions, such as enforcing specific indicator patterns for latents or restricting how they can interact with others. We argue that a core obstacle to a general, structural-assumption-free approach is the lack of an equivalence characterization: without knowing what can be identified, one generally cannot design methods for how to identify it. In this work, we aim to close this gap for linear non-Gaussian models. We establish the graphical criterion for when two graphs with arbitrary latent structure and cycles are distributionally equivalent, that is, they induce the same observed distribution set. Key to our approach is a new tool, edge rank constraints, which fills a missing piece in the toolbox for latent-variable causal discovery in even broader settings. We further provide a procedure to traverse the whole equivalence class and develop an algorithm to recover models from data up to such equivalence. To our knowledge, this is the first equivalence characterization with latent variables in any parametric setting without structural assumptions, and hence the first structural-assumption-free discovery method. Code and an interactive demo are available at https://equiv.cc.

📊 评审评分

平均分: 8.00

最低分: 8

最高分: 8

评审人数: 4

详细评分: 8, 8, 8, 8

📄 openreview 📄 下载PDF

20. Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

作者:

The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce **VIST3A**, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit *model stitching*, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt *direct reward finetuning*, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.

📊 评审评分

平均分: 8.00

最低分: 8

最高分: 8

评审人数: 4

详细评分: 8, 8, 8, 8

📄 openreview 📄 下载PDF

21. The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm

作者:

Computing the polar decomposition and the related matrix sign function has been a well-studied problem in numerical analysis for decades. Recently, it has emerged as an important subroutine within the Muon algorithm for training deep neural networks. However, the requirements of this application differ sharply from classical settings: deep learning demands GPU-friendly algorithms that prioritize high throughput over high precision. We introduce *Polar Express*, a new method for computing the polar decomposition. Like Newton–Schulz and other classical polynomial methods, our approach uses only matrix-matrix multiplications, making it very efficient on GPUs. Inspired by earlier work of Chen \& Chow and Nakatsukasa \& Freund, *Polar Express* adapts the update rule at each iteration by solving a minimax optimization problem. We prove that this strategy minimizes error in a worst-case sense, allowing *Polar Express* to converge as rapidly as possible both in the early iterations and asymptotically. We also address finite-precision issues, making it practical to use in `bfloat16`. When integrated into the Muon training framework, our method leads to consistent improvements in validation loss when training a GPT-2 model on one billion tokens from the FineWeb dataset, outperforming recent alternatives across a range of learning rates.

📊 评审评分

平均分: 8.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 8, 8, 10, 6

📄 openreview 📄 下载PDF

22. Price of Quality: Sufficient Conditions for Sparse Recovery using Mixed-Quality Data

作者:

We study sparse recovery when observations come from mixed-quality sources: a small collection of high-quality measurements with small noise variance and a larger collection of lower-quality measurements with higher variance. For this heterogeneous-noise setting, we establish sample-size conditions for information-theoretic and algorithmic recovery. On the information-theoretic side, we show that $(n_1, n_2)$ must satisfy a linear trade-off defining the _Price of Quality_: the number of low-quality samples needed to replace one high-quality sample. In the agnostic setting, where the decoder is completely agnostic to the quality of the data, it is uniformly bounded, and in particular one high-quality sample is never worth more than two low-quality samples. In the informed setting, where the decoder is informed of per-sample variances, the price of quality can grow arbitrarily large. On the algorithmic side, we analyze the LASSO in the agnostic setting and show that the recovery threshold matches the homogeneous-noise case and only depends on the average noise level, revealing a striking robustness of computational recovery to data heterogeneity. Together, these results give the first conditions for sparse recovery with mixed-quality data and expose a fundamental difference between how the information-theoretic and algorithmic thresholds adapt to changes in data quality.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 8

📄 openreview 📄 下载PDF

23. Neon: Negative Extrapolation From Self-Training Improves Image Generation

作者:

Scaling generative AI models is bottlenecked by the scarcity of high-quality training data. The ease of synthesizing from a generative model suggests using (unverified) synthetic data to augment a limited corpus of real data for the purpose of fine-tuning in the hope of improving performance. Unfortunately, however, the resulting positive feedback loop leads to model autophagy disorder (MAD, aka model collapse) that results in a rapid degradation in sample quality and/or diversity. In this paper, we introduce Neon (for Negative Extrapolation frOm self-traiNing), a new learning method that turns the degradation from self-training into a powerful signal for self-improvement. Given a base model, Neon first fine-tunes it on its own self-synthesized data but then, counterintuitively, reverses its gradient updates to extrapolate away from the degraded weights. We prove that Neon works because typical inference samplers that favor high-probability regions create a predictable anti-alignment between the synthetic and real data population gradients, which negative extrapolation corrects to better align the model with the true data distribution. Neon is remarkably easy to implement via a simple post-hoc merge that requires no new real data, works effectively with as few as 1k synthetic samples, and typically uses less than 1\% additional training compute. We demonstrate Neon’s universality across a range of architectures (diffusion, flow matching, autoregressive, and inductive moment matching models) and datasets (ImageNet, CIFAR-10, and FFHQ). In particular, on ImageNet 256x256, Neon elevates the xAR-L model to a new state-of-the-art FID of 1.02 with only 0.36\% additional training compute.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 8

📄 openreview 📄 下载PDF

24. Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and Implications for Mechanistic Interpretability

作者:

Layer-wise normalization (LN) is an essential component of virtually all transformer-based large language models. While its effects on training stability are well documented, its role at inference time is poorly understood. Additionally, LN layers hinder mechanistic interpretability by introducing additional nonlinearities and increasing the interconnectedness of individual model components. Here we show that all LN layers can be removed from every GPT-2 model with only a small increase in validation loss (e.g. +0.03 cross-entropy loss for GPT-2 XL). Thus LN cannot play a substantial role in language modeling. We find that the amount of fine-tuning data needed for LN removal grows sublinearly with model parameters, suggesting scaling to larger models is feasible. We release a suite of LN-free GPT-2 models on Hugging Face. Furthermore, we test interpretability techniques on LN-free models. Direct logit attribution now gives the exact direct effect of individual components, while the accuracy of attribution patching does not significantly improve. We also confirm that GPT-2's "confidence neurons" are inactive in the LN-free models. Our work clarifies the role of LN layers in language modeling, showing that GPT-2-class models can function without LN layers. We hope that our LN-free analogues of the GPT-2 family of models will enable more precise interpretability research and improve our understanding of language models.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 8, 6

📄 openreview 📄 下载PDF

25. Speculative Actions: A Lossless Framework for Faster AI Agents

作者:

AI agents have attracted growing interest across industry and academia, but in practice their execution can be slow. For example, letting two state-of-the-art agents play a game of chess may take hours. A key bottleneck is that agent behavior unfolds sequentially: each action requires an API call, and these calls can be time-consuming. Inspired by speculative execution in microprocessors and speculative decoding in LLM inference, we propose speculative actions—a lossless framework that predicts likely actions using faster models, enabling multiple API calls to be executed in parallel. We evaluate this framework across four agentic environments: gaming, e-commerce, web search, and operating systems. In all cases, speculative actions yield substantial acceleration, with potential speedups of up to 30%. Moreover, performance can be further improved through stronger guessing models and top-K action prediction, opening a promising path toward real world, efficient deployment of AI agents.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 8, 10, 6, 6

📄 openreview 📄 下载PDF

26. Relative Scaling Laws for LLMs

作者:

Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 8, 6, 10

📄 openreview 📄 下载PDF

27. Why Less is More (Sometimes): A Theory of Data Curation

作者:

This paper introduces a theoretical framework to resolve a central paradox in modern machine learning: When is it better to use less data? This question has become critical as classical scaling laws suggesting ``more is more'' (Sun et al., 2025) are challenged by methods like LIMO (``less is more'') and s1 (Ye et al., 2025; Muenighoff et al., 2025), which achieve superior performance with small, aggressively curated datasets. Here, we study data curation strategies where an imperfect oracle selects the training examples according to their difficulty and correctness. Our results provide exact scaling law curves for test error under both label-agnostic and label-aware curation rules, revealing when and why keeping only a subset of data can improve generalization. In contrast to classical scaling laws, we show that under certain conditions, small curated datasets can outperform full datasets, and we provide analytical conditions for this by deriving precise phase transition curves tied to data size and quality. We validate these theoretical claims with empirical results on ImageNet, confirming our predictions about when curation improves accuracy and can even mitigate model collapse. Furthermore, our framework provides a principled explanation for the contradictory curation strategies recently observed in LLM mathematical reasoning.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 8

📄 openreview 📄 下载PDF

28. Pre-training under infinite compute

作者:

Since compute grows much faster than web text available for language model pre-training, we ask how one should approach pre-training under fixed data and no compute constraints. We first show that existing data-constrained approaches of increasing epoch count and parameter count overfit, and we improve upon such recipes by tuning regularization, finding that the optimal weight decay is $30\times$ larger than standard practice. Since our regularized recipe monotonically decreases loss following a power law in parameter count, we estimate its best possible performance via the \textbf{asymptote} of its scaling law rather than the performance at a fixed compute budget. We then identify that ensembling independently trained models achieves a significantly lower loss asymptote than the regularized recipe. Our best intervention combining epoching, regularization, parameter scaling, and ensemble scaling achieves an asymptote at 200M tokens using $5.17\times$ less data than our baseline, and our data scaling laws predict that this improvement persists at higher token budgets. We find that our data efficiency gains can be realized at smaller parameter counts as we can distill an ensemble into a student model that is 8$\times$ smaller and retains $83$% of the ensembling benefit. Finally, our interventions designed for validation loss generalize to downstream benchmarks, achieving a $9$% improvement for pre-training evals. Our results show that simple algorithmic improvements can enable significantly more data-efficient pre-training in a compute-rich future.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 8, 6

📄 openreview 📄 下载PDF

29. Persona Features Control Emergent Misalignment

作者:

Understanding how language models generalize behaviors from their training to a broader deployment distribution is an important problem in AI safety. Betley et al. discovered that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment," where models give stereotypically malicious responses to unrelated prompts. We extend this work, demonstrating emergent misalignment across diverse conditions, including reinforcement learning on reasoning models, fine-tuning on various synthetic datasets, and in models without safety training. To investigate the mechanisms behind this generalized misalignment, we apply a "model diffing" approach using sparse autoencoders to compare internal model representations before and after fine-tuning. This approach reveals several "misaligned persona" features in activation space, including a toxic persona feature which most strongly controls emergent misalignment and can be used to predict whether a model will exhibit such behavior. Additionally, we investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 8

📄 openreview 📄 下载PDF

30. Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences

作者:

Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for safety research. Model diffing--the study of differences between base and finetuned models--is a promising approach for understanding how finetuning modifies neural networks. In this paper, we show that narrow finetuning creates easily readable biases in LLM activations that can be detected using simple model diffing tools, suggesting that the finetuning data is overrepresented in the model's activations. In particular, analyzing activation differences between base and finetuned models on the first few tokens of random text and steering with this difference allows us to recover the format and general content of the finetuning data. We demonstrate that these analyses significantly enhance an LLM-based interpretability agent's ability to identify subtle finetuning objectives through interaction with base and finetuned models. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). Our work: (1) demonstrates that researchers should be aware that narrow finetuned models will represent their training data and objective very saliently, (2) warns AI safety and mechanistic interpretability researchers that these models might not be a realistic proxy for studying broader finetuning, despite current literature widely using them. While we show that mixing pretraining data into the finetuning corpus is enough to remove this bias, a deeper investigation is needed to understand the side effects of narrow finetuning and develop truly realistic case studies for model-diffing, safety and interpretability research.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 8

📄 openreview 📄 下载PDF

31. Protein Structure Tokenization via Geometric Byte Pair Encoding

作者:

Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting interpretability, multi-scale control, and transfer across architectures. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences'' of geometry while enforcing global constraints. Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an $\mathrm{SE}(3)$ end-frame loss. GeoBPE offers compression ($>$10× reduction in bits-per-residue at similar distortion rate), data efficiency ($>$10× less training data), and generalization (maintains test/train distortion ratio of $1.0-1.1$). It is architecture-agnostic: (a) its hierarchical vocabulary provides a strong inductive bias for coarsening residue-level embeddings from large PLMs into motif- and protein-level representations, consistently outperforming leading PSTs across $12$ tasks and $24$ test splits; (b) paired with a transformer, GeoBPE supports unconditional backbone generation via language modeling; and (c) tokens align with CATH functional families and support expert-interpretable case studies, offering functional meaning absent in prior PSTs.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 6, 8, 10

📄 openreview 📄 下载PDF

32. Minimax-Optimal Aggregation for Density Ratio Estimation

作者:

Density ratio estimation (DRE) is fundamental in machine learning and statistics, with applications in domain adaptation and two-sample testing. However, DRE methods are highly sensitive to hyperparameter selection, with suboptimal choices often resulting in poor convergence rates and empirical performance. To address this issue, we propose a novel model aggregation algorithm for DRE that trains multiple models with different hyperparameter settings and aggregates them. Our aggregation provably achieves minimax-optimal error convergence without requiring prior knowledge of the smoothness of the unknown density ratio. Our method surpasses cross-validation-based model selection and model averaging baselines for DRE on standard benchmarks for DRE and large-scale domain adaptation tasks, setting a new state of the art on image and text data.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 8

📄 openreview 📄 下载PDF

33. From Markov to Laplace: How Mamba In-Context Learns Markov Chains

作者:

While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed-ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering an interesting phenomenon: even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 8

📄 openreview 📄 下载PDF

34. Bound by semanticity: universal laws governing the generalization-identification tradeoff

作者:

Intelligent systems must deploy internal representations that are simultaneously structured—to support broad generalization—and selective—to preserve input identity. We expose a fundamental limit on this tradeoff. For any model whose representational similarity between inputs decays with finite semantic resolution, we derive closed‑form expressions that pin its probability of correct generalization $p_S$ and identification $p_I$ to a universal Pareto front independent of input space geometry. Extending the analysis to noisy, heterogeneous spaces and to inputs $n>2$ predicts a sharp $1/n$ collapse of multi-input processing capacity and a non‑monotonic optimum for $p_S$. A minimal ReLU network trained end‑to‑end reproduces these laws: during learning a resolution boundary self‑organizes and empirical $(p_S,p_I)$ trajectories closely follow theoretical curves for linearly decaying similarity. Finally, we demonstrate that the same limits persist in two markedly more complex settings—a convolutional neural network and state‑of‑the‑art vision–language models—confirming that finite‑resolution similarity is a fundamental emergent informational constraint, not merely a toy‑model artifact. Together, these results provide an exact theory of the generalization‑identification trade‑off and clarify how semantic resolution shapes the representational capacity of deep networks and brains alike.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 6, 8, 10

📄 openreview 📄 下载PDF

35. An Improved Model-free Decision-estimation Coefficient with Applications in Adversarial MDPs

作者:

We study decision making with structured observation (DMSO). The complexity for DMSO has been characterized by a series of work [ FKQR21 , CMB22 , FGH23 ]. Still, there is a gap between known regret upper and lower bounds: current upper bounds incur a model estimation error that scales with the size of the model class. The work of [FGQ+23 ] made an initial attempt to reduce the estimation error to only scale with the size of the value function set, resulting in the complexity called optimistic decision-estimation coefficient (optimistic DEC). Yet, their approach relies on the optimism principle to drive exploration, which deviates from the general idea of DEC that drives exploration only through information gain. In this work, we introduce an improved model-free DEC, called Dig-DEC, that removes the optimism mechanism in [FGQ+23 ], making it more aligned with existing model-based DEC. Dig-DEC is always upper bounded by optimistic DEC, and could be significantly smaller in special cases. Importantly, the removal of optimism allows it to seamlessly handle adversarial environments, while it was unclear how to achieve it within the optimistic DEC framework. By applying Dig-DEC to hybrid MDPs where the transition is stochastic but the reward is adversarial, we provide the first model-free regret bounds in hybrid MDPs with bandit feedback in multiple settings: bilinear classes, Bellman-complete MDPs with bounded Bellman-eluder dimension or coverability, resolving the main open problem left by [LWZ25]. We also improve online function-estimation procedure used in model-free learning: For average estimation error minimization, we improve the estimator to achieve better concentration. This improves the $T^{\frac{3}{4}}$ and $T^{\frac{5}{6}}$ regret of [FGQ+23 ] to $T^{\frac{2}{3}}$and $T^{\frac{7}{9}}$ in the cases with on-policy and off-policy exploration. For squared estimation error minimization in Bellman-complete MDPs, we redesign the two-timescale procedure in [ AZ22 , FGQ+23], achieving $\sqrt{T}$ regret that improves over the $T^{\frac{2}{3}}$ regret by [ FGQ+23 ]. This is the first time the performance of a DEC-based approach for Bellman-complete MDPs matches that of optimism-based approaches [JLM21, XFB+23].

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 8

📄 openreview 📄 下载PDF

36. Reasoning without Training: Your Base Model is Smarter Than You Think

作者:

Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time, *without any additional training*. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Crucially, our method does not require training, curated datasets, or a verifier, suggesting a general applicability beyond easily verifiable domains.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 8

📄 openreview 📄 下载PDF

37. StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

作者:

Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 8, 6, 10

📄 openreview 📄 下载PDF

38. Spherical Watermark: Encryption-Free, Lossless Watermarking for Diffusion Models

作者:

Diffusion models have revolutionized image synthesis but raise concerns around content provenance and authenticity. Digital watermarking offers a means of tracing generated media, yet traditional schemes often introduce distributional shifts and degrade visual quality. Recent lossless methods embed watermark bits directly into the latent Gaussian prior without modifying model weights, but still require per-image key storage or heavy cryptographic overhead. In this paper, we introduce Spherical Watermark, an encryption‐free and lossless watermarking framework that integrates seamlessly with diffusion architectures. First, our binary embedding module mixes repeated watermark bits with random padding to form a high-entropy code. Second, the spherical mapping module projects this code onto the unit sphere, applies an orthogonal rotation, and scales by a chi-square-distributed radius to recover exact multivariate Gaussian noise. We theoretically prove that the watermarked noise distribution preserves the target prior up to third-order moments, and empirically demonstrate that it is statistically indistinguishable from a standard multivariate normal distribution. Adopting Stable Diffusion, extensive experiments confirm that Spherical Watermark consistently preserves high visual fidelity while simultaneously improving traceability, computational efficiency, and robustness under attacks, thereby outperforming both lossy and lossless approaches.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 8

📄 openreview 📄 下载PDF

39. Cautious Weight Decay

作者:

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 10, 8, 6

📄 openreview 📄 下载PDF

40. EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

作者:

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EditBench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e.,~user instructions and code contexts collected in the wild. EditBench comprises of 545 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EditBench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EditBench is a challenging set of problems where only 3 models score over 60\%. We find that model performance varies across different categories of user instructions. Further, we find that varying levels of contextual information greatly affect task success rate, with performance varying up to 11\%, indicating the importance of evaluating with realistic context.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 8, 10, 6

📄 openreview 📄 下载PDF

41. Is it Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

作者:

Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less `effort' than required to achieve high reward. TRACE quantifies effort by measuring how early a model's reasoning becomes sufficient to pass a verifier. We progressively truncate a model's CoT at various lengths and measure the verifier-passing rate at each cutoff. A hacking model, which takes a reasoning shortcut, will achieve a high passing rate with only a small fraction of its CoT, yielding a large area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitoring baseline in math, and over 30% gains over a 32B monitoring baseline in code. We further show that TRACE can discover unknown loopholes in the training environment. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 8

📄 openreview 📄 下载PDF

42. Hubble: a Model Suite to Advance the Study of LLM Memorization

作者:

We present Hubble, a suite of open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come as minimal pairs: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models---standard and perturbed, with 1B or 8B parameters, trained on 100B or 500B tokens. Hubble's core experiment establishes that memorization risks are determined by the frequency of sensitive data relative to the training corpus size (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release includes 6 more models with perturbations inserted at different pretraining phases; we observe perturbations without continued exposure can be forgotten. These findings suggest two best practices: to dilute sensitive data by increasing the training corpus size, and to order them to appear earlier in training. Beyond these general findings, Hubble enables a broad range of memorization research. We show that the randomized perturbations in Hubble make it an ideal testbed for membership inference and machine unlearning methods. We invite the community to explore, benchmark, and build upon our work.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 8

📄 openreview 📄 下载PDF

43. Nesterov Finds GRAAL: Optimal and Adaptive Gradient Method for Convex Optimization

作者:

In this paper, we focus on the problem of minimizing a continuously differentiable convex objective function, $\min_x f(x)$. Recently, Malitsky (2020); Alacaoglu et al. (2023) developed an adaptive first-order method, GRAAL. This algorithm computes stepsizes by estimating the local curvature of the objective function without any line search procedures or hyperparameter tuning, and attains the standard iteration complexity $\mathcal{O}(L\Vert x_0-x^* \Vert^2/\epsilon)$ of fixed-stepsize gradient descent for $L$-smooth functions. However, a natural question arises: is it possible to accelerate the convergence of GRAAL to match the optimal complexity $\mathcal{O}(\sqrt{L\Vert x_0-x^*\Vert^2/\epsilon})$ of the accelerated gradient descent of Nesterov (1983)? Although some attempts have been made by Li and Lan (2025); Suh and Ma (2025), the ability of existing accelerated algorithms to adapt to the local curvature of the objective function is highly limited. We resolve this issue and develop GRAAL with Nesterov acceleration, which can adapt its stepsize to the local curvature at a geometric, or linear, rate just like non-accelerated GRAAL. We demonstrate the adaptive capabilities of our algorithm by proving that it achieves near-optimal iteration complexities for $L$-smooth functions, as well as under a more general $(L_0,L_1)$-smoothness assumption (Zhang et al., 2019).

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 6, 10, 8

📄 openreview 📄 下载PDF

44. Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD

作者:

Matrix factorization mechanisms for differentially private training have emerged as a promising approach to improve model utility under privacy constraints. In practical settings, models are typically trained over multiple epochs, requiring matrix factorizations that account for repeated participation. Existing theoretical upper and lower bounds on multi-epoch factorization error leave a significant gap. In this work, we introduce a new explicit factorization method, Banded Inverse Square Root (BISR), which imposes a banded structure on the inverse correlation matrix. This factorization enables us to derive an explicit and tight characterization of the multi-epoch error. We further prove that BISR achieves asymptotically optimal error by matching the upper and lower bounds. Empirically, BISR performs on par with the state of the art factorization methods, while being simpler to implement, computationally efficient, and easier to analyze.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 10, 6, 6, 8

📄 openreview 📄 下载PDF

45. On The Surprising Effectiveness of a Single Global Merging in Decentralized Learning

作者:

Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time to improve global generalization, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global generalization. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significant improve the generalization performance of decentralized learning under serve high data heterogeneity. Our theoretical contributions, which explains these phenomena, are first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides promising results that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research. The code will be made publicly available.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 8

📄 openreview 📄 下载PDF

46. Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

作者:

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose EMPO$^2$, a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO$^2$ achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO$^2$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO$^2$ as a promising framework for building more exploratory and generalizable LLM-based agents.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 6, 10, 8

📄 openreview 📄 下载PDF

47. TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforcement Learning

作者:

Latent prediction–where agents learn by predicting their own latents–has emerged as a powerful paradigm for training general representations in machine learning. In reinforcement learning (RL), this approach has been explored to define auxiliary losses for a variety of settings, including reward-based and unsupervised RL, behavior cloning, and world modeling. While existing methods are typically limited to single-task learning, one-step prediction, or on-policy trajectory data, we show that temporal difference (TD) learning enables learning representations predictive of long-term latent dynamics across multiple policies from offline, reward-free transitions. Building on this, we introduce TD-JEPA, which leverages TD-based latent-predictive representations into unsupervised RL. TD-JEPA trains explicit state and task encoders, a policy-conditioned multi-step predictor, and a set of parameterized policies directly in latent space. This enables zero-shot optimization of any reward function at test time. Theoretically, we show that an idealized variant of TD-JEPA avoids collapse with proper initialization, and learns encoders that capture a low-rank factorization of long-term policy dynamics, while the predictor recovers their successor features in latent space. Empirically, TD-JEPA matches or outperforms state-of-the-art baselines on locomotion, navigation, and manipulation tasks across 13 datasets in ExoRL and OGBench, especially in the challenging setting of zero-shot RL from pixels.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 8

📄 openreview 📄 下载PDF

48. Characterization and Learning of Causal Graphs with Latent Confounders and Post-treatment Selection from Interventional Data

作者:

Interventional causal discovery seeks to identify causal relations by leveraging distributional changes introduced by interventions, even in the presence of latent confounders. Beyond the spurious dependencies induced by latent confounders, we highlight a common yet often overlooked challenge in the problem due to post-treatment selection, in which samples are selectively included in datasets after interventions. This fundamental challenge widely exists in biological studies; for example, in gene expression analysis, both observational and interventional samples are retained only if they meet quality control criteria (e.g., highly active cells). Neglecting post-treatment selection may introduce spurious dependencies and distributional changes under interventions, which can mimic causal responses, thereby distorting causal discovery results and challenging existing causal formulations. To address this, we introduce a novel causal formulation that explicitly models post-treatment selection and reveals how its differential reactions to interventions can distinguish causal relations from selection patterns, allowing us to go beyond traditional equivalence classes toward the underlying true causal structure. We then characterize its Markov properties and propose a $\mathcal{F}$ine-grained $\mathcal{I}$nterventional equivalence class, named $\mathcal{FI}$-Markov equivalence, represented by a new graphical diagram, $\mathcal{F}$-PAG. Finally, we develop a provably sound and complete algorithm, $\mathcal{F}$-FCI, to identify causal relations, latent confounders, and post-treatment selection up to $\mathcal{FI}$-Markov equivalence, using both observational and interventional data. Experimental results on synthetic and real-world datasets demonstrate that our method recovers causal relations despite the presence of both selection and latent confounders.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 8

📄 openreview 📄 下载PDF

49. KDP: Simplifying Representation Dynamics in Kernel Space

作者:

This paper proposes Kernelized Dynamics Pruning (KDP), a novel layer pruning method from the perspective of simplifying representation dynamics within large language models (LLMs). Motivated by the high similarity between consecutive layer representations, we view the LLM's forward pass as a discrete-time dynamical system. We speculate that this phenomenon indicates the model's internal dynamics have entered a ``slow manifold'', which exhibits computational redundancy. Based on this insight, we project the representations into a kernel space where the complex, non-linear transformation between them is simplified to an approximately linear one. Then, a simple network learns the inverse kernel transformation, thereby enabling the pruning of the entire layer block. Both theoretical analysis and extensive experiments validate the effectiveness of KDP, demonstrating its superiority over existing pruning baselines. Code is available at https://anonymous.4open.science/r/draft-123abc.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 8

📄 openreview 📄 下载PDF

50. BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions

作者:

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short of capturing this complexity, either by treating conversation histories as static context or by limiting evaluation to narrow, read-only (SELECT-ONLY) operations, thereby failing to reflect the challenges encountered in production-grade database assistant. In this work, we introduce BIRD-INTERACT, a benchmark that restores this missing realism through: (1) a ***comprehensive interaction environment*** that couples each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from execution errors without human supervision; (2) two ***evaluation settings*** reflecting real-world interaction settings which contain a pre-defined conversational protocol (c-Interact) and a more open-ended agentic setting (a-Interact) in which the model autonomously decides when to query the user simulator or explore the DB environment; (3) a ***challenging task suite*** that covers the full CRUD spectrum for both business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks, requiring LLMs to engage in dynamic interaction. The suite is organized into two sets: a full set (**BIRD-INTERACT-FULL**) of 600 tasks which unfold up to **11,796** dynamic interactions for a comprehensive overview of performance and a lite set (**BIRD-INTERACT-LITE**) of 300 tasks, with simplified databases for detailed behavioral analysis of interactions, and fast development of methods. Our empirical results highlight the difficulty of BIRD-INTERACT: the most recent flagship model GPT-5 completes only **8.67%** of tasks in the c-Interact setting and **17.00%** in the a-Interact setting on the full task suite. Further analysis via memory grafting and Interaction Test-time Scaling (ITS), validate the importance of effective interaction for achieving success in complex, dynamic text-to-SQL tasks.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 8, 6

📄 openreview 📄 下载PDF

51. The Art of Scaling Reinforcement Learning Compute for LLMs

作者:

While the training compute for reinforcement learning (RL) for LLMs is massively increasing, the field is still lacking predictive scaling methodologies for RL comparable to those established for pre-training. This gap is increasingly consequential given recent large scale RL efforts for reasoning-centric post-training. We present the first open, large-compute, systematic study of RL scaling for LLMs. We fit sigmoidal compute-performance curves for RL post-training and ablate a wide range of common design choices. We observe: (1) Not all recipes yield similar asymptotic performance; (2) details such as loss aggregation, normalization, curriculum, and precision handling primarily modulate compute efficiency without materially shifting the asymptote; (3) Stable and scalable recipes exhibit predictive performance behavior as a function of compute, akin to established recipes in pre-training. Combining these insights, we propose a ``best-practice'' recipe, \textbf{ScaleRL}, and demonstrate its effectiveness by successfully scaling and predicting RL training performance on up to 100,000 GPU-hours, based on 400,000 GPU-hours of total experiments. Our study provides a principled foundation for predictive RL scaling laws in the LLM era, and a stable, scalable recipe.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 8

📄 openreview 📄 下载PDF

52. Quotient-Space Diffusion Model

作者:

Diffusion-based generative models have reformed generative AI, and have enabled new capabilities in the science domain, for example, generating 3D structures of molecules. Due to the intrinsic problem structure of certain tasks, there is often a symmetry in the system, which identifies objects that can be converted by a group action as equivalent, hence the target distribution is essentially defined on the quotient space with respect to the group. In this work, we establish a formal framework for diffusion modeling on a general quotient space, and apply it to molecular structure generation which follows the special Euclidean group SE(3) symmetry. The framework reduces the necessity of learning the component corresponding to the group action, hence simplifies learning difficulty over conventional group-equivariant diffusion models, and the sampler guarantees recovering the target distribution, while heuristic alignment strategies lack proper samplers. The arguments are empirically validated on structure generation for small molecules and proteins, indicating that the principled quotient-space diffusion model provides a new framework that outperforms previous symmetry treatments.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 10, 8, 6

📄 openreview 📄 下载PDF

53. Temporal superposition and feature geometry of RNNs under memory demands

作者:

Understanding how populations of neurons represent information is a central challenge across machine learning and neuroscience. Recent work in both fields has begun to characterize the representational geometry and functionality underlying complex distributed activity. For example, artificial neural networks trained on data with more features than neurons compress data by representing features non-orthogonally in so-called *superposition*. However, the effect of time (or memory), an additional capacity-constraining pressure, on underlying representational geometry in recurrent models is not well understood. Here, we study how memory demands affect representational geometry in recurrent neural networks (RNNs), introducing the concept of temporal superposition. We develop a theoretical framework to better understand how properties of the data, task demands, and network dimensionality lead to different representational strategies. Through this, we identify an effectively linear, dense regime and a sparse regime where RNNs utilize an interference-free space, characterized by a phase transition in the angular distribution of features and decrease in spectral radius. Finally, we analyze the interaction of spatial and temporal superposition to observe how RNNs mediate different representational tradeoffs. Overall, our work offers a mechanistic, geometric explanation of representational strategies RNNs learn, how they depend on capacity and task demands, and why.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 8

📄 openreview 📄 下载PDF

54. FullPart: Generating each 3D Part at Full Resolution

作者:

Part-based 3D generation holds great potential for various applications. Previous part generators that represent parts using implicit vector-set tokens often suffer from insufficient geometric details. Another line of work adopts an explicit voxel representation but shares a global voxel grid among all parts; this often causes small parts to occupy too few voxels, leading to degraded quality. In this paper, we propose FullPart, a novel framework that combines both implicit and explicit paradigms. It first derives the bounding box layout through an implicit box vector-set diffusion process, a task that implicit diffusion handles effectively since box tokens contain little geometric detail. Then, it generates detailed parts, each within its own fixed full-resolution voxel grid. Instead of sharing a global low-resolution space, each part in our method—even small ones—is generated at full resolution, enabling the synthesis of intricate details. We further introduce a center-point encoding strategy to address the misalignment issue when exchanging information between parts of different actual sizes, thereby maintaining global coherence. Moreover, to tackle the scarcity of reliable part data, we present PartVerse-XL, the largest human-annotated 3D part dataset to date. Extensive experiments demonstrate that FullPart achieves state-of-the-art results in 3D part generation. We will release all code, data, and model to benefit future research in 3D part generation.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 8, 6

📄 openreview 📄 下载PDF

55. Characterizing the Discrete Geometry of ReLU Networks

作者:

It is well established that ReLU networks define continuous piecewise-linear functions, and that their linear regions are polyhedra in the input space. These regions form a complex that fully partitions the input space. The way these regions fit together is fundamental to the behavior of the network, as nonlinearities occur only at the boundaries where these regions connect. However, relatively little is known about the geometry of these complexes beyond bounds on the total number of regions, and calculating the complex exactly is intractable for most networks. In this work, we prove new theoretical results about these complexes that hold for all fully-connected ReLU networks, specifically about their connectivity graphs in which nodes correspond to regions and edges exist between each pair of regions connected by a face. We find that the average degree of this graph is upper bounded by twice the input dimension regardless of the width and depth of the network, and that the diameter of this graph has an upper bound that does not depend on input dimension, despite the number of regions increasing exponentially with input dimension. We corroborate our findings through experiments with networks trained on both synthetic and real-world data, which provide additional insight into the geometry of ReLU networks.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 8

📄 openreview 📄 下载PDF

56. TEMPFLOW-GRPO: WHEN TIMING MATTERS FOR GRPO IN FLOW MODELS

作者:

Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce TempFlow-GRPO (Temporal Flow-GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces three key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases; and (iii) a seed group strategy that controls for initialization effects to isolate exploration contributions. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and text-to-image benchmarks.

📊 评审评分

平均分: 7.50

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 6, 8, 10

📄 openreview 📄 下载PDF

57. THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

作者:

We present **THEMIS**, a novel multi-task benchmark designed to comprehensively evaluate Multimodal Large Language Models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advancements. (1) **Real-world Scenarios & Complexity**: Our benchmark comprises over 4K questions spanning 7 scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 73.73% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) **Task Diversity & Granularity**: THEMIS systematically covers five challenging tasks and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) **Multi-dimensional Capability Evaluation**: We establish a mapping from fraud tasks to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 11 leading MLLMs show that even the best-performing model still falls below the passing threshold, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud detection tasks. The data and code will be updated on url: https://anonymous.4open.science/r/themis1638.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

58. Computational Bottlenecks for Denoising Diffusions

作者:

Denoising diffusions sample from a probability distribution $\mu$ in $\mathbb{R}^d$ by constructing a stochastic process $(\hat{\mathbf{x}}_t:t\ge 0)$ in $\mathbb{R}^d$ such that $\hat{\mathbf{x}}_0$ is easy to sample, but the distribution of $\hat{\mathbf{x}}_T$ at large $T$ approximates $\mu$. The drift $\mathbf{m}:\mathbb{R}^{d}\times\mathbb{R}\to\mathbb{R}^d$ of this diffusion process is learned by minimizing a score-matching objective. Is every probability distribution $\mu$, for which sampling is tractable, also amenable to sampling via diffusions? We address this question by studying its relation to information-computation gaps in statistical estimation. Earlier work in this area constructs broad families of distributions $\mu$ for which sampling is easy, but approximating the drift $\mathbf{m}(\mathbf{y},t)$ is conjectured to be intractable, and provides rigorous evidence for intractability. We prove that this implies a failure of sampling via diffusions. First, there exist drifts whose score matching objective is superpolynomially close to the optimum value (among polynomial time drifts) and yet yield samples with distribution that is very far from the target one. Second, any polynomial-time drift that is also Lipschitz continuous results in equally incorrect sampling. We instantiate our results on the toy problem of sampling a sparse low-rank matrix, and further demonstrate empirically the failure of diffusion-based sampling. Our work implies that caution should be used in adopting diffusion sampling when other approaches are available.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

59. Robust Decision-Making with Partially Calibrated Forecasters

作者:

Calibration has emerged as a foundational goal in trustworthy machine learning, in part because of its strong decision theoretic semantics. Independent of the underlying distribution, and independent of the decision maker's utility function, calibration promises that amongst all policies mapping predictions to actions, the uniformly best policy is the one that trusts the predictions and acts as if they were correct. But this is true only of fully calibrated forecasts, which are tractable to guarantee only for very low dimensional prediction problems. For higher dimensional prediction problems (e.g. when outcomes are multiclass), weaker forms of calibration have been studied that lack these decision theoretic properties. In this paper we study how a conservative decision maker should map predictions endowed with these weaker (partial) calibration guarantees to actions, in a way that is robust in a minimax sense: i.e. to maximize their expected utility in the worst case over distributions consistent with the calibration guarantees. We characterize their minimax optimal decision rule via a duality argument, and show that surprisingly, trusting the predictions and acting accordingly is recovered in this minimax sense by decision calibration (and any strictly stronger notion of calibration), a substantially weaker and more tractable condition than full calibration. For calibration guarantees that fall short of decision calibration, the minimax optimal decision rule is still efficiently computable, and we provide an empirical evaluation of a natural one that applies to any regression model solved to optimize squared error.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 8, 6

📄 openreview 📄 下载PDF

60. LeRobot: An Open-Source Library for End-to-End Robot Learning

作者:

Robotics is undergoing a significant transformation powered by advances in high-level control techniques based on machine learning, giving rise to the field of robot learning. Recent progress in robot learning has been accelerated by the increasing availability of affordable teleoperation systems, large-scale openly available datasets, and scalable learning-based methods. However, development in the field of robot learning is often slowed by fragmented, closed-source tools designed to only address specific sub-components within the robotics stack. In this paper, we present lerobot, an open-source library that integrates across the entire robotics stack, from low-level middleware communication for motor controls to large-scale dataset collection, storage and streaming. The library is designed with a strong focus on real-world robotics, supporting accessible hardware platforms while remaining extensible to new embodiments. It also supports efficient implementations for various state-of-the-art robot learning algorithms from multiple prominent paradigms, as well as a generalized asynchronous inference stack. Unlike traditional pipelines which heavily rely on hand-crafted techniques, lerobot emphasizes scalable learning approaches that improve directly with more data and compute. Designed for accessibility, scalability, and openness, lerobot lowers the barrier to entry for researchers and practitioners to robotics while providing a platform for reproducible, state-of-the-art robot learning.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

61. Variational Deep Learning via Implicit Regularization

作者:

Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization. Instead, current theory credits implicit regularization imposed by the choice of architecture, hyperparameters and optimization procedure. However, deep neural networks can be surprisingly non-robust, resulting in overconfident predictions and poor out-of-distribution generalization. Bayesian deep learning addresses this via model averaging, but typically requires significant computational resources as well as carefully elicited priors to avoid overriding the benefits of implicit regularization. Instead, in this work, we propose to regularize variational neural networks solely by relying on the implicit bias of (stochastic) gradient descent. We theoretically characterize this inductive bias in overparametrized linear models as generalized variational inference and demonstrate the importance of the choice of parametrization. Empirically, our approach demonstrates strong in- and out-of-distribution performance without additional hyperparameter tuning and with minimal computational overhead.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

62. MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

作者:

The MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this, we propose \texttt{MCPMark}, a benchmark designed to evaluate realistic and comprehensive MCP use, comprising $127$ high-quality tasks collaboratively created by human experts and AI agents. Specifically, each task starts from a curated initial state and incldes a programmatic script for automatic verification. Moreover, these tasks require richer and more varied interactions with the environment, involving diverse create, read, update, and delete (CRUD) operations. We conduct comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, \texttt{gpt-5-medium}, reaches only $52.56$\% pass@1 and $33.86$\% pass^4, while other widely regarded strong models, including \texttt{claude-sonnet-4} and \texttt{o3}, fall below $30$\% pass@1 and $15$\% pass^4. On average, LLMs require $16.18$ execution turns and $17.38$ tool calls per task, substantially exceeding those in previous MCP benchmarks and demonstrating the stress-testing nature of \texttt{MCPMark}.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 10

评审人数: 3

详细评分: 6, 6, 10

📄 openreview 📄 下载PDF

63. Discovering alternative solutions beyond the simplicity bias in recurrent neural networks

作者:

Training recurrent neural networks (RNNs) to perform neuroscience-style tasks has become a popular way to generate hypotheses for how neural circuits in the brain might perform computations. Recent work has demonstrated that task-trained RNNs possess a strong simplicity bias. In particular, this inductive bias often causes RNNs trained on the same task to collapse on effectively the same solution, typically comprised of fixed-point attractors or other low-dimensional dynamical motifs. While such solutions are readily interpretable, this collapse proves counterproductive for the sake of generating a set of genuinely unique hypotheses for how neural computations might be performed. Here we propose Iterative Neural Similarity Deflation (INSD), a simple method to break this inductive bias. By penalizing linear predictivity of neural activity produced by standard task-trained RNNs, we find an alternative class of solutions to classic neuroscience-style RNN tasks. These solutions appear distinct across a battery of analysis techniques, including representational similarity metrics, dynamical systems analysis, and the linear decodability of task-relevant variables. Moreover, these alternative solutions can sometimes achieve superior performance in difficult or out-of-distribution task regimes. Our findings underscore the importance of moving beyond the simplicity bias to uncover richer and more varied models of neural computation.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

64. Latent Stochastic Interpolants

作者:

Stochastic Interpolants (SI) are a powerful framework for generative modeling, capable of flexibly transforming between two probability distributions. However, their use in jointly optimized latent variable models remains unexplored as they require direct access to the samples from the two distributions. This work presents Latent Stochastic Interpolants (LSI) enabling joint learning in a latent space with end-to-end optimized encoder, decoder and latent SI models. We achieve this by developing a principled Evidence Lower Bound (ELBO) objective derived directly in continuous time. The joint optimization allows LSI to learn effective latent representations along with a generative process that transforms an arbitrary prior distribution into the encoder-defined aggregated posterior. LSI sidesteps the simple priors of the normal diffusion models and mitigates the computational demands of applying SI directly in high-dimensional observation spaces, while preserving the generative flexibility of the SI framework. We demonstrate the efficacy of LSI through comprehensive experiments on the standard large scale ImageNet generation benchmark.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 10

评审人数: 3

详细评分: 6, 10, 6

📄 openreview 📄 下载PDF

65. The Coverage Principle: How Pre-Training Enables Post-Training

作者:

Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross entropy can be poorly predictive of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of coverage, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods like Best-of-N to succeed. Our main results develop an understanding of the coverage principle, a phenomenon whereby next-token prediction implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: coverage generalizes faster than cross entropy, avoiding spurious dependence on problem dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

66. In-Place Test-Time Training

作者:

The static "train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce **In-Place Test-Time Training (In-Place TTT)**, a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 8, 6

📄 openreview 📄 下载PDF

67. X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

作者:

Successful generalist Vision-Language-Action (VLA) models that rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders with an enhanced encoding pipeline, enjoying both scalability and simplicity. Evaluated across 6 simulation environments as well as 3 real-world robotics platforms, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves state-of-the-art performance over a sweep of benchmark suites, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

68. Symmetry-Aware Bayesian Optimization via Max Kernels

作者:

Bayesian Optimization (BO) is a powerful framework for optimizing noisy, expensive-to-evaluate black-box functions. When the objective exhibits invariances under a group action, exploiting these symmetries can substantially improve BO efficiency. While using maximum similarity across group orbits has long been considered in other domains, the fact that the max kernel is not positive semidefinite (PSD) has prevented its use in BO. In this work, we revisit this idea by considering a PSD projection of the max kernel. Compared to existing invariant (and non-invariant) kernels, we show it achieves significantly lower regret on both synthetic and real-world BO benchmarks, without increasing computational complexity.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

69. In-The-Flow Agentic System Optimization for Effective Planning and Tool Use

作者:

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, *in-the-flow* agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose *Flow-based Group Refined Policy Optimization* (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns. Codebase is available at https://anonymous.4open.science/r/agentflow.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 8, 6

📄 openreview 📄 下载PDF

70. Fewer Battles, More Gain: An Information-Efficient Framework for Arena-based LLM Evaluation

作者:

Arena-based evaluation has become a key method for assessing large language models (LLMs) through head-to-head model comparisons, closely reflecting human preferences. However, current arena rating systems (e.g., ELO rating system) often suffer from inefficiencies due to exhaustive or random model pair annotations, leading to redundant evaluations, longer evaluation times, and lower overall efficiency. To address these challenges, we propose a novel adaptive model-pair selection algorithm. By leveraging the asymptotic normality of LLM ability estimation under sparse conditions, our approach strategically selects high-value model pairs, focusing on confrontations with the lowest variance. Specifically, we introduce Fisher information as a metric to guide model pair selection, optimizing the evaluation process through A-optimality and D-optimality. A-optimality minimizes estimation variance, ensuring balanced reliability across models, while D-optimality reduces uncertainty by maximizing the determinant of the Fisher Information Matrix. Extensive experiments on both simulated and real-world datasets demonstrate that our method outperforms existing approaches in terms of information efficiency and result reliability. Notably, our method offers a flexible, general toolkit that can be easily integrated into existing arena-based platforms, greatly improving scalability and efficiency for large-scale LLM evaluations.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 8, 6

📄 openreview 📄 下载PDF

71. InclusiveVidPose: Bridging the Pose Estimation Gap for Individuals with Limb Deficiencies in Video-Based Motion

作者:

Approximately 445.2 million individuals worldwide are living with traumatic amputations, and an estimated 31.64 million children aged 0–14 have congenital limb differences, yet they remain largely underrepresented in human pose estimation (HPE) research. Accurate HPE could significantly benefit this population in applications, such as rehabilitation monitoring and health assessment. However, the existing HPE datasets and methods assume that humans possess a full complement of upper and lower extremities and fail to model missing or altered limbs. As a result, people with limb deficiencies remain largely underrepresented, and current models cannot generalize to their unique anatomies or predict absent joints.To bridge this gap, we introduce InclusiveVidPose Dataset, the first video-based large-scale HPE dataset specific for individuals with limb deficiencies. We collect 313 videos, totaling 327k frames, and covering nearly 400 individuals with amputations, congenital limb differences, and prosthetic limbs. We adopt 8 extra keypoints at each residual limb end to capture individual anatomical variations. Under the guidance of an internationally accredited para-athletics classifier, we annotate each frame with pose keypoints, segmentation masks, bounding boxes, tracking IDs, and per-limb prosthesis status. Experiments on InclusiveVidPose highlight the limitations of the existing HPE models for individuals with limb deficiencies. We introduce a new evaluation metric, Limb-specific Confidence Consistency (LiCC), which assesses the consistency of pose estimations between residual and intact limb keypoints. We also provide a rigorous benchmark for evaluating inclusive and robust pose estimation algorithms, demonstrating that our dataset poses significant challenges. We hope InclusiveVidPose spur research toward methods that fairly and accurately serve all body types. The project website is available at: [InclusiveVidPose](https://anonymous-accept.github.io/inclusivevidpose/).

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

72. Provable Guarantees for Automated Circuit Discovery in Mechanistic Interpretability

作者:

*Automated circuit discovery* is a central tool in mechanistic interpretability for identifying the internal components of neural networks responsible for specific behaviors. While prior methods have made significant progress, they typically depend on heuristics or approximations and do not offer provable guarantees over continuous input domains for the resulting circuits. In this work, we leverage recent advances in neural network verification to propose a suite of automated algorithms that yield circuits with *provable guarantees*. We focus on three types of guarantees: (1) *input domain robustness*, ensuring the circuit agrees with the model across a continuous input region; (2) *robust patching*, certifying circuit alignment under continuous patching perturbations; and (3) *minimality*, formalizing and capturing a wide array of various notions of succinctness. Interestingly, we uncover a diverse set of novel theoretical connections among these three families of guarantees, with critical implications for the convergence of our algorithms. Finally, we conduct experiments with state-of-the-art verifiers on various vision models, showing that our algorithms yield circuits with substantially stronger robustness guarantees than standard circuit discovery methods, establishing a principled foundation for provable circuit discovery.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 8, 6

📄 openreview 📄 下载PDF

73. Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings

作者:

We propose a method for evaluating the robustness of widely used LLM ranking systems---variants of a Bradley--Terry model---to dropping a worst-case very small fraction of preference data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from popular LLM ranking platforms, including Chatbot Arena and derivatives, we find that the rankings of top-performing models can be remarkably sensitive to the removal of a small fraction of preferences; for instance, dropping just 0.003% of human preferences can change the top-ranked model on Chatbot Arena. Our robustness check identifies the specific preferences most responsible for such ranking flips, allowing for inspection of these influential preferences. We observe that the rankings derived from MT-bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench's use of expert annotators and carefully constructed prompts. Finally, we find that neither rankings based on crowdsourced human evaluations nor those based on LLM-as-a-judge preferences are systematically more sensitive than the other.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

74. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

作者:

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 1.5: a carefully curated hard benchmark composed of 74 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 50% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

75. Estimating Dimensionality of Neural Representations from Finite Samples

作者:

The global dimensionality of a neural representation manifold provides rich insight into the computational process underlying both artificial and biological neural networks. However, all existing measures of global dimensionality are sensitive to the number of samples, i.e., the number of rows and columns of the sample matrix. We show that, in particular, the participation ratio of eigenvalues, a popular measure of global dimensionality, is highly biased with small sample sizes, and propose a bias-corrected estimator that is more accurate with finite samples and with noise. On synthetic data examples, we demonstrate that our estimator can recover the true known dimensionality. We apply our estimator to neural brain recordings, including calcium imaging, electrophysiological recordings, and fMRI data, and to the neural activations in a large language model and show our estimator is invariant to the sample size. Finally, our estimators can additionally be used to measure the local dimensionalities of curved neural manifolds by weighting the finite samples appropriately.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

76. Safe Exploration via Policy Priors

作者:

Safe exploration is a key requirement for reinforcement learning agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

77. CoDA: Agentic Systems for Collaborative Data Visualization

作者:

Automating data visualization from natural language is crucial for data science, yet current systems struggle with complex datasets containing multiple files and iterative refinement. Existing approaches, including simple single- or multi-agent systems, often oversimplify the task, focusing on initial query parsing while failing to robustly manage data complexity, code errors, or final visualization quality. In this paper, we reframe this challenge as a collaborative multi-agent problem. We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and iterative reflection. We formalize this pipeline, demonstrating how metadata-focused analysis bypasses token limits and quality-driven refinement ensures robustness. Extensive evaluations show CoDA achieves substantial gains in the overall score, outperforming competitive baselines by up to 41.5%. This work demonstrates that the future of visualization automation lies not in isolated code generation but in integrated, collaborative agentic workflows.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

78. TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

作者:

Historically, neuroscience has progressed by fragmenting into specialized domains, each focusing on isolated modalities, tasks, or brain regions. While fruitful, this approach hinders the development of a unified model of cognition. Here, we introduce TRIBE, the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas and individuals. By combining the pretrained representations of text, audio and video foundational models and handling their time-evolving nature with a transformer, our model can precisely model the spatial and temporal fMRI responses to videos, achieving the first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors. Ablations show that while unimodal models can reliably predict their corresponding cortical networks (e.g. visual or auditory networks), they are systematically outperformed by our multimodal model in high-level associative cortices. Currently applied to perception and comprehension, our approach paves the way towards building an integrative model of representations in the human brain. Our code is available at \url{https://anonymous.4open.science/r/algonauts-2025-C63E}.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

79. Learning with Dual-level Noisy Correspondence for Multi-modal Entity Alignment

作者:

Multi-modal entity alignment (MMEA) aims to identify equivalent entities across heterogeneous multi-modal knowledge graphs (MMKGs), where each entity is described by attributes from various modalities. Existing methods typically assume that both intra-entity and inter-graph correspondences are faultless, which is often violated in real-world MMKGs due to the reliance on expert annotations. In this paper, we reveal and study a highly practical yet under-explored problem in MMEA, termed Dual-level Noisy Correspondence (DNC). DNC refers to misalignments in both intra-entity (entity-attribute) and inter-graph (entity-entity and attribute-attribute) correspondences. To address the DNC problem, we propose a robust MMEA framework termed RULE. RULE first estimates the reliability of both intra-entity and inter-graph correspondences via a dedicated two-fold principle. Leveraging the estimated reliabilities, RULE mitigates the negative impact of intra-entity noise during attribute fusion and prevents overfitting to noisy inter-graph correspondences during inter-graph discrepancy elimination. Beyond the training-time designs, RULE further incorporates a correspondence reasoning module that uncovers the underlying attribute-attribute connection across graphs, guaranteeing more accurate equivalent entity identification. Extensive experiments on five benchmarks verify the effectiveness of our method against the DNC compared with seven state-of-the-art methods. The code will be released upon acceptance.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

80. Differentiable Model Predictive Control on the GPU

作者:

Differentiable model predictive control (MPC) offers a powerful framework for combining learning and control. However, its adoption has been limited by the inherently sequential nature of traditional optimization algorithms, which are challenging to parallelize on modern computing hardware like GPUs. In this work, we tackle this bottleneck by introducing a GPU-accelerated differentiable optimization tool for MPC. This solver leverages sequential quadratic programming and a custom preconditioned conjugate gradient (PCG) routine with tridiagonal preconditioning to exploit the problem's structure and enable efficient parallelization. We demonstrate substantial speedups over CPU- and GPU-based baselines, significantly improving upon state-of-the-art training times on benchmark reinforcement learning and imitation learning tasks. Finally, we showcase the method on the challenging task of reinforcement learning for driving at the limits of handling, where it enables robust drifting of a Toyota Supra through water puddles.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

81. Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

作者:

Many high-stakes applications of AI require forming data-driven hypotheses and making targeted guesses; e.g., in scientific and diagnostic settings. Given limited resources, to what extent do agents based on language models (LMs) act rationally? We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior. First, we introduce a strategic decision-oriented dialogue task called *Collaborative Battleship*, in which a partially-informed *Captain* must balance exploration (asking questions) and action (taking shots), while a fully-informed *Spotter* must provide accurate answers under an information bottleneck. Compared to human players (N=42), we find that LM agents struggle to ground answers in context, generate informative questions, and select high-value actions. Next, to address these gaps, we develop novel Monte Carlo inference strategies for LMs based on principles from Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303–0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% → 82% win rate) and frontier models (0% → 67% win rate vs. GPT-5) at ≈1% of GPT-5's cost. We replicate these findings on *Guess Who?* where our methods significantly boost accuracy (+28.3–42.4 p.p.), demonstrating their general applicability for building rational information-seeking agents.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

82. Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

作者:

We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, and pre-trained models will be made publicly available. Video rollouts are available: https://sites.google.com/view/lpwm

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 8, 6

📄 openreview 📄 下载PDF

83. FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

作者:

Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boost while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt much smaller number of query heads in each GQA group --- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose **F**lash **S**parse **A**ttention (**FSA**), an alternative kernel implementation that enables efficient NSA computation across a wide range of popular LLMs with varied smaller number of query heads in each GQA group on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction, (ii) up to 1.25x and 1.09x on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36x and 1.11x on average for prefill-phase speedup in LLM generative inference.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 8, 6

📄 openreview 📄 下载PDF

84. Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

作者:

Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we intro- duce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into L A T EX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+(133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including Mega-Math, FineMath, and OpenWebMath-but also contains 5.5× more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6. gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content—including math—from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code1 and datasets 2 .

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 8, 6

📄 openreview 📄 下载PDF

85. LLM Pretraining with Continuous Concepts

作者:

Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts ``continuous concepts'' learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction and knowledge distillation. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model’s internal reasoning process.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

86. DiffusionNFT: Online Diffusion Reinforcement with Forward Process

作者:

Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks. These include solver restrictions, forward–reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

87. Sequences of Logits Reveal the Low Rank Structure of Language Models

作者:

A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model's logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation --- in particular, we can generate a response to a target prompt using a linear combination of the model's outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

88. Asymptotic analysis of shallow and deep forgetting in replay with neural collapse

作者:

Neural networks exhibits two forms of forgetting: deep (loss of feature separability) and shallow (classifier misalignment). We analyze the effect of replay buffers on these phenomena through an asymptotic study of feature geometry under Neural Collapse. Our results show that replay reliably mitigates deep forgetting but fails to prevent shallow forgetting, as classifier weights converge to buffer-based rather than true class statistics. We extend the NC framework to continual learning, including the multi-head setting, and characterize how buffer size, weight decay, and feature-norm growth determine feature–head alignment. The analysis further reveals that multi-head models induce structurally lower-rank feature spaces and that weight decay has divergent effects on separability across learning regimes. Finally, we establish a formal connection between continual learning and out-of-distribution detection. Empirical evaluations on standard benchmarks support the theoretical predictions and quantify the persistence of shallow forgetting across buffer sizes.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 8, 6

📄 openreview 📄 下载PDF

89. Improving Human-AI Coordination through Online Adversarial Training and Generative Models

作者:

Being able to cooperate with diverse humans is an important component of many economically valuable AI tasks, from household robotics to autonomous driving. However, generalizing to novel humans requires training on data that captures the diversity of human behaviors. Adversarial training is a promising method that allows dynamic data generation and ensures that agents are robust. It creates a feedback loop where the agent’s performance influences the generation of new adversarial data, which can be used immediately to train the agent. However, adversarial training is difficult to apply in a cooperative task; how can we train an adversarial cooperator? We propose a novel strategy that combines a pre-trained generative model to simulate valid cooperative agent policies with adversarial training to maximize regret. We call our method \textbf{GOAT}: \textbf{G}enerative \textbf{O}nline \textbf{A}dversarial \textbf{T}raining. In this framework, the GOAT dynamically searches the latent space of the generative model for coordination strategies where the learning policy---the Cooperator agent---underperforms. GOAT enables better generalization by exposing the Cooperator to various challenging interaction scenarios. We maintain realistic coordination strategies by keeping the generative model frozen, thus avoiding adversarial exploitation. We evaluate GOAT with real human partners, and the results demonstrate state-of-the-art performance on the Overcooked benchmark, highlighting its effectiveness in generalizing to diverse human behaviors.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

90. FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

作者:

Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents’ adaptive reasoning and performance in dynamic environments. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

91. Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?

作者:

Stage lighting is a vital component in live music performances, shaping an engaging experience for both musicians and audiences. In recent years, Automatic Stage Lighting Control (ASLC) has attracted growing interest due to the high costs of hiring or training professional lighting engineers. However, most existing ASLC solutions only classify music into limited categories and map them to predefined light patterns, resulting in formulaic and monotonous outcomes that lack rationality. To address this gap, this paper presents Skip-BART, an end-to-end model that directly learns from experienced lighting engineers and predict vivid, human-like stage lighting. To the best of our knowledge, this is the first work to conceptualize ASLC as a generative task rather than merely a classification problem. Our method adapts the BART model to take audio music as input and produce light hue and value (intensity) as output, incorporating a novel skip connection mechanism to enhance the relationship between music and light within the frame grid. To address the lack of available datasets, we create the first stage lighting dataset, along with several pre-training and transfer learning techniques to improve model training with limited data. We validate our method through both quantitative analysis and an human evaluation, demonstrating that Skip-BART outperforms conventional rule-based methods across all evaluation metrics and shows only a limited gap compared to real lighting engineers. To support further research, we will make our self-collected dataset, code, and trained model parameters available upon publication, which are currently provided in the supplementary.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

92. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

作者:

Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20$\times$ beyond teacher's capability, avoiding common issues such as over-exposure and error-accumuation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9\% of the maximum span supported by our base model’s position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-pp.github.io.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

93. InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

作者:

Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces \alg, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving $20\%$ tokens without influence on performance, and achieving $2.3\times$ compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, \alg enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

94. Physics-Guided Motion Loss for Video Generation Model

作者:

Current video diffusion models generate visually compelling content but often violate basic laws of physics, producing subtle artifacts like rubber-sheet deformations and inconsistent object motion. We introduce a frequency-domain physics prior that improves motion plausibility without modifying model architectures. Our method decomposes common rigid motions (translation, rotation, scaling) into lightweight spectral losses, requiring only 2.7% of frequency coefficients while preserving 97%+ of spectral energy. Applied to Open-Sora, MVDIT, and Hunyuan, our approach improves both motion accuracy and action recognition by ~11\% on average on OpenVID-1M (relative), while maintaining visual quality. User studies show 74--83% preference for our physics-enhanced videos. It also reduces warping error by 22--37% (depending on the backbone) and improves temporal consistency scores. These results indicate that simple, global spectral cues are an effective drop-in regular- izer for physically plausible motion in video diffusion.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 8, 6

📄 openreview 📄 下载PDF

95. Multimodal Policy Internalization for Conversational Agents

作者:

Modern conversational agents such as ChatGPT and Alexa+ have become indispensable in everyday life. To handle diverse business requirements and enable agentic capabilities, these LLM-based systems often rely on predefined policies, which specify instructions such as model metadata, response styles, and tool-using rules. These policies, typically implemented as in-context prompts, are becoming increasingly complex and lengthy, posing challenges for models in faithfully following them. Moreover, they impose a large fixed computational cost regardless of the input query. As multimodal conversational agents emerge, complex policies that govern multimodal tasks and even involve visual instructions are becoming increasingly necessary, yet they have been rarely studied in previous work. In particular, prior work on prompt compression has focused solely on reducing the length of task templates and demonstrations, which require limited reasoning compared to policies. Meanwhile, related work on policy alignment has been limited to internalizing text-only safety instructions. To bridge this gap, we introduce Multimodal Policy Internalization (MPI), a new task that aims to internalize reasoning-intensive multimodal policies into the parameters of a large multimodal model, enabling stronger policy-following behavior without requiring the policy to be included in-context during inference. MPI presents unique challenges from both data and algorithmic perspectives. We construct two new datasets that cover complex decision-making and tool-using tasks across both synthetic and real-world visual inputs. We investigate diverse internalization strategies and propose a novel three-stage training framework, TriMPI, which enables stronger guidance from the original policy during internalization. Specifically, we first introduce a continual pretraining stage before supervised finetuning, which directly injects policy knowledge into the model. We then propose PolicyRollout, a simple yet effective extension to GRPO-style RL algorithms, which enables more grounded exploration by augmenting the rollout space with policy-aware responses. We show significant improvements of TriMPI over strong baselines in end-to-end performance, generalization capability, and robustness to catastrophic forgetting. As the first work on multimodal policy internalization, we aim to build a strong foundation for future research by providing datasets, training recipes, and comprehensive evaluations.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 8, 6

📄 openreview 📄 下载PDF

96. Virtual Community: An Open World for Humans, Robots, and Society

作者:

The rapid progress of AI and robotics may profoundly transform society, as humans and robots begin to coexist in shared communities, bringing both opportunities and challenges. To explore this future, we present Virtual Community—an open-world platform for humans, robots, and society—built on a universal physics engine and grounded in real-world 3D scenes. With Virtual Community, we aim to study embodied social intelligence at scale: 1) How robots can intelligently cooperate or compete, 2) how humans form social relations and build community, and more importantly, 3) how humans and robots can co-exist in open worlds. To support these, Virtual Community features: 1) An open-source multi-agent physics simulator that supports robots, humans, and their interactions within a society; 2) A large‑scale, real‑world aligned community generation pipeline, including vast outdoor space, diverse indoor scenes, and a community of grounded agents with rich characters and appearances. Leveraging Virtual Community, we propose two novel challenges. The Community Planning Challenge evaluates multi‑agent reasoning and planning in open‑world settings, such as cooperating to help agents with daily activities and efficiently connecting other agents. The Community Robot Challenge requires multiple heterogeneous robots to collaborate in solving complex open‑world tasks. We evaluate various baselines and demonstrate the challenges in both high‑level open‑world task planning and low‑level cooperation controls.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

97. MTVCraft: Tokenizing 4D Motion for Arbitrary Character Animation

作者:

Character image animation has rapidly advanced with the rise of digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 4D information for open-world animation. To address this, we propose MTVCraft (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for character image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatial-temporal cues and avoid strict pixel-level alignment between pose images and the character, enabling more flexible and disentangled control. Next, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for character image animation in the complex 4D world. We implement MTVCraft on both CogVideoX-5B (small scale) and Wan-2.1-14B (large scale), demonstrating that our framework is easily scalable and can be applied to models of varying sizes. Experiments on the TikTok and Fashion benchmarks demonstrate our state-of-the-art performance. Moreover, powered by robust motion tokens, MTVCraft showcases unparalleled zero-shot generalization. It can animate arbitrary characters in both single and multiple settings, in full-body and half-body forms, and even non-human objects across diverse styles and scenarios. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided video generation.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

98. Does FLUX Already Know How to Perform Physically Plausible Image Composition?

作者:

Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Artifact-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 8

📄 openreview 📄 下载PDF

99. Causal Time Series Generation via Diffusion Models

作者:

Time series generation (TSG) synthesizes realistic sequences and has achieved remarkable success. Among TSG, conditional models generate sequences given observed covariates, however, such models learn observational correlations without considering unobserved confounding. In this work, we propose a causal perspective on conditional TSG and introduce causal time series generation as a new TSG task family, formalized within Pearl’s causal ladder, extending beyond observational generation to include interventional and counterfactual settings. To instantiate these tasks, we develop CaTSG, a unified diffusion-based framework with backdoor-adjusted guidance that causally steers sampling toward desired interventions and individual counterfactuals while preserving observational fidelity. Specifically, our method derives causal score functions via backdoor adjustment and the abduction–action–prediction procedure, thus enabling principled support for all three levels of TSG. Extensive experiments on both synthetic and real-world datasets show that CaTSG achieves superior fidelity and also supporting interventional and counterfactual generation that existing baselines cannot handle. Overall, we propose the causal TSG family and instantiate it with CaTSG, providing an initial proof-of-concept and opening a promising direction toward more reliable simulation under interventions and counterfactual generation.

📊 评审评分

平均分: 7.33

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 8

📄 openreview 📄 下载PDF

100. Metric $k$-clustering using only Weak Comparison Oracles

作者:

Clustering is a fundamental primitive in unsupervised learning. However, classical algorithms for $k$-clustering (such as $k$-median and $k$-means) assume access to exact pairwise distances---an unrealistic requirement in many modern applications. We study clustering in the \emph{Rank-model (R-model)}, where access to distances is entirely replaced by a \emph{quadruplet oracle} that provides only relative distance comparisons. In practice, such an oracle can represent learned models or human feedback, and is expected to be noisy and entail an access cost. Given a metric space with $n$ input items, we design randomized algorithms that, using only a noisy quadruplet oracle, compute a set of $O(k \cdot \mathsf{polylog}(n))$ centers along with a mapping from the input items to the centers such that the clustering cost of the mapping is at most constant times the optimum $k$-clustering cost. Our method achieves a query complexity of $O(n\cdot k \cdot \mathsf{polylog}(n))$ for arbitrary metric spaces and improves to $O((n+k^2) \cdot \mathsf{polylog}(n))$ when the underlying metric has bounded doubling dimension. When the metric has bounded doubling dimension we can further improve the approximation from constant to $\varepsilon$, for any arbitrarily small constant $\varepsilon\in(0,1)$, while preserving the same asymptotic query complexity. Our framework demonstrates how noisy, low-cost oracles, such as those derived from large language models, can be systematically integrated into scalable clustering algorithms.

📊 评审评分

平均分: 7.20

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 8, 8, 8, 6

📄 openreview 📄 下载PDF

101. Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers

作者:

Uncertainty quantification (UQ) is essential for deploying deep neural networks in safety-critical settings. Although methods like Deep Ensembles achieve strong UQ performance, their high computational and memory costs hinder scalability to large models. We introduce Hydra Ensembles, an efficient transformer-based ensemble that prunes attention heads to create diverse members and merges them via a new multi-head attention with grouped fully-connected layers. This yields a compact model with inference speed close to a single network, matching or surpassing Deep Ensembles in UQ performance without retraining from scratch. We also provide an in-depth analysis of pruning, showing that naive approaches can harm calibration, whereas Hydra Ensembles preserves robust uncertainty. Experiments on image and text classification tasks, with various architectures, show consistent gains over Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our approach surpasses state of the art methods, even without requiring additional training.

📊 评审评分

平均分: 7.20

最低分: 6

最高分: 10

评审人数: 5

详细评分: 6, 6, 10, 6, 8

📄 openreview 📄 下载PDF

102. How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

作者:

Semantic associations such as the link between "bird" and "flew" are foundational for language modeling as they enable models to go beyond memorization and instead generalize and generate coherent text. Understanding how these associations are learned and represented in language models is essential for connecting deep learning with linguistic theory and developing a mechanistic foundation for large language models. In this work, we analyze how these associations emerge from natural language data in attention-based language models through the lens of training dynamics. By leveraging a leading-term approximation of the gradients, we develop closed-form expressions for the weights at early stages of training that explain how semantic associations first take shape. Through our analysis, we reveal that each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions--bigram, token-interchangeability, and context mappings--reflecting the statistics in the text corpus and uncover how each component of the transformer captures the semantic association based on these compositions. Experiments on real-world LLMs demonstrate that our theoretical weight characterizations closely match the learned weights, and qualitative analyses further guide us on how our theorem shines light on interpreting the learned association in transformers.

📊 评审评分

平均分: 7.20

最低分: 6

最高分: 8

评审人数: 5

详细评分: 8, 8, 6, 6, 8

📄 openreview 📄 下载PDF

103. Weight-Space Linear Recurrent Neural Networks

作者:

We introduce WARP (**W**eight-space **A**daptive **R**ecurrent **P**rediction), a simple yet powerful model that unifies weight-space learning with linear recurrence to redefine sequence modeling. Unlike conventional recurrent neural networks (RNNs) which collapse temporal dynamics into fixed-dimensional hidden states, WARP explicitly parametrizes its hidden state as the weights and biases of a distinct auxiliary neural network, and uses input differences to drive its recurrence. This brain-inspired formulation enables efficient gradient-free adaptation of the auxiliary network at test-time, in-context learning abilities, and seamless integration of domain-specific physical priors. Empirical validation shows that WARP matches or surpasses state-of-the-art baselines on diverse classification tasks, featuring in the top three in 5 out of 6 real-world challenging datasets. Furthermore, extensive experiments across sequential image completion, multivariate time series forecasting, and dynamical system reconstruction demonstrate its expressiveness and generalization capabilities. Remarkably, a physics-informed variant of our model outperforms the next best model by more than 10x. Ablation studies confirm the architectural necessity of key components, solidifying weight-space linear RNNs as a transformative paradigm for adaptive machine intelligence.

📊 评审评分

平均分: 7.20

最低分: 6

最高分: 8

评审人数: 5

详细评分: 8, 6, 6, 8, 8

📄 openreview 📄 下载PDF

104. Meta-Learning Theory-Informed Inductive Biases using Deep Kernel Gaussian Processes

作者:

Normative and task-driven theories offer powerful top-down explanations for biological systems, yet the goals of quantitatively arbitrating between competing theories, and utilizing them as inductive biases to improve data-driven fits of real biological datasets are prohibitively laborious, and often impossible. To this end, we introduce a Bayesian meta-learning framework designed to automatically convert raw functional predictions from normative theories into tractable probabilistic models. We employ adaptive deep kernel Gaussian processes, meta-learning a kernel on synthetic data generated from a normative theory. This Theory-Informed Kernel specifies a probabilistic model representing the theory predictions -- usable for both fitting data and rigorously validating the theory. As a demonstration, we apply our framework to the early visual system, using efficient coding as our normative theory. We show improved response prediction accuracy in ex vivo recordings of mouse retinal ganglion cells stimulated by natural scenes compared to conventional data-driven baselines, while providing well-calibrated uncertainty estimates and interpretable representations. Using exact Bayesian model selection, we also show that our informed kernel can accurately infer the degree of theory-match from data, confirming faithful encapsulation of theory structure. This work provides a more general, scalable, and automated approach for integrating theoretical knowledge into data-driven scientific inquiry in neuroscience and beyond.

📊 评审评分

平均分: 7.20

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 8, 8, 8, 6

📄 openreview 📄 下载PDF

105. Task Tokens: A Flexible Approach to Adapting Behavior Foundation Models

作者:

Recent advancements in imitation learning for robotic control have led to transformer-based behavior foundation models (BFMs) that enable multi-modal, human-like control for humanoid agents. These models generate solutions when conditioned on high-level goals or prompts, for example, walking to a coordinate when conditioned on the position of the robot's pelvis. While excelling at zero-shot generation of robust behaviors, BFMs often require meticulous prompt engineering for specific tasks, potentially yielding suboptimal results. In this work, we introduce ``Task Tokens'' - a method to effectively tailor BFMs to specific tasks while preserving their flexibility. Our approach integrates naturally within the transformer architecture of BFMs. Task Tokens trains a task-specific encoder (tokenizer), with the original BFM remaining untouched. Our method reduces trainable parameters per task by up to $\times 125$ and converges up to $\times 6$ faster compared to standard baselines. In addition, by keeping the original BFM unchanged, Task Tokens enables utilizing the pre-existing encoders. This allows incorporating user-defined priors, balancing reward design and prompt engineering. We demonstrate Task Tokens' efficacy across various tasks, including out-of-distribution scenarios, and show their compatibility with other prompting modalities. Our results suggest that Task Tokens offer a promising approach for adapting BFMs to specific control tasks while retaining their generalization capabilities.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

106. Jet Expansions: Restructuring LLM Computation for Model Inspection

作者:

Large language models are becoming general knowledge engines for diverse applications. However, their computations are deeply entangled after training, resisting modularization which complicates interpretability, auditing, and long-term maintenance. We introduce Jet Expansions, a framework for expanding computational graphs using jet operators that generalize truncated Taylor series. Our method systematically decomposes language models into explicit input-to-output computational paths and complementary remainders. This functional decomposition provides a principled, knife-like operator for cutting through entanglement in LLMs, enabling scalable model inspection. We demonstrate how Jet Expansions ground and subsume the popular interpretability technique Logit Lens, reveal a (super-)exponential path structure with respect to recursive residual depth, and support several interpretability applications, including sketching a transformer language model with $n$-gram statistics extracted from its computations and indexing model toxicity levels *without* curated benchmarks.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

107. Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

作者:

Large Language Models (LLMs) are pre-trained on large data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that the resulting model performs comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on large language models.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

108. LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures

作者:

Large Language Model (LLM) pretraining, finetuning, and evaluation rely on input-space reconstruction and generative capabilities. Yet, it has been observed in vision that embedding-space training objectives, e.g., with Joint Embedding Predictive Architectures (JEPAs), are far superior to their input-space counterpart. That mismatch in how training is achieved between language and vision opens up a natural question: {\em can language training methods learn a few tricks from the vision ones?} The lack of JEPA-style LLM is a testimony of the challenge in designing such objectives for language. In this work, we propose a first step in that direction where we develop LLM-JEPA, a JEPA based solution for LLMs applicable both to finetuning and pretraining. Thus far, LLM-JEPA is able to outperform the standard LLM training objectives by a significant margin across models, all while being robust to overfiting. Those findings are observed across numerous datasets (NL-RX, GSM8K, Spider, RottenTomatoes) and various models from the Llama3, OpenELM, Gemma2 and Olmo families. Code: \url{https://anonymous.4open.science/r/llm-jepa-0C6F/README.md}.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

109. AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

作者:

AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they (1) fail to provide holistic, product-informed measures of real-world use cases such as science research; (2) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (3) do not account for confounding variables such as model cost and tool access; (4) do not provide standardized interfaces to speed agent prototyping and evaluation; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized Asta agent classes plus numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

110. Compositional Diffusion with Guided search for Long-Horizon Planning

作者:

Generative models have emerged as powerful tools for planning, with compositional approaches offering particular promise for modeling long-horizon task distributions by composing together local, modular generative models. This compositional paradigm spans diverse domains, from multi-step manipulation planning to panoramic image synthesis to long video generation. However, compositional generative models face a critical challenge: when local distributions are multimodal, existing composition methods average incompatible modes, producing plans that are neither locally feasible nor globally coherent. We propose Compositional Diffusion with Guided Search (CDGS), which addresses this \emph{mode averaging} problem by embedding search directly within the diffusion denoising process. Our method explores diverse combinations of local modes through population-based sampling, prunes infeasible candidates using likelihood-based filtering, and enforces global consistency through iterative resampling between overlapping segments. \ours{} matches oracle performance on seven robot manipulation tasks, outperforming baselines that lack compositionality or require long-horizon training data. The approach generalizes across domains, enabling coherent text-guided panoramic images and long videos through effective local-to-global message passing. More details: https://cdgsearch.github.io/

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

111. Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics

作者:

Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long-horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatiotemporal dynamics. We present STAR-MD (Spatio-Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)-equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatiotemporal attention that efficiently captures complex space-time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR-MD achieves state-of-the-art performance across all metrics--substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. STAR-MD successfully extrapolates to generate stable microsecond-scale trajectories where baseline methods fail catastrophically, maintaining high structural quality throughout the extended rollout. Our comprehensive evaluation reveals severe limitations in current models for long-horizon generation, while demonstrating that STAR-MD's joint spatiotemporal modeling enables robust dynamics simulation at biologically relevant timescales, paving the way for accelerated exploration of protein function.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

112. GLASS Flows: Efficient Inference for Reward Alignment of Flow and Diffusion Models

作者:

The performance of flow matching and diffusion models can be greatly improved at inference time using reward adaptation algorithms, yet efficiency remains a major limitation. While several algorithms were proposed, we demonstrate that a common bottleneck is the *sampling* method these algorithms rely on: many algorithms require to sample Markov transitions via SDE sampling, which is significantly less efficient and often less performant than ODE sampling. To remove this bottleneck, we introduce GLASS Flows, a new sampling paradigm that simulates a ''flow matching model within a flow matching model'' to sample Markov transitions. As we show in this work, this ''inner'' flow matching model can be retrieved from any pre-trained model without any re-training, effectively combining the efficiency of ODEs with the stochastic evolution of SDEs. On large-scale text-to-image models, we show that GLASS Flows eliminate the trade-off between stochastic evolution and efficiency. GLASS Flows improve state-of-the-art performance in text-to-image generation, making it a simple, drop-in solution for inference-time scaling of flow and diffusion models.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

113. Tversky Neural Networks: Psychologically Plausible Deep Learning with Differentiable Tversky Similarity

作者:

Work in psychology has highlighted that the geometric model of similarity standard in deep learning is not psychologically plausible because its metric properties such as symmetry do not align with human perception of similarity. In contrast, (Tversky,1977) proposed an axiomatic theory of similarity with psychological plausibility based on a representation of objects as sets of features, and their similarity as a function of their common and distinctive features. This model of similarity has not been used in deep learning before, in part because of the challenge of incorporating discrete set operations. In this paper, we develop a differentiable parameterization of Tversky's similarity that is learnable through gradient descent, and derive basic neural network building blocks such as the \emph{Tversky projection layer}, which unlike the linear projection layer can model non-linear functions such as {\sc xor}. Through experiments with image recognition and language modeling neural networks, we show that the Tversky projection layer is a beneficial replacement for the linear projection layer. For instance, on the NABirds image classification task, a frozen ResNet-50 adapted with a Tversky projection layer achieves a 24.7\% relative accuracy improvement over the linear layer adapter baseline. With Tversky projection layers, GPT-2's perplexity on PTB decreases by 7.8\%, and its parameter count by 34.8\%. Finally, we propose a unified interpretation of both types of projection layers as computing similarities of input stimuli to learned prototypes for which we also propose a novel visualization technique highlighting the interpretability of Tversky projection layers. Our work offers a new paradigm for thinking about the similarity model implicit in modern deep learning, and designing neural networks that are interpretable under an established theory of psychological similarity.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

114. Achieving Expert-Level Agent from Foundation Model via Complexity Curriculum Reinforcement Learning with Synthetic Data

作者:

Large Language Model (LLM)-based agents exhibit strong mathematical problem-solving ability and can even solve International Mathematic Olympiad (IMO)-level problems with the assistance of a formal language prover. However, hindered by the weak heuristics of auxiliary constructions, the AI for solving geometry problems remains to be specialist models such as AlphaGeometry2, which heavily relies on large-scale data synthesis and search for training and testing. Therefore, this paper makes the first attempt to investigate how to build a medalist-level LLM agent for solving geometry problems and eventually proposes InternGeometry. InternGeometry conquers the weak heuristics of geometry problems by continuously proposing propositions and auxiliary configurations, verifying them in the symbolic engine, and reflecting on the feedback from the symbolic engine for the next proposal, where the dynamic memory mechanism allows InternGeometry to conduct model-symbolic engine interactions more than two hundred times. To further accelerate the learning process of InternGeometry, we introduce Complexity-Boosted Reinforcement Learning (CBRL) that gradually scales the complexity of the synthesized problem at different training stages. Based on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems (2000–2024), exceeding the average gold medalist score (40.9), using 13K training examples, only 0.004\% of the data used by AlphaGeometry2, demonstrating the potential of LLM agents on expert-level tasks. InternGeometry is also capable of proposing novel auxiliary constructions on IMO problems that are unseen in human solutions. Model, data, and symbolic engine will be released to benefit future research.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

115. RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

作者:

Recent advances in robot learning have accelerated progress toward generalist robots that can operate across diverse tasks and environments. Yet despite this momentum, it remains difficult to gauge how close we are to this goal, as the field lacks a reproducible, large-scale benchmark for systematic evaluation. To address this gap, we present RoboCasa365, a comprehensive robot simulation benchmark for everyday tasks. Built on the RoboCasa platform, RoboCasa365 introduces 365 everyday tasks across 2,500 diverse kitchen environments, and over 2,000 hours of robot interaction data, making it one of the most diverse and large-scale resources for studying generalist policies. We design the benchmark to support evaluation across key settings, including multi-task learning, robot foundation model training, and lifelong learning. We present extensive experiments with state-of-the-art methods and analyze how task diversity, dataset scale, and environment variation shape generalization. Our results provide new insights into what factors most strongly affect the performance of generalist robots and help inform strategies for future progress in the field.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 6, 6, 10

📄 openreview 📄 下载PDF

116. Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation

作者:

Large language models (LLMs) often experience language confusion, which is the unintended mixing of languages during text generation. Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching. This paper introduces the \textbf{Language Confusion Gate} (LCG), a lightweight, plug-in solution that filters tokens during decoding without altering the base LLM. The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed. Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling. When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly—often by an order of magnitude—without negatively impacting task performance.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

117. Convergence of Regret Matching in Potential Games and Constrained Optimization

作者:

Regret matching (RM)---and its modern variants---is a foundational online algorithm that has been at the heart of many AI breakthrough results in solving benchmark zero-sum games, such as poker. Yet, surprisingly little is known so far in theory about its convergence beyond two-player zero-sum games. For example, whether regret matching converges to Nash equilibria in potential games has been an open problem for two decades. Even beyond games, one could try to use RM variants for general constrained optimization problems. Recent empirical evidence suggests that they---particularly regret matching$^+$ (RM$^+$)---attain strong performance on benchmark constrained optimization problems, outperforming traditional gradient descent-type algorithms. We show that alternating RM$^+$ converges to an $\epsilon$-KKT point after $O_\epsilon(1/\epsilon^4)$ iterations, establishing for the first time that it is a sound and fast first-order optimizer. Our argument relates the KKT gap to the accumulated regret, two quantities that are entirely disparate in general but interact in an intriguing way in our setting, so much so that when regrets are bounded, our complexity bound improves all the way to $O_\epsilon(1/\epsilon^2)$. From a technical standpoint, while RM$^+$ does not have the usual one-step improvement property in general, we show that it does in a certain region that the algorithm will quickly reach and remain in thereafter. In sharp contrast, our second main result establishes a lower bound: RM, with or without alternation, can take an exponential number of iterations to reach a crude approximate solution even in two-player potential games. This represents the first worst-case separation between RM and RM$^+$. Our lower bound shows that convergence to coarse correlated equilibria in potential games is exponentially faster than convergence to Nash equilibria.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

118. DataMIL: Selecting Data for Robot Imitation Learning with Datamodels

作者:

Recently, the robotics community has amassed ever larger and more diverse datasets to train generalist policies. However, while these policies achieve strong mean performance across a variety of tasks, they often underperform on individual, specialized tasks and require further tuning on newly acquired task-specific data. Combining task-specific data with carefully curated subsets of large prior datasets via co-training can produce better specialized policies, but selecting data naively may actually harm downstream performance. To address this, we introduce DataMIL, a data selection framework built on the datamodels paradigm that reasons about data selection in an end-to-end manner, using the policy itself to identify which data points will most improve performance. Unlike standard practices that filter data using human notions of quality (e.g., based on semantic or visual similarity), DataMIL directly optimizes data selection for task success, allowing us to select data that improves the policy while dropping data that degrade it. To avoid performing expensive rollouts in the environment during selection, we introduce a surrogate loss function on task-specific data, allowing us to use DataMIL in the real world without degrading performance. We validate our approach on 60+ simulation and real-world manipulation tasks, notably showing successful data selection from the largest open collections of robot datasets (OXE); demonstrating consistent gains in success rates over prior works. Our results underscore the importance of end-to-end, performance-aware data selection for unlocking the potential of large prior datasets in robotics.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

119. The Diffusion Duality, Chapter II: $\Psi$-Samplers and Efficient Curriculum

作者:

Uniform-state discrete diffusion models excel at few-step generation and guidance due to their inherent ability to self-correct, making them more preferable than autoregressive or masked diffusion models in these settings. Yet, their sampling efficiency has been limited by reliance on standard posterior samplers, which plateau in quality as steps increase. In this work, we introduce a novel family of Predictor–Corrector (PC) samplers for discrete diffusion models that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers significantly outperform ancestral sampling on both language and vision tasks: achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve generation quality with more sampling steps, narrowing the gap with Masked diffusion. Beyond sampling, we develop a fast and memory-efficient curriculum for Duo$^{++}$'s (our method) Gaussian relaxation phase, which avoids materializing large Gaussian-diffused one-hot vectors. This reduces training time by 25\% compared to Duo while maintaining similar validation perplexity on OpenWebText and LM1B and strong downstream performance.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

120. Remotely Detectable Robot Policy Watermarking

作者:

The success of machine learning for real-world robotic systems has created a new form of intellectual property: the trained policy. This raises a critical need for novel methods that verify ownership and detect unauthorized, possibly unsafe misuse. While watermarking is established in other domains, physical policies present a unique challenge: remote detection. Existing methods assume access to the robot’s internal state, but auditors are often limited to external observations (e.g., video footage). This “Physical Observation Gap” means the watermark must be detected from signals that are noisy, asynchronous, and filtered by unknown system dynamics. We formalize this challenge using the concept of a glimpse sequence, and introduce Colored Noise Coherency (CoNoCo), the first watermarking strategy designed for remote detection. CoNoCo embeds a spectral signal into the robot’s motions by leveraging the policy’s inherent stochasticity. To show it does not degrade performance, we prove CoNoCo preserves the marginal action distribution. Our experiments demonstrate strong, robust detection across various remote modalities—including motion capture and side-way/top-down video footage—in both simulated and real-world robot experiments. This work provides a necessary step toward protecting intellectual property in robotics, offering the first method for validating the provenance of physical policies non invasively, using purely remote observations.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

121. Command-V: Training-Free Representation Finetuning Transfer

作者:

Retrofitting large language models (LLMs) with new behaviors typically requires full finetuning or distillation—costly steps that must be repeated for every architecture. In this work, we introduce ⌘V (Command-V), a backpropagation-free behavior transfer method that copies an existing residual representation adapter from a donor model and pastes its effect into an architecturally different recipient model. ⌘V profiles layer activations on a small prompt set, derives linear converters between corresponding layers, and applies the donor intervention in the recipient’s activation space. This process does not require access to the original training data and needs minimal compute. In three case studies—safety-refusal enhancement, jailbreak facilitation, and automatic chain-of-thought reasoning—⌘V matches the performance of direct finetuning while using orders of magnitude less resources.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 10, 6, 6

📄 openreview 📄 下载PDF

122. Test-Time Alignment for Large Language Models via Textual Model Predictive Control

作者:

Aligning Large Language Models (LLMs) with human preferences through finetuning is resource-intensive, motivating lightweight alternatives at test time. We address test-time alignment through the lens of sequential decision making, a perspective that reveals two fundamental challenges. When actions are defined at the token level, as in guided decoding, alignment suffers from the curse of horizon. Conversely, when actions are at the response level, as in traditional iterative refinement, the curse of dimensionality emerges. To resolve this trade-off, we draw inspiration from Model Predictive Control (MPC) in control theory to propose Textual Model Predictive Control (TMPC), a novel predictive planning framework adapted for aligning LLMs at inference time. A key limitation of standard MPC is its reliance on predefined, hard segment boundaries, which are often absent in text generation. TMPC overcomes this by introducing two principles inspired by hierarchical reinforcement learning: (1) Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to retrospectively identify high-reward intermediate outputs as subgoals. This allows the framework to discover meaningful, task-specific planning steps (e.g., a sentence in machine translation or a bug fix in code generation.). (2) Subgoal-Conditioned Re-Generation, where these identified subgoals are used to guide subsequent planning iterations. By conditioning on these proven, high-quality subgoals, , TMPC ensures stable improvement by building upon previously validated successes. TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis. The results demonstrate that TMPC consistently improves performance, highlighting the generality.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

123. Automata Learning and Identification of the Support of Language Models

作者:

We study the learnability of languages in the *Next Symbol Prediction* (NSP) setting, where a learner receives only positive examples from a language together with, for every prefix, (i) whether the prefix itself is in the language and (ii) which next symbols can lead to an accepting string. This setting has been used in prior work to empirically analyze neural sequence models, and additionally, we observe that efficient algorithms for the NSP setting can be used to learn the (truncated) support of language models. We first show that the class of DFAs with at most $n$ states is identifiable from positive examples augmented with these NSP labels. Nevertheless, even with this richer supervision, we show that PAC-learning DFAs remains computationally hard, and exact identification using only membership queries cannot be achieved in polynomial time. We then present $\mathrm{L_{nsp}^{\star}}$, an extension of Angluin’s $\mathrm{L}^{\star}$ algorithm, and show that DFAs can be PAC-learned efficiently using a language-model–based teacher that answers membership queries and generates valid strings conditioned on prefix prompts. Finally, we conduct a comprehensive experimental evaluation on 11 regular languages of varying complexity. Using $\mathrm{L}^{\star}_{\text{nsp}}$, we extract DFAs from Transformer-based language models trained on regular languages to evaluate the algorithm’s effectiveness and identify erroneous examples.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

124. LLMs Can Hide Text in Other Text of the Same Length

作者:

A meaningful text can be hidden inside another, completely different yet still coherent and plausible, text of the same length. For example, a tweet containing a harsh political critique could be embedded in a tweet that celebrates the same political leader, or an ordinary product review could conceal a secret manuscript. This uncanny state of affairs is now possible thanks to Large Language Models, and in this paper we present a simple and efficient protocol to achieve it. We show that even modest 8‑billion‑parameter open‑source LLMs are sufficient to obtain high‑quality results, and a message as long as this abstract can be encoded and decoded locally on a laptop in seconds. The existence of such a protocol demonstrates a radical decoupling of text from authorial intent, further eroding trust in written communication, already shaken by the rise of LLM chatbots. We illustrate this with a concrete scenario: a company could covertly deploy an unfiltered LLM by encoding its answers within the compliant responses of a safe model. This possibility raises urgent questions for AI safety and challenges our understanding of what it means for a Large Language Model to know something.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

125. Quantitative Bounds for Length Generalization in Transformers

作者:

We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2024) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: $\ell_\infty$ error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, as well as for one- or two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be ``simulated'' by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the required length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

126. Output Supervision Can Obfuscate the Chain of Thought

作者:

Recently, OpenAI (2025) showed that training against a chain of thought (CoT) monitor can cause obfuscated CoTs, which contain bad behavior the monitor cannot detect. They proposed to keep CoTs monitorable by training only against output monitors that do not have access to CoT. We show that such training can still cause obfuscated CoTs via two mechanisms. First, when a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. Second, since later tokens are conditioned on earlier ones, safe‑looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced. We introduce two mitigations to address these two issues, which achieve a Pareto improvement in terms of monitorability and task performance compared to regular training. To our knowledge, we are the first to identify and mitigate these problems. Our work implies that preserving CoT monitorability is more difficult than previously thought; we suggest practical guidelines for AI developers to maintain monitorable CoTs.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

127. How to Square Tensor Networks and Circuits Without Squaring Them

作者:

Squared tensor networks (TNs) and their extension as computational graphs---squared circuits---have been used as expressive distribution estimators, yet supporting closed-form marginalization. However, the squaring operation introduces additional complexity when computing the partition function or marginalizing variables, which hinders their applicability in ML. To solve this issue, canonical forms of TNs are parameterized via unitary matrices to simplify the computation of marginals. However, these canonical forms do not apply to circuits, as they can represent factorizations that do not directly map to a known TN. Inspired by the ideas of orthogonality in canonical forms and determinism in circuits enabling tractable maximization, we show how to parameterize squared circuits to overcome their marginalization overhead. Our parameterizations unlock efficient marginalization even in factorizations different from TNs, but encoded as circuits, whose structure would otherwise make marginalization computationally hard. Finally, our experiments on distribution estimation show how our proposed conditions in squared circuits come with no expressiveness loss, while enabling more efficient learning.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

128. AgentGym-RL: An Open-Source Framework to Train LLM Agents for Long-Horizon Decision Making via Multi-Turn RL

作者:

Training LLM agents for complex multi-turn decision-making tasks requires extensive exploration within their environment, with reinforcement learning (RL) as a natural way. However, the open-source community currently lacks a unified RL framework capable of training agents from scratch across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a modular and decoupled framework specifically designed for RL-based agent in multi-turn decision-making tasks. It offers high flexibility and extensibility, supports mainstream RL algorithms, and spans a broad range of real-world scenarios. To effectively train agents for challenging tasks, we argue that they are required to expand external interactions with the environment, rather than relying solely on internal reasoning. Nevertheless, training agents for long-horizon interaction with vanilla methods often faces challenges like training instability. To this end, we propose ScalingInter-RL, a staged training approach for stable long-horizon RL training. It starts with short-horizon interaction to establish foundational policies and progressively expands them to encourage deeper exploration. Extensive experiments show that agents trained with our method achieve performance on par with—or even surpass—commercial counterparts like OpenAI o3 and Gemini-2.5-Pro across 27 tasks in diverse environments. We share key insights and will release the full framework, including code and datasets, to empower the community in building the next generation of intelligent agents.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 10, 6, 6, 6

📄 openreview 📄 下载PDF

129. Low-Pass Filtering Improves Behavioral Alignment of Vision Models

作者:

Despite their impressive performance on computer vision benchmarks, Deep Neural Networks (DNNs) still fall short of adequately modeling human visual behavior, as measured by error consistency and shape bias. Recent work hypothesized that behavioral alignment can be drastically improved through generative - rather than discriminative - classifiers, with far-reaching implications for models of human vision. Here, we instead show that the increased alignment of generative models can be largely explained by a seemingly innocuous resizing operation in the generative model which effectively acts as a low-pass filter. In a series of controlled experiments, we show that removing high-frequency spatial information from discriminative models like CLIP drastically increases their behavioral alignment. Simply blurring images at test-time - rather than training on blurred images - achieves a new state-of-the-art score on the model-vs-human benchmark, halving the current alignment gap between DNNs and human observers. Furthermore, low-pass filters are likely optimal, which we demonstrate by directly optimizing filters for alignment. To contextualize the performance of optimal filters, we compute the frontier of all possible pareto-optimal solutions to the benchmark, which was formerly unknown. We explain our findings by observing that the frequency spectrum of optimal Gaussian filters roughly matches the spectrum of band-pass filters implemented by the human visual system. We show that the contrast sensitivity function, describing the inverse of the contrast threshold required for humans to detect a sinusoidal grating as a function of spatiotemporal frequency, is approximated well by Gaussian filters of a specific width.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

130. Relationship Alignment for View-aware Multi-view Clustering

作者:

Multi-view clustering improves clustering performance by integrating complementary information from multiple views. However, existing methods often suffer from two limitations: i) the neglect of preserving sample neighborhood structures, which weakens the consistency of inter-sample relationships across views; and ii) inability to adaptively utilize inter-view similarity, resulting in representation conflicts and semantic degradation. To address these issues, we propose a novel framework named Relationship Alignment for View-aware Multi-view Clustering (RAV). Our approach first constructs a sample relation matrix for each view using deep features and aligns it with a global relation matrix to enhance neighborhood consistency across views. Furthermore, we introduce a view-aware adaptive weighting mechanism for label contrastive learning. This mechanism dynamically adjusts the contrastive intensity between view pairs based on the similarity of their deep features: higher similarity leads to stronger label alignment, while lower similarity reduces the weighting to prevent forcing inconsistent views into agreement. This strategy effectively promotes cluster-level semantic consistency while preserving natural inter-view relationships. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches on multiple benchmark datasets.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

131. Energy-Efficient Random Variate Generation via Compressed Lookup Tables

作者:

Generating (pseudo-)random variates lies at the core of probabilistic machine learning and prediction algorithms and yet remains a major bottleneck due to its high computational and energy cost. In this paper, we introduce a general and scalable sampling strategy that enables fast and energy-efficient random variate generation from arbitrary distributions. Our approach is based on efficient lookup tables combined with a fast index sampling scheme. Using only a handful of fast and energy-efficient compute operations on simple array structures, we achieve superior speed, energy efficiency, and precision at near-optimal entropy cost compared to state-of-the-art techniques. Microbenchmarking our approach with a C implementation shows up to 40\% savings in time and 60\% in energy compared to state-of-the-art approaches. Compared to commonly employed Python samplers we achieve a 100x time improvement.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

132. Neural Message-Passing on Attention Graphs for Hallucination Detection

作者:

Large Language Models (LLMs) often generate incorrect or unsupported content, known as hallucinations. Existing detection methods rely on heuristics or simple models over isolated computational traces such as activations, or attention maps. We unify these signals by representing them as attributed graphs, where tokens are nodes, edges follow attentional flows, and both carry features from attention scores and activations. Our approach, CHARM, casts hallucination detection as a graph learning task and tackles it by applying GNNs over the above attributed graphs. We show that CHARM provably subsumes prior attention-based heuristics and, experimentally, it consistently outperforms other leading approaches across diverse benchmarks. Our results shed light on the relevant role played by the graph structure and on the benefits of combining computational traces, whilst showing CHARM exhibits promising zero-shot performance on cross-dataset transfer.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

133. Branch and Bound Search for Exact MAP Inference in Credal Networks

作者:

Credal networks extend Bayesian networks by incorporating imprecise probabilities through convex sets of probability distributions known as credal sets. MAP inference in credal networks, which seeks the most probable variable assignment given evidence, becomes inherently more difficult than in Bayesian networks because it involves computations over a complex joint credal set. In this paper, we introduce two tasks called \emph{maximax} and \emph{maximin} MAP, and develop depth-first branch-and-bound search algorithms for solving them \emph{exactly}. The algorithms exploit problem decomposition by exploring an AND/OR search space and use a partitioning-based heuristic function enhanced with a cost-shifting scheme to effectively guide the search. Our experimental results obtained on both random and realistic credal networks clearly demonstrate the effectiveness of the proposed algorithms as they scale to large and complex problem instances.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

134. StochasTok: Improving Fine-Grained Subword Understanding in LLMs

作者:

Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still struggle disproportionally with seemingly simple subword-level tasks, like counting the number of 'r's in 'strawberry'. A key factor behind these failures is tokenization, which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper, we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to ‘see’ their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs’ downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok’s simplicity allows seamless integration at any stage of the training pipeline, and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

135. A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning

作者:

Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. Multi-objective reinforcement learning (MORL) typically addresses this by training a single policy conditioned on preference-weighted rewards. In this paper, we explore a novel perspective: leveraging reward-free reinforcement learning (RFRL) for MORL. While RFRL has historically been studied independently of MORL, it learns optimal policies for any possible reward function, making it a natural fit for MORL's challenge of handling unknown user preferences. We propose using RFRL's training objective as an auxiliary task to enhance MORL, enabling more effective knowledge sharing beyond the multi-objective reward function given at training time. To this end, we adapt a state-of-the-art RFRL algorithm to the MORL setting and introduce a preference-guided exploration strategy that focuses learning on relevant part of the environment. Our approach significantly outperforms state-of-the-art MORL methods across diverse MO-Gymnasium tasks, achieving superior performance and data efficiency, especially in settings with limited preference samples. This work is the first to explicitly adapt RFRL for MORL, demonstrating its potential as a scalable and effective solution.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

136. Kimi-Dev: Agentless Training as Skill Prior for SWE-agents

作者:

Large Language Models (LLMs) are increasingly applied to software engineering (SWE), with SWE-bench as a key benchmark. Solutions are split into SWE-Agent frameworks with multi-turn interactions and workflow-based Agentless methods with single-turn verifiable steps. We argue these paradigms are not mutually exclusive: reasoning-intensive Agentless training induces skill priors, including localization, code edit, and self-reflection that enable efficient and effective SWE-Agent adaptation. In this work, we first curate the Agentless training recipe and present Kimi-Dev, an open-source SWE LLM achieving 60.4\% on SWE-bench Verified, the best among workflow approaches. With additional SFT adaptation on 5k publicly-available trajectories, Kimi-Dev powers SWE-Agents to 48.6\% pass@1, on par with that of Claude 3.5 Sonnet (241022 version). These results show that structured skill priors from Agentless training can bridge workflow and agentic frameworks for transferable coding agents.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

137. Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

作者:

Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting these effective information pathways while suppressing substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint on how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

138. Efficient algorithms for Incremental Metric Bipartite Matching

作者:

The minimum-cost bipartite matching between two sets of points $R$ and $S$ in a metric space has a wide range of applications in machine learning, computer vision, and logistics. For instance, it can be used to estimate the $1$-Wasserstein distance between continuous probability distributions and for efficiently matching requests to servers while minimizing cost. However, the computational cost of determining the minimum-cost matching for general metrics spaces, poses a significant challenge, particularly in dynamic settings where points arrive over time and each update requires re-executing the algorithm. In this paper, given a fixed set $S$, we describe a deterministic algorithm that maintains, after $i$ additions to $R$, an $O(1/\delta^{0.631})$-approximate minimum-cost matching of cardinality $i$ between sets $R$ and $S$ in any metric space, with an amortized insertion time of $\widetilde{O}(n^{1+\delta})$ for adding points in $R$. To the best of our knowledge, this is the first algorithm for incremental minimum-cost matching that applies to arbitrary metric spaces. Interestingly, an important subroutine of our algorithm lends itself to efficient parallelization. We provide both a CPU implementation and a GPU implementation that leverages parallelism. Extensive experiments on both synthetic and real world datasets showcase that our algorithm either matches or outperforms all benchmarks in terms of speed while significantly improving upon the accuracy.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

139. SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

作者:

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks under distribution shift. To overcome these limitations, we explore reinforcement learning (RL) as a pathway to scaling VLA training beyond limited datasets. Inspired by LLM breakthroughs where RL with outcome rewards enhances step-by-step reasoning, we ask: Can outcome-driven RL improve long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. Applied to OpenVLA-OFT, SimpleVLA-RL achieves 99\% of SoTA performance on LIBERO and 80\% relative improvement on RoboTwin 1.0\&2.0, outperforming $\pi_0$ with our proposed exploration-enhancing strategies. SimpleVLA-RL reduces dependence on large-scale data, enables robust generalization, and remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon "pushcut'' during RL training, wherein the policy discovers unseen patterns beyond those seen in previous training process.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

140. Characterizing Pattern Matching and Its Limits on Compositional Task Structures

作者:

Despite impressive capabilities, LLMs often exhibit surface-level pattern-matching behaviors, evidenced by OOD generalization failures in compositional tasks. However, behavioral studies commonly employ task setups that allow multiple generalization sources (e.g., algebraic invariances, structural repetition), obscuring a precise and testable account of how well LLMs perform generalization through pattern matching and their limitations. To address this ambiguity, we first formalize pattern matching as functional equivalence, i.e., substituting input fragments observed to result in identical outputs in shared contexts. Then, we systematically study how decoder-only Transformer and Mamba behave in controlled tasks with compositional structures that isolate this mechanism. Our formalism yields predictive and quantitative insights: (1) Instance-wise success of pattern matching is tightly ordered by the number of contexts witnessing the relevant functional equivalence. (2) We derive and empirically confirm that the training data required for learning a two-hop structure grows at least quadratically with token-set size. The power-law scaling exponent agrees with predictions and remains stable across 20× parameter scaling and different architectures. (3) Path ambiguity is a structural barrier: when a variable influences the output via multiple paths, models fail to form unified intermediate state representations, impairing accuracy and interpretability. (4) Chain-of-Thought reduces data requirements yet does not resolve path ambiguity. Hence, we provide a predictive, falsifiable boundary for pattern matching and a foundational diagnostic for disentangling mixed generalization mechanisms.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

141. The Effect of Attention Head Count on Transformer Approximation

作者:

Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $\epsilon$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/\epsilon^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, resulting in the approximation entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

142. Uncertainty-Aware 3D Reconstruction for Dynamic Underwater Scenes

作者:

Underwater 3D reconstruction remains challenging due to the intricate interplay between light scattering and environment dynamics. While existing methods yield plausible reconstruction with rigid scene assumptions, they struggle to capture temporal dynamics and remain sensitive to observation noise. In this work, we propose an Uncertainty-aware Dynamic Field (UDF) that jointly represents underwater structure and view-dependent medium over time. A canonical underwater representation is initialized using a set of 3D Gaussians embedded in a volumetric medium field. Then we map this representation into a 4D neural voxel space and encode spatial-temporal features by querying the voxels. Based on these features, a deformation network and a medium offset network are proposed to model transformations of Gaussians and time-conditioned updates to medium properties, respectively. To address input-dependent noise, we model per-pixel uncertainty guided by surface-view radiance ambiguity and inter-frame scene flow inconsistency. This uncertainty is incorporated into the rendering loss to suppress the noise from low-confidence observations during training. Experiments on both controlled and in-the-wild underwater datasets demonstrate our method achieves both high-quality reconstruction and novel view synthesis. Our code will be released.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

143. HDR-4DGS: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos

作者:

We introduce HDR-4DGS, the first system for reconstructing renderable 4D high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR) videos captured with alternating exposures. To tackle such a challenging problem, we present a unified framework with two-stage optimization approach based on Gaussian Splatting. The first stage learns a video HDR Gaussian representation in orthographic camera coordinate space, eliminating the need for camera poses and enabling robust initial HDR video reconstruction. The second stage transforms video Gaussians into world space and jointly refines the world Gaussians with camera poses. Furthermore, we propose a temporal luminance regularization strategy to enhance the temporal consistency of the HDR appearance. Since our task has not been studied before, we construct a new evaluation benchmark using publicly available datasets for HDR video reconstruction. Extensive experiments demonstrate that HDR-4DGS significantly outperforms alternative solutions adapted from state-of-the-art methods in both rendering quality and speed.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

144. StylOS: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

作者:

We present Stylos, a single-forward 3D Gaussian framework for 3D style transfer that operates on unposed content, from a single image to a multi- view collection, conditioned on a separate reference style image. Stylos synthesizes a stylized 3D Gaussian scene without per-scene optimization or precomputed poses, achieving geometry-aware, view-consistent stylization that generalizes to unseen categories, scenes, and styles. At its core, Stylos adopts a Transformer backbone with two pathways: geometry predictions retain self-attention to preserve geometric fidelity, while style is injected via global cross-attention to enforce visual consistency across views. With the addition of a voxel-based 3D style loss that aligns aggregated scene features to style statistics, Stylos enforces view-consistent stylization while preserving geometry. Experiments across multiple datasets demonstrate that Stylos delivers high-quality zero-shot stylization, highlighting the ef- fectiveness of global style–content coupling, the proposed 3D style loss, and the scalability of our framework from single view to large-scale multi-view settings. Our codes will be fully open-sourced soon.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

145. Discrete Adjoint Matching

作者:

Computation methods for solving entropy-regularized reward optimization—a class of problems widely used for fine-tuning generative models—have advanced rapidly. Among those, Adjoint Matching (AM, Domingo-Enrich et al., 2025) has proven highly effective in continuous state spaces with differentiable rewards. Transferring these practical successes to discrete generative modeling, however, remains particularly challenging and largely unexplored, mainly due to the drastic shift in generative model classes to discrete state spaces, which are nowhere differentiable. In this work, we propose Discrete Adjoint Matching (DAM)—a discrete variant of AM for fine-tuning discrete generative models characterized by Continuous-Time Markov Chains, such as diffusion-based large language models. The core of DAM is the introduction of discrete adjoint—an estimator of the optimal solution to the original problem but formulated on discrete domains—from which standard matching frameworks can be applied. This is derived via a purely statistical standpoint, in contrast to the control-theoretic viewpoint in AM, thereby opening up new algorithmic opportunities for general adjoint-based estimators. We showcase DAM’s effectiveness on synthetic and mathematical reasoning tasks.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

146. Correlated Policy Optimization in Multi-Agent Subteams

作者:

In cooperative multi-agent reinforcement learning, agents often face scalability challenges due to the exponential growth of the joint action and observation spaces. Inspired by the structure of human teams, we explore subteam-based coordination, where agents are partitioned into fully correlated subgroups with limited inter-group interaction. We formalize this structure using Bayesian networks and propose a class of correlated joint policies induced by directed acyclic graphs . Theoretically, we prove that regularized policy gradient ascent converges to near-optimal policies under a decomposability condition of the environment. Empirically, we introduce a heuristic for dynamically constructing context-aware subteams with limited dependency budgets, and demonstrate that our method outperforms standard baselines across multiple benchmark environments.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

147. Learning to Play Multi-Follower Bayesian Stackelberg Games

作者:

In a multi-follower Bayesian Stackelberg game, a leader plays a mixed strategy over $L$ actions to which $n\ge 1$ followers, each having one of $K$ possible private types, best respond. The leader's optimal strategy depends on the distribution of the followers' private types. We study an online learning problem for Bayesian Stackelberg game, where a leader interacts for $T$ rounds with $n$ followers with types sampled from an unknown distribution every round. The leader's goal is to minimize regret, defined as the difference between the cumulative utility of the optimal strategy and that of the actually chosen strategies. We design learning algorithms for the leader under different settings. Under type feedback, where the leader observes the followers' types after each round, we design algorithms that achieve $\mathcal O\big(\sqrt{\min\{L\log(nKA T), ~ nK \} \cdot T} \big)$ regret for independent type distributions and $\mathcal O\big(\sqrt{\min\{L\log(nKA T), ~ K^n \} \cdot T} \big)$ regret for general type distributions. Interestingly, these bounds do not grow with $n$ at a polynomial rate. Under action feedback, where the leader only observes the followers' actions, we design algorithms with $\mathcal O( \min\{\sqrt{ n^L K^L A^{2L} L T \log T}, ~ K^n\sqrt{ T } \log T \} )$ regret. We also provide a lower bound of $\Omega(\sqrt{\min\{L, ~ nK\}T})$, almost matching the type-feedback upper bounds.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 10, 6, 6, 6

📄 openreview 📄 下载PDF

148. Fair Decision Utility in Human-AI Collaboration: Interpretable Confidence Adjustment for Humans with Cognitive Disparities

作者:

In AI-assisted decision-making, human decision-makers finalize decisions by taking into account both their human confidence and AI confidence regarding specific outcomes. In practice, they often exhibit heterogeneous cognitive capacities, causing their confidence to deviate, sometimes significantly, from the actual label likelihood. We theoretically demonstrate that existing AI confidence adjustment objectives, such as *calibration* and *human-alignment*, are insufficient to ensure fair utility across groups of decision-makers with varying cognitive capacities. Such unfairness may raise concerns about social welfare and may erode human trust in AI systems. To address this issue, we introduce a new concept in AI confidence adjustment: *inter-group-alignment*. By theoretically bounding the utility disparity between human decision-maker groups as a function of *human-alignment* level and *inter-group-alignment* level, we establish an interpretable fairness-aware objective for AI confidence adjustment. Our analysis suggests that achieving utility fairness in AI-assisted decision-making requires both *human-alignment* and *inter-group-alignment*. Building on these objectives, we propose a multicalibration-based AI confidence adjustment approach tailored to scenarios involving human decision-makers with heterogeneous cognitive capacities. We further provide theoretical justification showing that our method constitutes a sufficient condition for achieving both *human-alignment* and *inter-group-alignment*. We validate our theoretical findings through extensive experiments on four real-world tasks. The results demonstrate that AI confidence adjusted toward both *human-alignment and *inter-group-alignment* significantly improves utility fairness across human decision-maker groups, without sacrificing overall utility. *The implementation code is available at* https://anonymous.4open.science/r/FairHAI.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

149. When Is Diversity Rewarded in Cooperative Multi-Agent Learning?

作者:

The success of teams in robotics, nature, and society often depends on the division of labor among diverse specialists; however, a principled explanation for when such diversity surpasses a homogeneous team is still missing. Focusing on multi-agent task allocation problems, we study this question from the perspective of reward design: what kinds of objectives are best suited for heterogeneous teams? We first consider an instantaneous, non-spatial setting where the global reward is built by two generalized aggregation operators: an inner operator that maps the N agents’ effort allocations on individual tasks to a task score, and an outer operator that merges the M task scores into the global team reward. We prove that the curvature of these operators determines whether heterogeneity can increase reward, and that for broad reward families this collapses to a simple convexity test. Next, we ask what incentivizes heterogeneity to emerge when embodied, time-extended agents must learn an effort allocation policy. To study heterogeneity in such settings, we use multi-agent reinforcement learning (MARL) as our computational paradigm, and introduce Heterogeneity Gain Parameter Search (HetGPS), a gradient-based algorithm that optimizes the parameter space of underspecified MARL environments to find scenarios where heterogeneity is advantageous. Across different environments, we show that HetGPS rediscovers the reward regimes predicted by our theory to maximize the advantage of heterogeneity, both validating HetGPS and connecting our theoretical insights to reward design in MARL. Together, these results help us understand when behavioral diversity delivers a measurable benefit.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 10, 6, 6, 6

📄 openreview 📄 下载PDF

150. CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

作者:

AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

151. Improving Online-to-Nonconvex Conversion for Smooth Optimization via Double Optimism

作者:

A recent breakthrough in nonconvex optimization is the online-to-nonconvex conversion framework of Cutkosky et al. (2023), which reformulates the task of finding an $\varepsilon$-first-order stationary point as an online learning problem. When both the gradient and the Hessian are Lipschitz continuous, instantiating this framework with two different online learners achieves a complexity of $ \mathcal{O}(\varepsilon^{-1.75}\log(1/\varepsilon)) $ in the deterministic case and a complexity of $ \mathcal{O}(\varepsilon^{-3.5}) $ in the stochastic case. However, this approach suffers from several limitations: (i) the deterministic method relies on a complex double-loop scheme that solves a fixed-point equation to construct hint vectors for an optimistic online learner, introducing an extra logarithmic factor; (ii) the stochastic method assumes a bounded second-order moment of the stochastic gradient, which is stronger than standard variance bounds; and (iii) different online learning algorithms are used in the two settings. In this paper, we address these issues by introducing an online optimistic gradient method based on a novel **doubly optimistic hint function**. Specifically, we use the gradient at an extrapolated point as the hint, motivated by two optimistic assumptions: that the difference between the hint and the target gradient remains near constant, and that consecutive update directions change slowly due to smoothness. Our method eliminates the need for a double loop and removes the logarithmic factor. Furthermore, by simply replacing full gradients with stochastic gradients and under the standard assumption that their variance is bounded by $\sigma^2$, we obtain a unified algorithm with complexity $\mathcal{O}(\varepsilon^{-1.75} + \sigma^2 \varepsilon^{-3.5})$, smoothly interpolating between the best-known deterministic rate and the optimal stochastic rate.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

152. Distributed Algorithms for Euclidean Clustering

作者:

We study the problem of constructing $(1+\varepsilon)$-coresets for Euclidean $(k,z)$-clustering in the distributed setting, where $n$ data points are partitioned across $s$ sites. We focus on two prominent communication models: the coordinator model and the blackboard model. In the coordinator model, we design a protocol that achieves a $(1+\varepsilon)$-strong coreset with total communication complexity $\tilde{O}\left(sk + \frac{dk}{\min(\varepsilon^4,\varepsilon^{2+z})} + dk\log(n\Delta)\right)$ bits, improving upon prior work (Chen et al., NeurIPS 2016) by eliminating the need to communicate explicit point coordinates in-the-clear across all servers. In the blackboard model, we further reduce the communication complexity to $\tilde{O}\left(s\log(n\Delta) + dk\log(n\Delta) + \frac{dk}{\min(\varepsilon^4,\varepsilon^{2+z})}\right)$ bits, achieving better bounds than previous approaches while upgrading from constant-factor to $(1+\varepsilon)$-approximation guarantees. Our techniques combine new strategies for constant-factor approximation with efficient coreset constructions and compact encoding schemes, leading to optimal protocols that match both the communication costs of the best-known offline coreset constructions and existing lower bounds (Chen et al., NeurIPS 2016, Huang et. al., STOC 2024), up to polylogarithmic factors.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

153. Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning

作者:

Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar-free class-incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making representation drift for old classes particularly harmful. Prototype-based EFCIL is attractive for its efficiency, yet prototypes drift as the embedding space evolves; thus, projection-based drift compensation has become a popular remedy. We show, however, that existing one-directional projections introduce systematic bias: they either retroactively distort the current feature geometry or align past classes only locally, leaving cycle inconsistencies that accumulate across tasks. We introduce bidirectional projector alignment during training: two maps, old$\to$new and new$\to$old, are trained during each new task with stop-gradient gating and a cycle-consistency objective so that transport and representation co-evolve. Analytically, we prove that the cycle loss contracts the singular spectrum toward unity in whitened space and that improved transport of class means/covariances yields smaller perturbations of classification log-odds, preserving old-class decisions and directly mitigating catastrophic forgetting. Empirically, across standard EFCIL benchmarks, our method achieves unprecedented reductions in forgetting while maintaining very high accuracy on new tasks, consistently outperforming state-of-the-art approaches.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

154. Temporal Generalization: A Reality Check

作者:

Machine learning (ML) models often struggle to maintain performance under distribution shifts, leading to inaccurate predictions on unseen future data. In this work, we investigate whether and under what conditions models can achieve such a generalization when relying solely on past data. We explore two primary approaches: convex combinations of past model parameters (parameter interpolation) and explicit extrapolation beyond the convex hull of past parameters (parameter extrapolation). We benchmark several methods within these categories on a diverse set of temporal tasks, including language modeling, news summarization, news tag prediction, academic paper categorization, satellite image-based land use classification over time, and historical yearbook photo gender prediction. Our empirical findings show that none of the evaluated methods consistently outperforms the simple baseline of using the latest available model parameters in all scenarios. In the absence of access to future data or robust assumptions about the underlying data-generating process, these results underscore the inherent difficulties of generalizing and extrapolating to future data and warrant caution when evaluating claims of such generalization.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

155. SelvaBox: A high‑resolution dataset for tropical tree crown detection

作者:

Detecting individual tree crowns in tropical forests is essential to study these complex and crucial ecosystems impacted by human interventions and climate change. However, tropical crowns vary widely in size, structure, and pattern and are largely overlapping and intertwined, requiring advanced remote sensing methods applied to high-resolution imagery. Despite growing interest in tropical tree crown detection, annotated datasets remain scarce, hindering robust model development. We introduce SelvaBox, the largest open‑access dataset for tropical tree crown detection in high-resolution drone imagery. It spans three countries and contains more than $83\,000$ manually labeled crowns -- an order of magnitude larger than all previous tropical forest datasets combined. Extensive benchmarks on SelvaBox reveal two key findings: 1) higher-resolution inputs consistently boost detection accuracy; and 2) models trained exclusively on SelvaBox achieve competitive zero-shot detection performance on unseen tropical tree crown datasets, matching or exceeding competing methods. Furthermore, jointly training on SelvaBox and three other datasets at resolutions from 3 to 10 cm per pixel within a unified multi-resolution pipeline yields a detector ranking first or second across all evaluated datasets. Our dataset, code, and pre-trained weights are made public.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

156. Mamba-3: Improved Sequence Modeling using State Space Principles

作者:

The recent scaling of test-time compute for LLMs has restricted the practical deployment of models to those with strong capabilities that can generate high-quality outputs in an inference-efficient manner. While current Transformer-based models are the standard, their quadratic compute and linear memory bottlenecks have spurred the development of sub-quadratic models with linear-scaling compute with constant memory requirements. However, many recent linear-style models lack certain capabilities or lag behind in quality, and even their linear-time inference is not hardware-efficient. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state-space model viewpoint of linear models. We combine a: 1) more expressive recurrence, 2) complex state update rule that enables richer state tracking, and 3) multi-input, multi-output formulation together, resulting in a stronger model that better exploits hardware parallelism during decoding. Together with architectural refinements, our **Mamba-3** model achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks. Our new architecture sets the Pareto-frontier for performance under a fixed inference budget and outperforms strong baselines in a head-to-head comparison.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

157. Do 3D Large Language Models Really Understand 3D Spatial Relationships?

作者:

Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that leverages negative samples via explicit 3D-relation alignment, substantially enhancing 3D-LLMs’ performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

158. Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks

作者:

Multivariate Hawkes process provides a powerful framework for modeling temporal dependencies and event-driven interactions in complex systems. While existing methods primarily focus on uncovering causal structures among observed subprocesses, real-world systems are often only partially observed, with latent subprocesses posing significant challenges. In this paper, we show that continuous-time event sequences can be represented by a discrete-time causal model as the time interval shrinks, and we leverage this insight to establish necessary and sufficient conditions for identifying latent subprocesses and the causal influences. Accordingly, we propose a two-phase iterative algorithm that alternates between inferring causal relationships among discovered subprocesses and uncovering new latent subprocesses, guided by path-based conditions that guarantee identifiability. Experiments on both synthetic and real-world datasets show that our method effectively recovers causal structures despite the presence of latent subprocesses.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

159. AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory

作者:

Vision-language-action (VLA) models have shown promise as generalist robotic policies by jointly leveraging visual, linguistic, and proprioceptive modalities to generate action trajectories. While recent benchmarks have advanced VLA research in domestic tasks, professional science-oriented domains remain underexplored. We introduce AutoBio, a simulation framework and benchmark designed to evaluate robotic automation in biology laboratory environments—an application domain that combines structured protocols with demanding precision and multimodal interaction. AutoBio extends existing simulation capabilities through a pipeline for digitizing real-world laboratory instruments, specialized physics plugins for mechanisms ubiquitous in laboratory workflows, and a rendering stack that support dynamic instrument interfaces and transparent materials through physically based rendering. Our benchmark comprises biologically grounded tasks spanning three difficulty levels, enabling standardized evaluation of language-guided robotic manipulation in experimental protocols. We provide infrastructure for demonstration generation and seamless integration with VLA models. Baseline evaluations with SOTA VLA models reveal significant gaps in precision manipulation, visual reasoning, and instruction following in scientific workflows. By releasing AutoBio, we aim to catalyze research on generalist robotic systems for complex, high-precision, and multimodal professional environments.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

160. Instance-Dependent Fixed-Budget Pure Exploration in Reinforcement Learning

作者:

We study the problem of fixed budget pure exploration in reinforcement learning. The goal is to identify a near-optimal policy, given a fixed budget on the number of interactions with the environment. Unlike the standard PAC setting, we do not require the target error level $\epsilon$ and failure rate $\delta$ as input. We propose novel algorithms and provide, to the best of our knowledge, the first instance-dependent $\epsilon$-uniform guarantee, meaning that the probability that $\epsilon$-correctness is ensured can be obtained simultaneously for all $\epsilon$ above a budget-dependent threshold. It characterizes the budget requirements in terms of the problem-specific hardness of exploration. As a core component of our analysis, we derive a $\epsilon$-uniform guarantee for the multiple bandit problem—solving multiple multi-armed bandit instances simultaneously—which may be of independent interest. To enable our analysis, we also develop tools for reward-free exploration under the fixed-budget setting, which we believe will be useful for future work.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

161. Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

作者:

Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types — including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios. For more details, please refer to our project webpage: https://vlm-rmd.github.io/.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

162. Efficient Learning on Large Graphs using a Densifying Regularity Lemma

作者:

Learning on large graphs presents significant challenges, with traditional Message Passing Neural Networks suffering from computational and memory costs scaling linearly with the number of edges. We introduce the Intersecting Block Graph (IBG), a low-rank factorization of large directed graphs based on combinations of intersecting bipartite components, each consisting of a pair of communities, for source and target nodes. By giving less weight to non-edges, we show how an IBG can efficiently approximate any graph, sparse or dense. Specifically, we prove a constructive version of the weak regularity lemma: for any chosen accuracy, every graph can be approximated by a dense IBG whose rank depends only on that accuracy. This improves over prior versions of the lemma, where the rank depended on the number of nodes for sparse graphs. Our method allows for efficient approximation of large graphs that are both directed and sparse, a crucial capability for many real-world applications. We then introduce a graph neural network architecture operating on the IBG representation of the graph and demonstrating competitive performance on node classification, spatio-temporal graph analysis, and knowledge graph completion, while having memory and computational complexity linear in the number of nodes rather than edges.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

163. Bures-Wasserstein Flow Matching for Graph Generation

作者:

Graph generation has emerged as a critical task in fields ranging from drug discovery to circuit design. Contemporary approaches, notably diffusion and flow-based models, have achieved solid graph generative performance through constructing a probability path that interpolates between reference and data distributions. However, these methods typically model the evolution of individual nodes and edges independently and use linear interpolations to build the path. This disentangled interpolation breaks the interconnected patterns of graphs, making the constructed probability path irregular and non-smooth, which causes poor training dynamics and faulty sampling convergence. To address the limitation, this paper first presents a theoretically grounded framework for probability path construction in graph generative models. Specifically, we model the joint evolution of the nodes and edges by representing graphs as connected systems parameterized by Markov random fields (MRF). We then leverage the optimal transport displacement between MRF objects to design a smooth probability path that ensures the co-evolution of graph components. Based on this, we introduce BWFlow, a flow-matching framework for graph generation that utilizes the derived optimal probability path to benefit the training and sampling algorithm design. Experimental evaluations in plain graph generation and molecule generation validate the effectiveness of BWFlow with competitive performance, better training convergence, and efficient sampling.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

164. Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime

作者:

Neural scaling laws underlie many of the recent advances in deep learning, yet their theoretical understanding remains largely confined to linear models. In this work, we present a systematic analysis of scaling laws for quadratic and diagonal neural networks in the feature learning regime. Leveraging connections with matrix compressed sensing and LASSO, we derive a detailed phase diagram for the scaling exponents of the excess risk as a function of sample complexity and weight decay. This analysis uncovers crossovers between distinct scaling regimes and plateau behaviors, mirroring phenomena widely reported in the empirical neural scaling literature. Furthermore, we establish a precise link between these regimes and the spectral properties of the trained network weights, which we characterize in detail. As a consequence, we provide a theoretical validation of recent empirical observations connecting the emergence of power-law tails in the weight spectrum with network generalization performance, yielding an interpretation from first principles.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

165. Unified and Efficient Multi-view Clustering from Probabilistic Perspective

作者:

Multi-view clustering aims to segment the view-specific data into the corresponding clusters. There have been a large number of works for multi-view clustering in recent years. As representive methods in multi-view clustering, works built on the graph make use of a view-consistent and discriminative graph while utilizing graph partitioning for the final clustering results. Despite the achieved significant success, these methods usually construct full graphs and the efficiency is not well guaranteed for the multi-view datasets with large scales. To handle the large-scale data, multi-view clustering methods based on anchor have been developed by learning the anchor graph with smaller size. However, the existing works neglect the interpretability of multi-view clustering based on anchor from the probabilistic perspective. These methods also ignore analyzing the relationship between the input data and the final clustering results based on the assigned meaningful probability associations in a unified manner. In this work, we propose a novel method termed Unified and Efficient Multi-view Clustering from Probabilistic perspective(UEMCP). It aims to improve the explanation ability of multi-view clustering based on anchor from the probabilistic perspective in an end-to-end manner. It ensures the consistent inherent structures among these views by learning the common transition probability from data points to categories in one step. With the guidance of the common transition probability matrix from data points to categories, the soft label of data points can be achieved based on the common transition probability matrix from anchor points to categories in the learning framework. Experiments on different challenging multi-view datasets confirm the superiority of UEMCP compared with the representative ones.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

166. Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability

作者:

Diffusion language models (DLMs) generate tokens in parallel through iterative denoising, which can reduce latency and enable bidirectional conditioning. However, the safety risks posed by jailbreak attacks that exploit this inference mechanism are not well understood. In this paper, we reveal that DLMs have a critical vulnerability stemming from their iterative denoising process and propose a countermeasure. Specifically, our investigation identifies that if an affirmative token for a harmful query appears at an intermediate step, subsequent denoising can be steered toward a harmful response even in aligned models. Furthermore, we demonstrate that the vulnerability enables existing optimization-based jailbreak attacks to be applied to MDLMs. Building on this analysis, we propose a novel safety alignment method tailored to DLMs that trains models to generate safe responses from contaminated intermediate denoising steps containing affirmative tokens. Our experiments indicate that the proposed method significantly mitigates the vulnerability with minimal impact on task performance. Furthermore, our method also improves robustness against conventional jailbreak attacks. Our work underscores the need for DLM-specific safety research.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

167. Zephyrus: An Agentic Framework for Weather Science

作者:

Foundation models for weather science are pre-trained on vast amounts of structured numerical data and outperform traditional weather forecasting systems. However, these models lack language-based reasoning capabilities, limiting their utility in interactive scientific workflows. Large language models (LLMs) excel at understanding and generating text but cannot reason about high-dimensional meteorological datasets. We bridge this gap by building a novel agentic framework for weather science. Our framework includes a Python code-based environment for agents (ZephyrusWorld) to interact with weather data, featuring tools like an interface to WeatherBench 2 dataset, geoquerying for geographical masks from natural language, weather forecasting, and climate simulation capabilities. We design Zephyrus, a multi-turn LLM-based weather agent that iteratively analyzes weather datasets, observes results, and refines its approach through conversational feedback loops. We accompany the agent with a new benchmark, ZephyrusBench, with a scalable data generation pipeline that constructs diverse question-answer pairs across weather-related tasks, from basic lookups to advanced forecasting, extreme event detection, and counterfactual reasoning. Experiments on this benchmark demonstrate the strong performance of Zephyrus agents over text-only baselines, outperforming them by up to 35 percentage points in correctness. However, on harder tasks, Zephyrus performs similarly to text-only baselines, highlighting the challenging nature of our benchmark and suggesting promising directions for future work.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

168. MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

作者:

Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitively expensive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval where users can balance retrieval quality against efficiency demands by selecting the number of tokens used for indexing and retrieval interactions. Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) and the Visual Document Retrieval Benchmark (ViDoRe) confirm that MetaEmbed achieves state-of-the-art retrieval performance while scaling robustly to models with 32B parameters.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

169. S2GO: Streaming Sparse Gaussian Occupancy

作者:

Despite the efficiency and performance of sparse query-based representations for perception, state-of-the-art 3D occupancy estimation methods still rely on voxel-based or dense Gaussian-based 3D representations. However, dense representations are slow, and they lack flexibility in capturing the temporal dynamics of driving scenes. Distinct from prior work, we instead summarize the scene into a compact set of 3D queries which are propagated through time in an online, streaming fashion. These queries are then decoded into semantic Gaussians at each timestep. We couple our framework with a denoising rendering objective to guide the queries and their constituent Gaussians in effectively capturing scene geometry. Owing to its efficient, query-based representation, S2GO achieves state-of-the-art performance on the nuScenes and KITTI occupancy benchmarks, outperforming prior art (e.g., GaussianWorld) by 2.7 IoU with 4.5x faster inference.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

170. Identity-Free Deferral For Unseen Experts

作者:

Learning to Defer (L2D) improves AI reliability in decision-critical environments, such as healthcare, by training a model to either make its own prediction or delerejector the decision to a human expert. A key challenge is adapting to unseen experts: those who were not involved during the system's training process. Current methods for this task, however, can falter when unseen experts are out-of-distribution (OOD) relative to the training population. We identify a core architectural flaw as the cause: they learn identity-conditioned policies by processing class-indexed signals in fixed coordinates, creating shortcuts that violate the problem's inherent permutation symmetry. We introduce Identity-Free Deferral (IFD), an architecture that enforces this symmetry by construction. From a few-shot context, IFD builds a query-independent Bayesian competence profile for each expert. It then supplies the deferral rejector with a low-dimensional, role-indexed state containing only structural information, such as the model's confidence in its top-ranked class and the expert's estimated skill for that same role, which obscures absolute class identities. We train IFD using an uncertainty-aware, context-only objective that removes the need for expensive query-time expert labels. We formally prove the permutation invariance of our approach, contrasting it with the generic non-invariance of standard population encoders. Experiments on medical imaging benchmarks and ImageNet-16H with real human annotators show that IFD consistently improves generalization to unseen experts, with significant gains in OOD settings, all while using fewer annotations than competing methods.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 2

详细评分: 6, 8

📄 openreview 📄 下载PDF

171. Slicing Wasserstein over Wasserstein via Functional Optimal Transport

作者:

Wasserstein distances define a metric between probability measures on arbitrary metric spaces, including *meta-measures* (measures over measures). The resulting *Wasserstein over Wasserstein* (WoW) distance is a powerful, but computationally costly tool for comparing datasets or distributions over images and shapes. Existing sliced WoW accelerations rely on parametric meta-measures or the existence of high-order moments, leading to numerical instability. As an alternative, we propose to leverage the isometry between the 1d Wasserstein space and the quantile functions in the function space $L_2([0,1])$. For this purpose, we introduce a general sliced Wasserstein framework for arbitrary Banach spaces. Due to the 1d Wasserstein isometry, this framework defines a sliced distance between 1d meta-measures via infinite-dimensional $L_2$-projections, parametrized by Gaussian processes. Combining this 1d construction with classical integration over the Euclidean unit sphere yields the *double-sliced Wasserstein* (DSW) metric for general meta-measures. We show that DSW minimization is equivalent to WoW minimization for discretized meta-measures, while avoiding unstable higher-order moments and computational savings. Numerical experiments on datasets, shapes, and images validate DSW as a scalable substitute for the WoW distance.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

172. Towards Physically Executable 3D Gaussian for Embodied Navigation

作者:

3D Gaussian Splatting (3DGS), a 3D representation method with photorealistic real-time rendering capabilities, is regarded as an effective tool for narrowing the sim-to-real gap. However, it lacks fine-grained semantics and physical executability for Visual-Language Navigation (VLN). To address this, we propose **SAGE-3D** (**S**emantically and Physically **A**ligned **G**aussian **E**nvironments for **3D** Navigation), a new paradigm that upgrades 3DGS into an executable, semantically and physically aligned environment. It comprises two components: **(1) Object-Centric Semantic Grounding**, which adds object-level fine-grained annotations to 3DGS; and **(2) Physics-Aware Execution Jointing**, which embeds collision objects into 3DGS and constructs rich physical interfaces. We release **InteriorGS**, containing 1K object-annotated 3DGS indoor scene data, and introduce **SAGE-Bench**, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments show that 3DGS scene data is more difficult to converge, while exhibiting strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

173. Coupled Transformer Autoencoder for Disentangling Multi-Region Neural Latent Dynamics

作者:

Simultaneous recordings from thousands of neurons across multiple brain areas reveal rich mixtures of activity that are shared between regions and dynamics that are unique to each region. Existing alignment or multi-view methods neglect temporal structure, whereas dynamical latent-variable models capture temporal dependencies but are usually restricted to a single area, assume linear read-outs, or conflate shared and private signals. We introduce Coupled Transformer Autoencoder (CTAE)—a sequence model that addresses both (i) non-stationary, non-linear dynamics and (ii) separation of shared versus region-specific structure, in a single framework. CTAE employs Transformer encoders and decoders to capture long-range neural dynamics, and explicitly partitions each region’s latent space into orthogonal shared and private subspaces. We demonstrate the effectiveness of CTAE on two high-density electrophysiology datasets of simultaneous recordings from multiple regions, one from motor cortical areas and the other from sensory areas. CTAE extracts meaningful representations that better decode behavior variables compared to existing approaches.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

174. Long-range Modeling and Processing of Multimodal Event Sequences

作者:

Temporal point processes (TPPs) have emerged as powerful tools for modeling asynchronous event sequences. While recent advances have extended TPPs to handle textual information, existing approaches are limited in their ability to generate rich, multimodal content and reason about event dynamics. A key challenge is that incorporating multimodal data dramatically increases sequence length, hindering the ability of attention-based models to generate coherent, long-form textual descriptions that require long-range understanding. In this paper, we propose a novel framework that extends LLM-based TPPs to the visual modality, positioning text generation as a core capability alongside time and type prediction. Our approach addresses the long-context problem through an adaptive sequence compression mechanism based on temporal similarity, which reduces sequence length while preserving essential patterns. We employ a two-stage paradigm of pre-training on compressed sequences followed by supervised fine-tuning for downstream tasks. Extensive experiments, including on the challenging DanmakuTPP-QA benchmark, demonstrate that our method outperforms state-of-the-art baselines in both predictive accuracy and the quality of its generated textual analyses.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

175. WoW!: World Models in a Closed-Loop World

作者:

Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., *do WMs actually help agents succeed at embodied tasks?* To address this gap, we introduce WoW!, the first open platform that benchmarks WMs in a closed-loop setting that mirrors real agent-environment interactions. WoW! provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success—controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance. By centering evaluation on closed-loop outcomes, WoW! establishes a new benchmark for the systematic assessment of WMs.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

176. TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

作者:

Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce $\textbf{Free-Form HOI Generation}$, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct $\textbf{WildO2}$, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 403 object categories, each with detailed semantic annotations. Building on this dataset, we propose $\textbf{TOUCH}$, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

177. Learning a distance measure from the information-estimation geometry of data

作者:

We introduce the Information-Estimation Metric (IEM), a novel form of distance function derived from an underlying continuous probability density over a domain of signals. The IEM is rooted in a fundamental relationship between information theory and estimation theory, which links the log-probability of a signal with the errors of an optimal denoiser, applied to noisy observations of the signal. In particular, the IEM between a pair of signals is obtained by comparing their denoising error vectors over a range of noise amplitudes. Geometrically, this amounts to comparing the score vector fields of the *blurred* density around the signals over a range of blur levels. We prove that the IEM is a valid global metric and derive a closed-form expression for its local second-order approximation, which yields a Riemannian metric. For Gaussian-distributed signals, the IEM coincides with the Mahalanobis distance. But for more complex distributions, it adapts, both locally and globally, to the geometry of the distribution. In practice, the IEM can be computed using a learned denoiser (analogous to generative diffusion models) and solving a one-dimensional integral. To demonstrate the value of our framework, we learn an IEM on the ImageNet database. Experiments show that this IEM is competitive with or outperforms state-of-the-art supervised image quality metrics in predicting human perceptual judgments.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

178. Partition Generative Modeling: Masked Modeling Without Masks

作者:

Masked generative models (MGMs) are widely used to capture complex data and enable faster generation than autoregressive models (AR) through parallel decoding. However, MGMs typically operate on fixed-length inputs, which can be inefficient: early in sampling, most tokens are masked and carry little information, leading to wasted computation. In contrast, AR models process only tokens generated previously, making early iterations faster. In this work, we introduce the ``Partition Generative Model'' (PGM), a novel approach that combines the strengths of AR and MGMs. Rather than masking, PGM partitions tokens into two groups and employs sparse attention to block information flow between them. Since there is no information flow between partitions, the model can process the previously-generated tokens only during sampling, while retaining the ability to generate tokens in parallel and in any order. On OpenWebText, PGMs offer at least $5\times$ improvements in sampling latency and throughput, while producing samples with superior generative perplexity, compared to Masked Diffusion Language Models. In the ImageNet dataset, PGMs achieve up to $7\times$ better throughput compared to MaskGIT with only a small change in FID. Finally, we show that PGMs are compatible with distillation methods for MGMs, enabling further inference speedups.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

179. On the Reasoning Abilities of Masked Diffusion Language Models

作者:

Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent to their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

180. Peak-Return Greedy Slicing: Subtrajectory Selection for Transformer-based Offline RL

作者:

Offline reinforcement learning enables policy learning solely from fixed datasets, without costly or risky environment interactions, making it highly valuable for real-world applications. While Transformer-based approaches have recently demonstrated strong sequence modeling capabilities, they typically learn from complete trajectories conditioned on final returns. To mitigate this limitation, we propose the Peak-Return Greedy Slicing (PRGS) framework, which explicitly partitions trajectories at the timestep level and emphasizes high-quality subtrajectories. PRGS first leverages an MMD-based return estimator to characterize the distribution of future returns for state-action pairs, yielding optimistic return estimates. It then performs greedy slicing to extract high-quality subtrajectories for training. During evaluation, an adaptive history truncation mechanism is introduced to align the inference process with the training procedure. Extensive experiments across multiple benchmark datasets indicate that PRGS significantly improves the performance of Transformer-based offline reinforcement learning methods by effectively enhancing their ability to exploit and recombine valuable subtrajectories.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

181. SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models

作者:

Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles—questioner, responder, and verifier—within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder’s output and the questioner's reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model’s evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

182. SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

作者:

Large Language Models (LLMs) can enhance their reasoning by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn settings using Reinforcement Learning (RL) often exhibits training instability and degraded performance. We attribute the instability to harmful negative samples resulting from distributional drift and compounding errors induced by using external tool outputs during multi-turn rollout. To address this issue, we introduce SimpleTIR, a simple method that stabilizes multi-turn TIR training via filtering out trajectories with "void turns", i.e., turns that yield neither a code block nor a final answer. Specifically, we remove those trajectories from the policy update to block harmful gradients, while retaining them in advantage estimation to keep the estimate unbiased. Extensive experiments show that SimpleTIR effectively mitigates gradient norm explosion and stabilizes multi-turn RL training from base models. It achieves state-of-the-art performance on challenging math reasoning benchmarks, including an AIME24 score of 50.5 starting from the Qwen2.5-7B base model. SimpleTIR also promotes more diverse reasoning behaviors such as self-correction and cross-validation, outperforming prior methods trained from stronger instruction-tuned models.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

183. Overlap-weighted orthogonal meta-learner for treatment effect estimation over time

作者:

Estimating heterogeneous treatment effects (HTEs) in time-varying settings is particularly challenging, as the probability of observing certain treatment sequences decreases exponentially with longer prediction horizons. Thus, the observed data contain little support for many plausible treatment sequences, which creates severe overlap problems. Existing meta-learners for the time-varying setting typically assume adequate treatment overlap, and thus suffer from exploding estimation variance when the overlap is low. To address this problem, we introduce a novel overlap-weighted orthogonal WO meta-learner for estimating HTEs that targets regions in the observed data with high probability of receiving the interventional treatment sequences. This offers a fully data-driven approach through which our WO-learner can counteract instabilities as in existing meta-learners and thus obtain more reliable HTE estimates. Methodologically, we develop a novel Neyman-orthogonal population risk function that minimizes the overlap-weighted oracle risk. We show that our WO-learner has the favorable property of Neyman-orthogonality, meaning that it is robust against misspecification in the nuisance functions. Further, our WO-learner is fully model-agnostic and can be applied to any machine learning model. Through extensive experiments with both transformer and LSTM backbones, we demonstrate the benefits of our novel WO-learner.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

184. PMark: Towards Robust and Distortion-free Semantic-level Watermarking with Channel Constraints

作者:

Semantic-level watermarking (SWM) for large language models (LLMs) enhances watermarking robustness against text modifications and paraphrasing attacks by treating the sentence as the fundamental unit. However, existing methods still lack strong theoretical guarantees of robustness, and reject-sampling–based generation often introduces significant distribution distortions compared with unwatermarked outputs. In this work, we introduce a new theoretical framework on SWM through the concept of proxy functions (PFs) -- functions that map sentences to scalar values. Building on this framework, we propose **PMark**, a simple yet powerful SWM method that estimates the PF median for the next sentence dynamically through sampling while enforcing multiple PF constraints (which we call channels) to strengthen watermark evidence. Equipped with solid theoretical guarantees, **PMark** achieves the desired distortion-free property and improves the robustness against paraphrasing-style attacks. We also provide an empirically optimized version that further removes the requirement for dynamical median estimation for better sampling efficiency. Experimental results show that **PMark** consistently outperforms existing SWM baselines in both text quality and robustness, offering a more effective paradigm for detecting machine-generated text. The source code is available at https://anonymous.4open.science/r/PMark.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

185. Enabling Your Forensic Detector Know How Well It Performs on Distorted Samples

作者:

Generative AI has substantially facilitated realistic image synthesizing, posing great challenges for reliable forensics. When image forensic detectors are deployed in the wild, the inputs usually undergone various distortions including compression, rescaling, and lossy transmission. Such distortions severely erode forensic traces and make a detector fail silently—returning an over-confident binary prediction while being incapable of making reliable decision, as the detector cannot explicitly perceive the degree of data distortion. This paper argues that reliable forensics must therefore move beyond "is the image real or fake?" to also ask "how trustworthy is the detector's decision on the image?" We formulate this requirement as Detector's Distortion-Aware Confidence (DAC): a sample-level confidence that a given detector could properly handle the input. Taking AI-generated image detection as an example, we empirically discover that detection accuracy drops almost monotonically with full-reference image quality scores as distortion becomes severer, while such references are in fact unavailable at test time. Guided by this observation, the Distortion-Aware Confidence Model (DACOM) is proposed as a useful assistant to the forensic detector. DACOM utilizes full-reference image quality assessment to provide oracle statistical information that labels the detectability of images for training, and integrates intermediate forensic features of the detector, no-reference image quality descriptors and distortion-type cues to estimate DAC. With the estimated confidence score, it is possible to conduct selective abstention and multi-detector routing to improve the overall accuracy of a detection system. Extensive experiments have demonstrated the effectiveness of our approach.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

186. Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

作者:

The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of X-LLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, X-LLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3’s proprietary 36T-token corpus for pretraining, X-LLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we release the complete training recipe, data sources, data mixing ratio, and model checkpoints, together with the key insights obtained throughout this study.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

187. A Law of Data Reconstruction for Random Features (And Beyond)

作者:

Large-scale deep learning models are known to *memorize* parts of the training set. In machine learning theory, memorization is often framed as interpolation or label fitting, and classical results show that this can be achieved when the number of parameters $p$ in the model is larger than the number of training samples $n$. In this work, we consider memorization from the perspective of *data reconstruction*, demonstrating that this can be achieved when $p$ is larger than $dn$, where $d$ is the dimensionality of the data. More specifically, we show that, in the random features model, when $p \gg dn$, the subspace spanned by the training samples in feature space gives sufficient information to identify the individual samples in input space. Our analysis suggests an optimization method to reconstruct the dataset from the model parameters, and we demonstrate that this method performs well on various architectures (random features, two-layer fully-connected and deep residual networks). Our results reveal a *law of data reconstruction*, according to which the entire training dataset can be recovered as $p$ exceeds the threshold $dn$.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

188. UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

作者:

While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

189. SAM 3: Segment Anything with Concepts

作者:

We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of a vision backbone shared between an image-level detector and a memory-based video tracker. Recognition and localization are decoupled with a presence head, which significantly boosts detection accuracy. SAM 3 delivers a 2x gain over existing systems in both image and video PCS, and improves previous SAM capabilities in interactive visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

190. Depth Anything 3: Recovering the Visual Space from Any Views

作者:

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINOv2 encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 35.7\% in camera pose accuracy and 23.6\% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

191. Adaptive Social Learning via Mode Policy Optimization for Language Agents

作者:

Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current studies. Existing methods either lack explicit reasoning or employ lengthy Chain-of-Thought reasoning uniformly across all scenarios, resulting in excessive token usage and inflexible social behaviors in tasks such as negotiation or collaboration. To address this, we propose an $\textbf{A}$daptive $\textbf{S}$ocial $\textbf{L}$earning ($\textbf{ASL}$) framework in this paper, aiming to improve the adaptive reasoning ability of language agents in dynamic social interactions. To this end, we first identify the hierarchical reasoning modes under such context, ranging from intuitive response to deep deliberation based on the cognitive control theory. We then develop the $\textbf{A}$daptive $\textbf{M}$ode $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{AMPO}$) algorithm to learn the context-aware mode adaptation and reasoning. Our framework advances existing research in three key aspects: (1) Multi-granular reasoning mode design, (2) Context-aware mode switching in rich social interaction, and (3) Token-efficient reasoning with depth adaptation. Extensive experiments on the benchmark social intelligence environment verify that ASL achieves 15.6\% higher task performance than GPT-4o. Notably, our AMPO outperforms GRPO by 7.0\% with 32.8\% shorter thinking chains, demonstrating the advantages of our AMPO and the learned adaptive reasoning ability over GRPO's solution.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

192. Premise Selection for a Lean Hammer

作者:

Neural methods are transforming automated reasoning for proof assistants, yet integrating these advances into practical verification workflows remains challenging. A $\textit{hammer}$ is a tool that integrates premise selection, translation to external automatic theorem provers, and proof reconstruction into one overarching tool to automate tedious reasoning steps. We present LeanPremise, a novel neural premise selection system, and we combine it with existing translation and proof reconstruction components to create LeanHammer, the first end-to-end domain general hammer for the Lean proof assistant. Unlike existing Lean premise selectors, LeanPremise is specifically trained for use with a hammer in dependent type theory. It also dynamically adapts to user-specific contexts, enabling it to effectively recommend premises from libraries outside LeanPremise's training data as well as lemmas defined by the user locally. With comprehensive evaluations, we show that LeanPremise enables LeanHammer to solve 21\% more goals than existing premise selectors and generalizes well to diverse domains. Our work helps bridge the gap between neural retrieval and symbolic reasoning, making formal verification more accessible to researchers and practitioners.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

193. Diagnosing and Improving Diffusion Models by Estimating Optimal Loss Value

作者:

Diffusion models have achieved remarkable success in generative modeling. Despite more stable training, the loss of diffusion models is not indicative of absolute data-fitting quality, since its optimal value is typically not zero but unknown, leading to the confusion between large optimal loss and insufficient model capacity. In this work, we advocate the need to estimate the optimal loss value for diagnosing and improving diffusion models. We first derive the optimal loss in closed form under a unified formulation of diffusion models, and develop effective estimators for it, including a stochastic variant scalable to large datasets with proper control of variance and bias. With this tool, we unlock the inherent metric for diagnosing training quality of representative diffusion model variants, and develop a more performant training schedule based on the optimal loss. Moreover, using models with 120M to 1.5B parameters, we find that the power law is better demonstrated after subtracting the optimal loss from the actual training loss, suggesting a more principled setting for investigating the scaling law for diffusion models.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

194. ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

作者:

Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often lead to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We also co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

195. Sapiens2

作者:

We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+22.3 mIoU), normal estimation (+29.2 rel-angular error) and extends to new tasks such as pointmap and albedo estimation.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

196. VoMP: Predicting Volumetric Mechanical Property Fields

作者:

Physical simulation relies on spatially-varying mechanical properties, typically laboriously hand-crafted. We present the first feed-forward model to predict fine-grained mechanical properties, Young’s modulus($E$), Poisson’s ratio($\nu$), and density($\rho$), throughout *the volume* of 3D objects. Our model supports any 3D representation that can be rendered and voxelized, including Signed Distance Fields(SDFs), Gaussian Splats and Neural Radiance Fields(NeRFs). To achieve this, we aggregate per-voxel multi-view features for any input, which are passed to our trained Geometry Transformer to predict per-voxel material latent codes. These latents reside on the trained manifold of physically plausible materials, which we train on a real-world dataset, guaranteeing the validity of decoded per-voxel materials. To obtain object-level training data, we propose an annotation pipeline combining knowledge from segmented 3D datasets, material databases, and a vision-language model. Experiments show that VoMP estimates accurate volumetric properties and can convert 3D objects into simulation-ready assets, resulting in realistic deformable simulations and far outperforming prior art.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

197. DemoGrasp: Universal Dexterous Grasping from a Single Demonstration

作者:

Universal grasping with multi-fingered dexterous hands is a fundamental challenge in robotic manipulation. While recent approaches successfully learn closed-loop grasping policies using reinforcement learning (RL), the inherent difficulty of high-dimensional, long-horizon exploration necessitates complex reward and curriculum design, often resulting in suboptimal solutions across diverse objects. We propose DemoGrasp, a simple yet effective method for learning universal dexterous grasping. We start from a single successful demonstration trajectory of grasping a specific object and adapt to novel objects and poses by editing the robot actions in this trajectory: changing the wrist pose determines where to grasp, and changing the hand joint angles determines how to grasp. We formulate this trajectory editing as a single-step Markov Decision Process (MDP) and use RL to optimize a universal policy across hundreds of objects in parallel in simulation, with a simple reward consisting of a binary success term and a robot–table collision penalty. In simulation, DemoGrasp achieves a 95% success rate on DexGraspNet objects using the Shadow Hand, outperforming previous state-of-the-art methods. It also shows strong transferability, achieving an average success rate of 84.6% across diverse dexterous hand embodiments on six unseen object datasets, while being trained on only 175 objects. Through vision-based imitation learning, our policy successfully grasps 110 unseen real-world objects, including small, thin items. It generalizes to spatial, background, and lighting changes, supports both RGB and depth inputs, and extends to language-guided grasping in cluttered scenes.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

198. Uncover Underlying Correspondence for Robust Multi-view Clustering

作者:

Multi-view clustering (MVC) aims to group unlabeled data into semantically meaningful clusters by leveraging cross-view consistency. However, real-world datasets collected from the web often suffer from noisy correspondence (NC), which breaks the consistency prior and results in unreliable alignments. In this paper, we identify two critical forms of NC that particularly harm clustering: i) category-level mismatch, where semantically consistent samples from the same class are mistakenly treated as negatives; and ii) sample-level mismatch, where collected cross-view pairs are misaligned and some samples may even lack any valid counterpart. To address these challenges, we propose a generative framework that formulates noisy correspondence learning in MVC as maximum likelihood estimation over underlying cross-view correspondences. The objective is elegantly solved via an Expectation–Maximization algorithm: in the E-step, soft correspondence distributions are inferred across views, capturing class-level relations while adaptively down-weighting noisy or unalignable samples through GMM-guided marginals; in the M-step, the embedding network is updated to maximize the expected log-likelihood. Extensive experiments on both synthetic and real-world noisy datasets demonstrate that our method significantly improves clustering robustness. The code will be released upon acceptance.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 8

📄 openreview 📄 下载PDF

199. Instilling an Active Mind in Avatars via Cognitive Simulation

作者:

Current video avatar models can generate fluid animations but struggle to capture a character's authentic essence, primarily synchronizing motion with low-level audio cues instead of understanding higher-level semantics like emotion or intent. To bridge this gap, we propose a novel framework for generating character animations that are not only physically plausible but also semantically rich and expressive. Our model is built on two technical innovations. First, we employ Multimodal Large Language Models to generate a structured textual representation from input conditions, providing high-level semantic guidance for creating contextually and emotionally resonant actions. Second, to ensure robust fusion of multimodal signals, we introduce a specialized Multimodal Diffusion Transformer architecture featuring a novel Pseudo Last Frame design. This allows our model to accurately interpret the joint semantics of audio, images and text, generating motions that are deeply coherent with the overall context. Comprehensive experiments validate the superiority of our method, which achieves compelling results in lip-sync accuracy, video quality, motion naturalness, and semantic consistency. The approach also shows strong generalization to challenging scenarios, including multi-person and non-human subjects. Our video results are linked in https://anonymous.4open.science/w/InstillinganActiveMindinAvatars_Anonymous/ .

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

200. Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

作者:

Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce Fast-dLLM, a method that incorporates a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, Fast-dLLM also proposes a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to 27.6× throughput improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

201. What matters for Representation Alignment: Global Information or Spatial Structure?

作者:

Representation alignment helps generation by distilling representations from a pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question - `what aspect of the target representation matters for generation, its global information (measured by Imagenet1K accuracy) or its spatial structure (pairwise cosine similarity between patch tokens)''? Prevalent wisdom holds that stronger global performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising - spatial structure, rather than global performance drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of spatial information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in <4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA-E and meanflow with REPA). Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 8, 6, 6

📄 openreview 📄 下载PDF

202. Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

作者:

Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and in some cases, to perform visual tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors—the implicit, emergent knowledge about the visual world acquired during language pre-training—are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM's latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (\eg, code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, the perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline—from LLM pre-training to visual alignment and supervised multimodal fine-tuning—across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we also propose and investigate several hypotheses, and introduce a Multi-Level Existence Bench (MLE-Bench) to facilitate future research. Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs. We recommend a visit to our anonymous project page (https://anonymouspaperweb.github.io/lsbs/) for an interactive reading.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

203. VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

作者:

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLMs) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how the choice and specific capabilities of the underlying VLM affect the performance of VLA policies? We introduce \textbf{VLM4VLA}, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Our pipeline, though simple, proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that a VLM's general capabilities are poor predictors of its downstream task performance, contrary to common assumptions. Inconsistencies across benchmarks suggest that VLA policies require capabilities beyond what current VLMs pursue. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Lastly, our analysis also reveals that the vision encoder is a critical bottleneck, and the ability to fine-tune it is crucial for strong performance. These results highlight a significant gap between current VLM pretraining paradigms and the specific demands of embodied tasks. We will release our code, models, and evaluation logs at \href{https://sites.google.com/view/vlm4vla}{our anonymous website} to encourage further research and help better understanding in this direction.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

204. Anime-Ready: Controllable 3D Anime Character Generation with Body-Aligned Component-Wise Garment Modeling

作者:

3D anime character generation has become increasingly important in digital entertainment, including animation production, virtual reality, gaming, and virtual influencers. Unlike realistic human modeling, anime-style characters require exaggerated proportions, stylized surface details, and artistically consistent garments, posing unique challenges for automated 3D generation. Previous approaches for 3D anime character generation often suffer from low mesh quality and blurry textures, and they typically do not provide corresponding skeletons, limiting their usability in animation. In this work, we present a novel framework for high-quality 3D anime character generation that overcomes these limitations by combining the expressive power of the Skinned Multi-Person Linear (SMPL) model with precise garment generation. Our approach extends the Anime-SMPL model to better capture the distinct features of anime characters, enabling unified skeleton generation and blendshape-based facial expression control. This results in fully animation-ready 3D characters with expressive faces, bodies, and garments. To complement the body model, we introduce a body-aligned component-wise garments generation pipeline (including hairstyles, upper garments, lower garments, and accessories), which models garments as structured components aligned with body geometry. Furthermore, our method produces high-quality skin and facial textures, as well as detailed garment textures, enhancing the visual fidelity of the generated characters. Experimental results demonstrate that our framework significantly outperforms baseline methods in terms of mesh quality, texture clarity, and garment-body alignment, making it suitable for a wide range of applications in anime content creation and interactive media.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

205. Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

作者:

The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation. Video results: https://anonlyra.github.io/anonlyra

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

206. SAM-Veteran: An MLLM-Based Human-like SAM Agent for Reasoning Segmentation

作者:

Significant progress has been made in reasoning segmentation by combining multi-modal large language models (MLLMs) with the Segment Anything Model (SAM): the former excel in reasoning and vision–language alignment, while the latter offers powerful pixel-level understanding. However, current paradigms fall short in exploiting SAM’s strengths, especially the ability to support iterative mask refinement by interactive segmentation, a process that human users can naturally perform. To bridge this gap, we introduce **SAM-Veteran**, an experienced mask-aware SAM agent capable of emulating human interaction with SAM via a reasoning-driven segmentation workflow that integrates (i) generating bounding boxes given image–query pairs for SAM input, (ii) proposing refinement points based on SAM-generated masks, and (iii) adaptively terminating the process. Aiming for this goal, we propose a multi-task reinforcement learning framework based on Group Relative Policy Optimization (GRPO), which enhances the MLLM’s abilities in textual grounding and mask comprehension. Furthermore, we introduce a dynamic sampling strategy tailored for generating both boxes and points to stabilize training. Extensive experiments across diverse datasets show that SAM-Veteran achieves human-like interaction with SAM and establishes new state-of-the-art performance on both in-domain and out-of-domain benchmarks.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

207. 3DGEER: 3D Gaussian Rendering Made Exact and Efficient for Generic Cameras

作者:

3D Gaussian Splatting (3DGS) achieves an appealing balance between rendering quality and efficiency, but relies on approximating 3D Gaussians as 2D projections—an assumption that degrades accuracy, especially under generic large field-of-view (FoV) cameras. Despite recent extensions, no prior work has simultaneously achieved both projective exactness and real-time efficiency for general cameras. We introduce 3DGEER, a geometrically exact and efficient Gaussian rendering framework. From first principles, we derive a closed-form expression for integrating Gaussian density along a ray, enabling precise forward rendering and differentiable optimization under arbitrary camera models. To retain efficiency, we propose the Particle Bounding Frustum (PBF), which provides tight ray–Gaussian association without BVH traversal, and the Bipolar Equiangular Projection (BEAP), which unifies FoV representations, accelerates association, and improves reconstruction quality. Experiments on both pinhole and fisheye datasets show that 3DGEER outperforms prior methods across all metrics, runs 5x faster than existing projective exact ray-based baselines, and generalizes to wider FoVs unseen during training—establishing a new state of the art in real-time radiance field rendering.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 10

评审人数: 4

详细评分: 6, 6, 6, 10

📄 openreview 📄 下载PDF

208. Train-before-Test Harmonizes Language Model Rankings

作者:

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare *model potential* by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach *train-before-test*. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

209. Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning

作者:

Deepfake detection remains a formidable challenge due to the evolving nature of fake content in real-world scenarios. However, existing benchmarks suffer from severe discrepancies from industrial practice, typically featuring homogeneous training sources and low-quality testing images, which hinder the practical usage of current detectors. To mitigate this gap, we introduce **HydraFake**, a dataset that contains diversified deepfake techniques and in-the-wild forgeries, along with rigorous training and evaluation protocol, covering unseen model architectures, emerging forgery techniques and novel data domains. Building on this resource, we propose **Veritas**, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT), we introduce *pattern-aware reasoning* that involves critical patterns such as "planning" and "self-reflection" to emulate human forensic process. We further propose a two-stage training pipeline to seamlessly internalize such deepfake reasoning capacities into current MLLMs. Experiments on HydraFake dataset reveal that although previous detectors show great generalization on cross-model scenarios, they fall short on unseen forgeries and data domains. Our Veritas achieves significant gains across different out-of-domain (OOD) scenarios, and is capable of delivering transparent and faithful detection outputs.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

210. Explainable $ K $-means Neural Networks for Multi-view Clustering

作者:

Despite multi-view clustering has achieved great progress in past decades, it is still a challenge to balance the effectiveness, efficiency, completeness and consistency of nonlinearly separable clustering for the data from different views. To address this challenge, we show that multi-view clustering can be regarded as a three-level optimization problem. To be specific, we divide the multi-view clustering into three sub-problems based on $ K $-means or kernel $ K $-means, i.e., linear clustering on the original multi-view dataset, nonlinear clustering on the set of obtained linear clusters and multi-view clustering by integrating partition matrices from different views obtained by linear and nonlinear clustering based on reconstruction. We propose Explainable $ K $-means Neural Networks (EKNN) and present how to unify these three sub-problems into a framework based on EKNN. It is able to simultaneously consider the effectiveness, efficiency, completeness and consistency for the nonlinearly multi-view clustering and can be optimized by an iterative algorithm. EKNN is explainable since the effect of each layer is known. To the best of our knowledge, this is the first attempt to balance the effectiveness, efficiency, completeness and consistency by dividing the multi-view clustering into three different sub-problems. Extensive experimental results demonstrate the effectiveness and efficiency of EKNN compared with other methods for multi-view clustering on different datasets in terms of different metrics.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

211. AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

作者:

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present **AVoCaDO**, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) **AVoCaDO SFT**, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) **AVoCaDO GRPO**, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC benchmark under visual-only settings. The model will be made publicly available to facilitate future research in audiovisual video understanding and generation.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

212. Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

作者:

We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256×256 res.) and 1024 to 48 (512×512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4× lower latency than previous parallelized autoregressive models.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

213. MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

作者:

Long video generation with diffusion transformer is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query–key pairs. Existing sparse methods rely on block-wise coarse estimation, whose accuracy–efficiency trade-offs are constrained by block size. This paper introduce Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight learnable token router to precisely match tokens without blockwise estimation. By semantics-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, it integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Built on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 FPS with approximately 580K context length. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach. We provide an anonymous link \url{https://anonymous.4open.science/r/MoGA} to showcase the generated videos.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 8, 6

📄 openreview 📄 下载PDF

214. RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation

作者:

Large language and vision-language models have inspired end-to-end vision-language-action (VLA) systems in robotics, yet existing robot datasets remain costly, embodiment-specific, and insufficient, limiting robustness and generalization. Recent approaches address this by adopting a plan-then-execute paradigm, where high-level plans are generated before translating into low-level actions, but their success depends on fine-grained intermediate supervision that current datasets lack. To fill this gap, we present the RoboInter Manipulation Suite, a unified resource for data, benchmarking, and modeling of intermediate representations. It includes RoboInter-Tool, a lightweight GUI for semi-automatic per-frame annotation of embodied videos, and RoboInter-Data, a human-verified dataset with over 200k episodes across 571 diverse scenes, offering dense per-frame alignment across more than nine intermediate categories and surpassing prior work in both scale and quality. Building on this foundation, RoboInter-VQA introduces 8 spatial and 20 temporal embodied QA categories to benchmark and enhance the embodied capabilities of current large vision-language models, while RoboInter-VLA provides a flexible plan-then-execute framework with modular and end-to-end variants that link planning to execution. Together, these contributions establish RoboInter Manipulation Suite as a foundation for advancing generalizable and robust robotic learning through fine-grained intermediate supervision.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

215. Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

作者:

Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin’s superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 8, 6

📄 openreview 📄 下载PDF

216. Federated Graph-Level Clustering Network with Dual Knowledge Separation

作者:

Federated Graph-level Clustering (FGC) offers a promising framework for analyzing distributed graph data while ensuring privacy protection. However, existing methods fail to simultaneously consider knowledge heterogeneity across intra- and inter-client, and still attempt to share as much knowledge as possible, resulting in consensus failure in the server. To solve these issues, we propose a novel **F**ederated **G**raph-level **C**lustering **N**etwork with **D**ual **K**nowledge **S**eparation (FGCN-DKS). The core idea is to decouple differentiated subgraph patterns and optimize them separately on the client, and then leverage cluster-oriented patterns to guide personalized knowledge aggregation on the server. Specifically, on the client, we separate personalized variant subgraphs and cluster-oriented invariant subgraphs for each graph. Then the former are retained locally for further refinement of the clustering process, while pattern digests are extracted from the latter for uploading to the server. On the server, we calculate the relation of inter-cluster patterns to adaptively aggregate cluster-oriented prototypes and parameters. Finally, the server generates personalized guidance signals for each cluster of clients, which are then fed back to local clients to enhance overall clustering performance. Extensive experiments on multiple graph benchmark datasets have proven the superiority of the proposed FGCN-DKS over the SOTA methods.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 8

📄 openreview 📄 下载PDF

217. Image Quality Assessment for Embodied AI

作者:

Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 30k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world.

📊 评审评分

平均分: 7.00

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 8

📄 openreview 📄 下载PDF

218. How to train data-efficient LLMs

作者:

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, \ie, techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, AskLLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose density sampling, which models the data distribution to select a diverse sample. Testing the effect of $22$ different data curation techniques on the pre-training of T5-style of models, involving hundreds of pre-training runs and post fine-tuning evaluation tasks, we find that AskLLM and density are the best methods in their respective categories. While coverage sampling techniques often recover the performance of training on the entire dataset, training on data curated via AskLLM consistently outperforms full-data training---even when we sample only $10$\% of the original dataset, while converging up to $70$\% faster.

📊 评审评分

平均分: 6.80

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 6, 8, 6, 8

📄 openreview 📄 下载PDF

219. Understanding the Learning Phases in Self-Supervised Learning via Critical Periods

作者:

Self-supervised learning (SSL) has emerged as a powerful pretraining strategy to learn transferable representations from unlabeled data. Yet, it remains unclear how long SSL models should be pretrained for such representations to emerge. Contrary to the prevailing heuristic that longer pretraining translates to better downstream performance, we identify a transferability trade-off: across multiple SSL methods, architectures, and datasets, we observe intermediate checkpoints yielding stronger out-of-domain (OOD) generalization, while models pretrained longer tend to instead only improve in-domain (ID) accuracy. From this observation, we hypothesize that SSL progresses through learning phases that can be characterized through the lens of critical periods (CP). Prior work on CP has shown that neural networks trained under supervised learning exhibit early phases of high plasticity, followed by a consolidation phase where adaptability declines but task-specific performance keeps increasing. Since traditional CP analysis depends on supervised labels, for SSL we rethink CP in two ways. First, we inject deficits to perturb the pretraining data and measure the quality of learned representations via downstream tasks. Second, to estimate network plasticity during pretraining we compute the Fisher Information matrix on pretext objectives, quantifying the sensitivity of model parameters to the supervisory signal defined by the pretext tasks. We conduct several experiments to demonstrate that SSL models do exhibit their own CP, with CP closure marking a sweet spot where representations are neither underdeveloped nor overfitted to the pretext task. Leveraging these insights, we propose CP-guided checkpoint selection as a mechanism for identifying intermediate checkpoints during SSL that improve OOD transferability. Finally, to balance the transferability trade-off, we propose CP-guided self-distillation, which selectively distills layer representations from the sweet spot (CP closure) checkpoint into their overspecialized counterparts in the final pretrained model.

📊 评审评分

平均分: 6.80

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 8, 8, 6, 6

📄 openreview 📄 下载PDF

220. Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes

作者:

While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion processes, replacing the masking and unmasking processes in current MDLMs. DID improves training and inference efficiency by eliminating two major sources of computational overhead in MDLMs: the computations on non-informative 1) $\texttt{\}$ tokens inherent to its paradigm, and 2) $\texttt{\}$ tokens introduced in variable-length settings. Furthermore, DID offers greater flexibility by: 1) natively supporting variable-length sequences without requiring fixed-length padding, and 2) an intrinsic self-correction mechanism during generation due to insertion that dynamically adjusts token positions. To train DID, we design a score-based approach that assigns scores to token insertion operations and derive appropriate training objectives. The objectives involve subsequence counting problems, which we efficiently solve via a parallelized dynamic programming algorithm. Our experiments across fixed and variable-length settings demonstrate the advantage of DID over baselines of MDLMs and existing insertion-based LMs, in terms of modeling performance, sampling quality, and training/inference speed.

📊 评审评分

平均分: 6.80

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 8, 6, 6, 8

📄 openreview 📄 下载PDF

221. Online Decision Making with Generative Action Sets

作者:

With advances in generative AI, decision-making agents can now dynamically create new actions during online learning, but action generation typically incurs costs that must be balanced against potential benefits. We study an online learning problem where an agent can generate new actions at any time step by paying a one-time cost, with these actions becoming permanently available for future use. The challenge lies in learning the optimal sequence of two-fold decisions: which action to take and when to generate new ones, further complicated by the triangular tradeoffs among exploitation, exploration and *creation*. To solve this problem, we propose a doubly-optimistic algorithm that employs Lower Confidence Bounds (LCB) for action selection and Upper Confidence Bounds (UCB) for action generation. Empirical evaluation on healthcare question-answering datasets demonstrates that our approach achieves favorable generation-quality trade-offs compared to baseline strategies. From theoretical perspectives, we prove that our algorithm achieves the optimal regret of $O(T^{\frac{d}{d+2}}d^{\frac{d}{d+2}} + d\sqrt{T\log T})$, providing the first sublinear regret bound for online learning with expanding action spaces.

📊 评审评分

平均分: 6.80

最低分: 6

最高分: 8

评审人数: 5

详细评分: 8, 8, 6, 6, 6

📄 openreview 📄 下载PDF

222. QVGen: Pushing the Limit of Quantized Video Generative Models

作者:

Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present *QVGen*, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (*e.g.*, $4$-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ($\Phi$) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of $\Phi$, we propose a *rank-decay* strategy that progressively eliminates $\Phi$. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization $\mathbf{\gamma}$ to identify and decay low-contributing components. This strategy retains performance while zeroing out additional inference overhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs, with parameter sizes ranging from $1.3\text{B}\sim14\text{B}$, show that QVGen is *the first* to reach full-precision comparable quality under $4$-bit settings. Moreover, it significantly outperforms existing methods. For instance, our $3$-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench. *Code and videos are available in the supplementary material.*

📊 评审评分

平均分: 6.80

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 8, 8, 6, 6

📄 openreview 📄 下载PDF

223. Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

作者:

Recent empirical studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT), and have found it to yield significant performance improvements. However, there is limited understanding of why and when TTT is effective. Earlier explanations mostly focused on the observation that TTT may help when applied to out-of-distribution adaptation or used with privileged data. However, the growing scale of foundation models with most test data being in-distribution questions these explanations. We instead posit that foundation models remain globally underparameterized, with TTT providing a mechanism for *specialization after generalization*—focusing capacity on concepts relevant to the test task. Specifically, under the linear representation hypothesis, we propose a model in which TTT achieves a substantially smaller *in-distribution* test error than global training. We empirically validate our model's key assumptions by training a sparse autoencoder on ImageNet, showing that semantically related data points are explained by only a few shared concepts. Finally, we perform scaling studies across image and language tasks that confirm the practical implications of our model, identifying the regimes where specialization is most effective.

📊 评审评分

平均分: 6.80

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 8, 8, 6, 6

📄 openreview 📄 下载PDF

224. The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

作者:

In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and addressing key open questions in the field of automated proof generation. Specifically, it remains unknown (1) how large the gap is between natural language and formal proof generation, (2) how final-answer accuracy relates to full proof correctness, and (3) how best-of-n selection strategies can affect proof quality. In this work, we present *the Open Proof Corpus* (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we address the open questions outlined above and provide new insights into LLMs' strengths and limitations in mathematical reasoning. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that matches **Gemini-2.5-Pro**, and performs close to the best model, **GPT-5**, on evaluating proof correctness.

📊 评审评分

平均分: 6.80

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 8, 8, 6, 6

📄 openreview 📄 下载PDF

225. Quantized Visual Geometry Grounded Transformer

作者:

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have achieved remarkable progress with large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has emerged as a common practice to compress and accelerate models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first **Quant**ization framework for **VGGT**s, namely **QuantVGGT**. This mainly relies on two technical contributions: First, we introduce *Dual-Smoothed Fine-Grained Quantization*, which integrates pre-global Hadamard rotation and post-local channel smoothing to robustly mitigate heavy-tailed distributions and inter-channel variance. Second, we design *Noise-Filtered Diverse Sampling*, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a **3.7$\times$** memory reduction and **2.5$\times$** acceleration in real-hardware inference, while preserving over **98\%** reconstruction accuracy of the full-precision counterparts. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios.

📊 评审评分

平均分: 6.80

最低分: 6

最高分: 8

评审人数: 5

详细评分: 8, 6, 6, 8, 6

📄 openreview 📄 下载PDF

226. ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

作者:

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research.

📊 评审评分

平均分: 6.80

最低分: 6

最高分: 10

评审人数: 5

详细评分: 6, 6, 6, 10, 6

📄 openreview 📄 下载PDF

227. ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

作者:

Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15\% and 17\% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

228. Why DPO is a Misspecified Estimator and How to Fix It

作者:

Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

229. TNT: Improving Chunkwise Training for Test-Time Memorization

作者:

Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers. While these expressive models do not yet match the peak performance of state-of-the-art Transformers, their potential has been largely untapped due to prohibitively slow training and low hardware utilization. Existing parallelization methods force a fundamental conflict governed by the chunksize hyperparameter: large chunks boost speed but degrade performance, necessitating a fixed, suboptimal compromise. To solve this challenge, we introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process. Stage one is an efficiency-focused pre-training phase utilizing a hierarchical memory. A global module processes large, hardware-friendly chunks for long-range context, while multiple parallel local modules handle fine-grained details. Crucially, by periodically resetting local memory states, we break sequential dependencies to enable massive context parallelization. Stage two is a brief fine-tuning phase where only the local memory modules are adapted to a smaller, high-resolution chunksize, maximizing accuracy with minimal overhead. Evaluated on Titans and TTT models, TNT achieves a substantial acceleration in training speed—up to 17$\times$ faster than the most accurate baseline configuration—while simultaneously improving model accuracy. This improvement removes a critical scalability barrier, establishing a practical foundation for developing expressive RNNs and facilitating future work to close the performance gap with Transformers.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

230. Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning

作者:

Generative models have recently demonstrated remarkable success across diverse domains, motivating their adoption as expressive policies in reinforcement learning (RL). While they have shown strong performance in offline RL, particularly where the target distribution is well defined, their extension to online fine-tuning has largely been treated as a direct continuation of offline pre-training, leaving key challenges unaddressed. In this paper, we propose Flow Matching with Injected Noise for Offline-to-Online RL (FINO), a novel method that leverages flow matching-based policies to enhance sample efficiency for offline-to-online RL. FINO facilitates effective exploration by injecting noise into policy training, thereby encouraging a broader range of actions beyond those observed in the offline dataset. In addition to exploration-enhanced flow policy training, we combine an entropy-guided sampling mechanism to balance exploration and exploitation, allowing the policy to adapt its behavior throughout online fine-tuning. Experiments across diverse, challenging tasks demonstrate that FINO consistently achieves superior performance under limited online budgets.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

231. CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

作者:

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present COUNSELBENCH, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, COUNSELBENCH-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct COUNSELBENCH-ADV, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Evaluation of 3,240 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, COUNSELBENCH establishes a clinically grounded framework for benchmarking LLMs in mental health QA.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

232. INO-SGD: Addressing Utility Imbalance under Individualized Differential Privacy

作者:

Differential privacy (DP) is widely employed in machine learning to protect confidential or sensitive training data from being revealed. As data owners gain greater control over their data due to personal data ownership, they are more likely to set their own privacy requirements, necessitating individualized DP (IDP) to fulfil such requests. In particular, owners of data from more sensitive subsets, such as positive cases of stigmatized diseases, likely set stronger privacy requirements, as leakage of such data could incur more serious societal impact. However, existing IDP algorithms induce a critical utility imbalance problem: Data from owners with stronger privacy requirements may be severely underrepresented in the trained model, resulting in poorer performance on similar data from subsequent users during deployment. In this paper, we analyze this problem and propose the INO-SGD algorithm, which strategically down-weights data within each batch to improve performance on the more private data across all iterations. Notably, our algorithm is specially designed to satisfy IDP, while existing techniques addressing utility imbalance neither satisfy IDP nor can be easily adapted to do so. Lastly, we demonstrate the empirical feasibility of our approach.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

233. Unbalanced Soft-Matching Distance For Neural Representational Comparison With Partial Unit Correspondence

作者:

Representational similarity metrics typically force all units to be matched, making them susceptible to noise and outliers common in neural representations. We extend the soft-matching distance to a partial optimal transport setting that allows some neurons to remain unmatched, yielding rotation-sensitive but robust correspondences. This unbalanced soft-matching distance provides theoretical advantages---relaxing strict mass conservation while maintaining interpretable transport costs---and practical benefits through efficient neuron ranking in terms of cross-network alignment without costly iterative recomputation. In simulations, it preserves correct matches under outliers and reliably selects the correct model in noise-corrupted identification tasks. On fMRI data, it automatically excludes low-reliability voxels and produces voxel rankings by alignment quality that closely match computationally expensive brute-force approaches. It achieves higher alignment precision across homologous brain areas than standard soft-matching, which is forced to match all units regardless of quality. In deep networks, highly matched units exhibit similar maximally exciting images, while unmatched units show divergent patterns. This ability to partition by match quality enables focused analyses, \emph{e.g.,} testing whether networks have privileged axes even within their most aligned subpopulations. Overall, unbalanced soft-matching provides a principled and practical method for representational comparison under partial correspondence.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

234. When Thinking Backfires: Mechanistic Insights into Reason-induced Misalignment

作者:

With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened—particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we find that certain attention heads diverge from CoT tokens, modulating rationalization to enable refusal during generation. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

235. 3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

作者:

AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D semantic point autoencoder which integrates 3D point trajectories, depth cues, and DINOv2 semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

236. Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

作者:

Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1\% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

237. Evaluating steering techniques using human similarity judgments

作者:

Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards "kind" similarity and struggled with "size" alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

238. Neural Posterior Estimation with Latent Basis Expansions

作者:

Neural posterior estimation (NPE) is a likelihood-free amortized variational inference method that approximates projections of the posterior distribution. To date, NPE variational families have been either simple and interpretable (such as the Gaussian family) or highly flexible but black-box and potentially difficult to optimize (such as normalizing flows). In this work, we parameterize variational families via basis expansions of the latent variables. The log density of our variational distribution is a linear combination of latent basis functions (LBFs), which may be fixed a priori or adapted to the problem class of interest. Our training and inference procedures are computationally efficient even for problems with high-dimensional latent spaces, provided only a low-dimensional projection of the posterior is of interest, owing to NPE's automatic marginalization capabilities. In numerous inference problems, the proposed variational family exhibits better performance than existing variational families used with NPE, including mixtures of Gaussians (mixture density networks) and normalizing flows, as well as outperforming an existing basis expansion method for variational inference.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

239. Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment

作者:

We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning that incorporates human perceptual structure during the initialization phase rather than as a downstream fine-tuning step. By integrating human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder, followed by self-supervised learning on YFCC15M, our approach demonstrates significant zero-shot performance improvements without any task-specific fine-tuning across 29 zero shot classification and 2 retrieval benchmarks. On ImageNet-1K, zero-shot gains emerge after approximately 15 epochs of pretraining. Benefits are observed across datasets of various scales, with improvements manifesting at different stages of the pretraining process depending on dataset characteristics. Our approach consistently enhances zero-shot top-1 accuracy, top-5 accuracy, and retrieval recall (e.g., R@1, R@5) across these diverse evaluation tasks, without requiring any adaptation to target domains. These findings challenge the conventional wisdom of using human-perceptual data primarily for fine-tuning and demonstrate that embedding human perceptual structure during early representation learning yields more capable and vision-language aligned systems that generalize immediately to unseen tasks. Our work shows that "beginning with you", starting with human perception, provides a stronger foundation for general-purpose vision-language intelligence.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

240. All Code, No Thought: Language Models Struggle to Reason in Ciphered Language

作者:

Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through *ciphered reasoning*: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

241. Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

作者:

The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and language, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

242. RESTRAIN: From Spurious Votes to Signals — Self-Training RL with Self-Penalization

作者:

Reinforcement learning (RL) with human annotated data has boosted long chain- of-thought (CoT) reasoning in large language models (LLMs), but these gains come at high costs in labeled data while still faltering on harder tasks. A nat- ural next step is experience-driven learning, where models improve without cu- rated labels by adapting to unlabeled data. We introduce REinforcement learning with Self-resTRAINt training (RESTRAIN), a self-penalizing RL framework that transforms the absence of gold labels into a learning signal. Rather than amplify- ing spurious majority votes, RESTRAIN leverages signals from the model’s entire answer distribution, penalizing overconfident rollouts and low-consistent exam- ples while preserving promising reasoning chains. This restraint mechanism inte- grates seamlessly into policy optimization methods such as GRPO to self-improve without human supervisions. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. On Qwen3-4B-Base and Octo- Thinker Hybrid-8B-Base model, RESTRAIN boosts pass@1 by up to +140.7% on AIME25, +36.2% on MMLU STEM, and +19.6% on GPQA-Diamond. Re- markably, it comes within 0.4% of a fully supervised counterpart, nearly matching gold-label training while using no gold labels at all. These results demonstrate that RESTRAIN consistently boosts reasoning without supervision.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

243. Ads that Stick: Near-Optimal Ad Optimization through Psychological Behavior Models

作者:

Optimizing the timing and frequency of advertisements (ads) is a central problem in digital advertising, with significant economic consequences. Existing scheduling policies rely on simple heuristics, such as uniform spacing and frequency caps, that overlook long-term user interest. However, it is well-known that users' long-term interest and engagement result from the interplay of several psychological effects (Curmei, Haupt, Recht, and Hadfield-Menell, ACM CRS, 2022). In this work, we model change in user interest upon showing ads based on three key psychological principles: mere exposure, hedonic adaptation, and operant conditioning. The first two effects are modeled using a concave function of user interest with repeated exposure, while the third effect is modeled using a temporal decay function, which explains the decline in user interest due to overexposure. Under our psychological behavior model, we ask the following question: Given a continuous time interval $T$, how many ads should be shown, and at what times, to maximize the user interest towards the ads? Towards answering this question, we first show that, if the number of displayed ads is fixed to $n$, then the optimal ad-schedule only depends on the operant conditioning function. Our main result is a quasi-linear time algorithm that, given the number of ads $n$, outputs a near-optimal ad-schedule, i.e., the difference in the performance of our schedule and the optimal schedule is exponentially small. Our algorithm leads to significant insights about optimal ad placement and shows that simple heuristics such as uniform spacing are sub-optimal under many natural settings. The optimal number of ads to display, which also depends on the mere exposure and hedonistic adaptation functions, can be found through a simple linear search given the above algorithm. We further support our findings with experimental results, demonstrating that our strategy outperforms various baselines.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

244. LRIM: a Physics-Based Benchmark for Provably Evaluating Long-Range Capabilities in Graph Learning

作者:

Accurately modeling long-range dependencies in graph-structured data is critical for many real-world applications. However, incorporating long-range interactions beyond the nodes' immediate neighborhood in a $\textit{scalable}$ manner remains an open challenge for graph machine learning models. Existing benchmarks for evaluating long-range capabilities either cannot $\textit{guarantee}$ that their tasks actually depend on long-range information or are rather limited. Therefore, claims of long-range modeling improvements based on said performance remain questionable. We introduce the Long-Range Ising Model Graph Benchmark, a physics-based benchmark utilizing the well-studied Ising model whose ground truth $\textit{provably}$ depends on long-range dependencies. Our benchmark consists of ten datasets that scale from 256 to 65k nodes per graph, and provide controllable long-range dependencies through tunable parameters, allowing precise control over the hardness and ``long-rangedness". We provide model-agnostic evidence that local information is insufficient, further validating the design choices of our benchmark. Via experiments on classical message-passing architectures and graph transformers, we show that both perform far from the optimum, especially those with scalable complexity. Our goal is that our benchmark will foster the development of scalable methodologies that effectively model long-range interactions in graphs.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

245. Fast Language Generation through Discrete Diffusion Divergence Instruct

作者:

The fast generation of language texts is the holy grail that people pursue in the AI era. In this work, we introduced **Di**screte **Di**ffusion Divergence **Instruct** (**DiDiInstruct**), a training-based method that leads to fast language generation models by initializing from a pre-trained (masked) discrete diffusion language model (dLLM). The resulting DiDi-Instruct model outperforms the dLLM counterparts and the GPT-2 baseline with 64$\times$ acceleration. In the theoretical part of the paper, we build the foundation of DiDi-Instruct in a framework of integral KL divergence minimization, with practical training algorithms. We also introduce techniques like grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler (RGAS) that significantly improve the training stability, the model coverage, and the inference performances. On OpenWebText, DiDi-Instruct outperforms all accelerated language generation models as well as the GPT-2 baseline and the standard dLLMs, achieving sample perplexities ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs). These performance gains are accomplished with a negligible entropy loss of about $1$\% and $20\times$ less additional training wall-clock time. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, and the generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye. We will release our code and models along with the paper

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

246. HYPED: A Multimodal HYbrid Perturbation Gene Expression and Imaging Dataset

作者:

Integrating multimodal, high-resolution biological data is a useful way to characterize biological processes, such as how cells respond to perturbations. Cell perturbation prediction is a major experimental challenge and has motivated substantial research in machine learning for biology. In this work, we generated a multimodal benchmark dataset that captures the dynamic response of human fibroblasts to transient transcription factor perturbations. We performed time-series live cell imaging with fluorescent cell cycle reporters over 72 hours and collected long-read single-cell RNA sequencing data from the same population of cells. We release the processed dataset, preprocessing pipelines and benchmarking code along with the evaluation of existing models using our data as ground truth. This work supports the development and evaluation of machine learning methods for modeling dynamical systems from multimodal datasets. HYPED makes the cell perturbation problem accessible to machine learning researchers with state-of-the-art experimental data.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

247. Mitigating Spurious Correlation via Distributionally Robust Learning with Hierarchical Ambiguity Sets

作者:

Conventional supervised learning methods are often vulnerable to spurious correlations, particularly under distribution shifts in test data. To address this issue, several approaches, most notably Group DRO, have been developed. While these methods are highly robust to subpopulation or group shifts, they remain vulnerable to intra-group distributional shifts, which frequently occur in minority groups with limited samples. We propose a hierarchical extension of Group DRO that addresses both inter-group and intra-group uncertainties, providing robustness to distribution shifts at multiple levels. We also introduce new benchmark settings that simulate realistic minority group distribution shifts—an important yet previously underexplored challenge in spurious correlation research. Our method demonstrates strong robustness under these conditions—where existing robust learning methods consistently fail—while also achieving superior performance on standard benchmarks. These results highlight the importance of broadening the ambiguity set to better capture both inter-group and intra-group distributional uncertainties.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

248. On the Impact of the Utility in Semivalue-based Data Valuation

作者:

Semivalue–based data valuation uses cooperative‐game theory intuitions to assign each data point a value reflecting its contribution to a downstream task. Still, those values depend on the practitioner’s choice of utility, raising the question: *How robust is semivalue-based data valuation to changes in the utility?* This issue is critical when the utility is set as a trade‐off between several criteria and when practitioners must select among multiple equally valid utilities. We address it by introducing the notion of a dataset’s *spatial signature*: given a semivalue, we embed each data point into a lower-dimensional space where any utility becomes a linear functional, making the data valuation framework amenable to a simpler geometric picture. Building on this, we propose a practical methodology centered on an explicit robustness metric that informs practitioners whether and by how much their data valuation results will shift as the utility changes. We validate this approach across diverse datasets and semivalues, demonstrating strong agreement with rank‐correlation analyses and offering analytical insight into how choosing a semivalue can amplify or diminish robustness.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

249. Toward Principled Flexible Scaling for Self-Gated Neural Activation

作者:

Neural networks necessitate nonlinearities to achieve universal approximability. Traditional activation functions introduce nonlinearities through rigid feature rectifications. Recent self-gated variants improve traditional methods in fitting flexibility by incorporating learnable content-aware factors and non-local dependencies, enabling dynamic adjustments to activation curves via adaptive translation and scaling. While SOTA approaches achieve notable gains in conventional CNN layers, they struggle to enhance Transformer layers, where fine-grained context is inherently modeled, severely reducing the effectiveness of non-local dependencies leveraged in activation processes. We refer to this critical yet unexplored challenge as the non-local tension of activation. Drawing on a decision-making perspective, we systematically analyze the origins of the non-local tension problem and explore the initial solution to foster a more discriminative and generalizable neural activation methodology. This is achieved by rethinking how non-local cues are encoded and transformed into adaptive scaling coefficients, which in turn recalibrate the contributions of features to filter updates through neural activation. Grounded in these insights, we present FleS, a novel self-gated activation model for discriminative pattern recognition. Extensive experiments on various popular benchmarks validate our interpretable methodology for improving neural activation modeling.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

250. Foundation Models for Causal Inference via Prior-Data Fitted Networks

作者:

Prior-data fitted networks (PFNs) have recently been proposed as a promising way to train tabular foundation models. PFNs are transformers that are pre-trained on synthetic data generated from a prespecified prior distribution and that enable Bayesian inference through in-context learning. In this paper, we introduce CausalFM, a comprehensive framework for training PFN-based foundation models in various causal inference settings. First, we formalize the construction of Bayesian priors for causal inference based on structural causal models (SCMs) in a principled way and derive necessary criteria for the validity of such priors. Building on this, we propose a novel family of prior distributions using causality-inspired Bayesian neural networks that enable CausalFM to perform Bayesian causal inference in various settings, including back-door, front-door, and instrumental variable adjustment. Finally, we instantiate CausalFM and train our foundation models for estimating conditional average treatment effects (CATEs) for different settings. We show that CausalFM performs competitively for CATE estimation using various synthetic and semi-synthetic benchmarks. In sum, our framework can be used as a general recipe to train foundation models for various causal inference settings. In contrast to the current state-of-the-art in causal inference, CausalFM offers a novel paradigm with the potential to fundamentally change how practitioners perform causal inference in medicine, economics, and other disciplines.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

251. FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

作者:

Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce **FATE**, a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3\% (pass@64) accuracy on FATE-H and 0\% on FATE-X. Our two-stage evaluation reveals that models' natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

252. Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction–Reasoning Synergy

作者:

Recent developments in Multimodal Large Language Models (MLLMs) have significantly improved Vision–Language (VL) reasoning in 2D domains. However, extending these capabilities to 3D scene understanding remains a major challenge. Existing 3D Multimodal Large Language Models (3D-MLLMs) often depend on 3D data inputs, which limits scalability and generalization. To address this limitation, we propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data, making it practical for real-world deployment. In our method, the geometric prior are directly used to improve the performance of the sceen perception. To integrate the geometric cues into the MLLM compactly, we design a Cross-Task Adapter (CTA) module to align the 3D geometric priors with the vision-language representations. To ensure geometric consistency and integrity, we introduce a Metric Depth Model that recovers real-scale geometry from the reconstruction outputs. Finally, the model is fine-tuned with a two-stage distillation optimization strategy, realizing fast convergence and stabilizes training. Extensive experiments across diverse benchmarks verified the effectiveness of our method on 3D Question Answering, 3D Dense Captioning and 3D Visual Grounding tasks, demonstrating the superior multi-task capabilities.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

253. Directed Semi-Simplicial Learning with Applications to Brain Activity Decoding

作者:

Graph Neural Networks (GNNs) excel at learning from pairwise interactions but often overlook multi-way and hierarchical relationships. Topological Deep Learning (TDL) addresses this limitation by leveraging combinatorial topological spaces, such as simplicial or cell complexes. However, existing TDL models are restricted to undirected settings and fail to capture the higher-order directed patterns prevalent in many complex systems, e.g., brain networks, where such interactions are both abundant and functionally significant. To fill this gap, we introduce Semi-Simplicial Neural Networks (SSNs), a principled class of TDL models that operate on semi-simplicial sets---combinatorial structures that encode directed higher-order motifs and their directional relationships. To enhance scalability, we propose Routing-SSNs, which dynamically select the most informative relations in a learnable manner. We theoretically characterize SSNs by proving they are strictly more expressive than standard graph and TDL models, and they are able to recover several topological descriptors. Building on previous evidence that such descriptors are critical for characterizing brain activity, we then introduce a new principled framework for brain dynamics representation learning centered on SSNs. Empirically, we test SSNs on 4 distinct tasks across 13 datasets, spanning from brain dynamics to node classification, showing competitive performance. Notably, SSNs consistently achieve state-of-the-art performance on brain dynamics classification tasks, outperforming the second-best model by up to 27\%, and message passing GNNs by up to 50\% in accuracy. Our results highlight the potential of topological models for learning from structured brain data, establishing a unique real-world case study for TDL. Code and data are uploaded as supplementary material.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

254. RedSage: A Cybersecurity Generalist LLM

作者:

Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q\&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code will be released to advance reproducibility and open cybersecurity LLM research.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

255. Decision Aggregation under Quantal Response

作者:

The effectiveness of collective decision-making is often challenged by the bounded rationality and inherent stochasticity of individual agents. We investigate this by analyzing how to aggregate decisions from $n$ experts, each receiving a private signal about an unknown state. Assuming signals are conditionally independent and identically distributed, we depart from the fully rational paradigm and model expert behavior using quantal response—a stochastic choice model capturing bounded rationality. Within a minimax regret framework, we show that majority voting is the optimal robust aggregator when individual rationality falls below a certain threshold. Interestingly, such groups can outperform perfectly rational agents, as their decision randomness encodes weak but informative signals lost in deterministic behavior. We validate these findings using large language models (LLMs), which naturally exhibit quantal response via their temperature parameter. Aggregating moderately stochastic LLM outputs significantly improves accuracy on complex reasoning tasks, highlighting bounded rationality not as a limitation, but as a potential strength in collective intelligence.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

256. LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis

作者:

Machine learning (ML)-based malware detection systems often fail to account for the dynamic nature of real-world training and test data distributions. In practice, these distributions evolve due to frequent changes in the Android ecosystem, adversarial development of new malware families, and the continuous emergence of both benign and malicious applications. Prior studies have shown that such concept drift—distributional shifts in benign and malicious samples, leads to significant degradation in detection performance over time. Despite the practical importance of this issue, existing datasets are often outdated and limited in temporal scope, diversity of malware families, and sample scale, making them insufficient for the systematic evaluation of concept drift in malware detection. To address this gap, we present LAMDA, the largest and most temporally diverse Android malware benchmark to date, designed specifically for concept drift analysis. LAMDA spans 12 years (2013–2025, excluding 2015), includes over 1 million samples (approximately 37\% labeled as malware), and covers 1,380 malware families and 150,000 singleton samples, reflecting the natural distribution and evolution of real-world Android applications. We empirically demonstrate LAMDA's utility by quantifying the performance degradation of standard ML models over time and analyzing feature stability across years. As the most comprehensive Android malware dataset to date, LAMDA enables in-depth research into temporal drift, generalization, explainability, and evolving detection challenges.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

257. From Fields to Random Trees

作者:

This study introduces a novel method for performing Maximum A Posteriori (MAP) estimation on Markov Random Fields (MRFs) that are defined on locally and sparsely connected graphs, broadly existing in real-world applications. We address this long-standing challenge by sampling uniform random spanning trees(SPT) from the associated graph. Such a sampling procedure effectively breaks the cycles and decomposes the original MAP inference problem into overlapping sub-problems on trees, which can be solved exactly and efficiently. We demonstrate the effectiveness of our approach on various types of graphical models, including grids, cellular/cell networks, and Erdős–Rényi graphs. Our algorithm outperforms various baselines on synthetic, UAI inference competition, and real-world PCI problems, specifically in cases involving locally and sparsely connected graphs. Furthermore, our method achieves comparable results to these methods in other scenarios.The code of our model can be accessed at \url{https://anonymous.4open.science/r/From-fields-to-trees-iclr-EB75}.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

258. Efficient Regression-based Training of Normalizing Flows for Boltzmann Generators

作者:

Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to large-scale diffusion and flow matching models. However, such modern generative models suffer from expensive inference, inhibiting their use in numerous scientific applications like Boltzmann Generators (BGs) for molecular conformations that require fast likelihood evaluation. In this paper, we revisit classical normalizing flows in the context of BGs that offer efficient sampling and likelihoods, but whose training via maximum likelihood is often unstable and computationally challenging. We propose Regression Training of Normalizing Flows (RegFlow), a novel and scalable regression-based training objective that bypasses the numerical instability and computational challenge of conventional maximum likelihood training in favour of a simple $\ell_2$-regression objective. Specifically, RegFlow maps prior samples under our flow to targets computed using optimal transport couplings or a pre-trained continuous normalizing flow (CNF). To enhance numerical stability, RegFlow employs effective regularization strategies such as a new forward-backward self-consistency loss that enjoys painless implementation. Empirically, we demonstrate that RegFlow unlocks a broader class of architectures that were previously intractable to train for BGs with maximum likelihood. We also show RegFlow exceeds the performance, computational cost, and stability of maximum likelihood training in equilibrium sampling in Cartesian coordinates of alanine dipeptide, tripeptide, and tetrapeptide, showcasing its potential in molecular systems.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

259. EMBridge: Enhancing Gesture Generalization from EMG Signals Through Cross-modal Representation Learning

作者:

Hand gesture classification using high-quality structured data such as videos, images, and hand skeletons is a well-explored problem in computer vision. Alternatively, leveraging low-power, cost-effective bio-signals, e.g. surface electromyography (sEMG), allows for continuous gesture prediction on wearable devices. In this work, we aim to enhance EMG representation quality by aligning it with embeddings obtained from structured, high-quality modalities that provide richer semantic guidance, ultimately enabling zero-shot gesture generalization. Specifically, we propose EMBridge, a cross-modal representation learning framework that bridges the modality gap between EMG and pose. EMBridge learns high-quality EMG representations by introducing a Querying Transformer (Q-Former), a masked pose reconstruction loss, and a community-aware soft contrastive learning objective that aligns the relative geometry of the embedding spaces. We evaluate EMBridge on both in-distribution and unseen gesture classification tasks and demonstrate consistent performance gains over all baselines. To the best of our knowledge, EMBridge is the first cross-modal representation learning framework to achieve zero-shot gesture classification from wearable EMG signals, showing potential toward real-world gesture recognition on wearable devices.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

260. On The Fragility of Benchmark Contamination Detection in Reasoning Models

作者:

Leaderboards for large reasoning models (LRMs) have turned evaluation into a competition, incentivizing developers to optimize directly on benchmark suites. A shortcut to achieving higher rankings is to incorporate evaluation benchmarks into the training data, thereby yielding inflated performance, known as benchmark contamination. Despite that numerous contamination detection approaches have been proposed, surprisingly, our studies find that evading contamination detections for LRMs is alarmingly easy. We focus on the two scenarios where contamination may occur in practice: (I) when the base model evolves into LRM via supervised fine-tuning (SFT) and reinforcement learning (RL), we find that contamination during SFT can be originally identified by contamination detection methods. Yet, even a brief Group Relative Policy Optimization (GRPO) training can markedly \textbf{conceal contamination signals} that most detection methods rely on. Further empirical experiments and theoretical analysis indicate that Proximal Policy Optimization (PPO) style importance sampling and clipping objectives are the root cause of this detection concealment, indicating that \textbf{a broad class of RL methods} may inherently exhibit similar concealment capability; (II) when SFT contamination with CoT is applied to advanced LRMs as the final stage, most contamination detection methods \textbf{perform near random guesses}. Without exposure to non-members, contaminated LRMs would still have more confidence when responding to those unseen samples that share similar distributions to the training set, and thus, evade existing memorization-based detection methods. Together, our findings reveal the unique vulnerability of LRMs evaluations: Model developers could easily contaminate LRMs to achieve inflated leaderboards performance while leaving minimal traces of contamination, thereby strongly undermining the fairness of evaluation and threatening the integrity of public leaderboards. This underscores the urgent need for advanced contamination detection methods and trustworthy evaluation protocols tailored to LRMs.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

261. ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

作者:

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

262. $\boldsymbol{\partial^\infty}$-Grid: Differentiable Grid Representations for Fast and Accurate Solutions to Differential Equations

作者:

We present a novel differentiable grid-based representation for efficiently solving differential equations (DEs). Widely used architectures for neural solvers, such as sinusoidal neural networks, are coordinate-based MLPs that are, both, computationally intensive and slow to train. Although grid-based alternatives for implicit representations (e.g., Instant-NGP and K-Planes) offer faster training, their reliance on linear interpolation restricts their ability to compute higher-order derivatives, rendering them unsuitable for solving DEs. In contrast, our approach overcomes these limitations by combining the efficiency of feature grids with radial basis function interpolation, which is infinitely often differentiable. To effectively capture high-frequency solutions and enable stable and faster computation of global gradients, we introduce a multi-resolution decomposition with co-located grids. Our proposed representation, $\boldsymbol{\partial^\infty}$-Grid, is trained implicitly using the differential equations as loss functions, enabling accurate modeling of physical fields. We validate $\boldsymbol{\partial^\infty}$-Grid on a variety of tasks, including Poisson equation for image reconstruction, the Helmholtz equation for wave fields, and the Kirchhoff-Love boundary value problem for cloth simulation. Our results demonstrate a 5–20× speed-up over coordinate-based MLP-based methods, solving differential equations in seconds or minutes while maintaining comparable accuracy and compactness.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

263. Generating metamers of human scene understanding

作者:

Human vision combines low-resolution “gist” information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. “foveated”) inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we conducted a same-different behavioral experiment where participants were asked for a “same” or “different” response between the generated and the original image. With that, we identify scene generations that are indeed metamers for the latent scene representations formed by the viewers. MetamerGen is a powerful tool for understanding scene understanding. Our proof-of-concept analyses uncovered specific features at multiple levels of visual processing that contributed to human judgments. While it can generate metamers even conditioned on random fixations, we find that high-level semantic alignment most strongly predicts metamerism when the generated scenes are conditioned on viewers’ own fixated regions.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

264. Breaking the Total Variance Barrier: Sharp Sample Complexity for Linear Heteroscedastic Bandits with Fixed Action Set

作者:

Recent years have witnessed increasing interests in tackling heteroscedastic noise in bandits and reinforcement learning \citep{zhou2021nearly, zhao2023variance, jia2024does, pacchiano2025second}. In these works, the cumulative variance of the noise $\Lambda = \sum_{t=1}^T \sigma_t^2$, where $\sigma_t^2$ is the variance of the noise at round $t$, has been used to characterize the statistical complexity of the problem, yielding simple regret bounds of order $\tilde{\mathcal O}(d \sqrt{\Lambda / T^2})$ for linear bandits with heteroscedastic noise \citep{zhou2021nearly, zhao2023variance}. However, with a closer look, $\Lambda$ remains the same order even if the noise is close to zero at half of the rounds, which indicates that the $\Lambda$-dependence is not optimal. In this paper, we revisit the linear bandit problem with heteroscedastic noise. We consider the setting where the action set is fixed throughout the learning process. We propose a novel variance-adaptive algorithm VAEE (Variance-Aware Exploration with Elimination) for large action set, which actively explores actions that maximizes the information gain among a candidate set of actions that are not eliminated. With the active-exploration strategy, we show that VAEE achieves a *simple regret* with a nearly *harmonic-mean* dependent rate, i.e. $\tilde{\mathcal O}\Big(d\Big[\sum_{t = 1}^T \frac{1}{\sigma_t^2} - \sum_{i = 1}^{\tilde{O}(d)} \frac{1}{[\sigma^{(i)}]^2} \Big]^{-\frac{1}{2}}\Big)$ where $d$ is the dimension of the feature space and $\sigma^{(i)}$ is the $i$-th smallest variance among $\\{\sigma_t\\}_{t=1}^T$. For finitely many actions, we propose a variance-aware variant of G-optimal design based exploration, which achieves a $\tilde {\mathcal O}$ $\bigg(\sqrt{d \log |\mathcal A| }\Big[ \sum\_{t = 1}\^T \frac{1}{\sigma\_t\^2}- \sum\_{i = 1}^{\tilde{O}(d)} \frac{1}{[\sigma^{(i)}]^2} \Big]^{-\frac{1}{2}}\bigg)$ simple regret bound. We also establish a nearly matching lower bound for the fixed action set setting indicating that \emph{harmonic-mean} dependent rate is unavoidable. To the best of our knowledge, this is the first work that breaks the $\sqrt{\Lambda}$ barrier for linear bandits with heteroscedastic noise.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

265. Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian

作者:

Score matching has become a central training objective in modern generative modeling, particularly in diffusion models, where it is used to learn high-dimensional data distributions through the estimation of score functions. Despite its empirical success, the theoretical understanding of the optimization behavior of score matching, particularly in over-parameterized regimes, remains limited. In this work, we study gradient descent for training over-parameterized models to learn a single Gaussian distribution. Specifically, we use a student model with $n$ learnable parameters, motivated by the structure of a Gaussian mixture model, and train it on data generated from a single ground-truth Gaussian using the population score matching objective. We analyze the optimization dynamics under multiple regimes. When the noise scale is sufficiently large, we prove a global convergence result for gradient descent, which resembles the known behavior of gradient EM in over-parameterized settings. In the low-noise regime, we identify the existence of a stationary point, highlighting the difficulty of proving global convergence in this case. Nevertheless, we show convergence under certain initialization conditions: when the parameters are initialized to be exponentially small, gradient descent ensures convergence of all parameters to the ground truth. We further give an example where, without the exponentially small initialization, the parameters may not converge to the ground truth. Finally, we consider the case of random initialization, where parameters are sampled from a Gaussian distribution far from the ground truth. We prove that, with high probability, only one parameter converges while the others diverge to infinity, yet the loss still converges to zero with a $1/\tau$ rate, where $\tau$ is the number of iterations. We also establish a nearly matching lower bound on the convergence rate in this regime. This is the first work to establish global convergence guarantees for Gaussian mixtures with at least three components under the score matching framework.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

266. Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization

作者:

In today’s world, where AI plays a major role in everyday life, energy consumption and data privacy have become critical concerns. On-device learning offers a promising solution by enabling models to train directly on edge devices, thereby reducing energy usage and minimizing the risk of data leakage. However, the increasing size of modern neural networks poses a serious challenge for on-device training. Although prior work has mainly focused on compact convolutional architectures, we explore a different direction by applying subspace-based training to transformer models. Based on the idea that a model’s essential information resides in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method designed to overcome the memory bottleneck of backpropagation and improve inference efficiency in transformer-based models by constraining training to this subspace. Our results show that, with accuracy comparable to vanilla training, WASI reduces memory usage by up to $62\times$ and computational cost (FLOPs) by up to $2\times$. Moreover, when tested on a Raspberry Pi 5, WASI delivers approximately $1.5\times$ faster training and inference than vanilla training.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

267. RECAST: Expanding the Boundaries of LLMs' Complex Instruction Following with Multi-Constraint Data

作者:

Large language models (LLMs) are increasingly expected to tackle complex tasks, driven by their expanding applications and users' growing proficiency in crafting sophisticated prompts. However, as the number of explicitly stated requirements increases (particularly more than $10$ constraints), LLMs often struggle to accurately follow such complex instructions, which limits their applicability in complex real-world scenarios. To the best of our knowledge, existing datasets do not exceed 10 constraints per instance. To address this challenge, we propose RECAST, an efficient and scalable framework for synthesizing datasets where each example incorporates far more constraints than those in existing benchmarks, aiming to challenge and extend the boundaries of models’ ability to follow complex instructions. These constraints are extracted from real-world prompt-response pairs to ensure practical relevance. Using this framework, we construct RECAST-$30$K, a large-scale, high-quality dataset comprising $30$k instances spanning $19$ constraint types. Experimental results demonstrate that models fine-tuned on RECAST-30K substantially improve in following complex instructions while maintaining their general capabilities without degradation. Moreover, RECAST enables automatic verification of constraint satisfaction via rule-based validators for quantitative constraints and LLM-based validators for qualitative ones, the verifiability provided by RECAST enables the design of reward functions for reinforcement learning, which further boosts model performance on complex and challenging tasks.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

268. PI-Light: Physics-Inspired Diffusion for Full-Image Relighting

作者:

Full-image relighting remains a challenging problem due to the difficulty of collecting large-scale structured paired data, the difficulty of maintaining physical plausibility, and the limited generalizability imposed by data-driven priors. Existing attempts to bridge the synthetic-to-real gap for full-scene relighting remain suboptimal. To tackle these challenges, we introduce **P**hysics-**I**nspired diffusion for full-image re**Light** ($\pi$-Light, or PI-Light), a two-stage framework that leverages physics-inspired diffusion models. Our design incorporates (i) batch-aware attention, which improves the consistency of intrinsic predictions across a collection of images, (ii) a physics-guided neural rendering module that enforces physically plausible light transport, (iii) physics-inspired losses that regularize training dynamics toward a physically meaningful landscape, thereby enhancing generalizability to real-world image editing, and (iv) a carefully curated dataset of diverse objects and scenes captured under controlled lighting conditions. Together, these components enable efficient finetuning of pretrained diffusion models while also providing a solid benchmark for downstream evaluation. Experiments demonstrate that $\pi$-Light synthesizes specular highlights and diffuse reflections across a wide variety of materials, achieving superior generalization to real-world scenes compared with prior approaches.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

269. Reformulation for Pretraining Data Augmentation

作者:

Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation associated with excessive data repetition during training. To overcome this critical bottleneck, we introduce the Massive Genre-Audience (MGA) reformulation method, a framework designed to augment corpora in a way that supports more effective model performance scaling. Instead of relying on complex, predefined seed systems, MGA systematically reformulates existing corpora into diverse, contextually-rich variations by adaptively generating genre-audience pairs. We present this framework and the resulting 770 billion token MGACorpus, created as a practical instantiation of our methodology. We experimentally validate MGA's core benefits by demonstrating superior scaling properties, in terms of both model size and data budget, against data repetition and upsampling (up to 13B parameters). Furthermore, our comprehensive analysis investigates the role of synthesis principles in generation quality and reveals nuances in evaluating model capabilities using standard loss metrics. Our work shows that a systematic framework like MGA provides a reliable pathway to substantially augment training datasets, effectively alleviating repetition bottlenecks and enabling more efficient scaling of large language models.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

270. Policy Newton Algorithm in Reproducing Kernel Hilbert Space

作者:

Reinforcement learning (RL) policies represented in Reproducing Kernel Hilbert Spaces (RKHS) offer powerful representational capabilities. While second-order optimization methods like Newton's method demonstrate faster convergence than first-order approaches, current RKHS-based policy optimization remains constrained to first-order techniques. This limitation stems primarily from the intractability of explicitly computing and inverting the infinite-dimensional Hessian operator in RKHS. We introduce Policy Newton in RKHS, the first second-order optimization framework specifically designed for RL policies represented in RKHS. Our approach circumvents direct computation of the inverse Hessian operator by optimizing a cubic regularized auxiliary objective function. Crucially, we leverage the Representer Theorem to transform this infinite-dimensional optimization into an equivalent, computationally tractable finite-dimensional problem whose dimensionality scales with the trajectory data volume. We establish theoretical guarantees proving convergence to a local optimum with a local quadratic convergence rate. Empirical evaluations on a toy financial asset allocation problem validate these theoretical properties, while experiments on standard RL benchmarks demonstrate that Policy Newton in RKHS achieves superior convergence speed and higher episodic rewards compared to established first-order RKHS approaches and parametric second-order methods. Our work bridges a critical gap between non-parametric policy representations and second-order optimization methods in reinforcement learning.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

271. Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

作者:

Traditional Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router’s decisions align well with the experts’ capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling loss (ERC loss), a lightweight auxiliary loss that couples expert capabilities and the router’s decisions. We treat each row of the router matrix as a cluster center for the tokens assigned to a particular expert. From these centers, we create proxy tokens by applying a perturbation with noise. Using these proxy tokens, the ERC loss forces the router and experts to satisfy two constraints: (1) each expert exhibits higher activation for its corresponding proxy token than for any other proxy token, and (2) each proxy token elicits stronger activation in its designated expert than in any other expert. This optimization leads to two key effects: each row of the router matrix is an accurate representation of its expert’s capabilities, while each expert develops expertise that closely match the tokens routed to it. Our experiments involve pre-training multiple 3B-parameter MoE-LLMs on trillions of tokens in total, providing detailed evidence of the ERC loss’s effectiveness. Additionally, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing many valuable insights into MoEs.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

272. Tractability via Low Dimensionality: The Parameterized Complexity of Training Quantized Neural Networks

作者:

The training of neural networks has been extensively studied from both algorithmic and complexity-theoretic perspectives, yet recent results in this direction almost exclusively concern real-valued networks. In contrast, advances in machine learning practice highlight the benefits of quantization, where network parameters and data are restricted to finite integer domains, yielding significant improvements in speed and energy efficiency. Motivated by this gap, we initiate a systematic complexity-theoretic study of ReLU Neural Network Training in the full quantization mode. We establish strong lower bounds by showing that hardness already arises in the binary setting and under highly restrictive structural assumptions on the architecture, thereby excluding parameterized tractability for natural measures such as depth and width. On the positive side, we identify nontrivial fixed-parameter tractable cases when parameterizing by input dimensionality in combination with width and either output dimensionality or error bound, and further strengthen these results by replacing width with the more general treewidth.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

273. What Happens Next? Anticipating Future Motion by Generating Point Trajectories

作者:

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

274. HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks

作者:

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) technique that adapts large pre-trained models by adding low-rank matrices to their weight updates. However, in the context of fine-tuning multi-head self-attention (MHA), LoRA has been employed to adapt each attention head separately, thereby overlooking potential synergies across different heads. To mitigate this issue, we propose a novel Hyper-shared Low-Rank Adaptation (HoRA) method, which utilizes joint hypernetworks to generate low-rank matrices across attention heads. By coupling their adaptation through a shared generator, HoRA encourages cross-head information sharing, and thus directly addresses the aforementioned limitation of LoRA. By comparing LoRA and HoRA through the lens of hierarchical mixture of experts, our theoretical findings reveal that the latter achieves superior sample efficiency to the former. Furthermore, through extensive experiments across diverse language and vision benchmarks, we demonstrate that HoRA outperforms LoRA and other PEFT methods while requiring only a marginal increase in the number of trainable parameters.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

275. Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

作者:

Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness critical. A significant and underexplored risk is intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective. Existing studies typically induce deception by explicitly setting a hidden objective through prompting or fine-tuning, which may not reflect real-world human-LLM interactions. Moving beyond such human-induced deception, we investigate LLMs' self-initiated deception on benign prompts. To address the absence of ground truth, we propose a framework based on Contact Searching Questions~(CSQ). This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the *Deceptive Intention Score*, measures the model's bias toward a hidden objective. The second, the *Deceptive Behavior Score*, measures the inconsistency between the LLM's internal belief and its expressed output. Evaluating 16 leading LLMs, we find that both metrics rise in parallel and escalate with task difficulty for most models. Moreover, increasing model capacity does not always reduce deception, posing a significant challenge for future LLM development.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

276. SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

作者:

The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over $8,743$ hours, SpeakerVid-5M contains more than $5.2$ million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark (VidChatBench) for future work. Both the dataset and the corresponding data processing code will be publicly released.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

277. H$^3$DP: Triply‑Hierarchical Diffusion Policy for Visuomotor Learning

作者:

Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce Triply-Hierarchical Diffusion Policy (H$^3$DP), a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation. H$^3$DP contains $\mathbf{3}$ levels of hierarchy: (1) depth-aware input layering that organizes RGB-D observations based on depth information; (2) multi-scale visual representations that encode semantic features at varying levels of granularity; and (3) a hierarchically conditioned diffusion process that aligns the generation of coarse-to-fine actions with corresponding visual features. Extensive experiments demonstrate that H$^3$DP yields a $+ \mathbf{27.5}$% average relative improvement over baselines across $\mathbf{44}$ simulation tasks and achieves superior performance in $\mathbf{4}$ challenging bimanual real-world manipulation tasks. Project Page: https://h3-dp.github.io/.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 6

详细评分: 8, 6, 6, 6, 6, 8

📄 openreview 📄 下载PDF

278. Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching

作者:

While existing approaches address unpaired image-text matching by constructing cross-modal aligned knowledge, they often fail to identify semantically corresponding visual representations for Out-of-Distribution (OOD) words. Moreover, the distributional variance of visual representations associated with different words varies significantly, which negatively impacts matching accuracy. To address these issues, we propose a novel method namely Multimodal Aligned Semantic Knowledge (MASK), which leverages word embeddings as bridges to associate words with their corresponding prototypes, thereby enabling semantic knowledge alignment between the image and text modalities. For OOD words, the representative prototypes are constructed by leveraging the semantic relationships encoded in word embeddings. Beyond that, we introduce a prototype consistency contrastive loss to structurally regularize the feature space, effectively mitigating the adverse effects of variance. Experimental results on the Flickr30K and MSCOCO datasets demonstrate that MASK achieves superior performance in unpaired matching.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

279. Mixture of Contexts for Long Video Generation

作者:

Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

280. Interactive Learning of Single-Index Models via Stochastic Gradient Descent

作者:

Stochastic gradient descent (SGD) is a cornerstone algorithm for high-dimensional optimization, renowned for its empirical successes. Recent theoretical advances have provided a deep understanding of how SGD enables feature learning in high-dimensional nonlinear models, most notably the \emph{single-index model} with i.i.d. data. In this work, we study the sequential learning problem for single-index models, also known as generalized linear bandits or ridge bandits, where SGD is a simple and natural solution, yet its learning dynamics remain largely unexplored. We show that, similar to the optimal interactive learner, SGD undergoes a distinct "burn-in" phase before entering the "learning" phase in this setting. Moreover, with an appropriately chosen learning rate schedule, a single SGD procedure simultaneously achieves near-optimal (or best-known) sample complexity and regret guarantees across both phases, for a broad class of link functions. Our results demonstrate that SGD remains highly competitive for learning single-index models under adaptive data.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

281. wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

作者:

Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead, and can lead to large variance and estimation error in RL objective -- particularly in computing the policy ratio for importance sampling. To mitigate these issues, we introduce wd1, a novel ratio-free policy optimization approach that reformulates the objective as a weighted log-likelihood, requiring only a single approximation for the current parametrized policy likelihood. Formally, we show that our proposed method is equivalent to training an energy-guided discrete diffusion model, thereby confirming its theoretical soundness. In experiments, wd1 outperforms diffusion-based GRPO (d1) while requiring lower computational cost, achieving up to a $+43\\%$ improvement in accuracy. Furthermore, we extend wd1 to denoising-stepwise weighted policy optimization (wd1++), which surpasses concurrent RL for dLLMs methods, attaining improvement $+6.2\\%$ on MATH500 ($44.2\\%$) and $+2.5\\%$ on GSM8K ($84.5\\%$) over d1 with only 20 training steps.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

282. Automatic Image-Level Morphological Trait Annotation for Organismal Images

作者:

Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K expert-validated trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. When used to fine-tune BioCLIP, a biologically grounded vision-language model, Bioscan-Traits improves zero-shot species classification on the in-the-wild Insects benchmark, underscoring the value of trait-level supervision for enhancing model generalization.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

283. DTO-KD: Dynamic Trade-off Optimization for Effective Knowledge Distillation

作者:

Knowledge Distillation (KD) is a widely adopted framework for compressing large models into compact student models by transferring knowledge from a high-capacity teacher. Despite its success, KD presents two persistent challenges: (1) the trade-off between optimizing for the primary task loss and mimicking the teacher's outputs, and (2) the gradient disparity arising from architectural and representational mismatches between teacher and student models. In this work, we propose Dynamic Trade-off Optimization for Knowledge Distillation (DTO-KD), a principled multi-objective optimization formulation of KD that dynamically balances task and distillation losses at the gradient level. Specifically, DTO-KD resolves two critical issues in gradient-based KD optimization: (i) gradient conflict, where task and distillation gradients are directionally misaligned, and (ii) gradient dominance, where one objective suppresses learning progress on the other. Our method adapts per-iteration trade-offs by leveraging gradient projection techniques to ensure balanced and constructive updates. We evaluate DTO-KD on large-scale benchmarks including ImageNet-1K for classification and COCO for object detection. Across both tasks, DTO-KD consistently outperforms prior KD methods, yielding state-of-the-art accuracy and improved convergence behavior. Furthermore, student models trained with DTO-KD exceed the performance of their non-distilled counterparts, demonstrating the efficacy of our multi-objective formulation for KD.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

284. Mean-Field Neural Differential Equations: A Game-Theoretic Approach to Sequence Prediction

作者:

We propose a novel class of neural differential equation models called mean-field continuous sequence predictors (MFPs) for efficiently generating continuous sequences with potentially infinite-order complexity. To address complex inductive biases in time-series data, we employ mean-field dynamics structured through carefully designed graphons. By reframing continuous sequence prediction as mean-field games, we utilize a fictitious play strategy integrated with gradient-descent techniques. This approach exploits the stochastic maximum principle to determine the Nash equilibrium of the system. Both empirical evidence and theoretical analysis highlight the unique advantages of our framework, where a collective of continuous predictors achieves highly accurate predictions and consistently outperforms benchmark prior works.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

285. Break the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models

作者:

Watermarking is a principled approach for tracing the provenance of large language model (LLM) outputs, but its deployment in practice is hindered by inference inefficiency. Speculative sampling accelerates inference, with efficiency improving as the acceptance rate between draft and target models increases. Yet recent work reveals a fundamental trade-off: higher watermark strength reduces acceptance, preventing their simultaneous achievement. We revisit this trade-off and show it is not absolute. We introduce a quantitative measure of watermark strength that governs statistical detectability and is maximized when tokens are deterministic functions of pseudorandom numbers. Using this measure, we fully characterize the trade-off as a constrained optimization problem and derive explicit Pareto curves for two existing watermarking schemes. Finally, we introduce a principled mechanism that injects pseudorandomness into draft-token acceptance, ensuring maximal watermark strength while maintaining speculative sampling efficiency. Experiments further show that this approach improves detectability without sacrificing efficiency. Our findings uncover a principle that unites speculative sampling and watermarking, paving the way for their efficient and practical deployment.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 6

详细评分: 6, 8, 6, 8, 6, 6

📄 openreview 📄 下载PDF

286. Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

作者:

As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of synthetic "persona" ratings -- produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either: (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

287. OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

作者:

We introduce spatio-temporal omni-object video grounding, dubbed $\textbf{OmniSTVG}$, a new STVG task aiming to localize spatially and temporally all targets mentioned in the textual query within videos. Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding. In order to facilitate exploration of OmniSTVG, we propose $\textbf{BOSTVG}$, a large-scale benchmark dedicated to OmniSTVG. Specifically, BOSTVG contains 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. Each sequence, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10. To ensure high quality, each video is manually annotated with meticulous inspection and refinement. To our best knowledge, BOSTVG, to date, is the first and the largest benchmark for OmniSTVG. To encourage future research, we present a simple yet effective approach, named $\textbf{OmniTube}$, which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results. By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG. Our benchmark and code will be released.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

288. WAFT: Warping-Alone Field Transforms for Optical Flow

作者:

We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is nec- essary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring, Sintel, and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being up to 4.1× faster than methods with similar performance. Code and model weights will be available upon acceptance.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

289. RAP: 3D Rasterization Augmented End-to-End Planning

作者:

Imitation learning for end-to-end driving trains policies only on expert demonstrations. Once deployed in a closed loop, such policies lack recovery data: small mistakes cannot be corrected and quickly compound into failures. A promising direction is to generate alternative viewpoints and trajectories beyond the logged path. Prior work explores photorealistic digital twins via neural rendering or game engines, but these methods are prohibitively slow and costly, and thus mainly used for evaluation. In this work, we argue that photorealism is unnecessary for training end-to-end planners. What matters is semantic fidelity and scalability: driving depends on geometry and dynamics, not textures or lighting. Motivated by this, we propose 3D Rasterization, which replaces costly rendering with lightweight rasterization of annotated primitives, enabling augmentations such as counterfactual recovery maneuvers and cross-agent view synthesis. To transfer these synthetic views effectively to real-world deployment, we introduce a Raster-to-Real (R2R) feature-space alignment that bridges the sim-to-real gap at the representation level. Together, these components form the Rasterization Augmented Planning (RAP) pipeline, a scalable data augmentation framework for planning. RAP achieves state-of-the-art closed-loop robustness and long-tail generalization, ranking 1st on four major benchmarks: NAVSIM v1/v2, Waymo Open Dataset Vision-based E2E Driving, and Bench2Drive. Our results demonstrate that lightweight rasterization with feature alignment suffices to scale end-to-end training, offering a practical alternative to photorealistic rendering. Code will be released.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

290. DISCO: Diversifying Sample Condensation for Accelerating Model Evaluation

作者:

Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that *maximise diversity in model responses*. Our method, **Diversifying Sample Condensation** **(DISCO)**, selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. **DISCO** shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 8, 6, 6

📄 openreview 📄 下载PDF

291. WFR-FM: Simulation-Free Dynamic Unbalanced Optimal Transport

作者:

The Wasserstein–Fisher–Rao (WFR) metric extends dynamic optimal transport (OT) by coupling displacement with change of mass, providing a principled geometry for modeling unbalanced snapshot dynamics. Existing WFR solvers, however, are often unstable, computationally expensive, and difficult to scale. Here we introduce **WFR Flow Matching (WFR-FM)**, a simulation-free training algorithm that unifies flow matching with dynamic unbalanced OT. Unlike classical flow matching which regresses only a transport vector field, WFR-FM simultaneously regresses a vector field for displacement and a scalar growth rate function for birth–death dynamics, yielding continuous flows under the WFR geometry. Theoretically, we show that minimizing the WFR-FM loss exactly recovers WFR geodesics. Empirically, WFR-FM yields more accurate and robust trajectory inference in single-cell biology, reconstructing consistent dynamics with proliferation and apoptosis, estimating time-varying growth fields, and applying to generative dynamics under imbalanced data. It outperforms state-of-the-art baselines in efficiency, stability, and reconstruction accuracy. Overall, WFR-FM establishes a unified and efficient paradigm for learning dynamical systems from unbalanced snapshots, where not only states but also mass evolve over time.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

292. LongLive: Real-time Interactive Long Video Generation

作者:

We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases the complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with the new prompt for smooth, adherent switches streaming long tuning to enable long video training and to align training and inference (train-long–test-long); and short window attention paired with a frame-level attention sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short- and long-video settings. LongLive supports up to 240-second videos on a single H100 GPU. With FP8 quantization, LongLive boosts inference to 24.8 FPS with marginal quality loss.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

293. SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models

作者:

Mixture-of-Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory-bound decoding stage. To tackle this issue, we present **SERE**, a **S**imilarity-based **E**xpert **R**e-routing method for **E**fficient batch decoding in MoE models. SERE dynamically reduces active experts in an input‑aware manner by re-routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug-and-play use in vLLM with only a single‑line code change. Experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to $2.0\times$ speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

294. IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

作者:

Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose Instance-Grounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline. Unlike previous methods that bound with a specific language model, we introduce an Instance-Grounded Scene Understanding paradigm, where instance masks serve as the bridge connecting our unified representation with diverse Visual Language Models (VLMs) in a plug-and-play manner, substantially expanding downstream understanding capabilities. Extensive experiments on instance spatial tracking, open-vocabulary segmentation, and QA scene grounding demonstrate that IGGT outperforms state-of-the-art methods in both quality and consistency for semantic 3D reconstruction.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

295. DreamPhase: Offline Imagination and Uncertainty-Guided Planning for Large-Language-Model Agents

作者:

Autonomous agents capable of perceiving complex environments, understanding instructions, and performing multi-step tasks hold transformative potential across domains such as robotics, scientific discovery, and web automation. While large language models (LLMs) provide a powerful foundation, they struggle with closed-loop decision-making due to static pretraining and limited temporal grounding. Prior approaches either rely on expensive, real-time environment interactions or brittle imitation policies, both with safety and efficiency trade-offs. We introduce DreamPhase, a modular framework that plans through offline imagination. A learned latent world model simulates multi-step futures in latent space; imagined branches are scored with an uncertainty-aware value and filtered by a safety gate. The best branch is distilled into a short natural-language reflection that conditions the next policy query, improving behavior without modifying the LLM. Crucially, DreamPhase attains its performance with substantially fewer real interactions: on WebShop, average API calls per episode drop from $\sim$40 with ARMAP-M (token-level search) to $<10$ with DreamPhase, a $4\times$ reduction that lowers latency and reduces executed irreversible actions by $\sim 5\times$ on WebShop (4.9$\times$ on ALFWorld) per incident logs. Across web, science, and embodied tasks, DreamPhase improves sample efficiency, safety, and cost over search-based and reward-based baselines. This offers a scalable path toward safe, high-performance autonomous agents via imagination-driven planning. Code: \url{https://anonymous.4open.science/r/DreamPhase-A8AD/README.md}.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

296. Neural Graduated Assignment for Maximum Common Edge Subgraphs

作者:

The Maximum Common Edge Subgraph (MCES) problem is a crucial challenge with significant implications in domains such as biology and chemistry. Traditional approaches, which include transformations into max-clique and search-based algorithms, suffer from scalability issues when dealing with larger instances. This paper introduces ``Neural Graduated Assignment'' (NGA), a simple, scalable, unsupervised-training-based method that addresses these limitations. Central to NGA is stacking of differentiable assignment optimization with neural components, enabling high-dimensional parameterization of the matching process through a learnable temperature mechanism. We further theoretically analyze the learning dynamics of NGA, showing its design leads to fast convergence, better exploration-exploitation tradeoff, and ability to escape local optima. Extensive experiments across MCES computation, graph similarity estimation, and graph retrieval tasks reveal that NGA not only significantly improves computation time and scalability on large instances but also enhances performance compared to existing methodologies. The introduction of NGA marks a significant advancement in the computation of MCES and offers insights into other assignment problems. Code is open-sourced at https://anonymous.4open.science/r/NGA-10E3.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

297. SYNC: Measuring and Advancing Synthesizability in Structure-Based Drug Design

作者:

Designing 3D ligands that bind to a given protein pocket with high affinity is a fundamental task in Structure-Based Drug Design (SBDD). However, the lack of synthesizability of 3D ligands has been hindering progress toward experimental validation; moreover, computationally evaluating synthesizability is a non-trivial task. In this paper, we first benchmark eight classical synthesizability metrics across 11 SBDD methods. The comparison reveals significant inconsistencies between these metrics, making them impractical and inaccurate criteria for guiding SBDD methods toward synthesizable drug design. Therefore, we propose a simple yet effective SE(3)-invariant \textit{\underline{SYN}thesizability \underline{C}lassifier} (SYNC) to enable better synthesizability estimation in SBDD, which demonstrates superior generalizability and speed compared to existing metrics on five curated datasets. Finally, with SYNC as a plug-and-play module, we establish a synthesizability classifier-driven SBDD paradigm through guided diffusion and Direct Preference Optimization, where highly synthesizable molecules are directly generated without compromising binding affinity. Extensive experiments also demonstrate the effectiveness of SYNC and the advantage of our paradigm in synthesizable SBDD. Code is available at \url{https://anonymous.4open.science/r/SYNC-C94D/}.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

298. Policy Likelihood-based Query Sampling and Critic-Exploited Reset for Efficient Preference-based Reinforcement Learning

作者:

Preference-based reinforcement learning (PbRL) enables agent training without explicit reward design by leveraging human feedback. Although various query sampling strategies have been proposed to improve feedback efficiency, many fail to enhance performance because they select queries from outdated experiences with low likelihood under the current policy. Such queries may no longer represent the agent's evolving behavior patterns, reducing the informativeness of human feedback. To address this issue, we propose a policy likelihood-based query sampling and critic-exploited reset (PoLiCER). Our approach uses policy likelihood-based query sampling to ensure that queries remain aligned with the agent’s evolving behavior. However, relying solely on policy-aligned sampling can result in overly localized guidance, leading to overestimation bias, as the model tends to overfit to early feedback experiences. To mitigate this, PoLiCER incorporates a dynamic resetting mechanism that selectively resets the reward estimator and its associated Q-function based on critic outputs. Experimental evaluation across diverse locomotion and robotic manipulation tasks demonstrates that PoLiCER consistently outperforms existing PbRL methods.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

299. LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models

作者:

Lipschitz-based certification offers efficient, deterministic robustness guarantees but has struggled to scale in model size, training efficiency, and ImageNet performance. We introduce \emph{LipNeXt}, the first \emph{constraint-free} and \emph{convolution-free} 1-Lipschitz architecture for certified robustness. LipNeXt is built using two techniques: (1) a manifold optimization procedure that updates parameters directly on the orthogonal manifold and (2) a \emph{Spatial Shift Module} to model spatial pattern without convolutions. The full network uses orthogonal projections, spatial shifts, a simple 1-Lipschitz $\beta$-Abs nonlinearity, and $L_2$ spatial pooling to maintain tight Lipschitz control while enabling expressive feature mixing. Across CIFAR-10/100 and Tiny-ImageNet, LipNeXt achieves state-of-the-art clean and certified robust accuracy (CRA), and on ImageNet it scales to 1–2B large models, improving CRA over prior Lipschitz models (e.g., up to $+8\%$ at $\varepsilon{=}1$) while retaining efficient, stable low-precision training. These results demonstrate that Lipschitz-based certification can benefit from modern scaling trends without sacrificing determinism or efficiency.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

300. JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

作者:

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code, benchmark, and checkpoints will be made publicly available.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 6, 8

📄 openreview 📄 下载PDF

301. Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

作者:

The probabilistic diffusion model (DM), generating content by inferencing through a recursive chain structure, has emerged as a powerful framework for visual generation. After pre-training on enormous data, the model needs to be properly aligned to meet requirements for downstream applications. How to efficiently align the foundation DM is a crucial task. Contemporary methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP). However, RL and truncated BP suffer from low sample efficiency and biased gradient estimation, respectively, resulting in limited improvement or, even worse, complete training failure. To overcome the challenges, we propose the Recursive Likelihood Ratio (RLR) optimizer, a Half-Order (HO) fine-tuning paradigm for DM. The HO gradient estimator enables the computation graph rearrangement within the recursive diffusive chain, making the RLR's gradient estimator **an unbiased one with lower variance** than other methods. We theoretically investigate the bias, variance, and convergence of our method. Extensive experiments are conducted on image and video generation to validate the superiority of the RLR. Furthermore, we propose a novel prompt technique that is natural for the RLR to achieve a synergistic effect.

📊 评审评分

平均分: 6.67

最低分: 6

最高分: 8

评审人数: 3

详细评分: 6, 8, 6

📄 openreview 📄 下载PDF

302. Automated Stateful Specialization for Adaptive Agent Systems

作者:

Current automated agent design frameworks produce either static workflows that lack adaptability or per-query optimizers that prevent the accumulation of deep, agent-level task expertise. We propose a new direction that reconciles these paradigms: creating stateful teams of specialist agents that accumulate knowledge over time and can be reconfigured for novel tasks entirely without human intervention. To this end, we introduce \textsc{ASpec}, a framework that manages this full agent lifecycle by first autonomously $\textbf{discovering}$ specialist archetypes via evolutionary search and then $\textbf{cultivating}$ their expertise through experience, mirroring how human experts learn through practice and reflection. We further introduce a lightweight hierarchical control policy, "retain-then-escalate," which governs when to leverage the established agent system versus when to adapt its structure. Through comprehensive experiments, we demonstrate that this approach leads to significant performance gains on expert-level scientific benchmarks like GPQA while matching the state-of-the-art on broader domain tasks, demonstrating a promising path toward agent systems that are simultaneously expert, adaptive, and efficient.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

303. ProxyAttn: Guided Sparse Attention via Representative Heads

作者:

The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their block-level coarse-grained estimation inevitably leads to performance degradation at high sparsity ratios. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves token-level estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads in long texts, we use the attention scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from a set of representative heads with a multi-head dynamic budget, we can achieve a more fine-grained block attention evaluation at a low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads in long texts. Leveraging a token-level fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

304. ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

作者:

Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio-temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training-free framework named ST-SimDiff. We first construct a spatio-temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual-selection strategy: 1) similarity-based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference-based selection precisely locates content-changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state-of-the-art approaches while substantially reducing computational costs. Our code is available in [https://anonymous.4open.science/r/ST-SimDiff-7225](https://anonymous.4open.science/r/ST-SimDiff-7225).

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

305. Designing Rules to Pick a Rule: Aggregation by Consistency

作者:

Rank aggregation has critical applications for developing AI agents, as well as for evaluating them. However, different methods can give rise to significantly different aggregate rankings, impacting these applications. Indeed, work in social choice and statistics has produced many rank aggregation methods, each with its desirable properties, but also with its limitations. Given this trade-off, how do we decide which aggregation rule to use, _i.e._, what is a good _rule picking rule (RPR)_? In this paper, we design a data-driven RPR that identifies the best method for each dataset without assuming a generative model. The principle behind our RPR is to maximize consistency if the data collection process was repeated. We show that our method satisfies several consistency-related axioms failed by a wide class of natural RPRs. While we prove that the computational problem of maximizing consistency is hard, we provide a sampling-based implementation that is efficient in practice. We run this implementation on known statistical models to experimentally demonstrate its desirable properties, as well as on real-world data where our method provides important insights into how to improve consistency.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

306. MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

作者:

Modern language agents often need to solve tasks requiring long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to un-bounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths due to LLM forgetting the context. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant context size when solving long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. Leveraging reinforcement learning (RL) and rollout trajectory truncation, we train a MEM1 agent to develop internal states that integrate prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5$\times$ while reducing memory usage by 3.7$\times$ compared to Qwen2.5-14B-Instruct on an augmented multi-hop QA dataset with 16 objectives in each task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon task-solving agents that involve multiple interactions, where both efficiency and performance are optimized.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

307. Optimistic Task Inference for Behavior Foundation Models

作者:

Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well- trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

308. Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models

作者:

Behavioral Foundation Models (BFMs) have been recently successful in producing agents with the capabilities to adapt to any unknown reward or task. In reality, these methods are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing _state features_. Naturally, their efficiency relies heavily on the choice of state features that they use. As a result, these BFMs have used a wide variety of complex objectives, often sensitive to environment coverage, to train task spanning features with different inductive properties. With this work, our aim is to examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span of reward functions that we can represent optimal policies for. We propose an approach, RLDP, that adds a simple regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we demonstrate the prior approaches diverge in low-coverage scenarios where RLDP still succeeds.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

309. Visual symbolic mechanisms: Emergent symbol processing in Vision Language Models

作者:

To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this ‘binding problem’ via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by Vision Language Models (VLM). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a previously unknown set of emergent symbolic mechanisms that support binding specifically in VLMs, via a content-independent, spatial indexing scheme. Moreover, we find that binding errors, when they occur, can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for reducing the number of binding failures exhibited by these models.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

310. SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

作者:

Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor’s sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor’s epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

311. Segment-Level Attribution for Selective Learning of Long Reasoning Traces

作者:

Large Reasoning Models (LRMs) achieve strong reasoning performance by generating long chains of thought (CoTs), yet only a small fraction of these traces meaningfully contributes to answer prediction, while the majority contains repetitive or truncated content. Such output redundancy is further propagated after supervised finetuning (SFT), as models learn to imitate verbose but uninformative patterns, which can degrade performance. To this end, we incorporate integrated gradient attribution to quantify each token's influence on final answers and aggregate them into two segment-level metrics: (1) \textit{attribution strength} measures the overall attribution magnitude; and (2) \textit{direction consistency} captures whether tokens' attributions within a segment are uniformly positive or negative (high consistency), or a mixture of both (moderate consistency). Based on these two metrics, we propose a segment-level selective learning framework to identify important segments with high attribution strength but moderate consistency that indicate reflective rather than shallow reasoning. The framework then applies selective SFT on these important segments while masking loss for unimportant ones. Experiments across multiple models and datasets show that our approach improves accuracy and output efficiency, enabling more effective learning from long reasoning traces.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

312. FutureFill: Fast Generation from Convolutional Sequence Models

作者:

We address the challenge of efficient auto-regressive generation in sequence prediction models by introducing FutureFill—a general-purpose fast generation method for any sequence prediction algorithm based on convolutional operators. FutureFill reduces generation time from quadratic to quasilinear in the context length. Moreover, when generating from a prompt, it requires a prefill cache whose size grows only with the number of tokens to be generated—often much smaller than the caches required by standard convolutional or attention‐based models. We validate our theoretical claims with language modeling experiments and demonstrate substantial efficiency gains when generating from a deep convolutional sequence prediction model.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

313. Scalable Oversight via Partitioned Human Supervision

作者:

As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging. Our focus is on tasks that require deep knowledge and skills of multiple domains. Unfortunately, even the best human experts are knowledgeable only in a single narrow area, and will not be able to evaluate the correctness of advanced AI systems on such superhuman tasks. However, based on their narrow expertise, humans may provide a weak signal, i.e., a *complementary label* indicating an option that is incorrect. For example, a cardiologist could state that ``this is not related to cardiology,'' even if they cannot identify the true disease. Based on this weak signal, we propose a scalable oversight framework that enables us to evaluate frontier AI systems without the need to prepare the ground truth. We derive an *unbiased* estimator of top-1 accuracy from complementary labels and quantify how many complementary labels are needed to match the variance of ordinary labels. We further introduce two estimators to combine scarce ordinary labels with abundant complementary labels. We provide finite-sample deviation guarantees for both complementary-only and the mixed estimators. Empirically, we show that we can evaluate the output of large language models without the ground truth, if we have complementary labels. We further show that we can train an AI system with such weak signals: we show how we can design an agentic AI system automatically that can perform better by these partitioned human supervision.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

314. ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

作者:

With the growing adoption of large language model (LLM) agents in persistent, real-world roles, they naturally encounter continuous streams of tasks and interactions. A key limitation, however, is their failure to learn from this accumulated experience, forcing them to discard valuable insights and repeat past errors. Unlike prior works that primarily store raw experience or successful routines, we propose ReasoningBank, a novel memory framework that allows an agent to self-curate generalizable reasoning strategies from both its successful and failed experiences for future leverage. This mechanism enables agents to generalize across tasks and become more capable over time. To accelerate and diversify this test-time learning process, we further propose memory-aware test-time scaling (MaTTS), which leverages a powerful synergy between memory and test-time scaling. On one hand, relevant memory from ReasoningBank guides the scaling process toward more effective exploration and improved reliability. On the other, scaling, in both parallel and sequential settings, generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. Experiments on web browsing and software engineering tasks show that ReasoningBank consistently outperforms existing memory mechanisms in both effectiveness and efficiency, with MaTTS further amplifying the gains. These findings position memory-driven experience as a new dimension of test-time scaling, where emergent behaviors naturally arise and agents acquire self-evolving capabilities.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

315. WorldGym: World Model as An Environment for Policy Evaluation

作者:

Evaluating robot control policies is difficult: real-world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world-model-based policy evaluation environment (WorldGym), an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards. We evaluate a set of VLA-based real-robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real-world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world model further enables efficient evaluation of robot policies' generalization ability on novel tasks and environments. We find that modern VLA-based robot policies still struggle to distinguish object shapes and can become distracted by adversarial facades of objects. While generating highly realistic object interaction remains challenging, WorldGym faithfully emulates robot motions and offers a practical starting point for safe and reproducible policy evaluation before deployment.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

316. Transport Clustering: Solving Low-Rank Optimal Transport via Clustering

作者:

Optimal transport (OT) finds a least cost transport plan between two probability distributions using a cost matrix over pairs of points. Constraining the rank of the transport plan yields low-rank OT, which improves computational complexity and statistical stability compared to full-rank OT. Further, low-rank OT naturally induces co-clusters between distributions and generalizes $K$-means clustering. Reversing this direction, we show that solving a clustering problem on a set of _correspondences_, termed _transport clustering_, solves low-rank OT. This connection between low-rank OT and transport clustering relies on a _transport registration_ of the cost matrix which registers the cost matrix via the transport map. We show that the reduction of low-rank OT to transport clustering yields polynomial-time, constant-factor approximation algorithms for low-rank OT. Specifically, we show that for the low-rank OT problem this reduction yields a $(1+\gamma)$-approximation algorithm for metrics of negative-type and a $(1+\gamma+\sqrt{2\gamma}\,)$-approximation algorithm for kernel costs where $\gamma \in [0,1]$ denotes the approximation ratio to the optimal full-rank solution. We demonstrate that transport clustering outperforms existing low-rank OT methods on several synthetic benchmarks and large-scale, high-dimensional real datasets.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

317. ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations

作者:

Neural theorem proving has advanced rapidly in the past year, reaching IMO gold-medalist capabilities and producing formal proofs that span thousands of lines. Although such proofs are mechanically verified by formal systems like Lean, their excessive length renders them difficult for humans to comprehend and limits their usefulness for mathematical insight. Proof simplification is therefore a critical bottleneck. Yet, training data for this task is scarce, and existing methods—mainly agentic scaffolding with off-the-shelf LLMs—struggle with the extremely long proofs generated by RL-trained provers. We introduce ProofOptimizer, the first language model trained to simplify Lean proofs without requiring additional human supervision. ProofOptimizer is trained via expert iteration and reinforcement learning, using Lean to verify simplifications and provide training signal. At inference time, it operates within an iterative proof-shortening workflow, progressively reducing proof length. Experiments show that ProofOptimizer substantially compresses proofs generated by state-of-the-art RL-trained provers on standard benchmarks, reducing proof length by 87% on miniF2F, 57% on PutnamBench, and 50% on Seed-Prover's IMO 2025 proofs. Beyond conciseness, the simplified proofs check faster in Lean and further improve downstream prover performance when reused as training data for supervised finetuning.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

318. Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning

作者:

Humans often rely on visual aids, such as diagrams or sketches, when tackling complex problems. Teaching multimodal models to adopt similar strategies, a process known as Visual Chain of Thought (visual CoT), is much more difficult. The main challenges are: (1) weak performance of off-the-shelf visual CoT, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce **Zebra-CoT** a diverse large-scale interleaved text-image reasoning dataset with 182,384 reasoning traces across 18 domains with over 50 distinct tasks. This dataset is specifically designed to train models to natively perform visual CoT. We emphasize four categories of tasks where sketching or visual reasoning is especially natural, spanning (a) *scientific questions* such as geometry, physics, and algorithms; (b) *2D visual reasoning tasks* like visual search and jigsaw puzzles; (c) *3D reasoning tasks* including 3D multi-hop inference, embodied and robot planning; and (d) *visual logic problems and strategic games* like chess. Fine-tuning Anole‑7B model on Zebra-CoT yields a +12\% improvement in our test‑set accuracy and up to +13\% performance gains on standard VLM benchmarks. Similarly, fine-tuning Bagel‑7B produces models capable of generating high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness in advancing multimodal reasoning.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

319. Towards Better Optimization For Listwise Preference in Diffusion Models

作者:

Reinforcement learning from human feedback (RLHF) has proven effectiveness for aligning text-to-image (T2I) diffusion models with human preferences. Although Direct Preference Optimization (DPO) is widely adopted for its computational efficiency and avoidance of explicit reward modeling, its applications to diffusion models have primarily relied on pairwise preferences. The precise optimization of listwise preferences remains largely unaddressed. In practice, human feedback on image preferences often contains implicit ranked information, which conveys more precise human preferences than pairwise comparisons. In this work, we propose Diffusion-LPO, a simple and effective framework for Listwise Preference Optimization in diffusion models with listwise data. Given a caption, we aggregate user feedback into a ranked list of images and derive a listwise extension of the DPO objective under the Plackett–Luce model. Diffusion-LPO enforces consistency across the entire ranking by encouraging each sample to be preferred over all of its lower-ranked alternatives. We empirically demonstrate the effectiveness of Diffusion-LPO across various tasks, including text-to-image generation, image editing, and personalized preference alignment. Diffusion-LPO consistently outperforms pairwise DPO baselines on visual quality and preference alignment.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

320. What Exactly Does Guidance Do in Masked Discrete Diffusion Models

作者:

Masked discrete diffusion models have been gaining popularity recently, and classifier-free guidance, just like its continuous counterpart, has been proposed to enable efficacious conditional generation by discrete diffusion. To quantify the precise effect of discrete guidance, this article considers masked discrete diffusion with arbitrary data distribution in low dimension, so that the distribution that guided masked discrete diffusion samples from, as well as the sampling dynamics, can be analytically and exactly quantified and interpreted. When the full data distribution is a mixture over classes and the goal is to sample from a specific class, guidance amplifies class-specific regions while suppresses regions shared with other classes. This effect depends on the guidance strength $w$ and induces distinct covariance structures in the sampled distribution. Notably, we observe quantitatively different behaviors in $1$D and $2$D. We also show that for large $w$, the decay rate of the total variation ($\text{TV}$) along the reverse dynamics is double-exponential in $w$ for both $1$D and $2$D. These findings highlight the role of guidance, not just in shaping the output distribution, but also in controlling the dynamics of the sampling trajectory. Our theoretical analysis is supported by experiments that illustrate the geometric effects of guidance and its impact on convergence.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

321. Continuum Transformers Perform In-Context Learning by Operator Gradient Descent

作者:

Transformers robustly exhibit the ability to perform in-context learning, whereby their predictive accuracy on a task can increase not by parameter updates but merely with the placement of training samples in their context windows. Recent works have shown that transformers achieve this by implementing gradient descent in their forward passes. Such results, however, are restricted to standard transformer architectures, which handle finite-dimensional inputs. In the space of PDE surrogate modeling, a generalization of transformers to handle infinite-dimensional function inputs, known as "continuum transformers," has been proposed and similarly observed to exhibit in-context learning. Despite impressive empirical performance, such in-context learning has yet to be theoretically characterized. We herein demonstrate that continuum transformers perform in-context operator learning by performing gradient descent in an operator RKHS. We demonstrate this using novel proof strategies that leverage a generalized representer theorem for Hilbert spaces and gradient flows over the space of functionals of a Hilbert space. We additionally show the operator learned in context is the Bayes Optimal Predictor in the infinite depth limit of the transformer. We then provide empirical validations of this optimality result and demonstrate that the parameters under which such gradient descent is performed are recovered through the continuum transformer training.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

322. OpenApps: Simulating Environment Variations to Measure UI Agent Reliability

作者:

Reliability is key to realizing the promise of autonomous UI-agents, multimodal agents that directly interact with the apps humans use, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments---often clones of existing apps--- which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent’s ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of thousands of versions of each app. Specifically, we run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents. We find that while standard reliability within a fixed app is relatively stable, reliability can vary drastically when measured across app variations. Task success rates for many agents can fluctuate by more than 50\% across app variations. For example, Kimi-VL-3B's average success across all tasks fluctuates from 63\% to just 4\% across app versions. We also find agent behaviors such as looping or hallucinating actions can differ drastically depending on the environment configuration. These initial findings highlight the importance of measuring reliability along this new dimension of app variations.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

323. How Text Quality Interventions Reshape Neural Scaling Laws for LLMs: Empirical Study

作者:

Neural scaling laws are widely used for performance projection and resource planning, yet their sensitivity to data quality interventions remains poorly understood. We present an empirical study of how interventions—deduplication, heuristic filtering, and LLM-guided rewriting—reshape scaling behavior in large language model training. Using QualityPajama, a suite of 23 systematically filtered and synthetic datasets, we train over 2,000 models (100M–8B parameters, 100M–200B tokens) to measure how data quality affects scaling-law parameters and compute-optimal design decisions. Our results show that data interventions reshape scaling dynamics in non-trivial ways not captured by current theory, simultaneously moving exponents, coefficients, and constants in conflicting directions that exert opposing forces on loss. For example, an intervention may improve constants but hurt the exponents. Strategies that appear optimal at small scale can reverse at larger scale, and compute-optimal token–parameter ratios can vary by orders of magnitude depending on the intervention. These findings demonstrate that data curation and scaling strategy are deeply intertwined, and that evaluating interventions only at fixed scales can lead to misleading conclusions. We recommend evaluating interventions through their full scaling trajectories using scaling law projections.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

324. Skirting Additive Error Lower Bounds for Private Turnstile Streams

作者:

We study differentially private continual release of the number of distinct items in a stream, where items may be both inserted and deleted. In this turnstile setting, a recent work of Jain, Kalemaj, Raskhodnikova, Sivakumar, and Smith (NeurIPS '23) showed that for streams of length $T$, polynomial additive error of $\Omega(T^{1/4})$ is necessary, even without any space restrictions. We show that this additive error lower bound can be circumvented if the algorithm is allowed to output estimates with *multiplicative* error. We give an algorithm for the continual release of the number of distinct elements with $\text{polylog} (T)$ multiplicative and $\text{polylog}(T)$ additive error. We also show a qualitatively similar phenomenon for estimating the $F_2$ moment of a turnstile stream, where we can obtain $1+o(1)$ multiplicative and $\text{polylog} (T)$ additive error. Both results can be achieved by polylogarithmic space streaming algorithms where some multiplicative error is necessary even without privacy. Lastly, we raise questions aimed at better understanding trade-offs between multiplicative and additive error in private continual estimation problems.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

325. Self-Destructive Language Models

作者:

Harmful fine-tuning attacks represent a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent `trainability' on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. The code is available: https://anonymous.4open.science/r/seam-5C7E (warning: this paper contains potentially harmful content generated by LLMs.)

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

326. Approximate Equivariance via Projection-Based Regularisation

作者:

Equivariance is a powerful inductive bias in neural networks, improving generalisation and physical consistency. Recently, however, non-equivariant models have regained attention, due to their better runtime performance and imperfect symmetries that might arise in real-world applications. This has motivated the development of approximately equivariant models that strike a middle ground between respecting symmetries and fitting the data distribution. Existing approaches in this field usually apply sample-based regularisers which depend on data augmentation at training time, incurring a high sample complexity, in particular for continuous groups such as $SO(3)$. This work instead approaches approximate equivariance via a projection-based regulariser which leverages the orthogonal decomposition of linear layers into equivariant and non-equivariant components. In contrast to existing methods, this penalises non-equivariance at an operator level across the full group orbit, rather than point-wise. We present a mathematical framework for computing the non-equivariance penalty exactly and efficiently in both the spatial and spectral domain. In our experiments, our method consistently outperforms prior approximate equivariance approaches in both model performance and efficiency, achieving substantial runtime gains over sample-based regularisers.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

327. Structural Inference: Interpreting Small Language Models with Susceptibilities

作者:

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

328. Lipschitz Bandits with Stochastic Delayed Feedback

作者:

The Lipschitz bandit problem extends stochastic bandits to a continuous action set defined over a metric space, where the expected reward function satisfies a Lipschitz condition. In this work, we introduce a new problem of Lipschitz bandit in the presence of stochastic delayed feedback, where the rewards are not observed immediately but after a random delay. We consider both bounded and unbounded stochastic delays, and design algorithms that attain sublinear regret guarantees in each setting. For bounded delays, we propose a delay-aware zooming algorithm that retains the optimal performance of the delay-free setting up to an additional term that scales with the maximum delay $\tau_{\max}$. For unbounded delays, we propose a novel phased learning strategy that accumulates reliable feedback over carefully scheduled intervals, and establish a regret lower bound showing that our method is nearly optimal up to logarithmic factors. Finally, we present experimental results to demonstrate the efficiency of our algorithms under various delay scenarios.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

329. SPIKE-RL: Video-LLMs meet Bayesian Surprise

作者:

Real-world videos often show routine activities punctuated by memorable, surprising events. However, most Video-LLMs process videos by sampling frames uniformly, likely missing critical moments that define a video's narrative. We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream, identifying moments where new visual evidence conflicts with prior beliefs. SPIKE effectively localizes surprise in videos, correlated with humans on positive (FunQA) and negative (Oops!) surprise benchmarks. SPIKE-RL further improves on SPIKE's ability to detect surprise, leveraging GRPO to refine its belief hypotheses based on a reward signal from the video caption. SPIKE and SPIKE-RL guide query-agnostic surprise-weighted frame sampling, which allocates more frames to interesting moments in the video. With this strategy, we achieve consistent performance gains on five downstream benchmarks. By enabling Video-LLMs to track beliefs and register surprise, our work paves the way for more robust models that can revise their understanding in response to new information.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

330. Private Rate-Constrained Optimization with Applications to Fair Learning

作者:

Many problems in trustworthy ML can be expressed as constraints on prediction rates across subpopulations, including group fairness constraints (demographic parity, equalized odds, etc.). In this work, we study such constrained minimization problems under differential privacy (DP). Standard DP optimization techniques like DP-SGD rely on objectives that decompose over individual examples, enabling per-example gradient clipping and noise addition. Rate constraints, however, depend on aggregate statistics across groups, creating inter-sample dependencies that violate this decomposability. To address this, we develop RaCO-DP, a DP variant of Stochastic Gradient Descent-Ascent (SGDA) that solves the Lagrangian formulation of rate constraint problems. We show that the additional privacy cost of incorporating these constraints reduces to privately estimating a histogram over the mini-batch at each step. We prove convergence of our algorithm through a novel analysis of SGDA that leverages the linear structure of the dual parameter. Empirical results show that our method Pareto-dominates existing private learning approaches under group fairness constraints and also achieves strong privacy–utility–fairness performance on neural networks.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

331. Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning

作者:

Directed Acyclic Graphs (DAGs) are a standard tool in causal modeling, but their suitability for capturing the complexity of large-scale multimodal data is questionable. In practice, real-world multimodal datasets are often collected from heterogeneous generative processes that do not conform to a single DAG. Instead, they may involve multiple, and even opposing, DAG structures with inverse causal directions. To address this gap, in this work, we first propose a novel latent partial causal model tailored for multimodal data representation learning, featuring two latent coupled variables parts connected by an undirected edge, to represent the transfer of knowledge across modalities. Under specific statistical assumptions, we establish an identifiability result, demonstrating that representations learned by MultiModal Contrastive Learning (MMCL) correspond to the latent coupled variables up to a trivial transformation. This result deepens our understanding of the why MMCL works, highlights its potential for representation disentanglement, and expands the utility of pre-trained models like CLIP. Synthetic experiments confirm the robustness of our findings, even when the assumptions are partially violated. Most importantly, experiments on a pre-trained CLIP model embodies disentangled representations, enabling few-shot learning and improving domain generalization across diverse real-world datasets. Together, these contributions push the boundaries of MMCL, both in theory and in practical applications.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

332. The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective

作者:

We study the sample complexity of online reinforcement learning in the general setting of nonlinear dynamical systems with continuous state and action spaces. Our analysis accommodates a large class of dynamical systems ranging from a finite set of nonlinear candidate models to models with bounded and Lipschitz continuous dynamics, to systems that are parametrized by a compact and real-valued set of parameters. In the most general setting, our algorithm achieves a policy regret of $\mathcal{O}(N \epsilon^2 + \mathrm{ln}(m(\epsilon))/\epsilon^2)$, where $N$ is the time horizon, $\epsilon$ is a user-specified discretization width, and $m(\epsilon)$ measures the complexity of the function class under consideration via its packing number. In the special case where the dynamics are parametrized by a compact and real-valued set of parameters (such as neural networks, transformers, etc.), we prove a policy regret of $\mathcal{O}(\sqrt{N p})$, where $p$ denotes the number of parameters, recovering earlier sample-complexity results that were derived for linear time-invariant dynamical systems. While this article focuses on characterizing sample complexity, the proposed algorithms are likely to be useful in practice, due to their simplicity, their ability to incorporate prior knowledge, and their benign transient behaviors.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

333. LLM Fingerprinting via Semantically Conditioned Watermarks

作者:

Most LLM fingerprinting methods teach the model to respond to a few fixed queries with predefined atypical responses (keys). This memorization often does not survive common deployment steps such as finetuning or quantization, and such keys can be easily detected and filtered from LLM responses, ultimately breaking the fingerprint. To overcome these limitations we introduce *LLM fingerprinting via semantically conditioned watermarks*, replacing fixed query sets with a broad semantic domain, and replacing brittle atypical keys with a statistical watermarking signal diffused throughout each response. After teaching the model to watermark its responses only to prompts from a predetermined domain e.g., French language, the model owner can use queries from that domain to reliably detect the fingerprint and verify ownership. As we confirm in our thorough experimental evaluation, our fingerprint is both stealthy and robust to all common deployment scenarios.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

334. Pareto Variational Autoencoder

作者:

Incorporating robustness in generative modeling has enticed many researchers of the field. To this end, we introduce a new class of multivariate power-law distributions---the symmetric Pareto (symPareto) distribution---which can be viewed as an $\ell_1$-norm-based counterpart of the multivariate $t$ distribution. The symPareto distribution possesses many attractive information-geometric properties with respect to the $\gamma$-power divergence that naturally populates power-law families. Leveraging on the joint minimization view of variational inference, we propose the ParetoVAE, a probabilistic autoencoder that minimizes the $\gamma$-power divergence between two statistical manifolds. ParetoVAE employs the symPareto distribution for both prior and encoder, with flexible decoder options including Student's $t$ and symPareto distributions. Empirical evidences demonstrate ParetoVAE's effectiveness across multiple domains through varying the types of the decoder. The $t$ decoder achieves superior performance in sparse, heavy-tailed data reconstruction and word frequency analysis; the symPareto decoder enables robust high-dimensional denoising.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

335. Parameterized Hardness of Zonotope Containment and Neural Network Verification

作者:

Neural networks with ReLU activations are a widely used model in machine learning. It is thus important to have a profound understanding of the properties of the functions computed by such networks. Recently, there has been increasing interest in the (parameterized) computational complexity of determining these properties. In this work, we close several gaps and resolve an open problem posted by Froese et al. [COLT '25] regarding the parameterized complexity of various problems related to network verification. In particular, we prove that deciding positivity (and thus surjectivity) of a function $f\colon\mathbb{R}^d\to\mathbb{R}$ computed by a 2-layer ReLU network is W[1]-hard when parameterized by $d$. This result also implies that zonotope (non-)containment is W[1]-hard with respect to $d$, a problem that is of independent interest in computational geometry, control theory, and robotics. Moreover, we show that approximating the maximum within any multiplicative factor in 2-layer ReLU networks, computing the $L_p$-Lipschitz constant for $p\in(0,\infty]$ in 2-layer networks, and approximating the $L_p$-Lipschitz constant in 3-layer networks are NP-hard and W[1]-hard with respect to $d$. Notably, our hardness results are the strongest known so far and imply that the naive enumeration-based methods for solving these fundamental problems are all essentially optimal under the Exponential Time Hypothesis.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

336. Spectral Bellman Method: Unifying Representation and Exploration in RL

作者:

Representation learning is critical to the empirical and theoretical success of reinforcement learning. However, many existing methods are induced from model-learning aspects, misaligning them with the RL task in hand. This work introduces the Spectral Bellman Method, a novel framework derived from the Inherent Bellman Error (IBE) condition. It aligns representation learning with the fundamental structure of Bellman updates across a space of possible value functions, making it directly suited for value-based RL. Our key insight is a fundamental spectral relationship: under the zero-IBE condition, the transformation of a distribution of value functions by the Bellman operator is intrinsically linked to the feature covariance structure. This connection yields a new, theoretically-grounded objective for learning state-action features that capture this Bellman-aligned covariance, requiring only a simple modification to existing algorithms. We demonstrate that our learned representations enable structured exploration by aligning feature covariance with Bellman dynamics, improving performance in hard-exploration and long-horizon tasks. Our framework naturally extends to multi-step Bellman operators, offering a principled path toward learning more powerful and structurally sound representations for value-based RL.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

337. Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States

作者:

Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalized tokens are discarded at each step, and premature commitment, where local decisions are made without sufficient global coordination. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalizes confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, Math500 +3.8) show that LRD improves accuracy while delivering speedups of up to 10.6×, making it a strong and versatile alternative for parallel sequence generation.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

338. Dual-Objective Reinforcement Learning with Novel Hamilton-Jacobi-Bellman Formulations

作者:

Hard constraints in reinforcement learning (RL) often degrade policy performance. Lagrangian methods offer a way to blend objectives with constraints, but require intricate reward engineering and parameter tuning. In this work, we extend recent advances that connect Hamilton-Jacobi (HJ) equations with RL to propose two novel value functions for dual-objective satisfaction. Namely, we address: 1) the Reach-Always-Avoid (RAA) problem – of achieving distinct reward and penalty thresholds – and 2) the Reach-Reach (RR) problem – of achieving thresholds of two distinct rewards. In contrast with temporal logic approaches, which typically involve representing an automaton, we derive explicit, tractable Bellman forms in this context via decomposition. Specifically, we prove that the RAA and RR problems may be rewritten as compositions of previously studied HJ-RL problems. We leverage our analysis to propose a variation of Proximal Policy Optimization (DO-HJ-PPO), and demonstrate that it produces distinct behaviors from previous approaches, out-competing a number of baselines in success, safety and speed across a range of tasks for safe-arrival and multi-target achievement.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

339. Pareto-Conditioned Diffusion Models for Offline Multi-Objective Optimization

作者:

Multi-objective optimization (MOO) arises in many real-world applications where trade-offs between competing objectives must be carefully balanced. In the offline setting, where only a static dataset is available, the main challenge is generalizing beyond observed data. We introduce Pareto-Conditioned Diffusion (PCD), a novel framework that formulates offline MOO as a conditional sampling problem. By conditioning directly on desired trade-offs, PCD avoids the need for explicit surrogate models. To effectively explore the Pareto front, PCD employs a reweighting strategy that focuses on high-performing samples and a reference-direction mechanism to guide sampling towards novel, promising regions beyond the training data. Experiments on standard offline MOO benchmarks show that PCD achieves highly competitive performance and, importantly, demonstrates greater consistency across diverse tasks than existing offline MOO approaches.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

340. MMReD: a Cross-Modal Benchmark for Dense Context Reasoning

作者:

Despite recent advancements in extending context windows of large language models (LLMs) and large vision-language models (LVLMs), their ability to perform complex multi-modal reasoning over extended contexts remains critically limited. To underline this challenge, we present MMReD, a benchmark specifically designed to assess reasoning abilities within dense, information-rich scenarios where simple retrieval is not enough. Unlike traditional Needle-in-a-Haystack evaluations, MMReD challenges models to identify and interpret global patterns across entire contexts. Our benchmark comprises 24 tasks of varying complexity, ranging from standard passkey retrieval setups to those requiring selective or uniform attention to all context chunks. The evaluation reveals a consistent performance drop across all tested models -- including the most advanced LLMs, LVLMs, and architectures specializing in code and reasoning -- as the number of observations increases. Notably, even the leading reasoning-specialized models achieve 0\% accuracy on certain tasks at the maximum context length of 128 observations. Conventional fine-tuning techniques, such as SFT and GRPO, also fail to generalize effectively to longer contexts. These observations reveal an inherent limitation in current model architectures, emphasizing the need for innovative approaches to enable competent dense context reasoning in multi-modal AI systems.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

341. Carré du champ flow matching: better quality-generalisation tradeoff in generative models

作者:

Deep generative models often face a fundamental tradeoff: high sample quality can come at the cost of memorisation, where the model reproduces training data rather than generalising across the underlying data geometry. We introduce Carré du champ flow matching (CDC-FM), a generalisation of flow matching (FM), that improves the quality-generalisation tradeoff by regularising the probability path with a geometry-aware noise. Our method replaces the homogeneous, isotropic noise in FM with a spatially varying, anisotropic Gaussian noise whose covariance captures the local geometry of the latent data manifold. We prove that this geometric noise can be optimally estimated from the data and is scalable to large data. Further, we provide an extensive experimental evaluation on diverse datasets (synthetic manifolds, point clouds, single-cell genomics, animal motion capture, and images) as well as various neural network architectures (MLPs, CNNs, and transformers). We demonstrate that CDC-FM consistently offers a better quality-generalisation tradeoff. We observe significant improvements over standard FM in data-scarce regimes and in highly non-uniformly sampled datasets, which are often encountered in AI for science applications. Our work provides a mathematical framework for studying the interplay between data geometry, generalisation and memorisation in generative models, as well as a robust and scalable algorithm that can be readily integrated into existing flow matching pipelines.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

342. Conformal Robustness Control: A New Strategy for Robust Decision

作者:

Robust decision-making is crucial in numerous risk-sensitive applications where outcomes are uncertain and the cost of failure is high. Contextual Robust Optimization (CRO) offers a framework for such tasks by constructing prediction sets for the outcome that satisfy predefined coverage requirements and then making decisions based on these sets. Many existing approaches leverage conformal prediction to build prediction sets with guaranteed coverage for CRO. However, since coverage is a *sufficient but not necessary* condition for robustness, enforcing such constraints often leads to overly conservative decisions. To overcome this limitation, we propose a novel framework named Conformal Robustness Control (CRC), that directly optimizes the prediction set construction under explicit robustness constraints, thereby enabling more efficient decisions without compromising robustness. We develop efficient algorithms to solve the CRC optimization problem, and also provide theoretical guarantees on both robustness and optimality. Empirical results show that CRC consistently yields more effective decisions than existing baselines while still meeting the target robustness level.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

343. Unsupervised Representation Learning for 3D Mesh Parameterization with Semantic and Visibility Objectives

作者:

Recent 3D generative models produce high-quality textures for 3D mesh objects. However, they commonly rely on the heavy assumption that input 3D meshes are accompanied by manual mesh parameterization (UV mapping), a manual task that requires both technical precision and artistic judgment. Industry surveys show that this process often accounts for a significant share of asset creation, creating a major bottleneck for 3D content creators. Moreover, existing automatic methods often ignore two perceptually important criteria: (1) semantic awareness (UV charts should align semantically similar 3D parts across shapes) and (2) visibility awareness (cutting seams should lie in regions unlikely to be seen). To overcome these shortcomings and to automate the mesh parameterization process, we present an unsupervised differentiable framework that augments standard geometry-preserving UV learning with semantic- and visibility-aware objectives. For semantic-awareness, our pipeline (i) segments the mesh into semantic 3D parts, (ii) applies an unsupervised learned per-part UV-parameterization backbone, and (iii) aggregates per-part charts into a unified UV atlas. For visibility-awareness, we use ambient occlusion (AO) as an exposure proxy and back-propagate a soft differentiable AO-weighted seam objective to steer cutting seams toward occluded regions. By conducting qualitative and quantitative evaluations against state-of-the-art methods, we show that the proposed method produces UV atlases that better support texture generation and reduce perceptible seam artifacts compared to recent baselines. We will make our implementation code publicly available upon acceptance of the paper.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

344. AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization

作者:

Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently witnessed remarkable advancements and are increasingly being deployed in real-world applications. However, inheriting the sensitivity of visual neural networks, LVLMs remain vulnerable to adversarial attacks, which can result in erroneous or malicious outputs. While existing efforts utilize adversarial fine-tuning to enhance robustness, they often suffer from significant performance degradation on clean inputs. In this paper, we propose AdPO, a novel adversarial defense strategy for LVLMs based on preference optimization. For the first time, we reframe adversarial training as a preference optimization problem, aiming to enhance the model’s preference for generating normal outputs on clean inputs while rejecting the potential misleading outputs for adversarial examples. Notably, AdPO achieves this by solely modifying the image encoder, e.g., CLIP ViT, resulting in superior clean and adversarial performance in a variety of downstream tasks. Due to the computational cost of training large language models, we show that training on smaller LVLMs and transferring to larger ones achieves state-of-the-art performance with efficiency comparable to previous methods. Our comprehensive experiments confirm the effectiveness of the proposed AdPO which highlights the potential of preference-based learning in adversarially robust multimodal systems.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

345. Reinforcement Learning for Machine Learning Engineering Agents

作者:

Machine learning engineering (MLE) has a clear objective: Given an MLE task and a verifier (e.g., performance on some held-out data), what is the most effective way to utilize compute to achieve the best performance for the given task? Existing language model (LM) agents rely on prompting frontier LMs and accumulating experience non-parametrically by storing and retrieving experience through agent scaffolds and test-time compute. In this paper, we show that in environments such as MLE where a good verifier is available, adapting the LM parameters through gradient updates can be more effective in utilizing compute and agent’s experience. Specifically, we show that agents backed by weaker models that improve via reinforcement learning (RL) can eventually outperform agents backed by much larger, but static models for a given MLE task. We identify two major challenges with RL in this setting. First, actions can take a variable amount of time (e.g., executing code for different solutions), which leads to asynchronous policy gradient updates that favor faster but suboptimal solutions. We propose duration-aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions. Second, using performance on the held-out data as a reward for MLE provides limited feedback. A program that’s nearly correct is treated the same as one that fails entirely (e.g., during data loading). We propose environment instrumentation to offer verifiable partial credit, using a separate, static language model to insert print statement to an existing program. Our experiments suggest that a small LM (Qwen2.5-3B) adapted with RL, when given enough compute, can solve an MLE task better than prompting a frontier model (Claude-3.5-Sonnet) with the state-of-the-art agent scaffold (AIDE) by an average of 22% across 12 Kaggle tasks.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

346. Navigating the Latent Space Dynamics of Neural Models

作者:

Neural networks transform high-dimensional data into compact, structured representations, often modeled as elements of a lower dimensional latent space. In this paper, we present an alternative interpretation of neural models as dynamical systems acting on the latent manifold. Specifically, we show that autoencoder models implicitly define a _latent vector field_ on the manifold, derived by iteratively applying the encoding-decoding map, without any additional training. We observe that standard training procedures introduce inductive biases that lead to the emergence of attractor points within this vector field. Drawing on this insight, we propose to leverage the vector field as a _representation_ for the network, providing a novel tool to analyze the properties of the model and the data. This representation enables to: $(i)$ analyze the generalization and memorization regimes of neural models, even throughout training; $(ii)$ extract prior knowledge encoded in the network's parameters from the attractors, without requiring any input data; $(iii)$ identify out-of-distribution samples from their trajectories in the vector field. We further validate our approach on vision foundation models, showcasing the applicability and effectiveness of our method in real-world scenarios.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

347. Poisson Midpoint Method for Log Concave Sampling: Beyond the Strong Error Lower Bounds

作者:

We study the problem of sampling from strongly log-concave distributions over $\mathbb{R}^d$ using the Poisson midpoint discretization (a variant of the randomized midpoint method) for overdamped/underdamped Langevin dynamics. We prove its convergence in the 2-Wasserstein distance ($\mathcal W_2$), achieving a cubic speedup in dependence on the target accuracy ($\epsilon$) over the Euler-Maruyama discretization, surpassing existing bounds for randomized midpoint methods. Notably, in the case of underdamped Langevin dynamics, we demonstrate the complexity of $\mathcal W_2$ convergence is much smaller than the complexity lower bounds for convergence in $L^2$ strong error established in the literature.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

348. Alternating Diffusion for Proximal Sampling with Zeroth Order Queries

作者:

This work introduces a new approximate proximal sampler that operates solely with zeroth-order information of the potential function. Prior theoretical analyses have revealed that proximal sampling corresponds to alternating forward and backward iterations of the heat flow. The backward step was originally implemented by rejection sampling, whereas we directly simulate the dynamics. Unlike diffusion-based sampling methods that estimate scores via learned models or by invoking auxiliary samplers, our method treats the intermediate particle distribution as a Gaussian mixture, thereby yielding a Monte Carlo score estimator from directly samplable distributions. Theoretically, when the score estimation error is sufficiently controlled, our method inherits the exponential convergence of proximal sampling under isoperimetric conditions on the target distribution. In practice, the algorithm avoids rejection sampling, permits flexible step sizes, and runs with a deterministic runtime budget. Numerical experiments demonstrate that our approach converges rapidly to the target distribution, driven by interactions among multiple particles and by exploiting parallel computation.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

349. OXtal: An All-Atom Diffusion Model for Organic Crystal Structure Prediction

作者:

Accurately predicting experimentally-realizable $3\textrm{D}$ molecular crystal structures from their $2\textrm{D}$ chemical graphs is a long-standing open challenge in computational chemistry called $\textit{crystal structure prediction}$ (CSP). Efficiently solving this problem has implications ranging from pharmaceuticals to organic semiconductors, as crystal packing directly governs the physical and chemical properties of organic solids. In this paper, we introduce $\textrm{OXtal}$, a large-scale $100\textrm{M}$ parameter all-atom diffusion model that directly learns the conditional joint distribution over intramolecular conformations and periodic packing. To efficiently scale $\textrm{OXtal}$, we abandon explicit equivariant architectures imposing inductive bias arising from crystal symmetries in favor of data augmentation strategies. We further propose a novel crystallization-inspired lattice-free training scheme, $\textit{Stoichiometric Stochastic Shell Sampling}$ ($S^4$), that efficiently captures long-range interactions while sidestepping explicit lattice parametrization---thus enabling more scalable architectural choices at all-atom resolution. Trained on $600 \text{K}$ experimentally validated crystal structures (including rigid and flexible molecules, co-crystals, and solvates), $\textrm{OXtal}$ achieves orders-of-magnitude improvements over prior $\textit{ab-initio}$ ML CSP methods, which remaining orders of magnitude cheaper than traditional quantum-chemical approaches. Specifically, $\textrm{OXtal}$ reproduces experimental structures with conformer $\mathrm{RMSD}_1<0.5$ Å and attains over 80\% lattice-match success, demonstrating its ability to model both thermodynamic and kinetic regularities of molecular crystallization.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

350. ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

作者:

Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre- and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

351. Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

作者:

Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the effectiveness, showing faster and better convergence than existing memory-efficient baselines and Adam with little memory overhead. Notably, Alice achieves better than 2x faster convergence over Adam, while RACS delivers strong performance on the 1B model with SGD-like memory.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

352. DrugTrail: Explainable Drug Discovery via Structured Reasoning and Druggability‑Tailored Preference Optimization

作者:

Machine learning promises to revolutionize drug discovery, but its "black-box" nature and narrow focus limit adoption by experts. While Large Language Models (LLMs) offer a path forward with their broad knowledge and interactivity, existing methods remain data-intensive and lack transparent reasoning. To address these issues, we present DrugTrail, an LLM-based framework for explainable drug discovery that integrates structured reasoning trajectories with a Druggability‑Tailored Preference Optimization (DTPO) strategy. It not only introduces structured reasoning traces to articulate the "how" and "why" behind its conclusions but also serve to guide task-specific reasoning pathways within the LLM's vast knowledge space, thereby enhancing its interpretability and reliability of its final outputs. Furthermore, based on the fact that optimizing for binding affinity alone does not equate to optimizing for druggability, DTPO explicitly moves beyond single-metric optimization and opens up a broader search space that balances affinity with other essential factors. Extensive experiments demonstrate the effectiveness of our approach and its generalizability to a wider range of biomolecular optimization domains, bridging the gap between LLM reasoning capabilities and trustworthy AI-assisted drug discovery.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

353. Learning from Algorithm Feedback: One-Shot SAT Solver Guidance with GNNs

作者:

Boolean Satisfiability (SAT) solvers are foundational to computer science, yet their performance typically hinges on hand-crafted heuristics. This work introduces Reinforcement Learning from Algorithm Feedback (RLAF) as a paradigm for learning to guide SAT solver branching heuristics with Graph Neural Networks (GNNs). Central to our approach is a novel and generic mechanism for injecting inferred variable weights and polarities into the branching heuristics of existing SAT solvers. In a single forward pass, a GNN assigns these parameters to all variables. Casting this one-shot guidance as a reinforcement learning problem lets us train the GNN with off-the-shelf policy-gradient methods, such as GRPO, directly using the solver's computational cost as the sole reward signal. Extensive evaluations demonstrate that RLAF-trained policies significantly reduce the mean solve times of different base solvers across diverse SAT problem distributions, achieving more than a 2x speedup in some cases, while generalizing effectively to larger and harder problems after training. Notably, these policies consistently outperform expert-supervised approaches based on learning handcrafted weighting heuristics, offering a promising path towards data-driven heuristic design in combinatorial optimization.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

354. ROC-n-reroll: How verifier imperfection affects test-time scaling

作者:

Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier *imperfection* affects performance — a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier’s ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

355. AutoEP: LLMs-Driven Automation of Hyperparameter Evolution for Metaheuristic Algorithms

作者:

Dynamically configuring algorithm hyperparameters is a fundamental challenge in computational intelligence. While learning-based methods offer automation, they suffer from prohibitive sample complexity and poor generalization. We introduce AutoEP, a novel framework that bypasses training entirely by leveraging Large Language Models (LLMs) as zero-shot reasoning engines for algorithm control. AutoEP's core innovation lies in a tight synergy between two components: (1) an online Exploratory Landscape Analysis (ELA) module that provides real-time, quantitative feedback on the search dynamics, and (2) a multi-LLM reasoning chain that interprets this feedback to generate adaptive hyperparameter strategies. This approach grounds high-level reasoning in empirical data, mitigating hallucination. Evaluated on three distinct metaheuristics across diverse combinatorial optimization benchmarks, AutoEP consistently outperforms state-of-the-art tuners, including neural evolution and other LLM-based methods. Notably, our framework enables open-source models like Qwen3-30B to match the performance of GPT-4, demonstrating a powerful and accessible new paradigm for automated hyperparameter design.Our code is available at https://anonymous.4open.science/r/AutoEP-3E11.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

356. Flow of Spans: Generalizing Language Models to Dynamic Span-Vocabulary via GFlowNets

作者:

Standard autoregressive language models generate text token-by-token from a fixed vocabulary, inducing a *tree-structured state space* when viewing token sampling as an action, which limits flexibility and expressiveness. Recent work introduces dynamic vocabulary by sampling retrieved text spans but overlooks that the same sentence can be composed of spans of varying lengths, lacking explicit modeling of the *directed acyclic graph (DAG) state space*. This leads to restricted exploration of compositional paths and is biased toward the chosen path. Generative Flow Networks (GFlowNets) are powerful for efficient exploring and generalizing over state spaces, particularly those with a DAG structure. However, prior GFlowNets-based language models operate at the token level and remain confined to tree-structured spaces, limiting their potential. In this work, we propose **F**low **o**f **S**pan**S** (**FOSS**), a principled GFlowNets framework for span generation. FoSS constructs a dynamic span vocabulary by segmenting the retrieved text flexibly, ensuring a DAG-structured state space, which allows GFlowNets to explore diverse compositional paths and improve generalization. With specialized reward models, FoSS generates diverse, high-quality text. Empirically, FoSS improves MAUVE scores by up to 12.5\% over Transformer on text generation and achieves 3.5\% gains on knowledge-intensive tasks, consistently outperforming state-of-the-art methods. Scaling experiments further demonstrate FoSS benefits from larger models, more data, and richer retrieval corpora, retaining its advantage over strong baselines.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

357. Preference Leakage: A Contamination Problem in LLM-as-a-judge

作者:

Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between the data generator LLM and the judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive and real-world problem that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

358. KnowProxy: Adapting Large Language Models by Knowledge-guided Proxy

作者:

Adapting large language models (LLMs) using smaller proxy models has been shown to improve training efficiency, where the LLMs remain frozen while the proxies are tuned on top. However, this approach typically requires access to the output probability distributions of LLMs, which are often inaccessible or unstable. To address this limitation, we propose KnowProxy, a knowledge-guided proxy framework in which the proxy is trained with textual knowledge rather than probability distributions. Specifically, we first elicit textual knowledge and reasoning from frozen LLMs through prompting, and then the proxy model learns to adapt this reasoning to target task distributions. We evaluate KnowProxy on diverse reasoning benchmarks with different fine-tuning scenarios. Comprehensive results show that KnowProxy achieves competitive or even better performance without direct access to probability distributions, thereby providing a scalable and versatile alternative to traditional fine-tuning.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

359. FlowRL: Matching Reward Distributions for LLM Reasoning

作者:

We propose FlowRL: matching the full reward distribution via flow balancing instead of solely maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (e.g., PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on both math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

360. Improved $\ell_{p}$ Regression via Iteratively Reweighted Least Squares

作者:

We introduce fast algorithms for solving $\ell_{p}$ regression problems using the iteratively reweighted least squares (IRLS) method. Our approach achieves state-of-the-art iteration complexity, outperforming the IRLS algorithm by Adil-Peng-Sachdeva (NeurIPS 2019) and matching the theoretical bounds established by the complex algorithm of Adil-Kyng-Peng-Sachdeva (SODA 2019, J. ACM 2024) via a simpler lightweight iterative scheme. This bridges the existing gap between theoretical and practical algorithms for $\ell_{p}$ regression. Our algorithms depart from prior approaches, using a primal-dual framework, in which the update rule can be naturally derived from an invariant maintained for the dual objective. Empirically, we show that our algorithms significantly outperform both the IRLS algorithm by Adil-Peng-Sachdeva and MATLAB/CVX implementations.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

361. Online Rubrics Elicitation from Pairwise Comparisons

作者:

Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

362. Dynamic Texture Modeling of 3D Clothed Gaussian Avatars from a Single Video

作者:

Recent advances in neural rendering, particularly 3D Gaussian Splatting (3DGS), have enabled animatable 3D human avatars from single videos with efficient rendering and high fidelity. However, current methods struggle with dynamic appearances, especially in loose garments (e.g., skirts), causing unrealistic cloth motion and needle artifacts. This paper introduces a novel approach to dynamic appearance modeling for 3DGS-based avatars, focusing on loose clothing. We identify two key challenges: (1) limited Gaussian deformation under pre-defined template articulation, and (2) a mismatch between body-template assumptions and the geometry of loose apparel. To address these issues, we propose a motion-aware autoregressive structural deformation framework for Gaussians. We structure Gaussians into an approximate graph and recursively predict structure-preserving updates, yielding realistic, template-free cloth dynamics. Our framework enables view-consistent and robust appearance modeling under the single-view constraint, producing accurate foreground silhouettes and precise alignment of Gaussian points with clothed shapes. To demonstrate the effectiveness of our method, we introduce an in-the-wild dataset featuring subjects performing dynamic movements in loose clothing, and extensive experiments validate that our approach significantly outperforms existing 3DGS-based methods in modeling dynamic appearances from single videos.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

363. When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

作者:

Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate reasoning. Despite its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models on both hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, covering fact retrieval, complex reasoning, contextual summarize, and creative generation, and a systematic evaluation across the entire pipeline, from graph construction and knowledge retrieval to final generation. Leveraging this novel benchmark, we systematically investigate the conditions when GraphRAG surpasses traditional RAG and the underlying reasons for its success, offering guidelines for its practical application. All related resources and analysis are collected for the community at https://anonymous.4open.science/r/GraphRAG-Benchmark-CE8D/.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

364. A Block Coordinate Descent Method for Nonsmooth Composite Optimization under Orthogonality Constraints

作者:

Nonsmooth composite optimization with orthogonality constraints has a wide range of applications in statistical learning and data science. However, this problem is challenging due to its nonsmooth objective and computationally expensive, non-convex constraints. In this paper, we propose a new approach called \textbf{OBCD}, which leverages Block Coordinate Descent to address these challenges. \textbf{OBCD} is a feasible method with a small computational footprint. In each iteration, it updates $k$ rows of the solution matrix, where $k \geq 2$, by globally solving a small nonsmooth optimization problem under orthogonality constraints. We prove that the limiting points of \textbf{OBCD}, referred to as (global) block-$k$ stationary points, offer stronger optimality than standard critical points. Furthermore, we show that \textbf{OBCD} converges to $\epsilon$-block-$k$ stationary points with an iteration complexity of $\mathcal{O}(1/\epsilon)$. Additionally, under the Kurdyka-Lojasiewicz (KL) inequality, we establish the non-ergodic convergence rate of \textbf{OBCD}. We also demonstrate how novel breakpoint search methods can be used to solve the subproblem in \textbf{OBCD}. Empirical results show that our approach consistently outperforms existing methods.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

365. Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

作者:

Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

366. Revisiting Matrix Sketching in Linear Bandits: Achieving Sublinear Regret via Dyadic Block Sketching

作者:

Linear bandits have become a cornerstone of online learning and sequential decision-making, providing solid theoretical foundations for balancing exploration and exploitation. Within this domain, matrix sketching serves as a critical component for achieving computational efficiency, especially when confronting high-dimensional problem instances. The sketch-based approaches reduce per-round complexity from $\Omega(d^2)$ to $O(dl)$, where $d$ is the dimension and $l

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

367. J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

作者:

The progress of AI is bottlenecked by the quality of evaluation, making powerful LLM-as-a-Judge models a core solution. The efficacy of these judges depends on their chain-of-thought reasoning, creating a critical need for methods that can effectively optimize this reasoning process. In this work, we introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions. Our core contribution lies in converting all judgment tasks for nonverifiable and verifiable prompts into a unified format with verifiable rewards, enabling direct optimization of evaluation quality while mitigating positional bias. We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance across multiple benchmarks. In particular, J1-Qwen-32B, our multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a much larger 671B DeepSeek-R1 on some benchmarks, while only training on synthetic data. Through comprehensive ablations of pairwise, pointwise, and multitask J1 variants, we demonstrate the effectiveness of our approach across seed prompts, reward strategies, and training recipes. Qualitative analysis reveals that J1 develops systematic evaluation strategies, including dynamic criteria generation, reference answer creation, iterative self-correction of initial assessments, and feedback generation for low-quality responses.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

368. ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation

作者:

Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 440 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By capturing the visual and structural complexity of real design, ChartGalaxy provides a useful resource for enhancing multimodal reasoning and generation in LVLMs.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

369. A Unifying View of Coverage in Linear Off-policy Evaluation

作者:

Off-policy evaluation (OPE) is a fundamental task in reinforcement learning (RL). In the classic setting of \emph{linear OPE}, finite-sample guarantees often take the form $$ \textrm{Prediction error} \le \textrm{poly}(C^\pi, d, 1/n, log(1/\delta)), $$ where $d$ is the dimension of the features, and $C^\pi$ is a **_feature coverage parameter_** that characterizes the degree to which the visited features lie in the span of the data distribution. Though such guarantees are well-understood for several popular algorithms under the Bellman-completeness assumption, this form of guarantee has not yet been achieved in the minimal setting where it is only assumed that the target value function is linearly realizable in the features. Despite recent interest in tight characterizations for this setting, the right notion of coverage remains unclear, and candidate definitions from prior analyses have undesirable properties and are starkly disconnected from more standard quantities in the literature. In this paper, we provide a novel finite-sample analysis of a canonical algorithm for this setting, LSTDQ. Inspired by an instrumental-variable (IV) view, we develop error bounds that depend on a novel coverage parameter, the feature-dynamics coverage, which can be interpreted as feature coverage in a linear dynamical system. With further assumptions, such as Bellman-completeness, our definition successfully recovers the coverage parameters specialized to those settings, providing a unified understanding for coverage in linear OPE.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

370. Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Optimization

作者:

Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step $n$ draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over $V^n$ variables, with $V$ being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so it remains infeasible to solve. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. With a novel application of polymatroid theory, we reduce the exponentially large OTLP to a convex optimization problem in at most $V$ variables. This allows us to devise an algorithm for optimal $n$-draft speculative sampling when the $n$ tokens are chosen i.i.d. from a single draft model, which can be tuned to arbitrary accuracy. Finally, we measure acceptance rates and algorithm runtimes for various $n$ and top-$k$ draft sampling settings. Our findings give the first multi-draft algorithm with 90\% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

371. Revela: Dense Retriever Learning via Language Modeling

作者:

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self‑supervised learning objectives in the spirit of language modeling to train retrievers? . To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self‑supervised retriever learning.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

372. OSCAR: Online Soft Compression for RAG

作者:

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, leading to improved accuracy and relevance. However, scaling RAG pipelines remains computationally expensive as context length grows. On one hand, hard compression methods have recently proposed to prune the retrieved text on-the-fly with a limited compression ration. On the other hand, soft compression method performs a costly offline compression thanks a dedicated LLM but with a higher compression rate. In this paper, we introduce OSCAR, a novel query-dependent online soft compression method for RAG. OSCAR bridges the gap between online hard and offline soft compression methods, bringing the best of both: OSCAR dynamically compresses retrieved information at inference time, eliminating storage overhead and enabling higher compression rates than existing methods. Our experiments demonstrate state-of-the-art performance with a 2-5x speed-up in inference and minimal, if any, accuracy loss, for LLMs ranging from 1B to 24B parameters.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

373. Multihead Mixture of Experts for Classification of Gigapixel Pathology Images

作者:

Multiple Instance Learning (MIL) is the predominant paradigm for classifying gigapixel whole-slide images in computational pathology. MIL follows a sequence of 1) extracting patch features, 2) applying a linear layer to obtain task-specific patch features, and 3) aggregating the patches into a slide feature for classification. While substantial efforts have been devoted to optimizing patch feature extraction and aggregation, none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features. We hypothesize that this layer constitutes an overlooked performance bottleneck and that stronger representations can be achieved with a low-rank transformation tailored to each patch's phenotype, yielding synergistic effects with existing MIL approaches. To this end, we introduce MAMMOTH, a parameter-efficient, multi-head mixture of experts module designed to improve the performance of any MIL model with minimal alterations to the total number of parameters. Across 8 MIL methods and 19 different tasks, we find that this improvement to the task-specific transformation has a larger effect on performance than the choice of aggregation method. For instance, when equipped with MAMMOTH, even simple methods such as max or mean pooling attain higher average performance than any method with the standard linear layer. Overall, MAMMOTH improves performance in 130 of the 152 examined configurations, with an average $+3.8%$ change in performance.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

374. Continuously Augmented Discrete Diffusion model for Categorical Generative Modeling

作者:

Standard discrete diffusion models treat all unobserved states the same way, typically mapping them to an absorbing [MASK] token. This creates an "information void" where global semantic information that may be inferred for the masked tokens from the unmasked tokens is not directly passed from one denoising step to another. We introduce **Continuously Augmented Discrete Diffusion (CADD)**, a framework that augments the discrete state space with a paired diffusion in a continuous latent space. This yields graded, gradually corrupted states in which masked tokens are represented by noisy yet informative latent vectors rather than information voids. At each reverse step, CADD uses the continuous latent as a semantic hint to guide discrete denoising. The design is clean and compatible with existing discrete diffusion training. At sampling time, the strength and estimator of the continuous latent vector enables a controlled trade-off between mode-coverage (diversity-oriented) and mode-seeking (context-localization-oriented). Empirically, we demonstrate CADD improves generative quality over mask-based diffusion across text generation, image synthesis, and code modeling, with consistent gains on both qualitative and quantitative metrics against strong discrete baselines.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

375. Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective

作者:

KV caching is a fundamental technique for accelerating Large Language Model (LLM) inference by reusing key-value (KV) pairs from previous queries, but its effectiveness under limited memory is highly sensitive to the eviction policy. The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals, especially in multi-LLM serving scenarios, where balancing query load across workers and maximizing cache hit rate of each worker are inherently conflicting objectives. We give the first unified mathematical model that captures the core trade-offs between KV cache eviction and query routing. Our analysis reveals the theoretical limitations of existing methods and leads to principled algorithms that integrate provably competitive randomized KV cache eviction with learning-based methods to adaptively route queries with evolving patterns, thus balancing query load and cache hit rate. Our theoretical results are validated by extensive experiments across 4 benchmarks and 3 prefix-sharing settings, demonstrating improvements of up to **6.92$\times$** in cache hit rate, **11.96$\times$** reduction in latency, **14.06$\times$** reduction in time-to-first-token (TTFT), and **77.4%** increase in throughput over the state-of-the-art methods.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

376. Out of the Shadows: Exploring a Latent Space for Neural Network Verification

作者:

Neural networks are ubiquitous. However, they are often sensitive to small input changes. Hence, to prevent unexpected behavior in safety-critical applications, their formal verification -- a notoriously hard problem -- is necessary. Many state-of-the-art verification algorithms use reachability analysis or abstract interpretation to enclose the set of possible outputs of a neural network. Often, the verification is inconclusive due to the conservatism of the enclosure. To address this problem, we design a novel latent space for formal verification that enables the transfer of output specifications to the input space for an iterative specification-driven input refinement, i.e., we iteratively reduce the set of possible inputs to only enclose the unsafe ones. The latent space is constructed from a novel view of projection-based set representations, e.g., zonotopes, which are commonly used in reachability analysis of neural networks. A projection-based set representation is a "shadow" of a higher-dimensional set -- a latent space -- that does not change during a set propagation through a neural network. Hence, the input set and the output enclosure are "shadows" of the same latent space that we can use to transfer constraints. We present an efficient verification tool for neural networks that uses our iterative refinement to significantly reduce the number of subproblems in a branch-and-bound procedure. Using zonotopes as a set representation, unlike many other state-of-the-art approaches, our approach can be realized by only using matrix operations, which enables a significant speed-up through efficient GPU acceleration. We demonstrate that our tool achieves competitive performance, which would place it among the top-ranking tools of the last neural network verification competition (VNN-COMP'24).

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

377. Enhancing Diffusion-Based Sampling with Molecular Collective Variables

作者:

Diffusion-based samplers learn to sample complex, high-dimensional distributions using energies or log densities alone, without training data. Yet, they remain impractical for molecular sampling because they are often slower than molecular dynamics and miss thermodynamically relevant modes. Inspired by enhanced sampling, we encourage exploration by introducing a sequential bias along bespoke, information-rich, low-dimensional projections of atomic coordinates known as collective variables (CVs). We introduce a repulsive potential centered on the CVs from recent samples, which pushes future samples towards novel CV regions and effectively increases the temperature in the projected space. Our resulting method improves efficiency, mode discovery, enables the estimation of free energy differences, and retains independent sampling from the approximate Boltzmann distribution via reweighting by the bias. On standard peptide conformational sampling benchmarks, the method recovers diverse conformational states and accurate free energy profiles. We are the first to demonstrate reactive sampling using a diffusion-based sampler, capturing bond breaking and formation with universal interatomic potentials at near-first-principles accuracy. The approach resolves reactive energy landscapes at a fraction of the wall-clock time of standard sampling methods, advancing diffusion-based sampling towards practical use in molecular sciences.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

378. We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning

作者:

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various tasks but still struggle with complex mathematical reasoning. Prior work has mainly focused on dataset construction and method optimization, while often overlooking two critical aspects: comprehensive knowledge-driven design and model-centric data space modeling. We introduce WE-MATH 2.0, a unified system that integrates a structured mathematical knowledge hierarchy, model-centric data space modeling, and a reinforcement learning (RL)-based training paradigm to enhance the mathematical reasoning abilities of MLLMs. Our contributions are fourfold: (1) MathBook Knowledge System: a five-level hierarchy covering 491 knowledge points and 1,819 fundamental principles; (2) MathBook-Standard and MathBook-Pro: datasets that ensure broad conceptual coverage and robust training through dual expansion, a three-dimensional difficulty space, and seven progressive variants per problem; (3) MathBook-RL: a two-stage RL framework including Cold-Start Fine-Tuning to align models with knowledge-oriented chain-of-thought reasoning, and Progressive Alignment RL leveraging average-reward learning with dynamic data scheduling for progressive difficulty alignment; (4) MathBookEval: a benchmark covering all 491 knowledge points with diverse reasoning step distributions. Experimental results show that MathBook-RL achieves competitive performance on four widely used benchmarks and demonstrates strong results on MathBookEval, suggesting promising generalization in mathematical reasoning.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

379. Differentially Private Domain Discovery

作者:

We study several problems in differentially private domain discovery, where each user holds a subset of items from a shared but unknown domain, and the goal is to output an informative subset of items. For set union, we show that the simple baseline Weighted Gaussian Mechanism (WGM) has a near-optimal $\ell_1$ missing mass guarantee on Zipfian data as well as a distribution-free $\ell_\infty$ missing mass guarantee. We then apply the WGM as a domain-discovery precursor for existing known-domain algorithms for private top-$k$ and $k$-hitting set and obtain new utility guarantees for their unknown domain variants. Finally, experiments demonstrate that all of our WGM-based methods are competitive with or outperform existing baselines for all three problems.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

380. Minimax Optimal Adversarial Reinforcement Learning

作者:

Consider episodic Markov decision processes (MDPs) with adversarially chosen transition kernels, where the transition kernel is adversarially chosen at each episode. Prior works have established regret upper bounds of $\widetilde{\mathcal{O}}(\sqrt{T} + C^P)$, where $T$ is the number of episodes and $C^P$ quantifies the degree of adversarial change in the transition dynamics. This regret bound may scale as large as $\mathcal{O}(T)$, leading to a linear regret. This raises a fundamental question: *Can sublinear regret be achieved under fully adversarial transition kernels?* We answer this question affirmatively. First, we show that the optimal policy for MDPs with adversarial transition kernels must be history-dependent. We then design an algorithm of Adversarial Dynamics Follow-the-Regularized-Leader (AD-FTRL), and prove that it achieves a sublinear regret of $\mathcal{O}(\sqrt{(|\mathcal{S}||\mathcal{A}|)^K T})$, where $K$ is the horizon length, $|\mathcal{S}|$ is the number of states, and $|\mathcal{A}|$ is the number of actions. Such a regret cannot be achieved by simply solving this problem as a contextual bandit. We further construct a hard MDP instance and prove a matching lower bound on the regret, which thereby demonstrates the **minimax optimality** of our algorithm.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

381. Latent Concept Disentanglement in Transformer-based Language Models

作者:

When large language models (LLMs) use in-context learning (ICL) to solve a new task, they must infer latent concepts from demonstration examples. This raises the question of whether and how transformers represent latent structures as part of their computation. Our work experiments with several controlled tasks, studying this question using mechanistic interpretability. First, we show that in transitive reasoning tasks with a latent, discrete concept, the model successfully identifies the latent concept and does step-by-step concept composition. This builds upon prior work that analyzes single-step reasoning. Then, we consider tasks parameterized by a latent numerical concept. We discover low-dimensional subspaces in the model's representation space, where the geometry cleanly reflects the underlying parameterization. Overall, we show that small and large models can indeed disentangle and utilize latent concepts that they learn in-context from a handful of abbreviated demonstrations.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

382. Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

作者:

Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms, STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M will be released for community development after reviewing the manuscript.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

383. Monitoring Decomposition Attacks with Lightweight Sequential Monitors

作者:

As LLMs become more agentic, a critical risk emerges: attackers can decompose harmful goals into stateful, benign subtasks that trick LLM agents into executing them without realizing the harmful intent. The challenge lies in the existing shallow safety alignment techniques: they only detect harm in the immediate prompt and do not reason about long-range intent. We therefore propose adding an external monitor that observes the conversation at a higher level. To facilitate our study on monitoring decomposition attacks, we curate the largest and most diverse dataset, DecomposedHarm, with 4,634 tasks that can be assigned to LLM agents, including general agent tasks, text-to-image, and question-answering tasks, where each task has a benignly decomposed version. We verify our datasets by testing them on frontier models and show an 87\% attack success rate on average on GPT-4o. To defend in real‐time, we propose a lightweight sequential monitoring framework that cumulatively evaluates each sub‑prompt. We show that a carefully prompt-engineered lightweight monitor hits a 93\% defense success rate—outperforming strong baselines such as Llama-Guard-4 and o3-mini, while cutting costs by 90\% and latency by 50\%. Additionally, we show that even under adversarial pressure, combining decomposition attacks with massive random task injection and automated red teaming, our lightweight sequential monitors remain robust. Our findings suggest that guarding against effective decomposition attacks is "surprisingly easy" with lightweight sequential monitors, enabling safety in real-world LLM agent deployment where expensive solutions are impractical.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

384. All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

作者:

From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g., human preferences) before using it to provide *online* feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on said dataset via *offline* maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only *lose* information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a *generation-verification gap*, *(1)* it is relatively easy to learn the relatively simple RM (*verifier*) from the preference data. Then, *(2)* the downstream RL procedure only returns policies (*generators*) that are optimal for such relatively simple verifiers. Thus, end-to-end, two-stage online FT only has to search over a reduced subset of the full space of policies, requiring less data than offline FT.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

385. FlowNar: Scalable Streaming Narration for Long-Form Videos

作者:

Recent Large Multimodal Models (LMMs), primarily designed for offline settings, are ill-suited for the dynamic requirements of streaming video. While recent online adaptations improve real-time processing, they still face critical scalability challenges, with resource demands typically growing at least linearly with video duration. To overcome this bottleneck, we propose FlowNar, a novel framework for scalable streaming video narration. The core of FlowNar is a dynamic context management strategy for historical visual context removal, combined with our novel CLAM (Cross Linear Attentive Memory) module for streaming visual history retention, ensuring bounded visual memory usage and computational complexity, crucial for efficient streaming. We also introduce a realistic autoregressive evaluation protocol and complementary evaluation metrics to assess streaming narration models under deployment-like conditions. Experiments on Ego4D, EgoExo4D, and EpicKitchens100 datasets demonstrate that FlowNar substantially improves narration quality over strong baselines while being highly efficient, supporting processing of 10$\times$ longer videos and achieving 3$\times$ higher throughput (FPS).

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

386. Pinet: Optimizing hard-constrained neural networks with orthogonal projection layers

作者:

We introduce an output layer for neural networks that ensures satisfaction of convex constraints. Our approach, $\Pi$net, leverages operator splitting for rapid and reliable projections in the forward pass, and the implicit function theorem for backpropagation. We deploy $\Pi$net as a feasible-by-design optimization proxy for parametric constrained optimization problems and obtain modest-accuracy solutions faster than traditional solvers when solving a single problem, and significantly faster for a batch of problems. We surpass state-of-the-art learning approaches by orders of magnitude in terms of training time, solution quality, and robustness to hyperparameter tuning, while maintaining similar inference times. Finally, we tackle multi-vehicle motion planning with non-convex trajectory preferences and provide $\Pi$net as a GPU-ready package implemented in JAX.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

387. FlashRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

作者:

Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present FlashRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to $665\times$ over na\"ive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply FlashRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the FlashRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

388. Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

作者:

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture‑of‑Experts (MoE) models, now standard in state‑of‑the‑art systems, introduce a new sparsity dimension that current dense‑model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. All code, data sources, and logs are released to facilitate reproducibility and future work.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

389. Training LLMs with LogicReward for Faithful and Rigorous Reasoning

作者:

Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6\% and 2\% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. We will release all data and code upon acceptance

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

390. Triple-BERT: Do We Really Need MARL for Order Dispatch on Ride-Sharing Platforms?

作者:

On-demand ride-sharing platforms, such as Uber and Lyft, face the intricate real-time challenge of bundling and matching passengers—each with distinct origins and destinations—to available vehicles, all while navigating significant system uncertainties. Due to the extensive observation space arising from the large number of drivers and orders, order dispatching, though fundamentally a centralized task, is often addressed using Multi-Agent Reinforcement Learning (MARL). However, independent MARL methods fail to capture global information and exhibit poor cooperation among workers, while Centralized Training Decentralized Execution (CTDE) MARL methods suffer from the curse of dimensionality. To overcome these challenges, we propose Triple-BERT, a centralized Single Agent Reinforcement Learning (MARL) method designed specifically for large-scale order dispatching on ride-sharing platforms. Built on a variant TD3, our approach addresses the vast action space through an action decomposition strategy that breaks down the joint action probability into individual driver action probabilities. To handle the extensive observation space, we introduce a novel BERT-based network, where parameter reuse mitigates parameter growth as the number of drivers and orders increases, and the attention mechanism effectively captures the complex relationships among the large pool of driver and orders. We validate our method using a real-world ride-hailing dataset from Manhattan. Triple-BERT achieves approximately an 11.95\% improvement over current state-of-the-art methods, with a 4.26\% increase in served orders and a 22.25\% reduction in pickup times. Our code, trained model parameters, and processed data are publicly available at the anonymous repository https://anonymous.4open.science/r/Triple-BERT .

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

391. Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database

作者:

Efficiently editing knowledge stored in Large Language Models (LLMs) enables model updates without large-scale training. One promising solution is Locate-and-Edit (L\&E), allowing simultaneous modifications of a massive number of factual knowledge. However, such editing may compromise the general abilities of LLMs and even result in forgetting edited facts when scaling up to thousands of edits. In this paper, we model existing linear L\&E methods as querying a Key-Value (KV) database. From this perspective, we then propose NeuralDB, an editing framework that explicitly represents the edited facts as a neural KV database equipped with a non-linear gated retrieval module. With simple modification over L\&E methods, our framework not only significantly extends the capacity of knowledge editing but also eliminates the associated side effects. Comprehensive experiments involving the editing of 10,000 facts were conducted on the ZsRE and CounterFact datasets, including GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results demonstrate that NeuralDB excels in all metrics of editing success while maintaining original performance evaluated by six representative text understanding and generation tasks. Further experiments indicate that NeuralDB maintains its effectiveness even when scaled to 100,000 facts (\textbf{50}$\mathbf{\times}$ more than in prior work).

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

392. ReDDiT: Rehashing Noise for Discrete Visual Generation

作者:

In the visual generative area, discrete diffusion models are gaining traction for their efficiency and compatibility. However, pioneered attempts still fall behind their continuous counterparts, which we attribute to noise (absorbing state) design and sampling heuristics. In this study, we propose a rehashing noise approach for discrete diffusion transformer (termed **ReDDiT**), with the aim to extend absorbing states and improve expressive capacity of discrete diffusion models. ReDDiT enriches the potential paths that latent variables traverse during training with randomized multi-index corruption. The derived rehash sampler, which reverses the randomized absorbing paths, guarantees high diversity and low discrepancy of the generation process. These reformulations lead to more consistent and competitive generation quality, mitigating the need for heavily tuned randomness. Experiments show that ReDDiT significantly outperforms the baseline model (reducing gFID from 6.18 to **1.61**) and is on par with the continuous counterparts. The code and models will be publicly available.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

393. DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

作者:

Large Vision-Language Models excel at multimodal understanding but struggle to deeply integrate visual information into their predominantly text-based reasoning processes, a key challenge in mirroring human cognition. To address this, we introduce DeepEyes, a model that learns to ``think with images'', trained end-to-end with reinforcement learning and without pre-collected reasoning data for supervised fine-tuning (SFT) as a cold-start. Notably, this ability emerges natively, leveraging the model's own grounding capability as an intrinsic function rather than relying on external specialized models or APIs. We enable this capability through active perception, where the model learns to strategically ground its reasoning in visual information, guided by a tailored data selection and reward strategy. DeepEyes achieves significant performance gains on general perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of active perception from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at \url{https://anonymous.4open.science/r/DeepEyes-97FE/}.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

394. Discovering Diverse Behaviors via Temporal Contrastive Learning

作者:

Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent perceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction. In this paper, we propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory behaviors in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms (e.g., quasimetric-based methods), our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

395. A Formal Controllability Toolkit for Black-Box Generative Models

作者:

As generative models become ubiquitous, there is a critical need for fine-grained control over the generation process. Yet, while controlled generation methods from prompting to fine-tuning proliferate, a fundamental question remains unanswered: are these models truly controllable in the first place? In this work, we provide a theoretical framework to formally answer this question. Framing human-model interaction as a control process, we propose a novel algorithm to estimate the controllable sets of models in a dialogue setting. Notably, we provide formal guarantees on the estimation error as a function of sample complexity: we derive probably-approximately correct bounds for controllable set estimates that are distribution-free, employ no assumptions except for output boundedness, and work for any black-box nonlinear control system (i.e., any generative model). We empirically demonstrate the theoretical framework on different tasks in controlling dialogue processes, for both language models and text-to-image generation. Our results show that model controllability is surprisingly fragile and highly dependent on the experimental setting. This highlights the need for rigorous controllability analysis, shifting the focus from simply attempting control to first understanding its fundamental limits.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

396. Learning Adaptive Distribution Alignment with Neural Characteristic Function for Graph Domain Adaptation

作者:

Graph Domain Adaptation (GDA) transfers knowledge from labeled source graphs to unlabeled target graphs but is challenged by complex, multi-faceted distributional shifts. Existing methods attempt to reduce distributional shifts by aligning manually selected graph elements (e.g., node attributes or structural statistics), which typically require manually designed graph filters to extract relevant features before alignment. However, such approaches are inflexible: they rely on scenario-specific heuristics, and struggle when dominant discrepancies vary across transfer scenarios. To address these limitations, we propose \textbf{ADAlign}, an Adaptive Distribution Alignment framework for GDA. Unlike heuristic methods, ADAlign requires no manual specification of alignment criteria. It automatically identifies the most relevant discrepancies in each transfer and aligns them jointly, capturing the interplay between attributes, structures, and their dependencies. This makes ADAlign flexible, scenario-aware, and robust to diverse and dynamically evolving shifts. To enable this adaptivity, we introduce the Neural Spectral Discrepancy (NSD), a theoretically principled parametric distance that provides a unified view of cross-graph shifts. NSD leverages neural characteristic function in the spectral domain to encode feature-structure dependencies of all orders, while a learnable frequency sampler adaptively emphasizes the most informative spectral components for each task via minimax paradigm. Extensive experiments on 10 datasets and 16 transfer tasks show that ADAlign not only outperforms state-of-the-art baselines but also achieves efficiency gains with lower memory usage and faster training.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

397. GUIDE: Gated Uncertainty-Informed Disentangled Experts for Long-tailed Recognition

作者:

Long-Tailed Recognition (LTR) remains a significant challenge in deep learning. While multi-expert architectures are a prominent paradigm, we argue that their efficacy is fundamentally limited by a series of deeply entangled problems at the levels of representation, policy, and optimization. These entanglements induce homogeneity collapse among experts, suboptimal dynamic adjustments, and unstable meta-learning. In this paper, we introduce GUIDE, a novel framework conceived from the philosophy of Hierarchical Disentanglement. We systematically address these issues at three distinct levels. First, we disentangle expert representations and decisions through competitive specialization objectives to foster genuine diversity. Second, we disentangle policy-making from ambiguous signals by using online uncertainty decomposition to guide a dynamic expert refinement module, enabling a differentiated response to model ignorance versus data ambiguity. Third, we disentangle the optimization of the main task and the meta-policy via a two-timescale update mechanism, ensuring stable convergence. Extensive experiments on four challenging LTR benchmarks, including ImageNet-LT, iNaturalist 2018, CIFAR-100-LT and Places-LT, demonstrate that GUIDE establishes a new state of the art, validating the efficacy of our disentanglement approach. Code is available at Supplement.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

398. Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning

作者:

Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model's weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL's light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by $\sim30\%$-$40 \%$ across multiple short-form QA benchmarks. CASAL is $\sim$30x more compute-efficient and $\sim$20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL's flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

399. DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

作者:

Video generation models, as one form of world models, has emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models—generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset—curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers—with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

400. WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research

作者:

This paper tackles \textbf{open-ended deep research (OEDR)}, a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and monolithic generation paradigms that include redundant, irrelevant evidence, suffering from hallucination issues and low citation accuracy. To address these challenges, we introduce \textbf{WebWeaver}, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, citation-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank via citations for each part, it effectively mitigates long-context issues and citation hallucinations. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing comprehensive, trusted, and well-structured reports.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

401. Textual Equilibrium Propagation for Deep Compound AI Systems

作者:

Large language models (LLMs) are increasingly deployed as part of compound AI systems which coordinate multiple modules (e.g., retrievers, tools, verifiers) over long-horizon workflows. Although recent frameworks that propagate textual feedback globally (e.g., TextGrad make it feasible to optimize such pipelines, we identify two depth-scaling failure modes in long-horizon agentic workflows: 1) exploding textual gradient, where textual feedback grows exponentially with depth, leading to prohibitively long message and amplifies evaluation biases; and 2) vanishing textual gradient, where limited long-context ability causes models overemphasize recent or early feedback, while compression of lengthy feedback causes downstream messages to lose specificity gradually as they propagate many hops upstream. To mitigate these issues, we introduce Textual Equilibrium Propagation (TEP), a local learning principle inspired by Equilibrium Propagation in energy-based models. TEP includes two phases: 1) a free phase where a local LLM critics iteratively refine prompts until reaching equilibrium (no further improvements are suggested); and 2) a nudged phase which applies proximal prompt edits with bounded modification intensity, using task-level objectives that propagate via forward signaling rather than backward feedback chains. This design supports local prompt optimization followed by controlled adaptation toward global goals without the computational burden and signal degradation of global textual backpropagation. Across long-horizon QA benchmarks and multi-agent tool-use dataset, TEP consistently improves accuracy and efficiency over global propagation methods such as TextGrad, with gains that increase at greater depths, while preserving the practicality of black-box LLM components in deep compound AI system.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

402. From Sorting Algorithms to Scalable Kernels: Bayesian Optimization in High-Dimensional Permutation Spaces

作者:

Bayesian Optimization (BO) is a powerful tool for black-box optimization, but its application to high-dimensional permutation spaces is severely limited by the challenge of defining scalable representations. The current state-of-the-art BO approach for permutation spaces relies on an exhaustive $\Omega(n^2)$ pairwise comparison, inducing a dense representation that is impractical for large-scale permutations. To break this barrier, we introduce a novel framework for generating efficient permutation representations via kernel functions derived from sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from enumeration sort. Further, we introduce the \textbf{Merge Kernel} , which leverages the divide-and-conquer structure of merge sort to produce a compact, $\Theta(n\log n)$ to achieve the lowest possible complexity with no information loss and effectively capture permutation structure. Our central thesis is that the Merge Kernel performs competitively with the Mallows kernel in low-dimensional settings, but significantly outperforms it in both optimization performance and computational efficiency as the dimension $n$ grows. Extensive evaluations on various permutation optimization benchmarks confirm our hypothesis, demonstrating that the Merge Kernel provides a scalable and more effective solution for Bayesian optimization in high-dimensional permutation spaces, thereby unlocking the potential for tackling previously intractable problems such as large-scale feature ordering and combinatorial neural architecture search.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

403. Geometry-aware Policy Imitation

作者:

We propose a Geometry-Aware Policy Imitation (GPI) approach that rethinks imitation learning by treating demonstrations as geometric curves rather than collections of state–action samples. From these curves, GPI derives distance fields that give rise to two complementary control primitives: a progression flow that advances along expert trajectories and an attraction flow that corrects deviations. Their combination defines a controllable, non-parametric vector field that directly guides robot behavior. This formulation decouples metric learning from policy synthesis, enabling modular adaptation across low-dimensional robot states and high-dimensional perceptual inputs. GPI naturally supports multimodality by preserving distinct demonstrations as separate models and allows efficient composition of new demonstrations through simple additions to the distance field. We evaluate GPI in simulation and on real robots across diverse tasks. Experiments show that GPI achieves higher success rates than diffusion-based policies while running 20× faster, requiring less memory, and remaining robust to perturbations. These results establish GPI as an efficient, interpretable, and scalable alternative to generative approaches for robotic imitation learning.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

404. Learning to Recall with Transformers Beyond Orthogonal Embeddings

作者:

Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability, thanks to their ability to encode information during training and retrieve it at inference. Existing theoretical analyses typically study transformers under idealized assumptions such as infinite data or orthogonal embeddings. In realistic settings, however, models are trained on finite datasets with non-orthogonal (random) embeddings. We address this gap by analyzing a single-layer transformer with random embeddings trained with (empirical) gradient descent on a simple token-retrieval task, where the model must identify an informative token within a length-$L$ sequence and learn a one-to-one mapping from tokens to labels. Our analysis tracks the "early phase" of gradient descent and yields explicit formulas for the model’s storage capacity—revealing a multiplicative dependence between sample size $N$, embedding dimension $d$, and sequence length $L$. We complement this with a lower bound for the statistical problem, showing that this multiplicative scaling is inherent under non-orthogonal embeddings.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

405. Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

作者:

While LLM agents can plan multi-step tasks, intervening at the planning stage—before any action is executed—is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release \texttt{Pre-Exec Bench}, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

406. Understanding and improving Shampoo and SOAP via Kullback-Leibler Minimization

作者:

Shampoo and its efficient, Adam-stabilized variant SOAP, employ structured second-moment estimation and have received growing attention for their effectiveness. In practice, Shampoo requires step-size grafting with Adam to achieve competitive performance. SOAP mitigates this by applying Adam in Shampoo's eigenbasis and further reducing per-iteration runtime. However, reliance on Adam introduces additional memory overhead in both methods. Prior theoretical interpretations have primarily examined their estimation schemes using the Frobenius norm. Motivated by the natural correspondence between the second moment and a covariance matrix, we reinterpret the estimation procedures in Shampoo and SOAP as instances of covariance estimation through the lens of Kullback–Leibler (KL) divergence minimization. This perspective reveals a previously overlooked theoretical limitation and motivates principled improvements to their design. Building on the KL perspective, we propose practical estimation schemes---KL-Shampoo and KL-SOAP---that match or exceed the performance of Shampoo and SOAP for pre-training a range of neural network models while maintaining SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to achieve superior performance, thereby avoiding the associated memory overhead. Surprisingly, KL-Shampoo consistently outperforms the other methods in our experiments.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

407. Real-Time Robot Execution with Masked Action Chunking

作者:

Real-time execution is essential for cyber-physical systems such as robots. These systems operate in dynamic real-world environments where even small delays can undermine responsiveness and compromise performance. Asynchronous inference has recently emerged as a system-level paradigm for real-time robot manipulation, enabling the next action chunk to be predicted while the current one is being executed. While this approach achieves real-time responsiveness, naive integration often results in execution failure. Previous methods attributed this failure to inter-chunk discontinuity and developed test-time algorithms to smooth chunk boundaries. In contrast, we identify another critical yet overlooked factor: intra-chunk inconsistency, where the robot’s executed action chunk partially misaligns with its current perception. To address this, we propose REMAC, which learns corrective adjustments on the pretrained policy through masked action chunking, enabling the policy to remain resilient under mismatches between intended actions and actual execution during asynchronous inference. In addition, we introduce a prefix-preserved sampling procedure to reinforce inter-chunk continuity. Overall, our method delivers more reliable policies without incurring additional latency. Extensive experiments in both simulation and real-world settings demonstrate that our method enables faster task execution, maintains robustness across varying delays, and consistently achieves higher completion rates.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

408. Look-ahead Reasoning with a Learned Model in Imperfect Information Games

作者:

Test-time reasoning significantly enhances pre-trained AI agents’ performance. However, it requires an explicit environment model, often unavailable or overly complex in real-world scenarios. While MuZero enables effective model learning for search in perfect information games, extending this paradigm to imperfect information games presents substantial challenges due to more nuanced look-ahead reasoning techniques and large number of states relevant for individual decisions. This paper introduces an algorithm LAMIR that learns an abstracted model of an imperfect information game directly from the agent-environment interaction. During test time, this trained model is used to perform look-ahead reasoning. The learned abstraction limits the size of each subgame to a manageable size, making theoretically principled look-ahead reasoning tractable even in games where previous methods could not scale. We empirically demonstrate that with sufficient capacity, LAMIR learns the exact underlying game structure, and with limited capacity, it still learns a valuable abstraction, which improves game playing performance of the pre-trained agents even in large games.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

409. Statistical Guarantees in the Search for Less Discriminatory Algorithms

作者:

Recent scholarship has argued that firms building data-driven decision systems in high-stakes domains like employment, credit, and housing should search for “less discriminatory algorithms” (LDAs) (Black et al., 2023). That is, for a given decision problem, firms considering deploying a model should make a good-faith effort to find equally performant models with lower disparate impact across social groups. Evidence from the literature on model multiplicity shows that randomness in training pipelines can lead to multiple models with the same performance, but meaningful variations in disparate impact. This suggests that developers can find LDAs simply by randomly retraining models. Firms cannot continue retraining forever, though, which raises the question: What constitutes a good-faith effort? In this paper, we formalize LDA search via model multiplicity as an optimal stopping problem, where a model developer with limited information wants to produce strong evidence that they have sufficiently explored the space of models. Our primary contribution is an adaptive stopping algorithm that yields a high-probability upper bound on the gains achievable from a continued search, allowing the developer to certify (e.g., to a court) that their search was sufficient. We provide a framework under which developers can impose stronger assumptions about the distribution of models, yielding correspondingly stronger bounds. We validate the method on real-world lending datasets.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

410. Scalable Supervising Software Agents with Patch Reasoner

作者:

While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2\% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2\% Pass@1 on SWE-bench-verified, showing a 10.0\% improvement over the original Qwen3-32B. This can be further improved to 33.8\% with R4P for test-time scaling. The stable scaling curves in both RL test rewards and test-time accuracy reflect R4P's practical utility for scalable supervision on software agents.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

411. From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation

作者:

Reinforcement learning with verifiable rewards (RLVR) succeeds in reasoning tasks (e.g., math and code) by checking the final verifiable answer (i.e., a verifiable dot signal). However, extending this paradigm to open-ended generation is challenging because there is no unambiguous ground truth. Relying on single-dot supervision often leads to inefficiency and reward hacking. To address these issues, we propose reinforcement learning with verifiable reference-based rewards (RLVRR). Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e, reward chain). Specifically, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts (e.g., keywords), and style, which evaluates adherence to stylistic properties through LLM-based verification. In this way, RLVRR combines the exploratory strength of RL with the efficiency and reliability of supervised fine-tuning (SFT). Extensive experiments on more than 10 benchmarks with Qwen and Llama models confirm the advantages of our approach. RLVRR (1) substantially outperforms SFT trained with ten times more data and advanced reward models, (2) unifies the training of structured reasoning and open-ended generation, and (3) generalizes more effectively while preserving output diversity. These results establish RLVRR as a principled and efficient path toward verifiable reinforcement learning for general-purpose LLM alignment.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

412. R-Zero: Self-Evolving Reasoning LLM from Zero Data

作者:

Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

413. Verification of the Implicit World Model in a Generative Model via Adversarial Sequences

作者:

Generative sequence models are typically trained on sample sequences from natural or formal languages. It is a crucial question whether—or to what extent—sample-based training is able to capture the true structure of these languages, often referred to as the "world model". Theoretical results indicate that we can hope for soundness at best, that is, generating valid sequences, but not necessarily all of them. However, it is still important to have practical tools that are able to verify whether a given sequence model is sound. In this study, we focus on chess, as it is a domain that provides enough complexity while having a simple rule-based world model. We propose adversarial sequence generation for verifying the soundness of the sequence model. Our adversaries generate valid sequences so as to force the sequence model to generate an invalid next move prediction. Apart from the falsification of soundness, this method is also suitable for a more fine-grained analysis of the failure modes and the effects of different choices during training. To demonstrate this, we propose a number of methods for adversarial sequence generation and evaluate the approach on a large set of chess models. We train models on random as well as high-quality chess games, using several training recipes. We find that none of the models are sound, but some training techniques and dataset choices are able to improve soundness remarkably. We also investigate the potential application of board state probes in both our training and attack methods. Our findings indicate that the extracted board states have no causal role in next token prediction in most of the models.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

414. One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning

作者:

Prompt-based methods have recently gained prominence in Continual Learning (CL) due to their strong performance and memory efficiency. A prevalent strategy in this paradigm assigns a dedicated subset of prompts to each task, which, while effective, incurs substantial computational overhead and causes memory requirements to scale linearly with the number of tasks. Conversely, approaches employing a single shared prompt across tasks offer greater efficiency but often suffer from degraded performance due to knowledge interference. To reconcile this trade-off, we propose **SMoPE**, a novel framework that integrates the benefits of both task-specific and shared prompt strategies. Inspired by recent findings on the relationship between Prefix Tuning and Mixture of Experts (MoE), SMoPE organizes a shared prompt into multiple "prompt experts" within a sparse MoE architecture. For each input, only a select subset of relevant experts is activated, effectively mitigating interference. To facilitate expert selection, we introduce a prompt-attention score aggregation mechanism that computes a unified proxy score for each expert, enabling dynamic and sparse activation. Additionally, we propose an adaptive noise mechanism to encourage balanced expert utilization while preserving knowledge from prior tasks. To further enhance expert specialization, we design a prototype-based loss function that leverages prefix keys as implicit memory representations. Extensive experiments across multiple CL benchmarks demonstrate that SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches, all while significantly reducing parameter counts and computational costs.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

415. Music Flamingo: Scaling Music Understanding in Audio Language Models

作者:

We introduce Music Flamingo, a novel large audio–language model, designed to advance music (including song) understanding in foundational audio models. While audio–language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question–answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio–language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as richly and meaningfully as humans do. Demo: https://musicflamingo.github.io

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

416. GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving

作者:

Solving math problems through verifiable languages such as Lean has significantly impacted both the mathematics and computer science communities. Current state-of-the-art models are often trained with expensive online Reinforcement Learning (RL) or expert iteration. However, these approaches rely on fixed problem sets, which causes inefficient training and limits the model to tackle complex problems. To overcome these limitations, we propose **GAR**: *Generative Adversarial Reinforcement learning*, a comprehensive RL training framework that jointly trains the problem composer and solver in an adversarial loop. **GAR** introduces an implicit curriculum learning mechanism, which aligns task difficulty with the prover's evolving capability. It thereby improves the training efficiency and enables stronger performance of proving advanced theorems. Experiments show that with **GAR** training, Goedel-Prover-V2-8B and DeepSeek-Prover-V2-7B achieve an average relative improvement in pass@32 of **4.20%** on MiniF2F-Test benchmark, while DeepSeek-Prover-V2's pass@32 on ProofNet-Test increases from 22.58% to **25.81%**. Beyond formal proving, **GAR** establishes a general RL paradigm for co-evolution of problem generation and solving under verifiable environments.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

417. Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

作者:

Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

418. MENLO: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages

作者:

Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt–response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

419. Spectral Attention Steering for Prompt Highlighting

作者:

Steering a large language model's attention towards user-specified highlighted text is a critical capability. Existing prompt highlighting methods are incompatible with modern efficient attention mechanisms like Flash Attention due to their reliance on post-hoc matrix editing. We introduce Spectral Editing Key Amplification (SEKA), a training-free steering method that tackles this by directly editing key embeddings before attention computation. SEKA learns universal relevance subspaces offline via spectral decomposition. We extend this to Adaptive SEKA (AdaSEKA), a query-adaptive variant that uses a training-free routing mechanism to dynamically combine multiple expert subspaces based on the prompt's semantic intent. Our experiments show both methods significantly outperform strong baselines on standard steering benchmarks while adding much lower latency and memory overhead, ensuring full compatibility with optimised attention.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

420. Fairness-Aware Multi-view Evidential Learning with Adaptive Prior

作者:

Multi-view evidential learning aims to integrate information from multiple views to improve prediction performance and provide trustworthy uncertainty estimation. Most previous methods assume that view-specific evidence learning is naturally reliable. However, in practice, the evidence learning process tends to be biased. Through empirical analysis on real-world data, we reveal that samples tend to be assigned more evidence to support data-rich classes, thereby leading to unreliable uncertainty estimation in predictions. This motivates us to delve into a new Biased Evidential Multi-view Learning (BEML) problem. To this end, we propose Fairness-Aware Multi-view Evidential Learning (FAML). FAML first introduces an adaptive prior based on training trajectories, which acts as a regularization strategy to flexibly calibrate the biased evidence learning process. Furthermore, we explicitly incorporate a fairness constraint based on class-wise evidence variance to promote balanced evidence allocation. In the multi-view fusion stage, we propose an opinion alignment mechanism to mitigate view-specific bias across views, thereby encouraging the integration of consistent and mutually supportive evidence. Theoretical analysis shows that FAML enhances fairness in the evidence learning process. Extensive experiments on six real-world multi-view datasets demonstrate that FAML achieves more balanced evidence allocation and improves both prediction performance and the reliability of uncertainty estimation compared to state-of-the-art methods.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

421. AnyUp: Universal Feature Upsampling

作者:

We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an *inference-time* feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

422. From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

作者:

We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, drawing on insights from stochastic control. Building on previous work, we introduce a viable model of actor–critic algorithm that incorporates both exploration and stochastic transitions. For single-hidden-layer neural networks, we show that the environment’s states can be formulated as a two time scale process: the environment time and the gradient time. Within this formulation, we characterize how the time-dependent random variables that represent the environment's state and the cumulative discounted return evolve over gradient steps in the infinite width limit of two-layer networks. Using the theory of stochastic differential equations, we derive, for the first time in continuous RL, an equation describing the infinitesimal change in the state distribution at each gradient step, under a vanishingly small learning rate. Overall, our work provides a novel nonparametric formulation for studying over-parametrized neural actor-critic algorithms. We empirically corroborate our theoretical result using a toy continuous control task.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

423. PointRePar : SpatioTemporal Point Relation Parsing for Robust Category-Unified 3D Tracking

作者:

3D single object tracking (SOT) remains a highly challenging task due to the inherent crux in learning representations from point clouds to effectively capture both spatial shape features and temporal motion features. Most existing methods employ a category-specific optimization paradigm, training the tracking model individually for each object category to enhance tracking performance, albeit at the expense of generalizability across different categories. In this work, we propose a robust category-unified 3D SOT model, referred to as SpatioTemporal Point Relation Parsing model (*PointRePar*), which is capable of joint training across multiple categories while excelling in unified feature learning for both spatial shapes and temporal motions. Specifically, the proposed *PointRePar* captures and parses the latent point relations across both spatial and temporal domains to learn superior shape and motion characteristics for robust tracking. On the one hand, it models the multi-scale spatial point relations using a Mamba-based U-Net architecture with adaptive point-wise feature refinement. On the other hand, it captures both the point-level and box-level temporal relations to exploit the latent motion features. Extensive experiments across three benchmarks demonstrate that our *PointRePar* not only outperforms the existing category-unified 3D SOT methods significantly, but also compares favorably against the state-of-the-art category-specific methods. Codes will be released.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

424. On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs

作者:

As increasingly large pre-trained models are released, deploying them on edge devices for privacy-preserving applications requires effective compression. Recent works combine quantization with the fine-tuning of high-precision LoRA adapters, which can substantially reduce model size while mitigating the accuracy loss from quantization. However, edge devices have inherently heterogeneous capabilities, while performing configuration-wise fine-tuning for every quantization setting is computationally prohibitive. In this paper, we propose CoA-LoRA, a method that dynamically adjusts the LoRA adapter to arbitrary quantization configurations (i.e., the per-layer bit-width choices of a pre-trained model) without requiring repeated fine-tuning. This is accomplished via a configuration-aware model that maps each configuration to its low-rank adjustments. The effectiveness of this model critically depends on the training configuration set, a collection of configurations chosen to cover different total bit-width budgets. However, constructing a high-quality configuration set is non-trivial. We therefore design a Pareto-based configuration search that iteratively optimizes the training configuration set, yielding more precise low-rank adjustments. Our experiments demonstrate that, unlike the state-of-the-art methods that require fine-tuning a separate LoRA adapter for each configuration, CoA-LoRA incurs no additional time cost while achieving comparable or even superior performance to those methods.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

425. Neural Collapse in Multi-Task Learning

作者:

Neural collapse (NC) plays a key role in understanding deep neural networks. However, existing empirical and theoretical studies of NC focus on one single task. This paper studies neural collapse in multi-task learning. We consider two standard feature-based multi-task learning scenarios: Single-Source Multi-Task Classification (SSMTC) and Multi-Source Multi-Task Classification (MSMTC). Interestingly, we find that the task-specific linear classifier and features converge to the Simplex Equiangular Tight Frame (ETF) in the setting of MSMTC. In the setting of SSMTC, task-specific linear classifier converges to the task-specific ETF and these task-specific ETFs are mutually orthogonal. Moreover, the shared features across tasks converge to the scaled sum of the weight vectors associated with the task-specific labels in each task's classifier. We also provide the theoretical guarantee for our empirical findings. Through detailed analysis, we uncover the mechanism of MTL where each task learns task-specific latent features that together form the shared features. Moreover, we reveal an inductive bias in MTL that task correlation reconfigures the geometry of task-specific classifiers and promotes alignment among the features learned by each task.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

426. Graph Random Features for Scalable Gaussian Processes

作者:

We study the application of graph random features (GRFs) – a recently-introduced stochastic estimator of graph node kernels – to scalable Gaussian processes on discrete input spaces. We prove that (under mild assumptions) Bayesian inference with GRFs enjoys $\mathcal{O}(N^{3/2})$ time complexity with respect to the number of nodes $N$, with probabilistic accuracy guarantees. In contrast, exact kernels generally incur $\mathcal{O}(N^{3})$. Wall-clock speedups and memory savings unlock Bayesian optimisation with over 1M graph nodes on a single computer chip, whilst preserving competitive performance.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

427. Should We Still Pretrain Encoders with Masked Language Modeling?

作者:

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM approach or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at \url{https://huggingface.co/XXX} to foster further research.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

428. Masked Generative Policy for Robotic Control

作者:

We present Masked Generative Policy (MGP), a novel framework for visuomotor imitation learning. We represent actions as discrete tokens, and train a conditional masked transformer that generates tokens in parallel and then rapidly refines only low-confidence tokens. We further propose two new sampling paradigms: MGP-Short, which performs parallel masked generation with score-based refinement for Markovian tasks, and MGP-Long, which predicts full trajectories in a single pass and dynamically refines low-confidence action tokens based on new observations. With globally coherent prediction and robust adaptive execution capabilities, MGP-Long enables reliable control on complex and non-Markovian tasks that prior methods struggle with. Extensive evaluations on 150 robotic manipulation tasks spanning the Meta-World and LIBERO benchmarks show that MGP achieves both rapid inference and superior success rates compared to state-of-the-art diffusion and autoregressive policies. Specifically, MGP increases the average success rate by 9\% across 150 tasks while cutting per-sequence inference time by up to 35×. It further improves the average success rate by 60\% in dynamic and missing-observation environments, and solves two non-Markovian scenarios where other state-of-the-art methods fail. Further results and videos are available at: https://anonymous.4open.science/r/masked_generative_policy-8BC6.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

429. Any-Order Flexible Length Masked Diffusion

作者:

Masked diffusion models (MDMs) have recently emerged as a promising alternative to autoregressive models over discrete domains. MDMs generate sequences in an any-order, parallel fashion, enabling fast inference and strong performance on non-causal tasks. However, a crucial limitation is that they do not support token insertions and are thus limited to *fixed-length* generations. To this end, we introduce **Flex**ible **M**asked **D**iffusion **M**odels (FlexMDMs), a discrete diffusion paradigm that simultaneously can model sequences of flexible length while provably retaining MDMs' flexibility of any-order inference. Grounded in an extension of the stochastic interpolant framework, FlexMDMs generate sequences by inserting mask tokens and unmasking them. Empirically, we show that FlexMDMs match MDMs in perplexity while modeling length statistics with much higher fidelity. On a synthetic maze planning task, they achieve $\approx$ 60\% higher success rate than MDM baselines. Finally, we show pretrained MDMs can easily be *retrofitted* into FlexMDMs: on 16 H100s, it takes only three days to fine-tune LLaDA-8B into a FlexMDM, achieving superior performance on math (GSM8K, 58\%$\to$67\%) and code infilling performance (52\%$\to$65\%).

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

430. Long-Context Generalization with Sparse Attention

作者:

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $\alpha$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $\alpha$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature $\alpha$-entmax baselines, achieving up to 1000$\times$ length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8$\times$ training length.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

431. Imitation Learning as Return Distribution Matching

作者:

We study the problem of training a risk-sensitive reinforcement learning (RL) agent through imitation learning (IL). Unlike standard IL, our goal is not only to train an agent that matches the expert’s expected return (i.e., its *average performance*) but also its *risk attitude* (i.e., other features of the return distribution, such as variance). We propose a general formulation of the risk-sensitive IL problem in which the objective is to match the expert’s return distribution in Wasserstein distance. We focus on the tabular setting and assume the expert’s reward is *known*. After demonstrating the limited expressivity of Markovian policies for this task, we introduce an efficient and sufficiently expressive subclass of non-Markovian policies tailored to it. Building on this subclass, we develop two provably efficient algorithms—RS-BC and RS-KT —for solving the problem when the transition model is unknown and known, respectively. We show that RS-KT achieves substantially lower sample complexity than RS-BC by exploiting dynamics information. We further demonstrate the sample efficiency of return distribution matching in the setting where the expert’s reward is *unknown* by designing an oracle-based variant of RS-KT. Finally, we complement our theoretical analysis of RS-KT and RS-BC with numerical simulations, highlighting both their sample efficiency and the advantages of non-Markovian policies over standard sample-efficient IL algorithms.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

432. Signal Structure-Aware Gaussian Splatting for Large-Scale Scene Reconstruction

作者:

3D Gaussian Splatting has demonstrated remarkable potential in novel view synthesis. In contrast to small-scale scenes, large-scale scenes inevitably contain sparsely observed regions with excessively sparse initial points. In this case, supervising Gaussians initialized from low-frequency sparse points with high-frequency images often induces uncontrolled densification and redundant primitives, degrading both efficiency and quality. Intuitively, this issue can be mitigated with scheduling strategies, which can be categorized into two paradigms: modulating target signal frequency via densification and modulating sampling frequency via image resolution. However, previous scheduling strategies are primarily hardcoded, failing to perceive the convergence behavior of the scene frequency. To address this, we reframe scene reconstruction problem from the perspective of signal structure recovery, and propose SIG, a novel scheduler that Synchronizes Image supervision with Gaussian frequencies. Specifically, we derive the average sampling frequency and bandwidth of 3D representations, and then regulate the training image resolution and the Gaussian densification process based on scene frequency convergence. Furthermore, we introduce Sphere-Constrained Gaussians, which leverage the spatial prior of initialized point clouds to control Gaussian optimization. Our framework enables frequency-consistent, geometry-aware, and floater-free training, achieving state-of-the-art performance with a substantial margin in both efficiency and rendering quality in large-scale scenes.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

433. MrRoPE: Mixed-radix Rotary Position Embedding

作者:

Rotary Position Embedding (RoPE)-extension refers to modifying or generalizing the Rotary Position Embedding scheme to handle longer sequences than those encountered during pre-training. However, current extension strategies are highly diverse and lack a unified theoretical foundation. In this paper, we propose $\textbf{\textit{MrRoPE (Mixed-radix RoPE)}}$, a generalized encoding formulation based on a radix system conversion perspective, which elegantly unifies various RoPE-extension approaches as distinct radix conversion strategies. Based on this theory, we introduce two training-free extensions, $\textbf{\textit{MrRoPE-Uni}}$ and $\textbf{\textit{MrRoPE-Pro}}$, which leverage uniform and progressive radix conversion strategies, respectively, to achieve “train short, test long” generalization. Without fine-tuning, MrRoPE-Pro sustains over 85% recall in the 128K-context Needle-in-a-Haystack test and achieves more than double YaRN’s accuracy on Infinite-Bench retrieval and dialogue subsets. Theoretical analysis confirms that MrRoPE-Pro effectively raises the upper bound of RoPE's attainable encoding length, which further validates the reliability and utility of our theory and methodology.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

434. OpenThoughts: Data Recipes for Reasoning Models

作者:

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best train- ing recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data genera- tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia- mond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on ANONYMIZED.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

435. Quasi-Equivariant Metanetworks

作者:

Metanetworks are neural architectures designed to operate directly on pretrained weights to perform downstream tasks. However, the parameter space serves only as a proxy for the underlying function class, and the parameter-function mapping is inherently non-injective: distinct parameter configurations may yield identical input-output behaviors. As a result, metanetworks that rely solely on raw parameters risk overlooking the intrinsic symmetries of the architecture. Reasoning about functional identity is therefore essential for effective metanetwork design, motivating the development of equivariant metanetworks, which incorporate equivariance principles to respect architectural symmetries. Existing approaches, however, typically enforce strict equivariance, which imposes rigid constraints and often leads to sparse and less expressive models. To address this limitation, we introduce the novel concept of quasi-equivariance, which allows metanetworks to move beyond the rigidity of strict equivariance while still preserving functional identity. We lay down a principled basis for this framework and demonstrate its broad applicability across diverse neural architectures, including feedforward, convolutional, and transformer networks. Through empirical evaluation, we show that quasi-equivariant metanetworks achieve good trade-offs between symmetry preservation and representational expressivity. These findings advance the theoretical understanding of weight-space learning and provide a principled foundation for the design of more expressive and functionally robust metanetworks.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

436. YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting

作者:

Fast and flexible 3D scene reconstruction from unstructured image collections remains a significant challenge. We present YoNoSplat, a feedforward model that reconstructs high-quality 3D Gaussian Splatting representations from an arbitrary number of images. Our model is highly versatile, operating effectively with both posed and unposed, calibrated and uncalibrated inputs. YoNoSplat predicts local Gaussians and camera poses for each view, which are aggregated into a global representation using either predicted or provided poses. To overcome the inherent difficulty of jointly learning 3D Gaussians and camera parameters, we introduce a novel mixing training strategy. This approach mitigates the entanglement between the two tasks by initially using ground-truth poses to aggregate local Gaussians and gradually transitioning to a mix of predicted and ground-truth poses, which prevents both training instability and exposure bias. We further resolve the scale ambiguity problem by a novel pairwise camera-distance normalization scheme and by embedding camera intrinsics into the network. Moreover, YoNoSplat also predicts intrinsic parameters, making it feasible for uncalibrated inputs. YoNoSplat demonstrates exceptional efficiency, reconstructing a scene from 100 views (at 280×518 resolution) in just 2.69 seconds on an NVIDIA GH200 GPU. It achieves state-of-the-art performance on standard benchmarks in both pose-free and pose-dependent settings. The code and pretrained models will be made public.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

437. xRFM: Accurate, scalable, and interpretable feature learning models for tabular data

作者:

Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to $31$ other methods, including recently introduced tabular foundation models (TabPFN-v2) and GBDTs, xRFM achieves best performance across $100$ regression datasets and is competitive to the best methods across $200$ classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

438. Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

作者:

Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet, most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information largely underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions augmented by CoT-inspired enhancements. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing a foundation for advancing embodied multi-view intelligence in robotics.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

439. Calibrated Information Bottleneck for Trusted Multi-modal Clustering

作者:

Information Bottleneck (IB) Theory is renowned for its ability to learn simple, compact, and effective data representations. In multi-modal clustering, IB theory effectively eliminates interfering redundancy and noise from multi-modal data, while maximally preserving the discriminative information. Existing IB-based multi-modal clustering methods suffer from low-quality pseudo-labels and over-reliance on accurate Mutual Information (MI) estimation, which is known to be challenging. Moreover, unreliable or noisy pseudo-labels may lead to an overconfident clustering outcome. To address these challenges, this paper proposes a novel CaLibrated Information Bottleneck (CLIB) framework designed to learn a clustering that is both accurate and trustworthy. We build a parallel multi-head network architecture—incorporating one primary cluster head and several modality-specific calibration heads—which achieves three key goals: namely, calibrating for the distortions introduced by biased MI estimation thus improving the stability of IB, constructing reliable target variables for IB from multiple modalities and producing a trustworthy clustering result. Notably, we design a dynamic pseudo-label selection strategy based on information redundancy theory to extract high-quality pseudo-labels, thereby enhancing training stability. Experimental results demonstrate that our model not only achieves state-of-the-art clustering accuracy on multiple benchmark datasets but also exhibits excellent performance on the expected calibration error metric.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

440. MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval

作者:

We introduce MRMR, the first expert-level multidisciplinary multimodal retrieval benchmark requiring intensive reasoning. MRMR contains 1,502 queries spanning 23 domains, with positive documents carefully verified by human experts. Compared to prior benchmarks, MRMR introduces three key advancements. First, it challenges retrieval systems across diverse areas of expertise, enabling fine-grained model comparison across domains. Second, queries are reasoning-intensive, with images requiring deeper interpretation such as diagnosing microscopic slides. We further introduce Contradiction Retrieval, a novel task requiring models to identify conflicting concepts. Finally, queries and documents are constructed as image–text interleaved sequences. Unlike earlier benchmarks restricted to single images or unimodal documents, MRMR offers a realistic setting with multi-image queries and mixed-modality corpus documents. We conduct an extensive evaluation of 4 categories of multimodal retrieval systems and 14 frontier models on MRMR. The text embedding model Qwen3-Embedding with LLM-generated image captions achieves the highest performance, highlighting substantial room for improving multimodal retrieval models. Although latest multimodal models such as Ops-MM-Embedding perform competitively on expert-domain queries, they fall short on reasoning-intensive tasks. We believe that MRMR paves the way for advancing multimodal retrieval in more realistic and challenging scenarios.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

441. EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models

作者:

Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs' capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

442. Streaming Visual Geometry Transformer

作者:

Perceiving and reconstructing 3D geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and low-latency applications, we propose a streaming visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 3D reconstruction. This design can handle low-latency 3D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operators (e.g., FlashAttention) from large language models. Extensive experiments on various 3D geometry perception benchmarks demonstrate that our model enhances inference speed in online scenarios while maintaining competitive performance, thereby facilitating scalable and interactive 3D vision systems.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

443. SeeDNorm: Self-Rescaled Dynamic Normalization

作者:

Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient $\gamma$ to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor $\gamma$ may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling data-dependent, self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains the ability of RMSNorm to dynamically adjust gradient according to the input norm. We provide a detailed analysis of the training optimization for SeedNorm and proposed corresponding solutions to address potential instability issues that may arise when applying SeeDNorm. We validate the effectiveness of SeeDNorm across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks. By introducing a minimal number of parameters and with neglligible impact on model efficiency, SeeDNorm achieves consistently superior performance compared to previously commonly used normalization layers such as RMSNorm and LayerNorm, as well as element-wise activation alternatives to normalization layers like DyT.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

444. Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

作者:

Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data. Despite this trend, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content- and context- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization--LM--detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching the token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net's improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

445. End-to-End Probabilistic Framework for Learning with Hard Constraints

作者:

We present ProbHardE2E, a probabilistic forecasting framework that incorporates hard operational/physical constraints and provides uncertainty quantification. Our methodology uses a novel differentiable probabilistic projection layer (DPPL) that can be combined with a wide range of neural network architectures. DPPL allows the model to learn the system in an end-to-end manner, compared to other approaches where constraints are satisfied either through a post-processing step or at inference. ProbHardE2E optimizes a strictly proper scoring rule, without making any distributional assumptions on the target, which enables it to obtain robust distributional estimates (in contrast to existing approaches that generally optimize likelihood-based objectives, which are heavily biased by their distributional assumptions and model choices); and it can incorporate a range of non-linear constraints (increasing the power of modeling and flexibility). We apply ProbHardE2E in learning partial differential equations with uncertainty estimates and to probabilistic time-series forecasting, showcasing it as a broadly applicable general framework that connects these seemingly disparate domains.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

446. Latent Diffusion Model without Variational Autoencoder

作者:

Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with Variational Autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+Diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are not only crucial for perception and understanding tasks, but also equally essential for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce **SVG**—a novel latent diffusion model without variational autoencoders, which unleashes **S**elf-supervised representations for **V**isual **G**eneration. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

447. MMPD: Diverse Time Series Forecasting via Multi-Mode Patch Diffusion Loss

作者:

Despite the flourishing in time series (TS) forecasting backbones, the training mostly relies on regression losses like Mean Square Error (MSE). However, MSE assumes a one-mode Gaussian distribution, which struggles to capture complex patterns, especially for real-world scenarios where multiple diverse outcomes are possible. We propose the Multi-Mode Patch Diffusion (MMPD) loss, which can be applied to any patch-based backbone that outputs latent tokens for the future. Models trained with MMPD loss generate diverse predictions (modes) with the corresponding probabilities. Technically, MMPD loss models the future distribution with a diffusion model conditioned on latent tokens from the backbone. A lightweight Patch Consistent MLP is introduced as the denoising network to ensure consistency across denoised patches. Multi-mode predictions are generated by a multi-mode inference algorithm that fits an evolving variational Gaussian Mixture Model (GMM) during diffusion. Experiments on eight datasets show its superiority in diverse forecasting. Its deterministic and probabilistic capabilities also match the strong competitor losses, MSE and Student-T, respectively.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

448. ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

作者:

The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a ``look-think-predict" paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality—achieving an improvement of 10% over scalar-based reward models.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

449. PACE: Pretrained Audio Continual Learning

作者:

Audio is a fundamental modality for analyzing speech, music, and environmental sounds. While pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world scenarios where data distributions evolve over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs) and provide a comprehensive analysis of its unique challenges. Unlike in the vision domain where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly applying such strategies to audio leads to poor performance. This is due to a fundamental property of audio backbones: they emphasize low-level spectral details rather than structured semantics, resulting in severe upstream–downstream misalignment. Through extensive empirical analysis, we identify a promising technical route based on analytic classifiers with first-session adaptation (FSA), but also uncover two major limitations: representation saturation in coarse-grained scenarios and representation shifts in fine-grained scenarios. To address these challenges, we propose **PACE**, an innovative method that improves FSA via a regularized analytic classifier and introduces multi-session adaptation through adaptive subspace-orthogonal PEFT for better semantic alignment. Additionally, we design spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments across six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, representing a significant step toward robust and scalable audio CL with PTMs.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

450. Unified Vision-Language-Action Model

作者:

Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This tokenized formulation naturally supports flexible multimodal task learning, particularly from large-scale video data, and further demonstrates that generative vision supervision can significantly enhance visual understanding. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning—especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, substantially outperforming prior methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing π₀-FAST's 85.5%. We further demonstrate its broad applicability through experiments on real-world ALOHA manipulation tasks and autonomous driving scenarios.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

451. SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors

作者:

Reconstructing 3D scenes from sparse images remains a challenging task due to the difficulty of recovering accurate geometry and texture without optimization. Recent approaches leverage generalizable models to generate 3D scenes using 3D Gaussian Splatting (3DGS) primitive. However, they often fail to produce continuous surfaces and instead yield discrete, color-biased point clouds that appear plausible at normal resolution but reveal severe artifacts under close-up views. To address this issue, we present SurfSplat, a feedforward framework based on 2D Gaussian Splatting (2DGS) primitive, which provides stronger anisotropy and higher geometric precision. By incorporating a surface continuity prior and a forced alpha blending strategy, SurfSplat reconstructs coherent geometry together with faithful textures. Furthermore, we introduce High-Resolution Rendering Consistency (HRRC), a new evaluation metric designed to evaluate high-resolution reconstruction quality. Extensive experiments on RealEstate10K, DL3DV, and ScanNet demonstrate that SurfSplat consistently outperforms prior methods on both standard metrics and HRRC, establishing a robust solution for high-fidelity 3D reconstruction from sparse inputs.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

452. Frozen Priors, Fluid Forecasts: Prequential Uncertainty for Low-Data Deployment with Pretrained Generative Models

作者:

Deploying ML systems with only a few real samples makes operational metrics (such as alert rates or mean scores) highly unstable. Existing uncertainty quantification (UQ) methods fail here: frequentist intervals ignore the deployed predictive rule, Bayesian posteriors assume continual refitting, and conformal methods offer per-example rather than long-run guarantees. We introduce a forecast-first UQ framework that blends the empirical distribution with a frozen pretrained generator using a unique Dirichlet schedule, ensuring time-consistent forecasts. Uncertainty is quantified via martingale posteriors: a lightweight, likelihood-free resampling method that simulates future forecasts under the deployed rule, yielding sharp, well-calibrated intervals for both current and long-run metrics without retraining or density evaluation. A single hyperparameter, set by a small-$n$ minimax criterion, balances sampling variance and model--data mismatch; for bounded scores, we provide finite-time drift guarantees. We also show how this framework informs optimal retraining decisions. Applicable off-the-shelf to frozen generators (flows, diffusion, autoregressive models, GANs) and linear metrics (means, tails, NLL), it outperforms bootstrap baselines across vision and language benchmarks (WikiText-2, CIFAR-10, and SVHN datasets); e.g., it achieves $\sim$90\% coverage on GPT-2 with 20 samples vs.\ 37\% for bootstrap. Importantly, our uncertainty estimates are operational under the deployed forecasting rule agnostic of the population parameters, affording practicable estimators for deployment in real world settings.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

453. Light-X: Generative 4D Video Rendering with Camera and Illumination Control

作者:

Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera–illumination control. Besides, our model surpasses prior video relighting methods in text- and background-conditioned settings. Ablation studies further validate the effectiveness of the disentangled formulation and degradation pipeline. Code, data and models will be made public.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

454. EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

作者:

Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs’ expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR}) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

455. InputDSA: Demixing, then comparing recurrent and externally driven dynamics

作者:

In control problems and basic scientific modeling, it is important to compare observations with dynamical simulations. For example, comparing two neural systems can shed light on the nature of emergent computations in the brain and deep neural networks. Recently, Ostrow et al. (2024) introduced Dynamical Similarity Analysis (DSA), a method to measure the similarity of two systems based on their state dynamics rather than geometry or topology. However, DSA does not consider how inputs affect the dynamics, meaning that two similar systems, if driven differently, may be classified as different. Because real-world dynamical systems are rarely autonomous, it is important to account for the effects of input drive. To this end, we introduce a novel metric for comparing both intrinsic (recurrent) and input-driven dynamics, called InputDSA (iDSA). InputDSA extends the DSA framework by estimating and comparing both input and intrinsic dynamic operators using a novel variant of Dynamic Mode Decomposition with control (DMDc) based on subspace identification. We demonstrate that InputDSA can successfully compare partially observed, input-driven systems from noisy data. We show that when the true inputs are unknown, surrogate inputs can be substituted without a major deterioration in similarity estimates. We apply InputDSA on Recurrent Neural Networks (RNNs) trained with Deep Reinforcement Learning, identifying that high-performing networks are dynamically similar to one another, while low-performing networks are more diverse. Lastly, we apply InputDSA to neural data recorded from rats performing a cognitive task, demonstrating that it identifies a transition from input-driven evidence accumulation to intrinsically-driven decision-making. Our work demonstrates that InputDSA is a robust and efficient method for comparing intrinsic dynamics and the effect of external input on dynamical systems.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

456. SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

作者:

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720×1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1\% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x} speedup). In summary, SANA-Video enables low-cost, high-quality video generation. Code and model will be publicly released.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

457. FakeXplain: AI-Generated Images Detection via Human-Aligned Grounded Reasoning

作者:

The rapid rise of image generation calls for detection methods that are both interpretable and reliable. Existing approaches, though accurate, act as black boxes and fail to generalize to out-of-distribution data, while multi-modal large language models (MLLMs) provide reasoning ability but often hallucinate. To address these issues, we construct FakeXplained dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, forming the basis for human-aligned, visually grounded reasoning. Leveraging FakeXplained, we develop FakeXplainer which fine-tunes MLLMs with a progressive training pipeline, enabling accurate detection, artifact localization, and coherent textual explanations. Extensive experiments show that FakeXplainer not only sets a new state-of-the-art in detection and localization accuracy (98.2% accuracy, 36.0% IoU), but also demonstrates strong robustness and out-of-distribution generalization, uniquely delivering spatially grounded, human-aligned rationales.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

458. Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction

作者:

Gene expression prediction, which predicts mRNA expression levels from DNA sequences, presents significant challenges. Previous works often focus on extending input sequence length to locate distal enhancers, which may influence target genes from hundreds of kilobases away. Our work first reveals that for current models, long sequence modeling can decrease performance. Even carefully designed algorithms only mitigate the performance degradation caused by long sequences. Instead, we find that proximal multimodal epigenomic signals near target genes prove more essential. Hence we focus on how to better integrate these signals, which has been overlooked. We find that different signal types serve distinct biological roles, with some directly marking active regulatory elements while others reflect background chromatin patterns that may introduce confounding effects. Simple concatenation may lead models to develop spurious associations with these background patterns. To address this challenge, we propose Prism (**P**roximal **r**egulatory **i**ntegration of **s**ignals for **m**RNA expression levels prediction), a framework that learns multiple combinations of high-dimensional epigenomic features to represent distinct background chromatin states and uses backdoor adjustment to mitigate confounding effects. Our experimental results demonstrate that proper modeling of multimodal epigenomic signals achieves state-of-the-art performance using only short sequences for gene expression prediction.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

459. Factuality Matters: When Image Generation and Editing Meet Structured Visuals

作者:

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Leveraging this dataset, we train a unified model that integrates a multimodal language model with FLUX.1-Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 2,000 challenging samples, and an accompanying evaluation metric, StructScore, which employs a multi-round Q&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even state-of-the-art systems score below 50\%, while our model achieves the strongest open-source performance, with consistent gains from inference-time reasoning. By releasing dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

460. Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match

作者:

Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation. Speculative Decoding (SPD) mitigates this issue by verifying candidate tokens from a smaller draft model in parallel, yet its strict exact-match verification discards many semantically valid continuations. We propose Training-Free Loosely Speculative Decoding (FLy), a novel method that loosens the rigid verification criterion by leveraging the target model’s own corrective behavior to judge whether a draft–target mismatch remains semantically valid. FLy introduces a two-tier mechanism: an entropy-level gate that identifies whether the current token allows multiple plausible alternatives or is nearly deterministic, and a token-level deferred window that distinguishes genuine errors from differently worded yet semantically correct variants. To further reduce latency, we design a multi-level acceleration strategy that accelerates not only the target model but also the drafter itself. Owing to its training-free design, FLy composes seamlessly with arbitrary draft–target pairs and generalizes across models and domains without hyperparameter re-tuning. Experiments show that FLy preserves $\geq$99\% of the target model’s accuracy while achieving an average 2.81$\times$ speedup on Llama-3.1-70B-Instruct and 5.07$\times$ speedup on the 405B variant. Notably, on out-of-domain datasets, our method remains highly effective and outperforms the training-based method EAGLE-3 by 1.62$\times$.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

461. MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning

作者:

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To overcome these shortcomings, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. To address this, we construct MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, and design MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision–language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments show that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments. More visualizations and robot demonstrations are available at https://momagraph.github.io/.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

462. Perception-Aware Policy Optimization for Multimodal Reasoning

作者:

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for empowering Large Language Models (LLMs) with long chain-of-thought reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error (67%) in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to generate visually grounded reasoning without external supervision. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which maximizes the difference between two probability distributions over the same rollout sequence, conditioned on either the original or corrupted visual input. Notably, PAPO does not rely on any additional data annotation, reward models, or stronger teacher models, and can therefore be seamlessly integrated into mainstream RLVR algorithms such as GRPO and DAPO. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, PAPO offers a new perspective on advancing multimodal RLVR via the optimization objective, moving beyond rollout or reward design and pointing toward deeper integration of perception and reasoning.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

463. Autoregressive Image Generation with Randomized Parallel Decoding

作者:

We introduce ARPG, a novel visual Autoregressive model that enables Randomized Parallel Generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel decoupled decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot tasks such as image in-painting, out-painting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.83 with only 32 sampling steps, achieving over a 30 times speedup in inference and and a 75 percent reduction in memory consumption compared to representative recent autoregressive models at a similar scale.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

464. WithAnyone: Toward Controllable and ID Consistent Image Generation

作者:

Identity-consistent (ID-consistent) generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets—containing multiple images of the same individual—forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset, MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive experiments—both qualitative and quantitative—demonstrate that WithAnyone substantially reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive, controllable generation.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

465. Aligning Collaborative View Recovery and Tensorial Subspace Learning via Latent Representation for Incomplete Multi-View Clustering

作者:

Multi-view data usually suffer from partially missing views in open scenarios, which inevitably degrades clustering performance. The incomplete multi-view clustering (IMVC) has attracted increasing attention and achieved significant success. Although existing imputation-based IMVC methods perform well, they still face one crucial limitation, i.e., view recovery and subspace representation lack explicit alignment and collaborative interaction in exploring complementarity and consistency across multiple views. To this end, this study proposes a novel IMVC method to Align collaborative view Recovery and tensorial Subspace Learning via latent representation (ARSL-IMVC). Specifically, the ARSL-IMVC infers the complete view from view-shared latent representation and view-specific estimator with Hilbert-Schmidt Independence Criterion regularizer, reshaping the consistent and diverse information intrinsically embedded in original multi-view data. Then, the ARSL-IMVC learns the view-shared and view-specific subspace representations from latent feature and recovered views, and models high-order correlations at the global and local levels in the unified low-rank tensor space. Thus, leveraging the latent representation as a bridge in a unified framework, the ARSL-IMVC seamlessly aligns the complementarity and consistency exploration across view recovery and subspace representation learning, negotiating with each other to promote clustering. Extensive experimental results on seven datasets demonstrate the powerful capacity of ARSL-IMVC in complex incomplete multi-view clustering tasks under various view missing scenarios.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

466. VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning

作者:

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in text-based reasoning with large language models, multi-modal reasoning – especially for videos – remains limited. In this work, we fill this gap by introducing VideoMind, a novel video-language agent for temporal-grounded video reasoning. Our method involves two key innovations: (1) We identify four essential capabilities for grounded video reasoning and propose a role-based agentic workflow, comprising a planner to coordinate roles, a grounder for temporal event localization, a verifier to assess event candidates, and an answerer for question answering. (2) To efficiently integrate these roles during inference, we propose a novel Chain-of-LoRA mechanism, where a unified base model with multiple LoRA adapters is leveraged to enable seamless role switching, balancing efficiency and flexibility. Extensive experiments on 14 benchmarks across 3 tasks, including Grounded VideoQA, Video Temporal Grounding, and General VideoQA, demonstrate the effectiveness of the proposed scheme in advancing video agent, test-time scaling, and long-form video reasoning. Code, models, and data will be publicly available.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

467. SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models

作者:

Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. In scalable applications, fine-tuning-based methods are time-consuming to precisely erase multiple target concepts, while real-time editing-based methods often degrade the generation quality of non-target concepts due to conflicting optimization objectives. To address this dilemma, we introduce SPEED, an efficient concept erasure approach that directly edits model parameters. SPEED searches for a null space, a model editing space where parameter updates do not affect non-target concepts, to achieve scalable and precise erasure. To facilitate accurate null space optimization, we incorporate three complementary strategies: Influence-based Prior Filtering (IPF) to selectively retain the most affected non-target concepts, Directed Prior Augmentation (DPA) to enrich the filtered retain set with semantically consistent variations, and Invariant Equality Constraints (IEC) to preserve key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in non-target preservation while achieving efficient and high-fidelity concept erasure, successfully erasing 100 concepts within only 5 seconds.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

468. DragFlow: Unleashing DiT Priors with Region-Based Supervision for Drag Editing

作者:

Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX’s rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 8, 6

📄 openreview 📄 下载PDF

469. Why Keep Your Doubts to Yourself? Trading Visual Uncertainties in Multi-Agent Bandit Systems

作者:

Vision-Language Models (VLMs) enable powerful multi-agent systems, but scaling them is economically unsustainable: coordinating heterogeneous agents under information asymmetry often spirals costs. Existing paradigms, such as Mixture-of-Agents and knowledge-based routers, rely on heuristic proxies that ignore costs and collapse uncertainty structure, leading to provably suboptimal coordination. We introduce Agora, a framework that reframes coordination as a decentralized market for uncertainty. Agora formalizes epistemic uncertainty into a structured, tradable asset (perceptual, semantic, inferential), and enforces profitability-driven trading among agents based on rational economic rules. A market-aware broker, extending Thompson Sampling, initiates collaboration and guides the system toward cost-efficient equilibria. Experiments on five multimodal benchmarks (MMMU, MMBench, MathVision, InfoVQA, CC-OCR) show that Agora outperforms strong VLMs and heuristic multi-agent strategies, e.g., achieving +8.5% accuracy over the best baseline on MMMU while reducing cost by over 3×. These results establish market-based coordination as a principled and scalable paradigm for building economically viable multi-agent visual intelligence systems.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 8, 6, 6

📄 openreview 📄 下载PDF

470. Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

作者:

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB’s contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 20 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB’s primary limitation and exposing critical gaps in current LLM capabilities. All prompts, source code and experimental configurations are publicly available at https://anonymous.4open.science/r/Multi-LiveCodeBench-C627/

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 8, 6, 6, 6

📄 openreview 📄 下载PDF

471. CL-DPS: A Contrastive Learning Approach to Blind Nonlinear Inverse Problem Solving via Diffusion Posterior Sampling

作者:

Diffusion models (DMs) have recently become powerful priors for solving inverse problems. However, most work focuses on non-blind settings with known measurement operators, and existing DM-based blind solvers largely assume linear measurements, which limits practical applicability where operators are frequently nonlinear. We introduce CL-DPS, a contrastively trained likelihood for diffusion posterior sampling that requires no knowledge of the operator parameters at inference. To the best of our knowledge, CL-DPS is the first DM-based framework capable of solving blind nonlinear inverse problems. Our key idea is to train an auxiliary encoder offline, using a MoCo-style contrastive objective over randomized measurement operators, to learn a surrogate for the conditional likelihood \$p(\boldsymbol{y} | \boldsymbol{x}\_t)\$. During sampling, we inject the surrogate's gradient as a guidance term along the reverse diffusion trajectory, which enables posterior sampling without estimating or inverting the forward operator. We further employ overlapping patch-wise inference to preserve fine structure and a lightweight color-consistency head to stabilize color statistics. The guidance is sampler-agnostic and pairs well with modern solvers (e.g., DPM-Solver++ (2M)). Extensive experiments show that CL-DPS effectively handles challenging nonlinear cases, such as rotational and zoom deblurring, where prior DM-based methods fail, while remaining competitive on standard linear benchmarks. Code: \url{https://anonymous.4open.science/r/CL-DPS-4F5D}.

📊 评审评分

平均分: 6.50

最低分: 6

最高分: 8

评审人数: 4

详细评分: 6, 6, 6, 8

📄 openreview 📄 下载PDF

472. Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought

作者:

Previous work shows that the chain of continuous thought (continuous CoT) improves the reasoning capability of large language models (LLMs) by enabling implicit parallel thinking, and a subsequent work provided theoretical insight by showing that a two-layer transformer equipped with continuous CoT can efficiently solve directed graph reachability by maintaining a superposition of multiple reasoning traces in the continuous thought. However, it remains unclear how the superposition mechanism is naturally learned from gradient-based training methods. To fill this gap, we theoretically analyze the training dynamics of a simplified two-layer transformer on the directed graph reachability problem to unveil how the superposition mechanism emerges during training in two training stages -- (i) a *thought-generation* stage that autoregressively expands the continuous thought, and (ii) a *prediction* stage that converts the thought into the final answer. Our analysis reveals that during training using continuous thought, the index-matching logit, an important quantity which reflects the strength of the model's local search ability, will first increase and then remain bounded under mild assumptions. The bounded index-matching logit effectively balances exploration and exploitation during the reasoning process: the model will exploit local problem structures to identify plausible search traces, and assign comparable weights to multiple such traces to explore when it is uncertain about which solution is correct, which results in superposition. Our experimental results tracking the growth of logits further validate our theory.

📊 评审评分

平均分: 6.40

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 6, 6, 6, 8

📄 openreview 📄 下载PDF

473. A Recovery Guarantee for Sparse Neural Networks

作者:

We prove the first guarantees of sparse recovery for ReLU neural networks, where the sparse network weights constitute the signal to be recovered. Specifically, we study structural properties of the sparse network weights for two-layer, scalar-output networks under which a simple iterative hard thresholding algorithm recovers these weights exactly, using memory that grows linearly in the number of nonzero weights. We validate this theoretical result with simple experiments on recovery of sparse planted MLPs, MNIST classification, and implicit neural representations. Experimentally, we find performance that is competitive with, and often exceeds, a high-performing but memory-inefficient baseline based on iterative magnitude pruning.

📊 评审评分

平均分: 6.40

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 8, 6, 6, 6

📄 openreview 📄 下载PDF

474. Language and Experience: A Computational Model of Social Learning in Complex Tasks

作者:

The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments. How do people integrate these two sources of knowledge, and how might AI systems? We present a computational framework that models human social learning as joint probabilistic inference over structured, executable world models given sensorimotor and linguistic data. We make this possible by turning a pretrained language model into a probabilistic model of how humans share advice conditioned on their beliefs, allowing our agents both to generate advice for others and to interpret linguistic input as evidence during Bayesian inference. Using behavioral experiments and simulations across 10 video games, we show how linguistic guidance can shape exploration and accelerate learning by reducing risky interactions and speeding up key discoveries in both humans and models. We further explore how knowledge can accumulate across generations through iterated learning experiments and demonstrate successful knowledge transfer between humans and models—revealing how structured, language-compatible representations might enable human-machine collaborative learning.

📊 评审评分

平均分: 6.40

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 8, 6, 6, 6

📄 openreview 📄 下载PDF

475. Monocular Normal Estimation via Shading Sequence Estimation

作者:

Monocular normal estimation aims to estimate normal map from a single RGB image of an object under arbitrary lighting. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have an overall correct color distribution, the reconstructed surfaces frequently fail to align with the geometry details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct spatially-various geometric, as they are represented in normal maps only by relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometry information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, dataset, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation. Codes and dataset will be released to facilitate reproducible research.

📊 评审评分

平均分: 6.40

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 6, 6, 8, 6

📄 openreview 📄 下载PDF

476. CheckMate! Watermarking Graph Diffusion Models in Polynomial Time

作者:

Watermarking provides an effective means for data governance. However, conventional post-editing graph watermarking approaches degrade the graph quality and involve NP-hard subroutines. Alternatively, recent approaches advocate for embedding watermarking patterns in the noisy latent during data generation from diffusion models, but remain uncharted for graph models due to the hardness of inverting the graph diffusion process. In this work, we propose CheckWate: the first watermarking framework for graph diffusion models embedding checkerboard watermark and providing exact and polynomial time verification. To address NP-completeness due to graph isomorphism, CheckWate embeds the watermark into the latent eigenvalues, which are isomorphism-invariant. To detect the watermark through reversing the graph diffusion process, CheckWate leverages the graph eigenvectors to approximately dequantizes the discrete graph back to the continuous latent, with theoretical guarantees on the detectability and dequantization error. We further introduce a latent sparsification mechanism to enhance the robustness of CheckWate against graph modifications. We evaluate CheckWate on four datasets and four graph modification attacks, against three generation time watermark schemes. CheckWate achieves remarkable generation quality while being detectable under strong attacks such as isomorphism, whereas the baselines are unable to detect the watermark. Code available at: https://anonymous.4open.science/r/checkwate.

📊 评审评分

平均分: 6.40

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 8, 6, 6, 6

📄 openreview 📄 下载PDF

477. Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

作者:

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent fine-tuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge: visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely compared with when given a single VQA sample. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-$55$K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. The code, dataset and trained models will be released upon acceptance.

📊 评审评分

平均分: 6.40

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 6, 6, 8, 6

📄 openreview 📄 下载PDF

478. Towards a Unified View of Large Language Model Post-Training

作者:

Many approaches with seemingly disparate losses exist for post-training modern language models, such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive the Unified Policy Gradient Estimator (UPGE), a framework with four interchangeable parts that unifies a wide spectrum of post-training approaches through their loss gradient form. We further present the calculations of these methods as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify effectiveness of HPT. Across six mathematical reasoning benchmarks and two out-of-distribution tasks, HPT consistently surpasses strong baselines across models of varying scales and families.

📊 评审评分

平均分: 6.40

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 6, 6, 8, 6

📄 openreview 📄 下载PDF

479. DA$^2$: Depth Anything in Any Direction

作者:

Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (\textit{e.g.}, cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$'s SoTA performance, with an average 38\% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data will be released.

📊 评审评分

平均分: 6.40

最低分: 6

最高分: 8

评审人数: 5

详细评分: 6, 6, 6, 6, 8

📄 openreview 📄 下载PDF

480. Quadratic Direct Forecast for Training Multi-Step Time-Series Forecast Models

作者:

The design of training objective is central to training time-series forecasting models. Existing training objectives such as mean squared error mostly treat each future step as an independent, equally weighted task, which we found leading to the following two issues: (1) overlook the *label autocorrelation effect* among future steps, leading to biased training objective; (2) fail to set *heterogeneous task weights* for different forecasting tasks corresponding to varying future steps, limiting the forecasting performance. To fill this gap, we propose a novel quadratic-form weighted training objective, addressing both of the issues simultaneously. Specifically, the off-diagonal elements of the weighting matrix account for the label autocorrelation effect, whereas the non-uniform diagonals are expected to match the most preferable weights of the forecasting tasks with varying future steps. To achieve this, we propose a Quadratic Direct Forecast (QDF) learning algorithm, which trains the forecast model using the adaptively updated quadratic-form weighting matrix. Experiments show that our QDF effectively improves performance of various forecast models, achieving state-of-the-art results. Code is available at https://anonymous.4open.science/r/QDF-8937.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

481. Improving Extreme Wind Prediction with Frequency-Informed Learning

作者:

Accurate prediction of extreme wind velocities has substantial significance in industry, particularly for the operation management of wind power plants. Although the state-of-the-art data-driven models perform well for general meteorological forecasting, they may exhibit large errors for extreme weather—for example, systematically underestimating the magnitudes and short-term variation of extreme winds. To address this issue, we conduct a theoretical analysis of how the data frequency spectrum influences errors in extreme wind prediction. Based on these insights, we propose a novel loss function that incorporates a gradient penalty to mitigate the magnitude shrinkage of extreme weather. To capture more precise short-term wind velocity variations, we design a novel structure of physics-embedded machine learning models with frequency reweighting. Experiments demonstrate that, compared to the baseline models, our approach achieves significant improvements in predicting extreme wind velocities while maintaining robust overall performance.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

482. Learning Music Style For Piano Arrangement Through Cross-Modal Bootstrapping

作者:

What is music style? Though often described using text labels such as "swing," "classical," or "emotional," the real style remains implicit and hidden in concrete music examples. In this paper, we introduce a cross-modal framework that learns implicit music styles from raw audio and applies the styles to symbolic music generation. Inspired by BLIP-2, our model leverages a Querying Transformer (Q-Former) to extract style representations from a large, pre-trained audio language model (LM), and further applies them to condition a symbolic LM for generating piano arrangements. We adopt a two-stage training strategy: contrastive learning to align auditory style with symbolic expression, followed by generative modelling to perform music arrangement. Our model generates piano performances jointly conditioned on a lead sheet (content) and a reference audio example (style), enabling controllable and stylistically faithful arrangement. Experiments demonstrate the effectiveness of our approach in piano cover generation, style transfer, and audio-to-MIDI retrieval, achieving substantial improvements in style-aware alignment and music quality.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

483. Meta-RL Induces Exploration in Language Agents

作者:

Reinforcement learning (RL) has enabled the training of Large Language Model (LLM) agents to interact with the environment and to solve multi-turn longhorizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LAMER, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LAMER consists of two key components: (i) a cross-episode training framework to encourage exploration and long term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across four different environments demonstrate that LAMER significantly improves performance over RL baselines, with more than 13% gains on Sokoban and more than 20% gains on MineSweeper and Webshop. It also generalizes better when evaluated on more challenging or previously unseen environments compared to the RL trained models. Overall, our results demonstrate that meta-reinforcement learning provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

484. FIRE: Frobenius-Isometry Reinitialization for Balancing the Stability–Plasticity Tradeoff

作者:

Deep neural networks trained on nonstationary data must balance stability (i.e., retaining prior knowledge) and plasticity (i.e., adapting to new tasks). Standard reinitialization methods, which reinitialize weights toward their original values, are widely used but difficult to tune: conservative reinitializations fail to restore plasticity, while aggressive ones erase useful knowledge. We propose FIRE, a principled reinitialization method that explicitly balances the stability–plasticity tradeoff. FIRE quantifies stability through Squared Frobenius Error (SFE), measuring proximity to past weights, and plasticity through Deviation from Isometry (DfI), reflecting weight isotropy. The reinitialization point is obtained by solving a constrained optimization problem, minimizing SFE subject to DfI being zero, which is efficiently approximated by Newton–Schulz iteration. FIRE is evaluated on continual visual learning (CIFAR-10 with ResNet-18), language modeling (OpenWebText with GPT-0.1B), and reinforcement learning (HumanoidBench with SAC and Atari games with DQN). Across all domains, FIRE consistently outperforms both naive training without intervention and standard reinitialization methods, demonstrating effective balancing of the stability–plasticity tradeoff.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

485. Condition Errors Refinement in Autoregressive Image Generation with Diffusion Loss

作者:

Recent studies have explored autoregressive models for image generation, with promising results, and have combined diffusion models with autoregressive frameworks to optimize image generation via diffusion losses. In this study, we present a theoretical analysis of diffusion and autoregressive models with diffusion loss, highlighting the latter's advantages. We present a theoretical comparison of conditional diffusion and autoregressive diffusion with diffusion loss, demonstrating that patch denoising optimization in autoregressive models effectively mitigates condition errors and leads to a stable condition distribution. Our analysis also reveals that autoregressive condition generation refines the condition, causing the condition error influence to decay exponentially. In addition, we introduce a novel condition refinement approach based on Optimal Transport (OT) theory to address ``condition inconsistency''. We theoretically demonstrate that formulating condition refinement as a Wasserstein Gradient Flow ensures convergence toward the ideal condition distribution, effectively mitigating condition inconsistency. Experiments demonstrate the superiority of our method over diffusion and autoregressive models with diffusion loss methods.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

486. CARD: Towards Conditional Design of Multi-agent Topological Structures

作者:

Large language model (LLM)-based multi-agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real-world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph-generation framework that instantiates AMACP, a protocol for adaptive multi-agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment-aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt-based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: \url{https://anonymous.4open.science/r/agentgraph-FF9A}.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

487. Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems

作者:

As foundation models grow increasingly intelligent, reliable and trustworthy safety evaluation becomes more indispensable than ever. However, an important question arises: \textit{Whether and how an advanced AI system would perceive the situation of being evaluated, and lead to the broken integrity of the evaluation process?} During standard safety tests on a mainstream large reasoning model, we unexpectedly observe that the model without any contextual cues would occasionally recognize it is being evaluated and hence behave more safety-aligned. This motivates us to conduct a systematic study on the phenomenon of \textit{evaluation faking}, i.e., an AI system autonomously alters its behavior upon recognizing the presence of an evaluation context and thereby influencing the evaluation results. Through extensive experiments on a diverse set of foundation models with mainstream safety benchmarks, we reach the main finding termed \textit{the observer effects for AI}: AI systems with stronger reasoning and situational awareness exhibit evaluation faking more frequently, which reflects in the following aspects: 1) A reasoning model (specifically the DeepSeek series in our work) recognizes it is being evaluated in $32.6\%$ more cases than a non-reasoning model. 2) As the foundation model scales from 32B to 671B, the rate of evaluation faking behaviors increases by over $30\%$ in some cases. Conversely, models below 32B exhibit almost no evaluation faking behaviors. 3) With a basic memory module, the AI system is 2.55$\times$ more likely to recognize the evaluation process and achieve a $28.2\%$ higher safety score compared with the no-memory case. Furthermore, we show a strong causal link between evaluation recognition and safety performance, with QwQ-32B's safety rate improving dramatically from $9\%$ to $98\%$ through intervention on the reasoning trace. To facilitate the above measurement and analysis, we devise a chain-of-thought monitoring technique to detect the faking intent in the reasoning process and further uncover internal signals which are strongly correlated with the model's evaluation faking behaviors, offering insights for future mitigation studies.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

488. Aligning at the Source: Steering Corrective to the Origins of Harmfulness in LLMs

作者:

Persistent vulnerabilities in safety alignment hinder the deployment of large language models (LLMs). Existing methods remain susceptible to jailbreak attacks, suggesting a fundamental, unaddressed flaw in current safety paradigms. In this work, we diagnose the root cause of this fragility, identifying a systemic issue we term Depth-wise Alignment Discrepancy. We find a fundamental misalignment: harmful vectors—latent representations predisposed to unsafe content—predominantly originate in the model's lower layers, yet conventional alignment training concentrates its corrective gradients disproportionately on the top-most layers. This creates a brittle, "end-of-pipe" defense that is easily bypassed. To address this discrepancy, we propose SAGA, a framework that achieves robust safety through two synergistic innovations. First, it leverages high-entropy Chain-of-Thought (CoT) augmented data to provide the deep semantic signals necessary to reach the source of harmfulness. Second, it introduces a novel Synergistic Gradient Scaling (SGS) mechanism to explicitly reshape the gradient flow, ensuring these corrective signals are precisely applied to the identified vulnerable layers. Extensive experiments on five LLMs against six distinct jailbreak attacks demonstrate SAGA's superiority, reducing attack success rates (ASR) by 21\%–63\% compared to state-of-the-art baselines. Our method preserves downstream task accuracy while introducing minimal computational overhead (<3\%).

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

489. Pallatom-Ligand: an All-Atom Diffusion Model for Designing Ligand-Binding Proteins

作者:

Small-molecule ligands extend protein functionality beyond natural amino acids, enabling sophisticated processes like catalysis, signal transduction, and light harvesting. However, designing proteins with high affinity and selectivity for arbitrary ligands remains a major challenge. We present Pallatom-Ligand, a diffusion model that performs end-to-end generation of ligand-binding proteins at atomic resolution. By directly learning the joint distribution of all atoms in the protein–ligand complexes, Pallatom-Ligand delivers state-of-the-art performance, achieving the highest *in silico* success rates in a comprehensive benchmark. In addition, Pallatom-Ligand's novel conditioning framework enables programmable control over global protein fold and atomic-level ligand solvent accessibility. With these capabilities, Pallatom-Ligand opens new opportunities for exploring the protein function space, advancing both generative modeling and computational protein engineering.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

490. RoRE: Rotary Ray Embedding for Generalised Multi-Modal Scene Understanding

作者:

Transformers have emerged as powerful implicit rendering models, capable of performing geometric reasoning and producing photorealistic novel views in a single feedforward pass. A central challenge in these architectures is how to inject camera parameters into the transformer in a way that generalises across diverse sensing conditions. In this work, we present Rotary Ray Embedding (RoRE), an approach that embeds image patches directly as rays, using a learning based rotary positional embedding (RoPE). This ray-based formulation provides a unified and general representation, improving robustness to unconventional camera geometries and sensing modalities. We evaluate our approach on conventional perspective imagery, fisheye cameras, and multi-modal RGB-thermal setups, showing that a single network can flexibly integrate arbitrary numbers of cameras and modalities into a coherent scene representation. Experiments demonstrate improved generalisation and cross-modal consistency compared to existing methods, highlighting the potential for relative ray-based embeddings to build adaptable, plug-and-play vision systems.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

491. PerFit: Exploring Personalization Shifts in Representation Space of LLMs

作者:

Personalization has become a pivotal field of study in contemporary intelligent systems. While large language models (LLMs) excel at general knowledge tasks, they often struggle with personalization, i.e., adapting their outputs to individual user expectations. Existing approaches that steer LLM behavior to meet users’ implicit preferences and behavior patterns, primarily relying on tune-free methods (e.g., RAG, PAG) or parameter fine-tuning methods (e.g., LoRA), face challenges in effectively balancing effectiveness and efficiency. Moreover, the mechanisms underlying personalized preferences remain underexplored. To address these challenges, we first uncover key patterns of user-specific information embedded in the representation space. Specifically, we find that (1) personalized information lies within a low-rank subspace represented by vectors, and (2) these vectors demonstrate both a collective shift shared across users and a personalized shift unique to each individual user. Building on these insights, we introduce PerFit, a novel two-stage solution that directly fine-tunes interventions in the hidden representation space by addressing both collective and user-specific shifts, thereby achieving precise steering of LLM with minimal parameter overhead. Experimental results demonstrate that \perfit delivers strong performance across six datasets while \cutting the number of parameters by an average of 92.3% compared to the state-of-the-art method.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

492. Micro-Macro Coupled Koopman Modeling on Graph for Traffic Flow Prediction

作者:

Traffic systems are inherently multi-scale: microscopic vehicle interactions and macroscopic flow co-evolve nonlinearly. Microscopic models capture local interactions but miss flow evolution; macroscopic models enforce aggregated consistency yet overlook stochastic vehicle-level dynamics. We propose Micro–Macro Coupled Koopman Modeling (MMCKM), which lifts the coupled dynamics to a high-dimensional linear observation space for a unified linear-operator representation. Unlike grid-based discretizations, MMCKM adopts a vehicle-centric dynamic graph that preserves microscopic perturbations while respecting macroscopic conservation laws by discretizing PDEs onto this graph. At the micro scale, scenario-adaptive Koopman evolvers selected by an Intent Discriminator are designed to model vehicle dynamics. A Koopman control module explicitly formulate how flow state influences individual vehicles, yielding bidirectional couplings. To our knowledge, this is the first work to jointly model vehicle trajectories and traffic flow density using a unified Koopman framework without requiring historical trajectories. The proposed MMCKM is validated for trajectory prediction on NGSIM and HighD. While MMCKM uses only real-time measurement, it achieves comparable or even higher accuracy than history-dependent baselines. We further analyze the effect of the operator interval and provide ablations to show the improvement by intent inference, macro-to-micro control, and diffusion. Code and implementation details are included to facilitate reproducibility.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

493. Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

作者:

Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player’s future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

494. How Dark Patterns Manipulate Web Agents

作者:

Deceptive UI designs, commonly known as dark patterns, manipulate users into performing actions misaligned with their goals. In this paper, we show that dark patterns are highly effective in altering web agent behavior, posing a significant risk given the wide applications of web agents. To quantify this risk, we introduce DECEPTICON, an environment for testing individual dark patterns in isolation. DECEPTICON includes 850 web navigation tasks with dark patterns—600 generated tasks and 250 real-world tasks, designed to evaluate both task success and dark pattern effectiveness. Testing frontier large language models and state-of- the-art agent scaffolds, we find dark patterns succeed in 70% of tested generated and real-world tasks. Moreover, the effectiveness correlates positively with model size and test-time reasoning, making larger, more capable models more susceptible. Leading defense methods, including in-context prompting and multi-agent verification, fail to consistently reduce dark pattern success. Our findings reveal dark patterns as a latent, unmitigated risk to web agents, highlighting the urgent need for robust defenses against manipulative designs.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

495. Soft Instruction De-escalation Defense

作者:

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control) - a simple yet effective multi-stage sanitization pipeline designed for tool-augmented LLM agents. Our approach begins by unconditionally rewriting incoming data to neutralize any potential instructions by masking, rephrasing, or removing them. To detect attacks against the rewriter itself, we inject known canary instructions before this process; if these instructions survive, we conclude the rewrite was compromised. To account for the imprecision of LLMs, we apply multiple independent rewrite passes. Finally, a detection module inspects the full text and smaller chunks of the output for any residual instruction-like content. If imperative instructions remain, the agent halts to ensure security. This defense-in-depth strategy, combining unconditional rewriting, canary checking, and chunk-based detection, makes successful attacks significantly more difficult than bypassing a single detection model.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

496. RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

作者:

Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2\%) and JudgeBench (81.4\%, \#1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at $<5$\% of the inference cost).

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

497. Multiple Token Divergence: Measuring and Steering In-Context Computation Density

作者:

Measuring the in-context computational effort of language models is a key challenge, as metrics like next-token loss fail to capture reasoning complexity. Prior methods based on latent state compressibility can be invasive and unstable. We propose Multiple Token Divergence (MTD), a simple measure of computational effort defined as the KL divergence between a model's full output distribution and that of a shallow, auxiliary prediction head. MTD can be computed directly from pre-trained models with multiple prediction heads, requiring no additional training. Building on this, we introduce Divergence Steering, a novel decoding method to control the computational character of generated text. We empirically show that MTD is more effective than prior methods at distinguishing complex tasks from simple ones. On mathematical reasoning benchmarks, MTD correlates positively with problem difficulty. Lower MTD is associated with more accurate reasoning. MTD provides a practical, lightweight tool for analyzing and steering the computational dynamics of language models.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

498. Cost-Optimal Active AI Model Evaluation

作者:

The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, a desire for rapid iteration often makes it necessary to rely on synthetic annotation data because of its low cost, despite the potential for substantial bias. In this paper, we develop a rigorous theoretical framework for novel, cost-aware evaluation pipelines that actively balance the use of a cheap, but often inaccurate, weak rater---such as a model-based autorater that is designed to automatically assess the quality of generated content---with a more expensive, but also more accurate, strong rater such as a human annotator. Building on recent work in active and prediction-powered statistical inference, we theoretically derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters so as to maximize statistical efficiency. Next, using synthetic and real-world data, we empirically characterize conditions under which these types of policies can yield significant improvements over classical methods. Finally, we find that practical approximations of the theoretically optimal policies can achieve the same estimation precision at a far lower total annotation budget than standard evaluation methods, especially in tasks where there is high variability in the difficulty of examples.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

499. $p\textrm{-less}$ Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding

作者:

Obtaining high-quality outputs from Large Language Models (LLMs) often depends upon the choice of a sampling-based decoding strategy to probabilistically choose the next token at each generation step. While a variety of such sampling methods have been proposed, their performance can be sensitive to the selection of hyperparameters which may require different settings depending upon the generation task and temperature configuration. In this work, we introduce $p\textrm{-less}$ sampling: an information-theoretic approach to sampling which dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution. Unlike existing methods, $p\textrm{-less}$ sampling has no hyperparameters and consistently produces high-quality outputs as temperature increases. We provide theoretical perspectives on $p$-less sampling to ground our proposed method and conduct experiments to empirically validate its effectiveness across a range of math, logical reasoning, and creative writing tasks. Our results demonstrate how $p\textrm{-less}$ sampling consistently outperforms existing sampling approaches while exhibiting much less degradation in text quality at higher temperature values. We further show how $p$-less achieves greater inference-time efficiency than alternative methods through lower average token sampling times and shorter generation lengths, without sacrificing accuracy. Finally, we provide analyses to highlight the benefits of $p\textrm{-less}$ through qualitative examples, case studies, and diversity assessments.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

500. Trapped by simplicity: When Transformers fail to learn from noisy features

作者:

Noise is ubiquitous in data used to train large language models, but it is not well understood whether these models are able to correctly generalize to inputs generated without noise. Here, we study noise-robust learning: are transformers trained on data with noisy features able to find a target function that correctly predicts labels for noiseless features? We show that transformers succeed at noise-robust learning for a selection of $k$-sparse parity and majority functions, compared to LSTMs which fail at this task for even modest feature noise. However, we find that transformers typically fail at noise-robust learning of random $k$-juntas, especially when the boolean sensitivity of the optimal solution is smaller than that of the target function. We argue that this failure is due to a combination of two factors: transformers' bias toward simpler functions, combined with an observation that the empirically optimal function for noise-robust learning has lower sensitivity than the target function. We test this hypothesis by exploiting transformers' simplicity bias to trap them in an incorrect solution, but show that transformers can escape this trap by training with an additional loss term penalizing high-sensitivity solutions. Overall, we find that transformers are particularly ineffective for learning boolean functions in the presence of feature noise.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

501. Hilbert-Guided Sparse Local Attention

作者:

The quadratic compute and memory costs of global self-attention severely limit its use in high-resolution images. Local attention reduces complexity by restricting attention to neighborhoods. Block-sparse kernels can further improve the efficiency of local attention, but conventional local attention patterns often fail to deliver significant speedups because tokens within a window are not contiguous in the 1D sequence. This work proposes a novel method for constructing windows and neighborhoods based on the Hilbert curve. Image tokens are first reordered along a Hilbert curve, and windows and neighborhoods are then formed on the reordered 1D sequence. From a block-sparse perspective, this strategy significantly increases block sparsity and can be combined with existing block-sparse kernels to improve the efficiency of 2D local attention. Experiments show that the proposed Hilbert Window Attention and Hilbert Slide Attention can accelerate window attention and slide attention by about $4\times$ and $18\times$, respectively. To assess practicality, the strategy is instantiated as the Hilbert Window Transformer and the Hilbert Neighborhood Transformer, both of which achieve end-to-end speedups with minimal accuracy loss. Overall, combining Hilbert-guided local attention with block-sparse kernels offers a general and practical approach to enhancing the efficiency of 2D local attention for images.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

502. AMiD: Knowledge Distillation for LLMs with $\alpha$-mixture Assistant Distribution

作者:

Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge distillation (KD) mitigates this issue by transferring knowledge from a large teacher to a smaller student through distributional alignment. Previous studies have proposed various discrepancy metrics, but the capacity gap and training instability caused by near-zero probabilities, stemming from the high-dimensional output of LLMs, remain fundamental limitations. To overcome these challenges, several approaches implicitly or explicitly incorporating assistant distribution have recently been proposed. However, the past proposals of assistant distributions have been a fragmented approach without a systematic investigation of the interpolation path and the divergence. This paper proposes $\alpha$-mixture assistant distribution, a novel generalized family of assistant distributions, and $\alpha$-mixture distillation, coined AMiD, a unified framework for KD using the assistant distribution. The $\alpha$-mixture assistant distribution provides a continuous extension of the assistant distribution by introducing a new distribution design variable $\alpha$, which has been fixed in all previous approaches. Furthermore, AMiD generalizes the family of divergences used with the assistant distributions based on optimality, which has also been restricted in previous works. Through extensive experiments, we demonstrate that AMiD offers superior performance and training stability by leveraging a broader and theoretically grounded assistant distribution space.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

503. Modeling the Density of Pixel-level Self-supervised Embeddings for Unsupervised Pathology Segmentation in Medical CT

作者:

Accurate detection of all pathological findings in 3D medical images remains a significant challenge, as supervised models are limited to detecting only the few pathology classes annotated in existing datasets. To address this, we frame pathology detection as an unsupervised visual anomaly segmentation (UVAS) problem, leveraging the inherent rarity of pathological patterns compared to healthy ones. We enhance the existing density-based UVAS framework with two key innovations: (1) dense self-supervised learning for feature extraction, eliminating the need for supervised pretraining, and (2) learned, masking-invariant dense features as conditioning variables, replacing hand-crafted positional encodings. Trained on over 30,000 unlabeled 3D CT volumes, our fully self-supervised model, Screener, outperforms existing UVAS methods on four large-scale test datasets comprising 1,820 scans with diverse pathologies. Furthermore, in a supervised fine-tuning setting, Screener surpasses existing self-supervised pretraining methods, establishing it as a state-of-the-art foundation for pathology segmentation. The code and pretrained models will be made publicly available.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

504. Contrastive Predictive Coding Done Right for Mutual Information Estimation

作者:

The InfoNCE objective, originally introduced for contrastive representation learning, has become a popular choice for mutual information (MI) estimation, despite its indirect connection to MI. In this paper, we demonstrate why InfoNCE should not be regarded as a valid MI estimator, and we introduce a simple modification, which we refer to as *InfoNCE-anchor*, for accurate MI estimation. Our modification introduces an auxiliary \emph{anchor} class, enabling consistent density ratio estimation and yielding a plug-in MI estimator with significantly reduced bias. Beyond this, we generalize our framework using proper scoring rules, which recover InfoNCE-anchor as a special case when the log score is employed. This formulation unifies a broad spectrum of contrastive objectives, including NCE, InfoNCE, and $f$-divergence variants, under a single principled framework. Empirically, we find that InfoNCE-anchor with the log score achieves the most accurate MI estimates; however, in self-supervised representation learning experiments, we find that the anchor does not improve the downstream task performance. These findings corroborate that contrastive representation learning benefits not from accurate MI estimation per se, but from the learning of structured density ratios.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

505. PYRREGULAR: A Unified Framework for Irregular Time Series, with Classification Benchmarks

作者:

Irregular temporal data, characterized by varying recording frequencies, differing observation durations, and missing values, presents significant challenges across fields like mobility, healthcare, and environmental science. Existing research communities often overlook or address these challenges in isolation, leading to fragmented tools and methods. To bridge this gap, we introduce a unified framework, and the first standardized dataset repository for irregular time series classification, built on a common array format to enhance interoperability. This repository comprises 34 datasets on which we benchmark 12 classifier models from diverse domains and communities. This work aims to centralize research efforts and enable a more robust evaluation of irregular temporal data analysis methods.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

506. Search Arena: Analyzing Search-Augmented LLMs

作者:

Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce \textbf{Search Arena}, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations and types of cited sources, even when the cited content does not directly support the associated claims, uncovering a gap between perceived and actual credibility. To assess cross-setting performance, we conduct cross-arena analyses by testing search-augmented LLMs in a general purpose chat environment and conventional LLMs in search-heavy settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

507. A New Approach to Controlling Linear Dynamical Systems

作者:

We propose a new method for controlling linear dynamical systems under adversarial disturbances and cost functions. Our algorithm achieves a running time that scales polylogarithmically with the inverse of the stability margin, improving upon prior methods with polynomial dependence maintaining the same regret guarantees. The technique, which may be of independent interest, is based on a novel convex relaxation that approximates linear control policies using spectral filters constructed from the eigenvectors of a specific Hankel matrix.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

508. Convergent Differential Privacy Analysis for General Federated Learning

作者:

The powerful cooperation of federated learning (FL) and differential privacy (DP) provides a promising paradigm for the large-scale private clients. However, existing analyses in FL-DP mostly rely on the composition theorem and cannot tightly quantify the privacy leakage challenges, which is tight for a few communication rounds but yields an arbitrarily loose and divergent bound eventually. This also implies a counterintuitive judgment, suggesting that FL-DP may not provide adequate privacy support during long-term training under constant-level noisy perturbations, yielding discrepancy between the theoretical and experimental results. To further investigate the convergent privacy and reliability of the FL-DP framework, in this paper, we comprehensively evaluate the worst privacy of two classical methods under the non-convex and smooth objectives based on the $f$-DP analysis. With the aid of the shifted interpolation technique, we successfully prove that privacy in Noisy-FedAvg has a tight convergent bound. Moreover, with the regularization of the proxy term, privacy in Noisy-FedProx has a stable constant lower bound. Our analysis further demonstrates a solid theoretical foundation for the reliability of privacy in FL-DP. Meanwhile, our conclusions can also be losslessly converted to other classical DP analytical frameworks, e.g. $(\epsilon,\delta)$-DP and R$\'{e}$nyi-DP (RDP), to provide more fine-grained understandings for the FL-DP frameworks.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

509. Neural-HSS: Hierarchical Semi-Separable Neural PDE Solver

作者:

Deep learning-based methods have shown remarkable effectiveness in solving PDEs, largely due to their ability to enable fast simulations once trained. However, despite the availability of high-performance computing infrastructure, many critical applications remain constrained by the substantial computational costs associated with generating large-scale, high-quality datasets and training models. In this work, inspired by studies on the structure of Green's functions for elliptic PDEs, we introduce Neural-HSS, a parameter-efficient architecture built upon the Hierarchical Semi-Separable (HSS) matrix structure that is provably data-efficient for a broad class of PDEs. We theoretically analyze the proposed architecture, proving that it satisfies exactness properties even in very low-data regimes. We also investigate its connections with other architectural primitives, such as the Fourier neural operator layer and convolutional layers. We experimentally validate the data efficiency of Neural-HSS on the three-dimensional Poisson equation over a grid of two million points, demonstrating its superior ability to learn from data generated by elliptic PDEs in the low-data regime while outperforming baseline methods. Finally, we demonstrate its capability to learn from data arising from a broad class of PDEs in diverse domains, including electromagnetism, fluid dynamics, and biology.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

510. Training Large Language Models To Reason In Parallel With Global Forking Tokens

作者:

Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem and incorporate a set-based global loss into Supervised Fine-Turning (SFT) using bipartite matching between global forking tokens and unique reasoning traces. We observe that, whereas naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Turning (SSFT), preserves these modes and produces emergent global forking tokens. Experiments on multiple reasoning benchmarks show our SSFT method consistently outperforms SFT under both pass@1 and cons@k metrics.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

511. Weierstrass Positional Encoding for Vision Transformers

作者:

Vision Transformers (ViTs) have demonstrated remarkable success in computer vision tasks. However, their reliance on learnable one-dimensional positional encoding disrupts the inherent two-dimensional spatial structure of images due to patch flattening. Existing positional encoding approaches lack geometric constraints and fail to preserve a monotonic correspondence between Euclidean spatial distances and sequential index distances, thereby limiting the model's capacity to leverage spatial proximity priors effectively. Recognizing that periodicity is particularly beneficial for positional encoding, we propose Weierstrass elliptic Positional Encoding (WePE), a mathematically principled approach that encodes two-dimensional coordinates in the complex domain. This method maps the normalized two-dimensional patch coordinates onto the complex plane and constructs a compact four-dimensional positional feature based on the Weierstrass elliptic function $\wp(z)$ and its derivative. The doubly periodic property of $\wp(z)$ enables a principled encoding of 2D positional information, while their intrinsic lattice structure aligns naturally with the geometric regularities of patch grids in images. Their nonlinear geometric characteristics enable faithful modeling of spatial distance relationships, while the associated algebraic addition formula allows relative positional information between arbitrary patch pairs to be derived directly from their absolute encodings. WePE is a plug-and-play, resolution-agnostic positional module that integrates seamlessly with existing ViTs. Extensive experiments demonstrate that WePE delivers consistent performance gains in most scenarios, while its implementation with precomputed lookup tables ensures that these improvements incur no noticeable computational or memory overhead. In addition, several analyses and ablation studies bring further confirmation to the effectiveness of our method.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

512. Theoretical Analysis of Contrastive Learning under Imbalanced Data: From Training Dynamics to a Pruning Solution

作者:

Contrastive learning has emerged as a powerful framework for learning generalizable representations, yet its theoretical understanding remains limited, particularly under imbalanced data distributions that are prevalent in real-world applications. Such an imbalance can degrade representation quality and induce biased model behavior, yet a rigorous characterization of these effects is lacking. In this work, we develop a theoretical framework to analyze the training dynamics of contrastive learning with Transformer-based encoders under imbalanced data. Our results reveal that neuron weights evolve through three distinct stages of training, with different dynamics for majority features, minority features, and noise. We further show that minority features reduce representational capacity, increase the need for more complex architectures, and hinder the separation of ground-truth features from noise. Inspired by these neuron-level behaviors, we show that pruning restores performance degraded by imbalance and enhances feature separation, offering both conceptual insights and practical guidance. Major theoretical findings are validated through numerical experiments.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

513. Learning Retrieval Models with Sparse Autoencoders

作者:

Sparse autoencoders (SAEs) provide a powerful mechanism for decomposing the dense representations produced by Large Language Models (LLMs) into interpretable latent features. We posit that SAEs constitute a natural foundation for Learned Sparse Retrieval (LSR), whose objective is to encode queries and documents into high-dimensional sparse representations optimized for efficient retrieval. In contrast to existing LSR approaches that project input sequences into the vocabulary space, SAE-based representations offer the potential to produce more semantically structured, expressive, and language-agnostic features. By leveraging recently released open-source SAEs, we show that their latent features can serve as effective indexing units for representing documents and queries for sparse retrieval. Our experiments demonstrate that SAE-based LSR models consistently outperform their vocabulary-based counterparts in multilingual and out-of-domain settings. Finally, we introduce SPLARE, a 7B-parameter multilingual retrieval model capable of producing generalizable sparse latent embeddings for a wide range of languages and domains, achieving top results on MMTEB’s multilingual and English retrieval tasks.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

514. The Gaussian-Head OFL Family: One-Shot Federated Learning from Client Global Statistics

作者:

Classical Federated Learning relies on a multi-round iterative process of model exchange and aggregation between server and clients, with high communication costs and privacy risks from repeated model transmissions. In contrast, one-shot federated learning (OFL) alleviates these limitations by reducing communication to a single round, thereby lowering overhead and enhancing practical deployability. Nevertheless, most existing one-shot approaches remain either impractical or constrained, for example, they often depend on the availability of a public dataset, assume homogeneous client models, or require uploading additional data or model information. To overcome these issues, we introduce the Gaussian-Head OFL (GH-OFL) family, a suite of one-shot federated methods that assume class-conditional Gaussianity of pretrained embeddings. Clients transmit only sufficient statistics (per-class counts and first/second-order moments) and the server builds heads via three components: (i) Closed-form Gaussian heads (NB/LDA/QDA) computed directly from the received statistics; (ii) FisherMix, a linear head with cosine margin trained on synthetic samples drawn in an estimated Fisher subspace; and (iii) Proto-Hyper, a lightweight low-rank residual head that refines Gaussian logits via knowledge distillation on those synthetic samples. In our experiments, GH-OFL methods deliver state-of-the-art robustness and accuracy under strong non-IID skew while remaining strictly data-free.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

515. Learning From the Past with Cascading Eligibility Traces

作者:

Animals often receive information about errors and rewards after significant delays. In some cases these delays are fixed aspects of neural processing or sensory feedback, for example, there is typically a delay of tens to hundreds of milliseconds between motor actions and visual feedback. The standard approach to handling delays in models of synaptic plasticity is to use eligibility traces. However, standard eligibility traces that decay exponentially mix together any events that happen during the delay, presenting a problem for any credit assignment signal that occurs with a significant delay. Here, we show that eligibility traces formed by a state-space model, inspired by a cascade of biochemical reactions, can provide a temporally precise memory for handling credit assignment at arbitrary delays. We demonstrate that these cascading eligibility traces (CETs) work for credit assignment at behavioral time-scales, ranging from seconds to minutes. As well, we can use CETs to handle extremely slow retrograde signals, as have been found in retrograde axonal signaling. These results demonstrate that CETs can provide an excellent basis for modeling synaptic plasticity.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

516. Free Energy Mixer

作者:

Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

517. Talking Points: Describing and Localizing Pixels

作者:

Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on LlamaPointInPart. The bidirectional nature of our framework enables applications in both keypoint-guided image understanding and language-guided precise localization. Our dataset and code will be published upon publication.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

518. Best-of-three-worlds Analysis for Dueling Bandits with Borda Winner

作者:

The dueling bandits (DB) problem addresses online learning from relative preferences, where the learner queries pairs of arms and receives binary win-loss feedback. Most existing work focuses on designing algorithms for specific stochastic or adversarial environments. Recently, a unified algorithm has been proposed that achieves convergence across all settings. However, this approach relies on the existence of a Condorcet winner, which is often not achievable, particularly when the preference matrix changes in the adversarial setting. Aiming for a more general Borda winner objective, there currently exists no unified framework that simultaneously achieves optimal regret across these environments. In this paper, we explore how the follow-the-regularized-leader (FTRL) algorithm can be employed to achieve this objective. We propose a hybrid negative entropy regularizer and demonstrate that it enables us to achieve $\tilde{O}(K^{1/3} T^{2/3})$ regret in the adversarial setting, ${O}({K \log^2 T}/{\Delta_{\min}^2})$ regret in the stochastic setting, and $O({K \log^2 T }/{\Delta_{\min}^2} + ({C^2 K \log^2 T }/{\Delta_{\min}^2})^{1/3})$ regret in the corrupted setting, where $K$ is the arm set size, $T$ is the horizon, $\Delta_{\min}$ is the minimum gap between the optimal and sub-optimal arms, and $C$ is the corruption level. These results align with the state-of-the-art in individual settings, while eliminating the need to assume a specific environment type. We also present experimental results demonstrating the advantages of our algorithm over baseline methods across different environments.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

519. Eigen-1: Scientific Reasoning through Adaptive Multi-Agent Refinement and Monitor-based RAG

作者:

Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden tool tax of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity’s Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy—the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5% and agent steps by 43.7%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

520. MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE

作者:

Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token. However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory. While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes. This paper proposes \textbf{M}ixture-\textbf{o}f-\textbf{N}ovices-and-\textbf{E}xperts (\textbf{MoNE}), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression. MoNE evaluates expert redundancy based on two metrics: access frequency and output variance. Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices—unbiased estimations of their original outputs—minimizing performance degradation. Extensive experiments demonstrate that MoNE consistently outperforms baseline methods with minimal accuracy degradation across the three dimensions, confirming its effectiveness and robustness. Notably, it outperforms baselines by up to 2.72 for the average zero shot accuracy across nine downstream tasks under 25\% pruning ratio, with only 0.14 performance drop for Qwen2-57B-A14B. The code is available at https://anonymous.4open.science/r/AnonymizedMoNE.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

521. PerfGuard: A Performance-Aware Agent for Visual Content Generation

作者:

The advancement of Large Language Model (LLM)-powered agents has enabled automated task processing through reasoning and tool invocation capabilities. However, existing frameworks often operate under the idealized assumption that tool executions are invariably successful, relying solely on textual descriptions that fail to distinguish precise performance boundaries and cannot adapt to iterative tool updates. This gap introduces uncertainty in planning and execution, particularly in domains like visual content generation (AIGC), where nuanced tool performance significantly impacts outcomes. To address this, we propose PerfGuard, a performance-aware agent framework for visual content generation that systematically models tool performance boundaries and integrates them into task planning and scheduling. Our framework introduces three core mechanisms: (1) Performance-Aware Selection Modeling (PASM), which replaces generic tool descriptions with a multi-dimensional scoring system based on fine-grained performance evaluations; (2) Adaptive Preference Update (APU), which dynamically optimizes tool selection by comparing theoretical rankings with actual execution rankings; and (3) Capability-Aligned Planning Optimization (CAPO), which guides the planner to generate subtasks aligned with performance-aware strategies. Experimental comparisons against state-of-the-art methods demonstrate PerfGuard’s advantages in tool selection accuracy, execution reliability, and alignment with user intent, validating its robustness and practical utility for complex AIGC tasks.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

522. Benchmarking Open-ended Segmentation

作者:

Open-ended segmentation requires models capable of generating free-form descriptions of previously unseen concepts and regions. Despite advancements in model development, current evaluation protocols for open-ended segmentation tasks fail to capture the true semantic accuracy of the generated descriptions. We empirically demonstrate that embedding‐based similarity score mappings diverge significantly from human judgments. To address this issue, we introduce a novel mapping function that considers multiple lexical relationships between free‐form outputs and test‐vocabulary labels, yielding much closer alignment with human annotations. We integrate this mapping into a robust evaluation framework and re‐benchmark previous state‐of‐the‐art methods. Additionally, we present the first Multi-modal Large‐Language Model trained with a contrastive objective to jointly align visual regions and textual descriptions, achieving new state‐of‐the‐art results in open‐ended panoptic segmentation.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

523. SiNGER: A Clearer Voice Distills Vision Transformers Further

作者:

Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality. When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals. The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts. Then, the refined teacher's features are distilled to a student. We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification. Extensive experiments show that \oursname consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

524. Post-hoc Probabilistic Vision-Language Models

作者:

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

525. MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models

作者:

Text-to-video diffusion models have enabled high-quality video synthesis, yet often fail to generate temporally coherent and physically plausible motion. A key reason is the models' insufficient understanding of complex motions that natural videos often entail. Recent works tackle this problem by aligning diffusion model features with those from pretrained video encoders. However, these encoders mix video appearance and dynamics into entangled features, limiting the benefit of such alignment. In this paper, we propose a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder. This subspace is optimized to predict ground-truth optical flow, ensuring it captures true motion dynamics. We then align the latent features of a text-to-video diffusion model to this new subspace, enabling the generative model to internalize motion knowledge and generate more plausible videos. Our method improves the physical commonsense in a state-of-the-art video diffusion model, while preserving adherence to textual prompts, as evidenced by empirical evaluations on VideoPhy, VideoPhy2, VBench, and VBench-2.0, along with a user study.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

526. Supporting High-Stakes Decision Making Through Interactive Preference Elicitation in the Latent Space

作者:

High-stakes, infrequent consumer decisions, such as housing selection, challenge conventional recommender systems due to sparse interaction signals, heterogeneous multi-criteria objectives, and high-dimensional feature spaces. This work presents an interactive preference elicitation framework that couples preferential Bayesian optimization (PBO) with two complementary components: (i) large language models (LLMs) that interpret natural language input to produce personalized probabilistic priors over feature utility weights to mitigate cold start, and (ii) an autoencoder (AE)-based latent representation that reduces effective dimensionality for sample-efficient exploration. The framework learns a latent utility function from user pairwise comparisons observed and integrated in real-time. We evaluate the developed method on rental real estate datasets from two major European cities. The results show that executing PBO in an AE latent space improves final pairwise ranking accuracy by 12%. For LLM-based preference prior generation, we find that direct, LLM-driven weight specification is outperformed by a static prior, while probabilistic weight priors that use LLMs only to rank feature importance achieve 25% better pairwise accuracy on average than a direct approach.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

527. Reinforced Preference Optimization for Recommendation

作者:

Recent breakthroughs in large language models (LLMs) have fundamentally shifted recommender systems from discriminative to generative paradigms, where user behavior modeling is achieved by generating target items conditioned on historical interactions. Yet current generative recommenders still suffer from two core limitations: the lack of high-quality negative modeling and the reliance on implicit rewards. Reinforcement learning with verifiable rewards (RLVR) offers a natural solution by enabling on-policy sampling of harder negatives and grounding optimization in explicit reward signals. However, applying RLVR to generative recommenders remains non-trivial. Its unique generation space often leads to invalid or repetitive items that undermine sampling efficiency, and ranking supervision is sparse since most items receive identical zero rewards. To address these challenges, we propose \textbf{Reinforced Preference Optimization for Recommendation} (\textbf{ReRe}), a reinforcement-based paradigm tailored to LLM-based recommenders, an important direction in generative recommendation. ReRe incorporates constrained beam search to improve sampling efficiency and diversify hard negatives, while augmenting rule-based accuracy rewards with auxiliary ranking rewards for finer-grained supervision. Extensive experiments on three real-world datasets demonstrate that ReRe consistently outperforms both traditional and LLM-based recommenders in ranking performance. Further analysis shows that ReRe not only enhances performance across both base and SFT-initialized models but also generalizes robustly across different backbone families and scales. Beyond empirical gains, we systematically investigate the design space of RLVR in recommendation across generation, sampling strategy, reward modeling, and optimization algorithm, offering insights for future research. Our codes are available at \url{https://anonymous.4open.science/r/ReRe-E1B0}.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

528. Choices Speak Louder than Questions

作者:

Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of \textit{choice sensitivity}, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called **Normalized Probability Shift by the Question (NPSQ)**, designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods — such as those based on log-likelihood or its length-normalized variant — are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

529. CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

作者:

Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by \textbf{+3.3} points and \textbf{+4.82} points with 1.5B and 7B models, respectively, and exceeds the best prior sample efficient methods by \textbf{+2.12} points on average across eight math reasoning benchmarks. Our CurES also improves convergence speed compare to baselines such as GRPO.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

530. MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

作者:

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs’ reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,676 multiple-choice questions based on 19,367 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

531. OD$^3$: Optimization-free Dataset Distillation for Object Detection

作者:

Training large neural networks on large-scale datasets requires substantial computational resources, particularly for dense prediction tasks such as object detection. Although dataset distillation (DD) has been proposed to alleviate these demands by synthesizing compact datasets from larger ones, most existing work focuses solely on image classification, leaving the more complex detection setting largely unexplored. In this paper, we introduce OD$^3$, a novel optimization-free data distillation framework specifically designed for object detection. Our approach involves two stages: first, a candidate selection process in which object instances are iteratively placed in synthesized images based on their suitable locations, and second, a candidate screening process using a pre-trained observer model to remove low-confidence objects. We perform our data synthesis framework on MS COCO and PASCAL VOC, two popular detection datasets, with compression ratios ranging from 0.25% to 5%. Compared to the prior solely existing dataset distillation method on detection and conventional core set selection methods, OD$^3$ delivers superior accuracy, establishes new state-of-the-art results, surpassing prior best method by more than 14% on COCO mAP$_{50}$ at a compression ratio of 1.0%. The code is in the supplementary material.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

532. Taming Curvature: Architecture Warm-up for Stable Transformer Training

作者:

Training billion-parameter Transformers is often brittle, with transient loss spikes and divergence that waste compute. Even though the recently developed Edge of Stability (EoS) theory provides a powerful tool to understand and control the stability of optimization methods via the (preconditioned) curvature, these curvature-controlling methods are not popular in large-scale Transformer training due to the complexity of curvature estimation. To this end, we first introduce a fast online estimator of the largest (preconditioned) Hessian eigenvalue (i.e., curvature) based on a warm-started variant for power iteration with Hessian–vector products. We show theoretically, and verify empirically, that the proposed method makes per-iteration curvature tracking feasible at billion-parameter scale while being more accurate. Using this tool, we find that training instabilities coincide with surges in preconditioned curvature and that curvature grows with depth. Motivated by these observations, we propose architecture warm-up: progressively growing network depth to carefully control the preconditioned Hessian and stabilize training. Experiments on large Transformers validate that our approach enables efficient curvature tracking and reduces instabilities compared to existing state-of-the-art stabilization techniques without slowing down convergence.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

533. SVD Provably Denoises Nearest Neighbor Data

作者:

We study the Nearest Neighbor Search (NNS) problem in a high-dimensional setting where data originates from a low-dimensional subspace and is corrupted by Gaussian noise. Specifically, we consider a semi-random model where $n$ points from an unknown $k$-dimensional subspace of $\mathbb{R}^d$ ($k \ll d$) are perturbed by zero-mean $d$-dimensional Gaussian noise with variance $\sigma^2$ on each coordinate. Without loss of generality, we may assume the nearest neighbor is at distance $1$ from the query, and that all other points are at distance at least $1+\varepsilon$. We assume we are given only the noisy data and are required to find NN of the uncorrupted data. We prove the following results: 1. For $\sigma \in O(1/k^{1/4})$, we show that simply performing SVD denoises the data; namely, we provably recover accurate NN of uncorrupted data (Theorem 1.1). 2. For $\sigma \gg 1/k^{1/4}$, NN in uncorrupted data is not even {\bf identifiable} from the noisy data in general. This is a matching lower bound on $\sigma$ with the above result, demonstrating the necessity of this threshold for NNS (Lemma 3.1). 3. For $\sigma \gg 1/\sqrt k$, the noise magnitude ($\sigma \sqrt{d}$) is significantly exceeds the inter-point distances in the unperturbed data. Moreover, NN in noisy data is different from NN in the uncorrupted data in general. \end{enumerate} Note that (1) and (3) together imply SVD identifies correct NN in uncorrupted data even in a regime where it is different from NN in noisy data. This was not the case in existing literature (see e.g. (Abdullah et al., 2014)). Another comparison with (Abdullah et al., 2014) is that it requires $\sigma$ to be at least an inverse polynomial in the ambient dimension $d$. The proof of (1) above uses upper bounds on perturbations of singular spaces of matrices as well as concentration and spherical symmetry of Gaussians. We thus give theoretical justification for the performance of spectral methods in practice. We also provide empirical results on real datasets to corroborate our findings.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

534. Score-Based Density Estimation from Pairwise Comparisons

作者:

We study density estimation from pairwise comparisons, motivated by expert knowledge elicitation and learning from human feedback. We relate the unobserved target density to a tempered winner density (marginal density of preferred choices), learning the winner's score via score-matching. This allows estimating the target by `de-tempering' the estimated winner density's score. We prove that the score vectors of the belief and the winner density are collinear, linked by a position-dependent tempering field. We give analytical formulas for this field and propose an estimator for it under the Bradley-Terry model. Using a diffusion model trained on tempered samples generated via score-scaled annealed Langevin dynamics, we can learn complex multivariate belief densities of simulated experts, from only hundreds to thousands of pairwise comparisons.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

535. IC-Custom: Diverse Image Customization via In-Context Learning

作者:

Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We propose the In-context Multi-Modal Attention (ICMA) mechanism, which employs learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to effectively handle diverse tasks and distinguish between inputs in polyptych configurations. To address the data gap, we curated a 12K identity-consistent dataset with 8K real-world and 4K high-quality synthetic samples, avoiding the overly glossy, oversaturated look typical of synthetic data. IC-Custom supports various industrial applications, including try-on, image insertion, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves about 73\% higher human preference across identity consistency, harmony, and text alignment metrics, while training only 0.4\% of the original model parameters.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

536. MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

作者:

Imitation learning from large-scale, diverse human demonstrations has proven effective for training robots, but collecting such data is costly and time-consuming. This challenge is amplified for multi-step bimanual mobile manipulation, where humans must teleoperate both a mobile base and two high-degree-of-freedom arms. Prior automated data generation frameworks have addressed static bimanual manipulation by augmenting a few human demonstrations in simulation, but they fall short for mobile settings due to two key challenges: (1) determining base placement to ensure reachability, and (2) positioning the camera to provide sufficient visibility for visuomotor policies. To address these issues, we introduce MoMaGen, which formulates data generation as a constrained optimization problem that enforces hard constraints (e.g., reachability) while balancing soft constraints (e.g., visibility during navigation). This formulation generalizes prior approaches and provides a principled foundation for future methods. We evaluate MoMaGen on four multi-step bimanual mobile manipulation tasks and show that it generates significantly more diverse datasets than existing methods. Leveraging this diversity, MoMaGen can train successful imitation learning policies from a single source demonstration, and these policies can be fine-tuned with as few as 40 real-world demonstrations to achieve deployment on physical robotic hardware. More details are available at our project page: momagen-iclr2026.github.io.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

537. Avoid Catastrophic Forgetting with Rank-1 Fisher from Diffusion Models

作者:

Catastrophic forgetting remains a central obstacle for continual learning in neural models. Popular approaches---replay and elastic weight consolidation (EWC)---have limitations: replay requires a strong generator and is prone to distributional drift, while EWC implicitly assumes a shared optimum across tasks and typically uses a diagonal Fisher approximation. In this work, we study the gradient geometry of diffusion models, which can already produce high-quality replay data. We provide theoretical and empirical evidence that, in the low signal-to-noise ratio (SNR) regime, per-sample gradients become strongly collinear, yielding an empirical Fisher that is effectively rank-1 and aligned with the mean gradient. Leveraging this structure, we propose a rank-1 variant of EWC that is as cheap as the diagonal approximation yet captures the dominant curvature direction. We pair this penalty with a replay-based approach to encourage parameter sharing across tasks while mitigating drift. On class-incremental image generation datasets (MNIST, FashionMNIST, CIFAR-10, ImageNet-1k), our method consistently improves average FID and reduces forgetting relative to replay-only and diagonal-EWC baselines. In particular, forgetting is nearly eliminated on MNIST and FashionMNIST and is roughly halved on ImageNet-1k. These results suggest that diffusion models admit an approximately rank-1 Fisher. With a better Fisher estimate, EWC becomes a strong complement to replay: replay encourages parameter sharing across tasks, while EWC effectively constrains replay-induced drift.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

538. From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

作者:

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data construction pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD’s capabilities in both “seeing” and “doing”, achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

539. Efficient Spatially-Variant Convolution via Differentiable Sparse Kernel Complex

作者:

Image convolution with complex kernels is a fundamental operation in photography, scientific imaging, and animation effects, yet direct dense convolution is computationally prohibitive on resource-limited devices. Existing approximations, such as simulated annealing or low-rank decompositions, either lack efficiency or fail to capture non-convex kernels. We introduce a differentiable kernel decomposition framework that represents a target spatially-variant, dense, complex kernel using a set of sparse kernel samples. Our approach features (i) a decomposition that enables differentiable optimization of sparse kernels, (ii) a dedicated initialization strategy for non-convex shapes to avoid poor local minima, and (iii) a kernel-space interpolation scheme that extends single-kernel filtering to spatially varying filtering without retraining and additional runtime overhead. Experiments on Gaussian and non-convex kernels show that our method achieves higher fidelity than simulated annealing and significantly lower cost than low-rank decompositions. Our approach provides a practical solution for mobile imaging and real-time rendering, while remaining fully differentiable for integration into broader learning pipelines.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

540. Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data

作者:

Recent research has introduced distributed self-supervised learning (D-SSL) approaches to leverage vast amounts of unlabeled decentralized data. However, D-SSL faces the critical challenge of data heterogeneity, and there is limited theoretical understanding of how different D-SSL frameworks respond to this challenge. To fill this gap, we present a rigorous theoretical analysis of the robustness of D-SSL frameworks under non-IID (non-independent and identically distributed) settings. Our results show that pre-training with Masked Image Modeling (MIM) is inherently more robust to heterogeneous data than Contrastive Learning (CL), and that the robustness of decentralized SSL increases with average network connectivity, implying that federated learning (FL) is no less robust than decentralized learning (DecL). These findings provide a solid theoretical foundation for guiding the design of future D-SSL algorithms. To further illustrate the practical implications of our theory, we introduce MAR loss, a refinement of the MIM objective with local-to-global alignment regularization. Extensive experiments across model architectures and distributed settings validate our theoretical insights, and additionally confirm the effectiveness of MAR loss as an application of our analysis.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

541. Locality-Attending Vision Transformer

作者:

Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance the segmentation performance of vision transformers after being trained using the usual image-level classification objective. More specifically, we present a simple yet effective add-on for vision transformers that improve their performance on segmentation tasks while retaining their image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications ensure meaningful representations at spatial positions and encourage tokens to focus on local surroundings, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://anonymous.4open.science/r/LocAtViTRepo/.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

542. Improving Attributed Long-form Question Answering with Intent Awareness

作者:

Large language models (LLMs) are increasingly being used to generate comprehensive, knowledge-intensive reports. However, while these models are trained on diverse academic papers and reports, they are not exposed to the reasoning processes and intents that guide authors in crafting these documents. We hypothesize that enhancing a model's intent awareness can significantly improve the quality of generated long-form reports. We develop and employ structured, tag-based schemes to better elicit underlying implicit intents to write or cite. We demonstrate that these extracted intents enhance both zero-shot generation capabilities in LLMs and enable the creation of high-quality synthetic data for fine-tuning smaller models. Our experiments reveal improved performance across various challenging scientific report generation tasks, with an average improvement of +2.9 and +12.3 absolute points for large and small models over baselines, respectively. Furthermore, our analysis illuminates how intent awareness enhances model citation usage and substantially improves report readability.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

543. Towards a Sharp Analysis of Learning Offline $f$-Divergence-Regularized Contextual Bandits

作者:

Many offline reinforcement learning algorithms are underpinned by $f$-divergence regularization, but their sample complexity *defined with respect to regularized objectives* still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the $\tilde{\Theta}(\epsilon^{-1})$ sample complexity for offline $f$-divergence-regularized contextual bandits. For reverse Kullback–Leibler (KL) divergence, arguably the most commonly used one, we achieve an $\tilde{O}(\epsilon^{-1})$ sample complexity under single-policy concentrability for the first time via a novel pessimism-based analysis, surpassing existing $\tilde{O}(\epsilon^{-1})$ bound under all-policy concentrability and $\tilde{O}(\epsilon^{-2})$ bound under single-policy concentrability. We also propose a near-matching lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the curvature property of reverse KL. Moreover, for $f$-divergences with strongly convex $f$, to which reverse KL *does not* belong, we show that the sharp sample complexity $\tilde{\Theta}(\epsilon^{-1})$ is achievable even without pessimistic estimation or single-policy concentrability. We further corroborate our theoretical insights with numerical experiments and extend our analysis to contextual dueling bandits. We believe these results take a significant step towards a comprehensive understanding of objectives with $f$-divergence regularization.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

544. Seq vs Seq: An Open Suite of Paired Encoders and Decoders

作者:

The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

545. Type-Compliant Adaptation Cascades

作者:

Reliably composing Large Language Models (LLMs) for complex, multi-step workflows remains a significant challenge. The dominant paradigm---optimizing discrete prompts in a pipeline---is notoriously brittle and struggles to enforce the formal compliance required for structured tasks. We introduce Type-Compliant Adaptation Cascades (TACs), a framework that recasts workflow adaptation as learning typed probabilistic programs. TACs treat the entire workflow, which is composed of parameter-efficiently adapted LLMs and deterministic logic, as an unnormalized joint distribution. This enables principled, gradient-based training even with latent intermediate structures. We provide theoretical justification for our tractable optimization objective, proving that the optimization bias vanishes as the model learns type compliance. Empirically, TACs significantly outperform state-of-the-art prompt-optimization baselines. Gains are particularly pronounced on structured tasks, improving FinQA from $12.0\%$ to $24.7\%$ for a Qwen 3 8B model, MGSM-SymPy from $57.1\%$ to $75.9\%$ for a Gemma 2 27B model, MGSM from $1.6\%$ to $27.3\%$, and MuSR from $36.5\%$ to $62.6\%$ for a Gemma 7B model. TACs offer a robust and theoretically grounded paradigm for developing reliable, task-compliant LLM systems.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

546. Conformal Prediction for Long-Tailed Classification

作者:

Many real-world classification problems, such as plant identification, have extremely long-tailed class distributions. In order for prediction sets to be useful in such settings, they should (i) provide good class-conditional coverage, ensuring that rare classes are not systematically omitted from the prediction sets, and (ii) be a reasonable size, allowing users to easily verify candidate labels. Unfortunately, existing conformal prediction methods, when applied to the long-tailed setting, force practitioners to make a binary choice between small sets with poor class-conditional coverage or sets with very good class-conditional coverage but that are extremely large. We propose methods with guaranteed marginal coverage that smoothly trade off between set size and class-conditional coverage. First, we introduce a new conformal score function, coined prevalence-adjusted softmax, that targets macro-coverage, a relaxed notion of class-conditional coverage. Second, we propose a label-weighted conformal prediction method that allows us to interpolate between marginal and class-conditional conformal prediction. We demonstrate our methods on Pl@ntNet-300K and iNaturalist-2018, two long-tailed image datasets with 1,081 and 8,142 classes, respectively.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

547. Demystifying Deep Search: A Holistic Evaluation with Hint-free Multi-Hop Questions and Factorised Metrics

作者:

RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviors into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilization, and refusal behavior. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilization despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap—today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow EvidenceLoop that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

548. TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

作者:

Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

549. MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

作者:

We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input–output coupling. Also, tasks in MCP-Bench test agents’ ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows—capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectorylevel planning and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

550. Mordal: Automated Pretrained Model Selection for Vision Language Models

作者:

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search. In the process of evaluation, we have also discovered that Mordal achieves $1.2\times$--$3.3\times$ better performance than the state-of-the-art model selection methods on a variety of tasks.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

551. NetArena: Dynamically Generated LLM Benchmarks for Network Applications

作者:

As large language models (LLMs) expand into high-stakes domains like network system operations, evaluating their real-world reliability becomes increasingly critical. However, existing benchmarks risk contamination due to static design, show high statistical variance from limited dataset size, and fail to reflect the complexity of production environments. We introduce NetArena, a dynamic benchmark generation framework for network applications. NetArena features a novel abstraction and unified interface that generalizes across applications, effec- tively addressing the challenges of dynamic benchmarking posed by the diversity of network tasks. At runtime, users can generate unlimited queries on demand. NetArena integrates with network emulators to provide execution-time feedback on correctness, safety, and latency. We demonstrate NetArena on three repre- sentative applications and find that (1) it significantly improve statistical reliability among LLM agents (confidence interval overlap reduced from 85% to 0), (2) agents achieve only 13–38% average performance (as low as 3%) for large-scale, realistic queries, (3) it reveals finer-grained behaviors missed by static, correctness-only benchmarks. NetArena also enables use cases such as SFT and RL fine-tuning on network system tasks. Code is available anonymously at https://anonymous.4open.science/r/netarena_iclr2026-BE94/README.md

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

552. CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

作者:

Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT's core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($\rho$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

553. Identifiability Challenges in Sparse Linear Ordinary Differential Equations

作者:

Dynamical systems modeling is a core pillar of scientific inquiry across natural and life sciences. Increasingly, dynamical system models are learned from data, rendering identifiability a paramount concept. For systems that are not identifiable from data, no guarantees can be given about their behavior under new conditions and inputs, or about possible control mechanisms to steer the system. It is known in the community that "linear ordinary differential equations (ODE) are almost surely identifiable from a single trajectory." However, this only holds for dense matrices. The sparse regime remains underexplored, despite its practical relevance with sparsity arising naturally in many biological, social, and physical systems. In this work, we address this gap by characterizing the identifiability of sparse linear ODEs. Contrary to the dense case, we show that sparse systems are unidentifiable with a positive probability in practically relevant sparsity regimes and provide lower bounds for this probability. We further study empirically how this theoretical unidentifiability manifests in state-of-the-art methods to estimate linear ODEs from data. Our results corroborate that sparse systems are also practically unidentifiable. Theoretical limitations are not resolved through inductive biases or optimization dynamics. Our findings call for rethinking what can be expected from data-driven dynamical system modeling and allows for quantitative assessments of how much to trust a learned linear ODE.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

554. PD$^{2}$GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting

作者:

Articulated objects are ubiquitous and important in robotics, AR/VR, and digital twins. Most self-supervised methods for articulated object modeling reconstruct discrete interaction states and relate them via cross-state geometric consistency, yielding representational fragmentation and drift that hinder smooth control of articulated configurations. We introduce PD$^{2}$GS, a novel framework that learns a shared canonical Gaussian field and models the arbitrary interaction state as its continuous deformation, jointly encoding geometry and kinematics. By associating each interaction state with a latent code and refining part boundaries using generic vision priors, PD$^{2}$GS enables accurate and reliable part-level decoupling while enforcing mutual exclusivity between parts and preserving scene-level coherence. This unified formulation supports part-aware reconstruction, fine-grained continuous control, and accurate kinematic modeling, all without manual supervision. To assess realism and generalization, we release RS-Art, a real-to-sim RGB-D dataset aligned with reverse-engineered 3D models, supporting real-world evaluation. Extensive experiments demonstrate that PD$^{2}$GS surpasses prior methods in geometric and kinematic accuracy, and in consistency under continuous control, both on synthetic and real data.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

555. How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

作者:

Curriculum learning is a powerful paradigm, yet its application to large language model (LLM) pretraining remains underexplored, especially in scenarios where high-quality data is limited yet crucial. Previous works have primarily focused on applying curriculum learning to LLM pretraining by searching for better data quality metrics. However, these approaches have yielded only marginal gains, and curriculum-based training is still not a standard practice. In this work, we explore the problem from the opposite perspective: if a good quality metric is available, can current curriculum learning strategies produce better results? We diagnose a key, yet overlooked, factor responsible for this deficiency: the interplay between the data order and the learning rate (LR) schedule. We find that while curriculum learning can greatly outperform pretraining with a uniform data distribution under a constant LR schedule, this advantage diminishes as the learning rate decays. Building on this observation, we propose replacing LR decay with model averaging, which involves computing a weighted average of last several model checkpoints. We find this strategy achieves better results than standard LR decay schedules, especially in a mid-training regime where only a portion of high-quality data is available. Furthermore, this approach reveals that model averaging is greatly strengthened with the occurrence of curriculum learning. Finally, we propose a co-designed strategy for curriculum-based LLM pretraining: combining a moderate LR decay with model averaging. This approach allows the model to strike a balance between learning effectively from high-quality data, reducing knowledge forgetting, and mitigating gradient noise. We find that this combination highlights a previously overlooked opportunity to improve pretraining by co-designing the data curriculum, LR schedule, and model averaging.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

556. Persuasive Prediction via Decision Calibration

作者:

Bayesian persuasion, a central model in information design, studies how a sender, who privately observes a state drawn from a prior distribution, strategically sends a signal to influence a receiver's action. A key assumption is that both sender and receiver share the precise knowledge of the prior. Although this prior can be estimated from past data, such assumptions break down in high-dimensional or infinite state spaces, where learning an accurate prior may require a prohibitive amount of data. In this paper, we study a learning-based variant of persuasion, which we term *persuasive prediction*. This setting mirrors Bayesian persuasion with large state spaces, but crucially does not assume a common prior: the sender observes covariates $X$, learns to predict a payoff-relevant outcome $Y$ from past data, and releases a prediction to influence a population of receivers. To model rational receiver behavior without a common prior, we adopt a learnable proxy: *decision calibration*, which requires the prediction to be unbiased conditioned on the receiver's best response to the prediction. This condition guarantees that myopically responding to the prediction yields no swap regret. Assuming the receivers best respond to decision-calibrated predictors, we design a provably efficient algorithm that learns a decision-calibrated predictor within a randomized predictor class that optimizes the sender's utility. In the commonly studied single-receiver case, our method matches the utility of a Bayesian sender who has full knowledge of the underlying prior distribution. Finally, we extend our algorithmic result to a setting where receivers respond stochastically to predictions and the sender may randomize over an infinite predictor class.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

557. Closing the Gap Between Text and Speech Understanding in LLMs

作者:

Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts—and even cascaded pipelines—on language understanding tasks. We term this shortfall the text–speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD—Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation—which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from publicly available corpora.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

558. Hierarchical Concept-based Interpretable Models

作者:

Modern deep neural networks remain challenging to interpret due to the opacity of their latent representations, impeding model understanding, debugging, and debiasing. Concept Embedding Models (CEMs) address this by mapping inputs to human-interpretable concept representations from which tasks can be predicted. Yet, CEMs fail to represent inter-concept relationships and require concept annotations at different granularities during training, limiting their applicability. In this paper, we introduce *Hierarchical Concept Embedding Models* (HiCEMs), a new family of CEMs that explicitly model concept relationships through hierarchical structures. To enable HiCEMs in real-world settings, we propose *Concept Splitting*, a method for automatically discovering finer-grained sub-concepts from a pretrained CEM’s embedding space without requiring additional annotations. This allows HiCEMs to generate fine-grained explanations from limited concept labels, reducing annotation burdens. Our evaluation across multiple datasets, including a user study and experiments on *PseudoKitchens*, a newly proposed concept-based dataset of 3D kitchen renders, demonstrates that (1) Concept Splitting discovers human-interpretable sub-concepts absent during training that can be used to train highly accurate HiCEMs, and (2) HiCEMs enable powerful test-time concept interventions at different granularities, leading to improved task accuracy.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

559. AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport

作者:

Flow-based Generative Models (FGMs) effectively transform noise into a data distribution, and coupling the noise and data in the training of FGM by Optimal Transport (OT) improves the straightness of the flow paths. However, existing OT- based couplings are difficult to combine with modern models and/or to scale to large datasets due to the curse of dimensionality in the sample complexity of (batch) OT. This paper introduces AlignFlow, a new approach using Semi-Discrete Optimal Transport (SDOT) to enhance FGM training by establishing explicit alignment between noise and data pairs. SDOT computes a transport map by partitioning the noise space into Laguerre cells, each mapped to a corresponding data point. During the training of FGM, i.i.d.-sampled noise is matched with corresponding data by the SDOT map. AlignFlow bypasses the curse of dimensionality and scales effectively to large datasets and models. Our experiments demonstrate that AlignFlow improves a wide range of state-of-the-art FGM algorithms and can be integrated as a plug-and-play solution with negligible additional cost.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

560. Control Tax: The Price of Keeping AI in Check

作者:

The rapid integration of agentic AI into high-stakes real-world applications requires robust oversight mechanisms. The emerging field of AI Control (AIC) aims to provide such an oversight mechanism, but practical adoption depends heavily on implementation overhead. To study this problem better, we introduce the notion of Control tax---the operational and financial cost of integrating control measures into AI pipelines. Our work makes three key contributions to the field of AIC: (1) we introduce a theoretical framework that quantifies the Control Tax and maps classifier performance to safety assurances; (2) we conduct comprehensive evaluations of state-of-the-art language models in adversarial settings, where attacker models insert subtle backdoors into code while monitoring models attempt to detect these vulnerabilities; and (3) we provide empirical financial cost estimates for control protocols and develop optimized monitoring strategies that balance safety and cost-effectiveness while accounting for practical constraints like auditing budgets. Our framework enables practitioners to make informed decisions by systematically connecting safety guarantees with their costs, advancing AIC through principled economic feasibility assessment across different deployment contexts.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

561. Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model

作者:

Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose **E**volutionary **C**aching to **A**ccelerate **D**iffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD's learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and FLUX-1.dev using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

562. Improved high-dimensional estimation with Langevin dynamics and stochastic weight averaging

作者:

Significant recent work has studied the ability of gradient descent to recover a hidden planted direction $\theta^\star \in S^{d-1}$ in different high-dimensional settings, including tensor PCA and single-index models. The key quantity that governs the ability of gradient descent to traverse these landscapes is the information exponent $k^\star$ (Ben Arous et al., (2021)), which corresponds to the order of the saddle at initialization in the population landscape. Ben Arous et al., (2021) showed that $n \gtrsim d^{\max(1, k^\star-1)}$ samples were necessary and sufficient for online SGD to recover $\theta^\star$, and Ben Arous et al., (2020) proved a similar lower bound for Langevin dynamics. More recently, Damian et al., (2023) showed it was possible to circumvent these lower bounds by running gradient descent on a smoothed landscape, and that this algorithm succeeds with $n \gtrsim d^{\max(1, k^\star/2)}$ samples, which is optimal in the worst case. This raises the question of whether it is possible to achieve the same rate without explicit smoothing. In this paper, we show that Langevin dynamics can succeed with $n \gtrsim d^{ k^\star/2 }$ samples if one considers the average iterate, rather than the last iterate. The key idea is that the combination of noise-injection and iterate averaging is able to emulate the effect of landscape smoothing. We apply this result to both the tensor PCA and single-index model settings. Finally, we conjecture that minibatch SGD can also achieve the same rate without adding any additional noise.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

563. Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

作者:

Recent studies have shown that large language models (LLMs) can infer private user attributes (e.g., age, location, gender) from user-generated text shared online, enabling rapid and large-scale privacy breaches. Existing anonymization-based defenses are coarse-grained, lacking word-level precision in anonymizing privacy-leaking elements. Moreover, they are inherently limited as altering user text to hide sensitive cues still allows attribute inference to occur through models' reasoning capabilities. To address these limitations, we propose a unified defense framework that combines fine-grained anonymization (TRACE) with inference-preventing optimization (RPS). TRACE leverages attention mechanisms and inference chain generation to identify and anonymize privacy-leaking textual elements, while RPS employs a lightweight two-stage optimization strategy to induce model rejection behaviors, thereby preventing attribute inference. Evaluations across diverse LLMs show that TRACE-RPS reduces attribute inference accuracy from around 50\% to below 5\% on open-source models. In addition, our approach offers strong cross-model generalization, prompt-variation robustness, and utility-privacy tradeoffs.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

564. Spiking Discrepancy Transformer for Point Cloud Analysis

作者:

Spiking Transformer has sparked growing interest, with the Spiking Self-Attention merging spikes with self-attention to deliver both energy efficiency and competitive performance. However, existing work primarily focuses on 2D visual tasks, and in the domain of 3D point clouds, the disorder and complexity of spatial information, along with the scale of the point clouds, present significant challenges. For point clouds, we introduce spiking discrepancy, measuring differences in spike features to highlight key information, and then construct the Spiking Discrepancy Attention Mechanism (SDAM). SDAM contains two variants: the Spiking Element Discrepancy Attention captures local geometric correlations between central points and neighboring points, while the Spiking Intensity Discrepancy Attention characterizes structural patterns of point clouds based on macroscopic spike statistics. Moreover, we propose a Spatially-Aware Spiking Neuron. Based on these, we construct a hierarchical Spiking Discrepancy Transformer. Experimental results demonstrate that our method achieves state-of-the-art performance within the Spiking Neural Networks and exhibits impressive performance compared to Artificial Neural Networks along with a few parameters and significantly lower theoretical energy consumption.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

565. Unified In-Context Video Editing

作者:

Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inversion), which limit the integration of versatile editing conditions and the unification of various editing tasks. In this paper, we introduce UNified In-Context Video Editing (UNIC), a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner. To achieve this unification, we represent the inputs of various video editing tasks as three types of tokens: the source video tokens, the noisy video latent, and the multi-modal conditioning tokens that vary according to the specific editing task. Based on this formulation, our key insight is to integrate these three types into a single consecutive token sequence and jointly model them using the native attention operations of DiT, thereby eliminating the need for task-specific adapter designs. Nevertheless, direct task unification under this framework is challenging, leading to severe token collisions and task confusion due to the varying video lengths and diverse condition modalities across tasks. To address these, we introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks. This allows our approach to adaptively perform different video editing tasks by referring the source video and varying condition tokens "in context", and support flexible task composition. To validate our method, we construct a unified video editing benchmark containing six representative video editing tasks. Results demonstrate that our unified approach achieves comparable performance with task specialists and exhibits emergent task composition abilities.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

566. Diffusion and Flow-based Copulas: Forgetting and Remembering Dependencies

作者:

Copulas are a fundamental tool for modelling multivariate dependencies in data, forming the method of choice in diverse fields and applications. However, the adoption of existing models for multimodal and high-dimensional dependencies is hindered by restrictive assumptions and poor scaling. In this work, we present methods for modelling copulas based on the principles of diffusions and flows. We design two processes that progressively forget inter-variable dependencies while leaving dimension-wise distributions unaffected, provably defining valid copulas at all times. We show how to obtain copula models by learning to remember the forgotten dependencies from each process, theoretically recovering the true copula at optimality. The first instantiation of our framework focuses on direct density estimation, while the second specialises in expedient sampling. Empirically, we demonstrate the superior performance of our proposed methods over state-of-the-art copula approaches in modelling complex and high-dimensional dependencies from scientific datasets and images. Our work enhances the representational power of copula models, empowering applications and paving the way for their adoption on larger scales and more challenging domains.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

567. Exploratory Diffusion Model for Unsupervised Reinforcement Learning

作者:

Unsupervised reinforcement learning (URL) pre-trains agents by exploring diverse states in reward-free environments, aiming to enable efficient adaptation to various downstream tasks. Without extrinsic rewards, prior methods rely on intrinsic objectives, but heterogeneous exploration data demand strong modeling capacity for both intrinsic reward design and policy learning. We introduce the **Ex**ploratory **D**iffusion **M**odel (**ExDM**), which leverages the expressive power of diffusion models to fit diverse replay-buffer distributions, thus providing accurate density estimates and a score-based intrinsic reward that drives exploration into under-visited regions. This mechanism substantially broadens state coverage and yields robust pre-trained policies. Beyond exploration, ExDM offers theoretical guarantees and practical algorithms for fine-tuning diffusion policies under limited interactions, overcoming instability and computational overhead from multi-step sampling. Extensive experiments on Maze2d and URLB show that ExDM achieves superior exploration and faster downstream adaptation, establishing new state-of-the-art results, particularly in environments with complex structure or cross-embodiment settings.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

568. Video Scene Segmentation with Genre and Duration Signals

作者:

Video scene segmentation aims to detect semantically coherent boundaries in long-form videos, bridging the gap between low-level visual signals and high-level narrative understanding. However, existing methods primarily rely on visual similarity between adjacent shots, which makes it difficult to accurately identify scene boundaries, especially when semantic transitions do not align with visual changes. In this paper, we propose a novel approach that incorporates production-level metadata, specifically genre conventions and shot duration patterns, into video scene segmentation. Our main contributions are three-fold: (1) we leverage textual genre definitions as semantic priors to guide shot-level representation learning during self-supervised pretraining, enabling better capture of narrative coherence; (2) we introduce a duration-aware anchor selection strategy that prioritizes shorter shots based on empirical duration statistics, improving pseudo-boundary generation quality; (3) we propose a test-time shot splitting strategy that subdivides long shots into segments for improved temporal modeling. Experimental results demonstrate state-of-the-art performance on MovieNet-SSeg and BBC datasets. We introduce MovieChat-SSeg, extending MovieChat-1K with manually annotated scene boundaries across 1,000 videos spanning movies, TV series, and documentaries.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

569. Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

作者:

Diffusion Large Language Models (dLLMs) offer a promising alternative to autoregressive models, excelling in text generation tasks due to their bidirectional attention mechanisms. However, their computational complexity, scaling as $\mathcal{O}(L^3)$ with sequence length $L$, poses significant challenges for long-sequence and real-time applications, primarily due to the lack of compatibility with key-value caching and the non-autoregressive nature of denoising steps. Existing acceleration methods rely on static caching or parallel decoding strategies, which fail to account for the dynamic behavior of token properties across layers and decoding steps. We propose \textbf{Dynamic-dLLM}, a training-free framework that enhances dLLM inference efficiency through two components: Dynamic Cache Updating (DCU), which adaptively allocates cache-update budgets based on layer-wise token dynamics, and Adaptive Parallel Decoding (APD), which dynamically calibrates decoding thresholds to balance generation quality and efficiency. Extensive experiments on models like LLaDA-8B-Instruct, LLaDA-1.5, and Dream-v0-7B-Instruct across benchmarks such as MMLU, GSM8K, and HumanEval demonstrate that Dynamic-dLLM significantly improves inference speed, attaining an average speedup of exceeding 3$\times$ while maintaining performance. Dynamic-dLLM outperforms state-of-the-art acceleration methods and provides a plug-and-play solution for efficient dLLM deployment without compromising performance. Code and models will be made publicly available.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

570. PCB-Bench: Benchmarking LLMs for Printed Circuit Board Placement and Routing

作者:

Recent advances in Large Language Models (LLMs) have enabled impressive capabilities across diverse reasoning and generation tasks. However, their ability to understand and operate on real-world engineering problems—such as Printed Circuit Board (PCB) placement and routing—remains underexplored due to the lack of standardized benchmarks and high-fidelity datasets. To address this gap, we introduce PCB-Bench, the first comprehensive benchmark designed to systematically evaluate LLMs in the context of PCB design. PCB-Bench spans three complementary task settings: (1) text-based reasoning with approximately 3,700 expert-annotated instances, consisting of over 1,800 question-answer pairs and their corresponding choice question versions, covering component placement, routing strategies, and design rule compliance; (2) multimodal image-text reasoning with approximately 500 problems requiring joint interpretation of PCB visuals and technical specifications, including component identification, function recognition, and visual trace reasoning; (3) real-world design comprehension using over 170 complete PCB projects with schematics, placement files, and design documentation. We design structured evaluation protocols to assess both generative and discriminative capabilities, and conduct extensive comparisons across state-of-the-art LLMs. Our results reveal substantial gaps in current models’ ability to reason over spatial placements, follow domain-specific constraints, and interpret professional engineering artifacts. PCB-Bench establishes a foundational resource for advancing research toward more capable engineering AI, with implications extending beyond PCB design to broader structured reasoning domains. Data and code are available at https://anonymous.4open.science/r/ICLR_submission_PCB-Bench-CDC5.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

571. Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction

作者:

The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present **PG-Occ**, an innovative **P**rogressive **G**aussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that **PG-Occ** achieves *state-of-the-art* performance with a relative **14.3\% mIoU improvement** over the previous best performing method. The source code and models will be made publicly available upon publication.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

572. Context Parametrization with Compositional Adapters

作者:

Large language models (LLMs) often seamlessly adapt to new tasks through in-context learning (ICL) or supervised fine-tuning (SFT). However, both of these approaches face key limitations: ICL is inefficient when handling many demonstrations, and SFT incurs training overhead while sacrificing flexibility. Mapping instructions or demonstrations from context directly into adapter parameters offers an appealing alternative. While prior work explored generating adapters based on a single input context, it has overlooked the need to integrate multiple chunks of information. To address this gap, we introduce CompAs, a meta-learning framework that translates context into adapter parameters with a compositional structure. Adapters generated this way can be merged algebraically, enabling instructions, demonstrations, or retrieved passages to be seamlessly combined without reprocessing long prompts. Critically, this approach yields three benefits: lower inference cost, robustness to long-context instability, and establishes a principled solution when input exceeds the model’s context window. Furthermore, CompAs encodes information into adapter parameters in a reversible manner, enabling recovery of input context through a decoder, facilitating safety and security. Empirical results on diverse multiple-choice and extractive question answering tasks show that \method outperforms ICL and prior generator-based methods, especially when scaling to more inputs. Our work establishes composable adapter generation as a practical and efficient alternative for scaling LLM deployment.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

573. DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

作者:

Recent advances towards End-to-End Autonomous Driving (E2E-AD) focus on integrating modular designs into a unified framework for joint optimization. Most of these advances follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder. Specifically, both extracted image features and expected task outputs are converted into token-level sparse representations in advance, which are then sorted by their instantiated positions in 3D space. The linear-complexity operator enables efficient long-context sequential token modeling to capture task-related inter-dependencies simultaneously. Additionally, a bidirectional trajectory-guided "local-to-global" scan method is designed to preserve spatial locality from ego-perspective, thus facilitating the ego-planning. Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

574. Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

作者:

Recent AI regulations call for data that remain useful for innovation while resistant to misuse, balancing utility with protection at the model level. Existing approaches either perturb data to make it unlearnable or retrain models to suppress transfer, but neither governs inference by unknown models, and both typically require control over training. We propose non-transferable examples (NEs), a training-free and data-agnostic input-side usage-control mechanism. We recode inputs within a model-specific low-sensitivity subspace, preserving outputs for the authorized model while reducing performance on unauthorized models through subspace misalignment. We establish formal bounds that guarantee utility for the authorized model and quantify deviation for unauthorized ones, with the Hoffman-Wielandt inequality linking degradation to spectral differences. Empirically, NEs retain performance on diverse vision backbones and state-of-the-art vision-language models under common preprocessing, while non-targets collapse even under severe distortions. These results establish NEs as a practical means to preserve intended data utility while preventing unauthorized exploitation.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

575. PU-BENCH: A UNIFIED BENCHMARK FOR RIGOROUS AND REPRODUCIBLE PU LEARNING

作者:

Positive-Unlabeled (PU) learning, a challenging paradigm for training binary classifiers from only positive and unlabeled samples, is fundamental to many applications. While numerous PU learning methods have been proposed, the research is systematically hindered by the lack of a standardized and comprehensive benchmark for rigorous evaluation. Inconsistent data generation, disparate experimental settings, and divergent metrics have led to irreproducible findings and unsubstantiated performance claims. To address this foundational challenge, we introduce \textbf{PU-Bench}, the first unified open-source benchmark for PU learning. PU-Bench provides: 1) a unified data generation pipeline to ensure consistent input across configurable sampling schemes, label ratios and labeling mechanisms ; 2) an integrated framework of 16 state-of-the-art PU methods; and 3) standardized protocols for reproducible assessment. Through a large-scale empirical study on 8 diverse datasets (\textbf{2,560 }evaluations in total), PU-Bench reveals a complex while intuitional performance landscape, uncovering critical trade-offs between effectiveness and efficiency, and those of robustness and label frequency and selection bias. It is anticipated to serve as a foundational resource to catalyze reproducible, rigorous, and impactful research in the PU learning community.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

576. Arbitrary-Shaped Image Generation via Spherical Neural Field Diffusion

作者:

Existing diffusion models excel at generating diverse content, but remain confined to fixed image shapes and lack the ability to flexibly control spatial attributes such as viewpoint, field-of-view (FOV), and resolution. To fill this gap, we propose Arbitrary-Shaped Image Generation (ASIG), the first generative framework that enables precise spatial attribute control while supporting high-quality synthesis across diverse image shapes (e.g., perspective, panoramic, and fisheye). ASIG introduces two key innovations: (1) a mesh-based spherical latent diffusion to generate a complete scene representation, with seam enforcement denoising strategy to maintain semantic and spatial consistency across viewpoints; and (2) a spherical neural field to sample arbitrary regions from the scene representation with coordinate conditions, enabling distortion-free generation at flexible resolutions. To this end, ASIG enables precise control over spatial attributes within a unified framework, enabling high-quality generation across diverse image shapes. Experiments demonstrate clear improvements over prior methods specifically designed for individual shapes.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

577. THE PATH OF LEAST RESISTANCE: GUIDING LLM REASONING TRAJECTORIES WITH PREFIX CONSENSUS

作者:

Large language models achieve strong reasoning performance, but inference strategies such as Self-Consistency (SC) are computationally expensive, as they fully expand all reasoning traces. We introduce PoLR (Path of Least Resistance), the first inference-time method to leverage prefix self-consistency for compute-efficient reasoning. PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands only a subset of promising paths, preserving the accuracy benefits of SC while substantially reducing token usage and latency. Our theoretical analysis, framed via mutual information and entropy, explains why early reasoning steps encode strong signals predictive of final correctness. Empirically, PoLR consistently matches or exceeds SC across GSM8K, Math500, AIME 2024/2025, and GPQA-Diamond, reducing token usage by up to 60% and wall-clock latency by up to 50%. Moreover, PoLR is fully complementary to adaptive inference methods (e.g., Adaptive Consistency, Early-Stopping SC) and can serve as a drop-in pre-filter, making SC substantially more efficient and scalable without requiring model fine-tuning.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

578. Multi-Head Low-Rank Attention

作者:

Long-context inference in large language models is bottlenecked by Key-Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip to on-chip memory at each step. Recent architectures like Multi-Head Latent Attention (MLA) significantly reduce the KV cache size to $4.5d_h$ per token per layer while maintaining high model quality. However, when using tensor parallelism (TP) with sufficient devices for inference, MLA still decodes slower than Grouped-Query Attention (GQA) because its single latent vector cannot be sharded, forcing each device to load $4.5 d_h$ versus $2 d_h$ for GQA. In this work, we propose Multi-Head Low-Rank Attention (MLRA), a TP-friendly attention mechanism that slashes the per-device KV cache under TP to just $1.5 d_h$. Extensive experiments show that MLRA achieves state-of-the-art perplexity and downstream task performance, while also delivering a 2.8$\times$ decoding speedup over MLA.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

579. Diffusion Alignment as Variataional Expectation-Maximization

作者:

Diffusion alignment aims to optimize diffusion models for the downstream objective. While existing methods based on reinforcement learning or direct backpropagation achieve considerable success in maximizing rewards, they often suffer from reward over-optimization and mode collapse. We introduce Diffusion Alignment as Variational Expectation-Maximization (DAV), a framework that formulates diffusion alignment as an iterative process alternating between two complementary phases: the E-step and the M-step. In the E-step, we employ test-time search to generate diverse and reward-aligned samples. In the M-step, we refine the diffusion model using samples discovered by the E-step. We demonstrate that DAV can optimize reward while preserving diversity for both continuous and discrete tasks: text-to-image synthesis and DNA sequence design.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

580. Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning

作者:

Reasoning about failures is crucial for building reliable and trustworthy robotic systems. Prior approaches either treat failure reasoning as a closed-set classification problem or assume access to ample human annotations. Failures in the real world are typically subtle, combinatorial, and difficult to enumerate, whereas rich reasoning labels are expensive to acquire. We address this problem by introducing ARMOR: Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning. We formulate detection and reasoning as a multi-task self-refinement process, where the model iteratively predicts detection outcomes and natural language reasoning conditioned on past outputs. During training, ARMOR learns from heterogeneous supervision - large-scale sparse binary labels and small-scale rich reasoning annotations - optimized via a combination of offline and online imitation learning. At inference time, ARMOR generates multiple refinement trajectories and selects the most confident prediction via a self-certainty metric. Experiments across diverse environments show that ARMOR achieves state-of-the-art performance by improving over the previous approaches by up to 30\% on failure detection rate and up to 100\% in reasoning measured through LLM fuzzy match score, demonstrating robustness to heterogeneous supervision and open-ended reasoning beyond predefined failure modes.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

581. GOOD: Geometry-guided Out-of-Distribution Modeling for Open-set Test-time Adaptation in Point Cloud Semantic Segmentation

作者:

Open-set Test-time Adaptation (OSTTA) has been introduced to address the challenges of both online model optimization and open-set recognition. Despite the demonstrated success of OSTTA methodologies in 2D image recognition, their application to 3D point cloud semantic segmentation is still hindered by the complexities of point cloud data, particularly the imbalance between known (in-distribution, ID) and unknown (out-of-distribution, OOD) data, where known samples dominate and unknown instances are often sparse or even absent. In this paper, we propose a simple yet effective strategy, termed Geometry-guided Out-of-Distribution Modeling (GOOD), specifically designed to address OSTTA for 3D point cloud semantic segmentation. Technically, we first leverage geometric priors to cluster the point cloud into superpoints, thereby mitigating the numerical disparity between individual points and providing a more structured data representation. Then, we introduce a novel confidence metric to effectively distinguish between known and unknown superpoints. Additionally, prototype-based representations are integrated to enhance the discrimination between ID and OOD regions, facilitating robust segmentation. We validate the efficacy of GOOD across four benchmark datasets. Remarkably, on the Synth4D to SemanticKITTI task, GOOD outperforms HGL by 1.93%, 8.99%, and 7.91% in mIoU, AUROC, and FPR95, respectively.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

582. Learning a Game by Paying the Agents

作者:

We study the problem of learning the utility functions of no-regret learning agents in a repeated normal-form game. Differing from most prior literature, we introduce a principal with the power to observe the agents playing the game, send agents signals, and give agents *payments* as a function of their actions. We show that the principal can, using a number of rounds polynomial in the size of the game, learn the utility functions of all agents to any desired precision $\varepsilon > 0$, for any no-regret learning algorithms of the agents. Our main technique is to formulate a zero-sum game between the principal and the agents, where the principal's strategy space is the set of all payment functions. Finally, we discuss implications for the problem of *steering* agents to a desired equilibrium: in particular, we introduce, using our utility-learning algorithm as a subroutine, the first algorithm for steering arbitrary no-regret learning agents without prior knowledge of their utilities.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

583. Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility

作者:

Language models (LMs) are used for a diverse range of tasks, from question answering to writing fantastical stories. In order to reliably accomplish these tasks, LMs must be able to discern the modal category of a sentence (i.e., whether it describes something that is possible, impossible, completely nonsensical, etc.). However, recent studies have called into question the ability of LMs to categorize sentences according to modality. In this work, we identify linear representations that discriminate between modal categories within a variety of LMs, or modal difference vectors. Analysis of modal difference vectors reveals that LMs have access to more reliable modal categorization judgments than previously reported. Furthermore, we find that modal difference vectors emerge in a consistent order as models become more competent (i.e., through training steps, layers, and parameter count). Notably, we find that modal difference vectors identified within LM activations can be used to model fine-grained human categorization behavior. This potentially provides a novel view into how human participants distinguish between modal categories, which we explore by correlating projections along modal difference vectors with human participants' ratings of interpretable features. In summary, we derive new insights into LM modal categorization using techniques from mechanistic interpretability, with the potential to inform our understanding of modal categorization in humans.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

584. Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data

作者:

Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures, and functional dependencies. We propose the _Relational Transformer (RT)_, a cell-level architecture pretrained on diverse relational databases and directly applicable to unseen datasets and tasks, without any need for task- or dataset-specific fine-tuning or retrieval of in-context examples. RT (i) tokenizes cells with table/column metadata, (ii) is pretrained via masked token prediction, and (iii) utilizes a novel _Relational Attention_ mechanism over columns, rows, and primary–foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance; on binary classification it averages 94\% of fully supervised AUROC in a single forward pass, and fine-tuning yields state-of-the-art results with high sample efficiency. Our experiments show that RT’s zero-shot transfer harnesses task-table context, column and feature attention, and schema semantics. Overall, RT provides a practical path toward foundation models for relational data.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

585. Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction

作者:

We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the state-of-the-art (SoTA) by over 15\% in terms of geometric accuracy for posed facial expressions.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

586. Demystifying Emergent Exploration in Goal-Conditioned RL

作者:

In this work, we take a first step toward elucidating the mechanisms behind emergent exploration in unsupervised reinforcement learning. We study Single-Goal Contrastive Reinforcement Learning (SGCRL) (Liu et al., 2025), a self-supervised algorithm capable of solving challenging long-horizon goal-reaching tasks without external rewards or curricula. We combine theoretical analysis of the algorithm’s objective function with controlled experiments to understand what drives its exploration. We show that SGCRL maximizes implicit rewards shaped by its learned representations. These representations automatically modify the reward landscape to promote exploration before reaching the goal and exploitation thereafter. Our experiments also demonstrate that these exploration dynamics arise from learning low-rank representations of the state space rather than from neural network function approximation. Our improved understanding enables us to adapt SGCRL to perform safety-aware exploration.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

587. Towards a Comprehensive Scaling Law of Mixture-of-Experts

作者:

Mixture-of-Experts (MoE) models have become the consensus approach for enabling parameter-efficient scaling and cost-effective deployment in large language models. However, existing scaling laws for dense models are inapplicable to MoE models, which stems from three critical challenges: the multiplicity of influencing factors, their intricate coupling relationships and the non-monotonic nature of their performance impacts. They collectively necessitate a fine-grained investigation into MoE-specific scaling laws. In this work, we perform a systematic decomposition of MoE settings, identifying five key factors that influence model performance from both size and structural perspectives (data size ($D$), total model size ($N$), activated model size ($N_a$), number of active experts ($G$) and the ratio of shared experts ($S$)). Specifically, we design $446$ controlled experiments to characterize their marginal effects, ultimately constructing a comprehensive and precise joint MoE scaling law that considers all essential factors. Furthermore, we derive the theoretically optimal and practically efficiency-aware optimal configurations for $G$, $S$ and $N_a/N$ with detailed analyses. Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size. With the scaling of $N$, the optimal activation parameter ratio of $N_a/N$ becomes sparser. Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

588. Efficient Zero-shot Inpainting with Decoupled Diffusion Guidance

作者:

Diffusion models have emerged as powerful priors for image editing tasks such as inpainting and local modification, where the objective is to generate realistic content that remains consistent with observed regions. In particular, zero-shot approaches that leverage a pretrained diffusion model, without any retraining, have been shown to achieve highly effective reconstructions. However, state-of-the-art zero-shot methods typically rely on a sequence of surrogate likelihood functions, whose scores are used as proxies for the ideal score. This procedure however requires vector-Jacobian products through the denoiser at every reverse step, introducing significant memory and runtime overhead. To address this issue, we propose a new likelihood surrogate that yields simple and efficient to sample Gaussian posterior transitions, sidestepping the backpropagation through the denoiser network. Our extensive experiments show that our method achieves strong observation consistency compared with fine-tuned baselines and produces coherent, high-quality reconstructions, all while significantly reducing inference cost.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

589. Reversible Primitive–Composition Alignment for Continual Vision–Language Learning

作者:

Vision-language (VL) models are increasingly deployed in non-stationary settings, yet under sequential adaptation they often preserve primitive recognition while losing compositional structure, especially with tight rehearsal budgets and no task IDs. We address this gap by asking how a continual VL system can maintain structurally dependable behaviour while safeguarding zero-shot performance. We introduce Compo-ReAlign, a structure-first recipe built around three components: a reversible composer that maps primitive embeddings to compositions by design, a multi-positive InfoNCE that jointly aligns textual and composed views of the same target, and a spectral trust region that clips updates when alignment sensitivity inflates. Across compositional DIL and multi-domain MTIL retrieval, Compo-ReAlign sets a new state of the art, improves over the strongest prior by +2.4 R@1, and reduces forgetting by 40%. We provide a compact, reversible alignment head with geometry-aware training for compositionally robust VL continual learning.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

590. Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities

作者:

Reasoning on Temporal Knowledge Graphs (TKGs) is essential for predicting future events and time-aware facts. While existing methods are effective at capturing relational dynamics, their performance is limited by a closed-world assumption, which fails to account for emerging entities not present in the training. Notably, these entities continuously join the network without historical interactions. Empirical study reveals that emerging entities are widespread in TKGs, comprising roughly 25\% of all entities. The absence of historical interactions of these entities leads to significant performance degradation in reasoning tasks. Whereas, we observe that entities with semantic similarities often exhibit comparable interaction histories, suggesting the presence of transferable temporal patterns. Inspired by this insight, we propose TransFIR (Transferable Inductive Reasoning), a novel framework that leverages historical interaction sequences from semantically similar known entities to support inductive reasoning. Specifically, we propose a codebook-based classifier that categorizes emerging entities into latent semantic clusters, allowing them to adopt reasoning patterns from similar entities. Experimental results demonstrate that TransFIR outperforms all baselines in reasoning on emerging entities, achieving an average improvement of 28.6\% in Mean Reciprocal Rank (MRR) across multiple datasets. The implementations are available at https://anonymous.4open.science/r/TransFIR-C72F.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

591. LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora

作者:

Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose Linear Graph-based Retrieval-Augmented Generation (LinearRAG), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four benchmark datasets demonstrate that LinearRAG significantly outperforms baseline models. Our code and datasets are available at https://anonymous.4open.science/r/LinearRAG-C205/.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

592. ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes

作者:

Gaussian Splatting (GS) enables immersive rendering, but realistic 3D object–scene composition remains challenging. Baked appearance and shadow information in GS radiance fields cause inconsistencies when combining objects and scenes. Addressing this requires relightable object reconstruction and scene lighting estimation. For relightable object reconstruction, existing Gaussian-based inverse rendering methods often rely on ray tracing, leading to low efficiency. We introduce Surface Octahedral Probes (SOPs), which store lighting and occlusion information and allow efficient 3D querying via interpolation, avoiding expensive ray tracing. SOPs provide at least a 2x speedup in reconstruction and enable real-time shadow computation in Gaussian scenes. For lighting estimation, existing Gaussian-based inverse rendering methods struggle to model intricate light transport and often fail in complex scenes, while learning-based methods predict lighting from a single image and are viewpoint-sensitive. We observe that 3D object–scene composition primarily concerns the object’s appearance and nearby shadows. Thus, we simplify the challenging task of full scene lighting estimation by focusing on the environment lighting at the object’s placement. Specifically, we capture a 360° reconstructed radiance field of the scene at the location and fine-tune a diffusion model to complete the lighting. Building on these advances, we propose ComGS, a novel 3D object–scene composition framework. Our method achieves high-quality, real-time rendering at around 28 FPS, produces visually harmonious results with vivid shadows, and requires only 36 seconds for editing. The code and dataset will be publicly released.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

593. CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLMs

作者:

Chain-of-Thought (CoT) prompting has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing implementations, such as in-context learning and fine-tuning, remain costly and inefficient. To improve CoT reasoning at a lower cost, and inspired by the task vector paradigm, we introduce CoT Vectors, compact representations that encode task-general, multi-step reasoning knowledge. Through experiments with Extracted CoT Vectors, we observe pronounced layer-wise instability, manifesting as a U-shaped performance curve that reflects a systematic three-stage reasoning process in LLMs. To address this limitation, we propose Learnable CoT Vectors, optimized under a teacher–student framework to provide more stable and robust guidance. Extensive evaluations across diverse benchmarks and models demonstrate that CoT Vectors not only outperform existing baselines but also achieve performance comparable to parameter-efficient fine-tuning methods, while requiring fewer trainable parameters. Moreover, by treating CoT Vectors as a probe, we uncover how their effectiveness varies due to latent space structure, information density, acquisition mechanisms, and pre-training differences, offering new insights into the functional organization of multi-step reasoning in LLMs. The source code will be released.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

594. Spotlight on Token Perception for Multimodal Reinforcement Learning

作者:

While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose $\textbf{V}$isually-$\textbf{P}$erceptive $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VPPO}$), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

595. DR-Submodular Maximization with Stochastic Biased Gradients: Classical and Quantum Gradient Algorithms

作者:

In this work, we investigate DR-submodular maximization using stochastic biased gradients, which is a more realistic but challenging setting than stochastic unbiased gradients. We first generalize the Lyapunov framework to incorporate biased stochastic gradients, characterizing the adverse impacts of bias and noise. Leveraging this framework, we consider not only conventional constraints but also a novel constraint class: convex sets with a largest element, which naturally arises in applications such as resource allocations. For this constraint, we propose an $1/e$ approximation algorithm for non-monotone DR-submodular maximization, surpassing the hardness result $1/4$ for general convex constraints. As a direct application of stochastic biased gradients, we consider zero-order DR-submodular maximization and introduce both classical and quantum gradient estimation algorithms. In each constraint we consider, while retaining the same approximation ratio, the iteration complexity of our classical zero-order algorithms is $O(\epsilon^{-3})$, matching that of stochastic unbiased gradients; our quantum zero-order algorithms reach $O(\epsilon^{-1})$ iteration complexity, on par with classical first-order algorithms, demonstrating quantum acceleration and validated in numerical experiments.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

596. Riemannian Variational Flow Matching for Material and Protein Design

作者:

We present Riemannian Gaussian Variational Flow Matching (RG-VFM), a geometric extension of Variational Flow Matching (VFM) for generative modeling on manifolds. In Euclidean space, predicting endpoints (VFM), velocities (FM), or noise (diffusion) are largely equivalent due to affine interpolations. On curved manifolds this equivalence breaks down, and we hypothesize that endpoint prediction provides a stronger learning signal by directly minimizing geodesic distances. Building on this insight, we derive a variational flow matching objective based on Riemannian Gaussian distributions, applicable to manifolds with closed-form geodesics. We formally analyze its relationship to Riemannian Flow Matching (RFM), exposing that the RFM objective lacks a curvature-dependent penalty - encoded via Jacobi fields - that is naturally present in RG-VFM. Experiments on synthetic spherical and hyperbolic benchmarks, as well as real-world tasks in material and protein generation, demonstrate that RG-VFM more effectively captures manifold structure and improves downstream performance over Euclidean and velocity-based baselines.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

597. ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

作者:

Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

598. Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks Preserving Action Understanding Ability

作者:

Temporal Video Grounding (TVG) aims to localize video segments corresponding to a given textual query, which often describes human actions. However, we observe that current methods, usually optimizing for high temporal Intersection-over-Union (IoU), frequently struggle to accurately recognize or understand the underlying actions in both the video and query, thus reducing the effectiveness of these methods. To address this, we propose a novel TVG framework that integrates inversion-based TVG as auxiliary objectives to maintain the model's action understanding ability. We introduce three kinds of inversion TVG tasks derived from the original TVG annotations: (1) Verb Completion, predicting masked verbs (actions) in queries given video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions containing query-relevant actions given video segments. These inversion tasks are entirely derived from the original TVG tasks and are probabilistically integrated with them within a reinforcement learning framework. By leveraging carefully designed reward functions, the model preserves its ability to understand actions, thereby improving the accuracy of temporal grounding. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1\% improvement in R1@0.7 on Charades-STA for a 3B model.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

599. Debiased and Denoised Projection Learning for Incomplete Multi-view Clustering

作者:

Multi-view clustering achieves outstanding performance but relies on the assumption of complete multi-view samples. However, certain views may be partially unavailable due to failures during acquisition or storage, resulting in distribution shifts across views. Although some incomplete multi-view clustering (IMVC) methods have been proposed, they still confront the following limitations: 1) Missing-view data imputation methods increase the unnecessary computational complexity; 2) Consensus representation imputation methods ignore the inter-view distribution bias due to missing views. To tackle these issues, we propose a novel IMVC based on projection debiasing and denoising (PDD). Specifically, it utilizes the unbiased projection learned from complete views to refine the biased projection learned from data with missing views. Additionally, we introduce a robust contrastive learning for consensus projection to mitigate cluster collapse risk induced by misalignment noise. Comprehensive experiments demonstrate that PDD achieves superior performance compared with state-of-the-art methods.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

600. Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

作者:

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs to achieve rapid convergence.We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to the vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

601. A Step to Decouple Optimization in 3DGS

作者:

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time novel view synthesis. As an explicit representation optimized through gradient propagation among primitives, optimization widely accepted in deep neural networks (DNNs) is actually adopted in 3DGS, such as synchronous weight updating and Adam with the adaptive gradient. However, considering the physical significance and specific design in 3DGS, there are two overlooked details in the optimization of 3DGS: (i) update step coupling, which induces optimizer state rescaling and costly attribute updates outside the viewpoints, and (ii) gradient coupling in the moment, which may lead to under- or over-effective regularization. Nevertheless, such a complex coupling is under-explored. After revisiting the optimization of 3DGS, we take a step to decouple it and recompose the process into: Sparse Adam, Re-State Regularization and Decoupled Attribute Regularization. Taking a large number of experiments under the 3DGS and 3DGS-MCMC frameworks, our work provides a deeper understanding of these components. Finally, based on the empirical analysis, we re-design the optimization and propose AdamW-GS by re-coupling the beneficial components, under which better optimization efficiency and representation effectiveness are achieved simultaneously.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

602. Relational Feature Caching for Accelerating Diffusion Transformers

作者:

Feature caching approaches accelerate diffusion transformers (DiTs) by storing the output features of computationally expensive modules at certain timesteps, and exploiting them for subsequent steps to reduce redundant computations. Recent forecasting-based caching approaches employ temporal extrapolation techniques to approximate the output features with cached ones. Although effective, relying exclusively on temporal extrapolation still suffers from significant prediction errors, leading to performance degradation. Through a detailed analysis, we find that 1) these errors stem from the irregular magnitude of changes in the output features, and 2) an input feature of a module is strongly correlated with the corresponding output. Based on this, we propose relational feature caching (RFC), a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction. Specifically, we introduce relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions. We also present relational cache scheduling (RCS), which estimates the prediction errors using the input features and performs full computations only when the errors are expected to be substantial. Extensive experiments across various DiT models demonstrate that RFC consistently outperforms prior approaches significantly. We will release our code publicly upon acceptance.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

603. RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation

作者:

Mean flow (MeanFlow) enables efficient, high-fidelity image generation, yet its single-function evaluation (1-NFE) generation often cannot yield compelling results. We address this issue by introducing RMFlow, an efficient multimodal generative model that integrates a coarse 1-NFE MeanFlow transport with a subsequent tailored noise-injection refinement step. RMFlow approximates the average velocity of the flow path using a neural network trained with a new loss function that balances minimizing the Wasserstein distance between probability paths and maximizing sample likelihood. RMFlow achieves competitive, often (near) state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using 1-NFE, at a comparable computational cost to the baseline MeanFlows.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

604. SP-MoMamba: Superpixel-driven Mixture of State Space Experts for Efficient Image Super-Resolution

作者:

The state space model (SSM) has garnered significant attention recently due to its exceptional long-range modeling capabilities achieved with linear-time complexity, enabling notable success in efficient super-resolution. However, applying SSMs to vision tasks typically requires scanning 2D visual data with a 1D-sequence form, which disrupts inherent semantic relationships and introduce artifacts and distortions during image restoration. To address these challenges, we propose a novel SP-MoMamba method that integrate SSMs with the semantic preservation capability of superpixels and the efficiency advantage of Mixture-of-Experts (MoE). Specifically, we pioneer the use of superpixel features as semantic units to reconstruct the SSM scanning method, proposing the Superpixel-driven State Space Model (SP-SSM) as a basic building block of SP-MoMamba. Furthermore, we introduce the Multi-Scale Superpixel Mixture of State Space Experts (MSS-MoE) scheme to strategically integrate SP-SSMs across scales, effectively harnessing the complementary semantic information from multiple experts. This multi-scale expert integration significantly reduces the number of pixels processed by each SSM while enhancing the reconstruction of fine details through specialized experts operating at different semantic scales. This framework enables our model to deliver superior performance with minimal computational overhead.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

605. HiCache: A Plug-in Scaled-Hermite Upgrade for Taylor-Style Cache-then-Forecast Diffusion Acceleration

作者:

Diffusion models have achieved remarkable success in content generation but suffer from prohibitive computational costs due to iterative sampling. While recent feature caching methods tend to accelerate inference through temporal extrapolation, these methods still suffer from severe quality loss due to the failure in modeling the complex dynamics of feature evolution. To solve this problem, this paper presents HiCache (Hermite Polynomial-based Feature Cache), a training-free acceleration framework that fundamentally improves feature prediction by aligning mathematical tools with empirical properties. Our key insight is that feature derivative approximations in Diffusion Transformers exhibit multivariate Gaussian characteristics, motivating the use of Hermite polynomials, the potentially theoretically optimal basis for Gaussian-correlated processes. Besides, we introduce a dual-scaling mechanism that ensures numerical stability while preserving predictive accuracy, which is also effective when applied standalone to TaylorSeer. Extensive experiments demonstrate HiCache's superiority: achieving \$5.55\times\$ speedup on FLUX.1-dev while exceeding baseline quality, maintaining strong performance across text-to-image, video generation, and super-resolution tasks. Moreover, HiCache can be naturally added to the previous caching methods to enhance their performance, e.g., improving ClusCa from \$0.9480\$ to \$0.9840\$ in terms of image rewards. Our code is included in the supplementary material, and will be released on GitHub.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

606. Consistent Text-to-Image Generation via Scene De-Contextualization

作者:

Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-subject correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I’s built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-subject correlation within the ID prompt’s embedding by quantifying SVD directional stability to re-weight the corresponding eigenvalues adaptively. Critically, SDeC allows for per-scene use (one prompt per scene) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

607. Reforming the Mechanism: Editing Reasoning Patterns in LLMs with Circuit Reshaping

作者:

Large language models (LLMs) often exhibit flawed reasoning ability that undermines reliability. Existing approaches to improving reasoning typically treat it as a general and monolithic skill, applying broad training that is inefficient and unable to target specific reasoning errors. We introduce Reasoning Editing, a paradigm for selectively modifying specific reasoning patterns in LLMs while preserving other reasoning pathways. This task presents a fundamental trade-off between Generality, the ability of an edit to generalize across different tasks sharing the same reasoning pattern, and Locality, the ability to preserve other reasoning capabilities. Through systematic investigation, we uncover the Circuit-Interference Law: edit interference between reasoning patterns is proportional to the overlap of their neural circuits. Guided by this principle, we propose REdit, the first framework to actively reshape neural circuits before editing, thereby modulating interference between reasoning patterns and mitigating the trade-off. REdit integrates three components: (i) Contrastive Circuit Reshaping, which directly addresses the generality-locality trade-off by disentangling overlapping circuits; (ii) Meta-Contrastive Learning, which extends transferability to novel reasoning patterns; and (iii) Dual-Level Protection, which preserves preexisting abilities by constraining reshaping update directions and regularizing task-level predictions. Extensive experiments with Qwen-2.5-3B on propositional logic reasoning tasks across three difficulty levels demonstrate that REdit consistently achieves superior generality and locality compared to baselines, with additional validation in mathematics showing broader potential. Our code is available at https://anonymous.4open.science/r/REdit-DBD8.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

608. Train on Validation (ToV): Fast data selection with applications to fine-tuning

作者:

State-of-the-art machine learning often follows a two-stage process: $(i)$ pre-training on large, general-purpose datasets; $(ii)$ fine-tuning on task-specific data. In fine-tuning, selecting training examples that closely reflect the target distribution is crucial. However, it is often the case that only a few samples are available from the target distribution. Existing data selection methods treat these target samples as a validation set and estimate the effect of adding or removing a single sample from the training pool by performing inference on the validation set. We propose a simpler and faster alternative that inverts the usual role of train and validation: we perform inference on the training pool before and after fine-tuning on the validation set. We then select samples whose predictions change the most. Our key insight is that the training samples most affected by fine-tuning on a small validation set tend to be the most beneficial for reducing test loss on the target distribution. Experiments on instruction tuning and named entity recognition tasks show that, in most cases, our method achieves lower test log-loss than state-of-the-art approaches. We support our findings with theoretical analysis.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

609. Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

作者:

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our anonymous code is available at https://anonymous.4open.science/r/tpc-anon-0708.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

610. CAVINR: Coordinate-Aware Attention for Video Implicit Neural Representations

作者:

Implicit Neural Representations (INRs) have emerged as a compelling paradigm, with Neural Representations for Videos (NeRV) achieving remarkable compression ratios by encoding videos as neural network parameters. However, existing NeRV-based approaches face fundamental scalability limitations: computationally expensive per-video optimization through iterative gradient descent and convolutional architectures with shared kernel parameters that provide weak pixel-level control and limit global dependency modeling essential for high-fidelity reconstruction. We introduce CAVINR, a pure transformer framework that fundamentally departs from convolutional approaches by leveraging persistent cross-attention mechanisms. CAVINR introduces three contributions: a transformer encoder that compresses videos into compact video tokens encoding spatial textures and temporal dynamics; a coordinate-attentive decoder utilizing persistent weights and cross-attention between coordinate queries and video tokens; and temperature-modulated attention with block query processing that enhances reconstruction fidelity while reducing memory complexity. Comprehensive experiments demonstrate CAVINR's superior performance: 6-9 dB PSNR improvements over state-of-the-art methods, $10^5\times$ encoding acceleration compared to gradient-based optimization, $85-95\%$ memory reduction, and 7.5$\times$ faster convergence with robust generalization across diverse video content, enabling practical deployment for large-scale video processing applications.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

611. No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

作者:

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward—so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

612. EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

作者:

A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like “clever but clueless interns” in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients—by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state–action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

613. Unified Privacy Guarantees for Decentralized Learning via Matrix Factorization

作者:

Decentralized Learning (DL) enables users to collaboratively train models without sharing raw data by iteratively averaging local updates with neighbors in a network graph. This setting is increasingly popular for its scalability and its ability to keep data local under user control. Strong privacy guarantees in DL are typically achieved through Differential Privacy (DP), with results showing that DL can even amplify privacy by disseminating noise across peer-to-peer communications. Yet in practice, the observed privacy-utility trade-off often appears worse than in centralized training, which may be due to limitations in current DP accounting methods for DL. In this paper, we show that recent advances in centralized DP accounting based on Matrix Factorization (MF) for analyzing temporal noise correlations can also be leveraged in DL. By generalizing existing MF results, we show how to cast both standard DL algorithms and common trust models into a unified formulation. This yields tighter privacy accounting for existing DP-DL algorithms and provides a principled way to develop new ones. To demonstrate the approach, we introduce MAFALDA-SGD, a gossip-based DL algorithm with user-level correlated noise that outperforms existing methods on synthetic and real-world graphs.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

614. D$^2$GS: Depth-and-Density Guided Gaussian Splatting for Stable and Accurate Sparse-View Reconstruction

作者:

Recent advances in 3D Gaussian Splatting (3DGS) enable real-time, high-fidelity novel view synthesis (NVS) with explicit 3D representations. However, performance degradation and instability remain significant under sparse-view conditions. In this work, we identify two key failure modes under sparse-view conditions: overfitting in regions with excessive Gaussian density near the camera, and underfitting in distant areas with insufficient Gaussian coverage. To address these challenges, we propose a unified framework \modelname{}, comprising two key components: a Depth-and-Density Guided Dropout strategy that suppresses overfitting by adaptively masking redundant Gaussians based on density and depth, and a Distance-Aware Fidelity Enhancement module that improves reconstruction quality in under-fitted far-field areas through targeted supervision. Moreover, we introduce a new evaluation metric to quantify the stability of learned Gaussian distributions, providing insights into the robustness of the sparse-view 3DGS. Extensive experiments on multiple datasets demonstrate that our method significantly improves both visual quality and robustness under sparse view conditions. The source code and trained models will be made publicly available.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

615. FlexHiNM-GP: Flexible Hierarchical Pruning via Region Allocation and Channel Permutation

作者:

N:M sparsity has emerged as a hardware-friendly pruning strategy, notably supported by NVIDIA’s Sparse Tensor Cores. While efficient, its fixed sparsity ratio restricts flexibility, making it difficult to adapt pruning granularity to varying weight importance across layers and architectures. To overcome this limitation, we propose FlexHiNM, a hybrid framework that adaptively partitions each layer into three regions: dense, vector-pruned, and N:M sparse, enabling finer-grained control while preserving hardware compatibility. To better preserve salient weights, we extend this to FlexHiNM-GP, which incorporates Gyro-Permutation, an iterative channel-rearrangement algorithm. Through successive sampling, clustering, and assignment, Gyro-Permutation aligns high-importance weights with structured sparsity patterns and mitigates suboptimal configurations in multi-level pruning. During gradual pruning, FlexHiNM-GP further employs a differentiable masking mechanism based on the Hard Concrete distribution, enabling gradient-based mask learning and preventing over-aggressive early pruning. Experiments on vision and language benchmarks demonstrate that FlexHiNM-GP consistently surpasses strong structured baselines and approaches the performance of unstructured pruning, validating the effectiveness of combining hybrid sparsity with learned masks and permutation strategies.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

616. MobileKGQA: On-Device KGQA System on Dynamic Mobile Environments

作者:

Developing a mobile system capable of generating responses based on stored user data is a crucial challenge. Since user data is stored in the form of Knowledge Graphs, the field of knowledge graph question answering (KGQA) presents a promising avenue towards addressing this problem. However, existing KGQA systems face two critical limitations that preclude their on-device deployment: resource constraints and the inability to handle data accumulation. Therefore, we propose MobileKGQA, the first on-device KGQA system capable of adapting to evolving databases with minimal resource demands. MobileKGQA significantly reduces computational overhead through embedding hashing. Moreover, it successfully adapts to evolving databases under resource constraints through a novel annotation generation method. Its mobile applicability is validated on the NVIDIA Jetson Orin Nano edge-device platform, achieving 20.3% higher performance while using only 30.4% of the energy consumed by the SOTA (state of the art). On standard KGQA benchmarks, using just 7.2% of the computation and 9% of the parameters, MobileKGQA demonstrates performance that is empirically indistinguishable from the SOTA and outperforms baselines under distribution shift scenarios.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

617. LogicSR: A Unified Benchmark for Logical Discovery from Data

作者:

Discovering underlying logical expressions from data is a critical task for interpretable AI and scientific discovery, yet it remains poorly served by existing research infrastructure. The field of Symbolic Regression (SR) primarily focuses on continuous mathematical functions, while Logic Synthesis (LS) is designed for exact, noise-free specifications, not for learning from incomplete or noisy data. This leaves a crucial gap for evaluating algorithms that can learn generalizable logical rules in realistic scenarios. To address this, we introduce LogicSR, a large-scale and comprehensive benchmark for logical symbolic regression. LogicSR is built from two sources: real-world problems from digital circuits and biological networks, and a novel synthetic data generator capable of producing a diverse set of complex logical formulas at scale. We use LogicSR to conduct a rigorous evaluation of 14 algorithms, spanning classical logic solvers, modern machine learning models, and Large Language Models (LLMs). Our findings reveal that the logical modeling capabilities and generalization robustness of these algorithms significantly depend on task scale and logical complexity, with current cutting-edge LLMs showing limited complex logical reasoning ability. LogicSR provides a robust foundation to benchmark progress, unify evaluation across disparate fields, and steer the future development of powerful neuro-symbolic systems.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

618. Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs

作者:

Tool-integrated reasoning has emerged as a key focus for enabling agentic applications. Among these, DeepResearch Agents have gained significant attention for their strong performance on complex, open-ended information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch model trained from Qwen3-4B and optimized for evidence-based investigation through live web search and targeted webpage querying. Its training combines three advances: (i) DUETQA, a ∼5K-sample dataset generated via multi-agent self-play that enforces strict web-search dependence and heterogeneous source grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn Reinforcement Learning with Verifiable Rewards through curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers; and (iii) a steerable step-level reward that classifies each tool call by cognitive behavior and marginal utility, enabling explicit control over search trajectory breadth, depth, and horizon. These improvements enable reliable extension of tool-calling beyond 20 calls when warranted. The second is Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports for comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves state-of-the-art performance in the open-weights category while closely rivaling proprietary closed systems, while also demonstrating strong performance in general reasoning benchmarks: HLE, AIME-25, GPQA-Diamond, and MedQA.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

619. R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

作者:

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models’ ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks (+7.5 on AIME2024). These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

620. Expressive yet Efficient Feature Expansion with Adaptive Cross-Hadamard Products

作者:

Recent theoretical advances reveal that the Hadamard product induces nonlinear representations and implicit high-dimensional mappings for the field of deep learning, yet their practical deployment in efficient vision models remains underdeveloped. To address this gap, we introduce the Adaptive Cross-Hadamard (ACH) module, a novel operator that embeds learnability through differentiable discrete sampling and dynamic softsign normalization. This enables parameter-free feature reuse while stabilizing gradient propagation. Integrated into Hadaptive-Net (Hadamard Adaptive Network) via neural architecture search, our approach achieves unprecedented efficiency. Comprehensive experiments demonstrate state-of-the-art accuracy/speed trade-offs on image classification task, establishing Hadamard operations as fundamental building blocks for efficient vision models.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

621. Primal-Dual Policy Optimization for Adversarial Linear CMDPs

作者:

Existing work on linear constrained Markov decision processes (CMDPs) has primarily focused on stochastic settings, where the losses and costs are either fixed or drawn from fixed distributions. However, such formulations are inherently vulnerable to adversarially changing environments. To overcome this limitation, we propose a primal-dual policy optimization algorithm for online finite-horizon {adversarial} linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the \emph{first} to achieve sublinear regret and constraint violation bounds in this setting, both bounded by $\widetilde{\mathcal{O}}(K^{3/4})$, where $K$ denotes the number of episodes. The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from the following key contributions: (i) a new covering number argument for the weighted LogSumExp softmax policies, and (ii) two novel algorithmic components---periodic policy mixing and a regularized dual update---which allow us to effectively control both the covering number and the dual variable. We also report numerical results that validate our theoretical findings on the performance of the algorithm.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

622. Rethinking Unsupervised Cross-modal Flow Estimation: Learning from Decoupled Optimization and Consistency Constraint

作者:

This work presents DCFlow, a novel unsupervised cross-modal flow estimation framework that integrates a decoupled optimization strategy and a cross-modal consistency constraint. Unlike previous approaches that implicitly learn flow estimation solely from appearance similarity, we introduce a decoupled optimization strategy with task-specific supervision to address modality discrepancy and geometric misalignment distinctly. This is achieved by collaboratively training a modality transfer network and a flow estimation network. To enable reliable motion supervision without ground-truth flow, we propose a geometry-aware data synthesis pipeline combined with an outlier-robust loss. Additionally, we introduce a cross-modal consistency constraint to jointly optimize both networks, significantly improving flow prediction accuracy. For evaluation, we construct a comprehensive cross-modal flow benchmark by repurposing public datasets. Experimental results demonstrate that DCFlow can be integrated with various flow estimation networks and achieves state-of-the-art performance among unsupervised approaches.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

623. Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

作者:

Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present ``Brain-IT'', a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters \& subjects, allowing efficient training with limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i) high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii) low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT's design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

624. DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

作者:

Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40–80\% across systems.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

625. Catalog-Native LLM: Speaking Item-ID dialect with Less Entanglement for Recommendation

作者:

While collaborative filtering delivers predictive accuracy and efficiency, and Large Language Models (LLMs) enable expressive and generalizable reasoning, modern recommendation systems must bring these strengths together. Growing user expectations, such as natural-language queries and transparent explanations, further highlight the need for a unified approach. However, doing so is nontrivial. Collaborative signals are often token-efficient but semantically opaque, while LLMs are semantically rich but struggle to model implicit user preferences when trained only on textual inputs. This paper introduces Item-ID + Natural-language Mixture-of-Experts Language Model (IDIOMoE), which treats item interaction histories as a native dialect within the language space, enabling collaborative signals to be understood in the same way as natural language. By splitting the Feed Forward Network of each block of a pretrained LLM into a separate text expert and an item expert with token-type gating, our method avoids destructive interference between text and catalog modalities. IDIOMoE demonstrates strong recommendation performance across both public and proprietary datasets, while preserving the text understanding of the pretrained model.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

626. Nasty Adversarial Training: A Probability Sparsity Perspective for Robustness Enhancement

作者:

The vulnerability of deep neural networks to adversarial examples poses significant challenges to their reliable deployment. Among existing empirical defenses, adversarial training and robust distillation have proven the most effective. In this paper, we identify a property originally associated with model intellectual property, i.e., probability sparsity induced by nasty training, and demonstrate that it can also provide interpretable improvements to adversarial robustness. We begin by analyzing how nasty training induces sparse probability distributions and qualitatively explore the spatial metric preferences this sparsity introduces to the model. Building on these insights, we propose a simple yet effective adversarial training method, nasty adversarial training (NAT), which incorporates probability sparsity as a regularization mechanism to boost adversarial robustness. Both theoretical analysis and experimental results validate the effectiveness of NAT, highlighting its potential to enhance the adversarial robustness of deep neural networks in an interpretable manner.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

627. Revisiting [CLS] and Patch Token Interaction in Vision Transformers

作者:

Vision Transformers have emerged as powerful, scalable and versatile representation learners. To capture both global and local features, a learnable [CLS] class token is typically prepended to the input sequence of patch tokens. Despite their distinct nature, both token types are processed identically throughout the model. In this work, we investigate the friction between global and local feature learning under different pre-training strategies by analyzing the interactions between class and patch tokens. Our analysis reveals that standard normalization layers introduce an implicit differentiation between these token types. Building on this insight, we propose specialized processing paths that selectively disentangle the computational flow of class and patch tokens, particularly within normalization layers and early query-key-value projections. This targeted specialization leads to significantly improved patch representation quality for dense prediction tasks. Our experiments demonstrate segmentation performance gains of over 2 mIoU points on standard benchmarks, while maintaining strong classification accuracy. The proposed modifications introduce only an 8\% increase in parameters, with no additional computational overhead. Through comprehensive ablations, we provide insights into which architectural components benefit most from specialization and how our approach generalizes across model scales and learning frameworks.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

628. SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models

作者:

Despite the empirical success of extensive, step-by-step reasoning in large multimodal models, long reasoning processes inevitably incur substantial computational overhead, i.e., in terms of higher token costs and increased response time, which undermines inference efficiency. In contrast, humans often employ sketch-style reasoning: a concise, goal-directed cognitive process that prioritizes salient information and enables efficient problem-solving. Inspired by this cognitive efficiency, we propose SketchThinker-R1, which incentivizes sketch-style reasoning ability in large multimodal models. Our method consists of three primary stages. In the Sketch-Mode Cold Start stage, we convert standard long reasoning process into sketch-style reasoning and finetune base multimodal model, instilling initial sketch-style reasoning capability. Next, we train SketchJudge Reward Model, which explicitly evaluates thinking process of model and assigns higher scores to sketch-style reasoning. Finally, we conduct Sketch-Thinking Reinforcement Learning under supervision of SketchJudge to further generalize sketch-style reasoning ability. Experimental evaluation on four benchmarks reveals that our SketchThinker-R1 achieves over 64% reduction in reasoning token cost without compromising final answer accuracy. Qualitative analysis further shows that sketch-style reasoning focuses more on key cues during problem solving.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

629. Conditionally Whitened Generative Models for Probabilistic Time Series Forecasting

作者:

Probabilistic forecasting of multivariate time series is challenging due to non-stationarity, inter-variable dependencies, and distribution shifts. While recent diffusion and flow matching models have shown promise, they often ignore informative priors such as conditional means and covariances. In this work, we propose Conditionally Whitened Generative Models (CW-Gen), a framework that incorporates prior information through conditional whitening. Theoretically, we establish sufficient conditions under which replacing the traditional terminal distribution of diffusion models, namely the standard multivariate normal, with a multivariate normal distribution parameterized by estimators of the conditional mean and covariance improves sample quality. Guided by this analysis, we design a novel Joint Mean-Covariance Estimator (JMCE) that simultaneously learns the conditional mean and sliding-window covariance. Building on JMCE, we introduce Conditionally Whitened Diffusion Models (CW-Diff) and extend them to Conditionally Whitened Flow Matching (CW-Flow). Experiments on five real-world datasets with six state-of-the-art generative models demonstrate that CW-Gen consistently enhances predictive performance, capturing non-stationary dynamics and inter-variable correlations more effectively than prior-free approaches. Empirical results further demonstrate that CW-Gen can effectively mitigate the effects of distribution shift.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

630. Sample Reward Soups: Query-efficient Multi-Reward Guidance for Text-to-Image Diffusion Models

作者:

Recent advances in inference-time alignment of diffusion models have shown reduced susceptibility to reward over-optimization. However, when aligning with multiple black-box reward functions, the number of required queries grows exponentially with the number of reward functions, making the alignment process highly inefficient. To address the challenge, we propose the first inference-time soup strategy, named Sample Reward Soups (SRSoup), for Pareto-optimal sampling across the entire space of preferences. Specifically, at each denoising step, we independently steer multiple denoising distributions using reward-guided search gradients (one for each reward function) and then linearly interpolate their search gradients. This design is effective because sample rewards can be shared when two denoising distributions are close, particularly during the early stages of the denoising process. As a result, SRSoup significantly reduces the number of queries required in the early stages without sacrificing performance. Extensive experiments demonstrate the effectiveness of SRSoup in aligning T2I models with diverse reward functions, establishing a practical and scalable solution.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

631. Causal Discovery in the Wild: A Voting-Theoretic Ensemble Approach

作者:

Causal discovery is a critical yet persistently challenging task across scientific domains. Despite years of significant algorithmic advances, existing methods still struggle with inconsistent outcomes due to reliance on untestable assumptions, sensitivity to data perturbations, and optimization constraints. To this end, ensemble-based causal discovery has been actively pursued, aiming to aggregate multiple structural predictions for increased stability and uncertainty estimation. However, current aggregation methods are largely heuristic, lacking theoretical guarantees and guidance on how ensemble design choices affect performance. This work is proposed to address there fundamental limitations. We introduce a principled voting-based framework for structural ensembling, establishing conditions under which the aggregated structure recovers the true causal graph. Our analysis yields a theoretically justified weighted voting mechanism that informs optimal choices regarding the number, competency, and diversity of causal discovery experts in the ensemble. Extensive experiments on synthetic and real-world datasets verify the robustness and effectiveness of our approach, offering a rigorous alternative to existing heuristic ensemble methods.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

632. Object-Centric Refinement for Enhanced Zero-Shot Segmentation

作者:

Zero-shot semantic segmentation aims to recognize, pixel-wise, unseen categories without annotated masks, typically by leveraging vision-language models such as CLIP. However, the patch representations obtained by the CLIP's vision encoder lack object-centric structure, making it difficult to localize coherent semantic regions. This hinders the performance of the segmentation decoder, especially for unseen categories. To mitigate this issue, we propose object-centric zero-shot segmentation (OC-ZSS) that enhances patch representations using object-level information. To extract object features for patch refinement, we introduce self-supervision-guided object prompts into the encoder. These prompts attend to coarse object regions using attention masks derived from unsupervised clustering of features from a pretrained self-supervised~(SSL) model. Although these prompts offer a structured initialization of the object-level context, the extracted features remain coarse due to the unsupervised nature of clustering. To further refine the object features and effectively enrich patch representations, we develop a dual-stage Object Refinement Attention (ORA) module that iteratively updates both object and patch features through cross-attention. Last, to make the refinement more robust and sensitive to objects of varying spatial scales, we incorporate a lightweight granular attention mechanism that operates over multiple receptive fields. OC-ZSS achieves state-of-the-art performance on standard zero-shot segmentation benchmarks across inductive, transductive, and cross-domain settings.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

633. MRAD: Zero-Shot Anomaly Detection with Memory-Driven Retrieval

作者:

Zero-shot anomaly detection (ZSAD) often leverages pretrained vision or vision-language models, but many existing methods use prompt learning or complex modeling to fit the data distribution, resulting in high training or inference cost and limited cross-domain stability. To address these limitations, we propose Memory-Retrieval Anomaly Detection method (MRAD), a unified framework that replaces parametric fitting with a direct memory retrieval. The train-free base model, MRAD-TF, freezes the CLIP image encoder and constructs a two-level memory bank (image-level and pixel-level) from auxiliary data, where feature-label pairs are explicitly stored as keys and values. During inference, anomaly scores are obtained directly by similarity retrieval over the memory bank. Based on the MRAD-TF, we further propose two lightweight variants as enhancements: (i) MRAD-FT fine-tunes the retrieval metric with two linear layers to enhance the discriminability between normal and anomaly; (ii) MRAD-CLIP injects the normal and anomalous region priors from the MRAD-FT as dynamic biases into CLIP's learnable text prompts, strengthening generalization to unseen categories. Across 16 industrial and medical datasets, the MRAD framework consistently demonstrates superior performance in anomaly classification and segmentation, under both train-free and training-based settings. Our work shows that fully leveraging the empirical distribution of raw data, rather than relying only on model fitting, can achieve stronger anomaly detection performance. Code will be released.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

634. CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

作者:

Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians’ evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce **CARE**, advancing **C**linical **A**ccountability in multi-modal medical **R**easoning with an **E**vidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our **CARE-Flow** (coordinator-free) improves average accuracy by **10.9%** over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our **CARE-Coord** yields a further gain, outperforming the heavily pre-trained SOTA by **5.2%**. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

635. On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

作者:

In this work, we present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. With just a single-line change, the method outperforms standard SFT on multiple difficult benchmarks and base models, from math reasoning to code generation and multi-modal tasks, demonstrating improved generalization. Additionally, DFT achieves competitive results in offline RL settings, providing an effective yet streamlined alternative. By bridging theoretical insights with practical solutions, this work advances the state of SFT. The source code will be publicly released.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

636. UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models

作者:

Diffusion LLMs have attracted growing interest, with plenty of recent work emphasizing their great potential in various downstream tasks; yet the long‑context behavior of diffusion LLMs remains largely uncharted. We present a case study of post‑training techniques for extending the context window of diffusion LLMs (i.e., LLaDA) without retraining from scratch. We show that a simple modification to the standard Rotary Positional Embeddings (RoPE) extension effectively accommodates the probabilistic modeling inherent in the diffusion process, enabling stable scaling to longer context ranges. We further compare masking strategies used during post‑training and analyze their impact on optimization stability and long‑range recall. Instantiating these insights, we introduce UltraLLaDA, a diffusion LLM with a 128K‑token context window that, in our empirical evaluation on long‑context tasks, significantly outperforms training‑free baselines. Our experimental results highlight the special positional extension as a key lever for scaling diffusion LLMs to extended contexts and offer practical guidance for practitioners seeking 128K‑scale context via efficient post‑training.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

637. Tackling Time-Series Forecasting Generalization via Mitigating Concept Drift

作者:

Time-series forecasting finds broad applications in real-world scenarios. Due to the dynamic nature of time series data, it is important for time-series forecasting models to handle potential distribution shifts over time. In this paper, we initially identify two types of distribution shifts in time series: concept drift and temporal shift. We acknowledge that while existing studies primarily focus on addressing temporal shift issues in time series forecasting, designing proper concept drift methods for time series forecasting has received comparatively less attention. Motivated by the need to address potential concept drift, while conventional concept drift methods via invariant learning face certain challenges in time-series forecasting, we propose a soft attention mechanism that finds invariant patterns from both lookback and horizon time series. Additionally, we emphasize the critical importance of mitigating temporal shifts as a preliminary to addressing concept drift. In this context, we introduce ShifTS, a method-agnostic framework designed to tackle temporal shift first and then concept drift within a unified approach. Extensive experiments demonstrate the efficacy of ShifTS in consistently enhancing the forecasting accuracy of agnostic models across multiple datasets, and outperforming existing concept drift, temporal shift, and combined baselines.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

638. LUMINA: Detecting Hallucinations in RAG System with Context–Knowledge Signals

作者:

Retrieval-Augmented Generation (RAG) aims to mitigate hallucinations in large language models (LLMs) by grounding responses in retrieved documents. Yet, RAG-based LLMs still hallucinate even when provided with correct and sufficient context. A growing line of work suggests that this stems from an imbalance between how models use external context and their internal knowledge, and several approaches have attempted to quantify these signals for hallucination detection. However, existing methods require extensive hyperparameter tuning, limiting their generalizability. We propose LUMINA, a novel framework that detects hallucinations in RAG systems through context–knowledge signals: external context utilization is quantified via distributional distance, while internal knowledge utilization is measured by tracking how predicted tokens evolve across transformer layers. We further introduce a framework for statistically validating these measurements. Experiments on common RAG hallucination benchmarks and four open-source LLMs show that LUMINA achieves consistently high AUROC and AUPRC scores, outperforming prior utilization-based methods by up to +13% AUROC on HalluRAG. Moreover, LUMINA remains robust under relaxed assumptions about retrieval quality and model matching, offering both effectiveness and practicality.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

639. Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

作者:

Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI fine-tuning API’s safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction. We also replicate the attack on two open-source models, Phi-4 and Mistral-Small-24B-Base-2501, confirming the generality of our method. We quantitatively evaluate our method on the AdvBench dataset, using Llama-Guard-3-8B for content safety classification. Across all three models, all stegotexts containing malicious content are incorrectly classified as safe.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

640. Sheaves Reloaded: A Direction Awakening

作者:

Sheaf Neural Networks (SNNs) are a powerful algebraic-topology generalization of Graph Neural Networks (GNNs), and have been shown to significantly improve our ability to model complex relational data. While the GNN literature proved that incorporating directionality can substantially boost performance in many real-world applications, no SNNs approaches are known with such a capability. To address this limitation, we introduce the Directed Cellular Sheaf, a generalized cellular sheaf designed to explicitly account for edge orientations. Building on it, we define a corresponding sheaf Laplacian, the Directed Sheaf Laplacian $L^{\widetilde{\mathcal{F}}}$, which exploits the sheaf's structure to capture both the graph’s topology and its directions. $L^{\widetilde{\mathcal{F}}}$ serves as the backbone of the Directed Sheaf Neural Network (DSNN), the first SNN model to embed a directional bias into its architecture. Extensive experiments on twelve real-world benchmarks show that DSNN consistently outperforms many baseline methods.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

641. Tree-sliced Sobolev IPM

作者:

Recent work shows Tree-Sliced Optimal Transport to be an efficient and more expressive alternative to Sliced Wasserstein (SW), improving downstream performance. Tree-sliced metrics compare probability distributions by projecting measures onto tree metric spaces; a central example is the Tree-Sliced Wasserstein (TSW) distance, which applies the $1$-Wasserstein metric after projection. However, computing tree-based $p$-Wasserstein for general $p$ is costly, largely confining practical use to $p=1$. In this work, we revisit Sobolev integral probability metrics (IPM) on trees to obtain a practical generalization of TSW. Building on the insight that a suitably regularized Sobolev IPM admits a closed-form expression, we introduce TS-Sobolev, a tree-sliced metric that aggregates regularized Sobolev IPMs over random tree systems and remains tractable for all $p\ge1$; for $p>1$, TS-Sobolev has the same computational complexity as TSW at $p=1$. Notably, at $p=1$ it recovers TSW exactly. Consequently, TS-Sobolev serves as a drop-in replacement for TSW in practical applications, with an additional flexibility in changing $p$. Furthermore, we extend this framework to define a corresponding metric for probability measures on hyperspheres. Experiments on Euclidean and spherical datasets show that TS-Sobolev and its spherical variant improve downstream performance in gradient flows, self-supervised learning, generative modeling, and text topic modeling over recent SW and TSW variants.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

642. HUMOF: Human Motion Forecasting in Interactive Social Scenes

作者:

Complex dynamic scenes present significant challenges for predicting human behavior due to the abundance of interaction information, such as human-human and human-environment interactions. These factors complicate the analysis and understanding of human behavior, thereby increasing the uncertainty in forecasting human motions. Existing motion prediction methods thus struggle in these complex scenarios. In this paper, we propose an effective method for human motion forecasting in dynamic scenes. To achieve a comprehensive representation of interactions, we design a hierarchical interaction feature representation so that high-level features capture the overall context of the interactions, while low-level features focus on fine-grained details. Besides, we propose a coarse-to-fine interaction reasoning module that leverages both spatial and frequency perspectives to efficiently utilize hierarchical features, thereby enhancing the accuracy of motion predictions. Our method achieves state-of-the-art performance across four public datasets. We will release our code upon publication.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

643. ELViS: Efficient Visual Similarity from Local Descriptors that Generalizes Across Domains

作者:

Large-scale instance-level training data is scarce, so models are typically trained on domain-specific datasets. Yet in real-world retrieval, they must handle diverse domains, making generalization to unseen data critical. We introduce ELViS, an image-to-image similarity model that generalizes effectively to unseen domains. Unlike conventional approaches, our model operates in similarity space rather than representation space, promoting cross-domain transfer. It leverages local descriptor correspondences, refines their similarities through an optimal transport step with data-dependent gains that suppress uninformative descriptors, and aggregates strong correspondences via a voting process into an image-level similarity. This design injects strong inductive biases, yielding a simple, efficient, and interpretable model. To assess generalization, we compile a benchmark of eight datasets spanning landmarks, artworks, products, and multi-domain collections, and evaluate ELViS as a re-ranking method. Our experiments show that ELViS outperforms competing methods by a large margin in out-of-domain scenarios and on average, while requiring only a fraction of their computational cost.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

644. Systematic Biosafety Evaluation of DNA Language Models under Jailbreak Attacks

作者:

DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Language Models have achieved success in designing synthetic functional DNA sequences, even whole genomes of novel bacteriophage, verified with wet lab experiments. Such remarkable generative power also brings severe biosafety concerns about whether DNA language models can design human viruses. With the goal of exposing vulnerabilities and informing the development of robust safeguarding techniques, we perform a systematic biosafety evaluation of DNA language models through the lens of jailbreak attacks. Specifically, we introduce JailbreakDNABench, a benchmark centered on high-priority human viruses, together with an end-to-end jailbreak framework, GeneBreaker. GeneBreaker integrates three key components: (1) an LLM agent equipped with customized bioinformatics tools to design high-homology yet non-pathogenic jailbreak prompts, (2) beam search guided by PathoLM and log-probability heuristics to steer sequence generation toward pathogen-like outputs, and (3) a BLAST- and function-annotation–based evaluation pipeline to identify successful jailbreaks. On JailbreakDNABench, GeneBreaker successfully jailbreaks the latest Evo series models across 6 viral categories consistently (up to 60\% Attack Success Rate for Evo2-40B). Further case studies on SARS-CoV-2 spike protein and HIV-1 envelope protein demonstrate the sequence and structural fidelity of jailbreak output, while evolutionary modeling of SARS-CoV-2 underscores biosecurity risks. Our findings also reveal that scaling DNA language models amplifies dual-use risks, motivating enhanced safety alignment and tracing mechanisms.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

645. ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision

作者:

We propose a method for inferring an egocentric dense depth map from an RGB image and a sparse point cloud. The crux of our method lies in modeling the 3D scene implicitly within the latent space and learning an inductive bias in an unsupervised manner through principles of Structure-from-Motion. To force the learning of this inductive bias, we propose to optimize for an ill-posed objective: predicting latent features that are not observed in the input view, but exists in the 3D scene. This is facilitated by means of rigid warping of latent features from the input view to a nearby or adjacent (co-visible) view of the same 3D scene. "Empty" regions in the latent space that correspond to regions occluded from the input view are completed by a Contextual eXtrapolation mechanism based on features visible in input view. Once learned, the inductive bias can be transferred to modulate the features of the input view to improve fidelity. We term our method "Occluded Region Completion as Supervision" or ORCaS. We evaluate ORCaS on VOID1500 and NYUv2 benchmark datasets, where we improve over the best existing method by 8.91% across all metrics. ORCaS also improves generalization from VOID1500 to ScanNet and NYUv2 by 15.7% and robustness to low density inputs by 31.2%. Code will be released.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

646. Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow

作者:

Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The implementation source code will be made publicly available.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

647. MarkovScale: Towards Optimal Sequential Scaling at Inference Time

作者:

Sequential scaling is a prominent inference-time scaling paradigm, yet its performance improvements are typically modest and not well understood, largely due to the prevalence of heuristic, non-principled approaches that obscure clear optimality bounds. To address this, we introduce a principled framework that models sequential scaling as a two-state Markov process, uncovering its fundamental properties and providing closed-form expressions for key aspects, including the conditions under which sequential scaling enhances accuracy, the theoretical accuracy upper bound, and the convergence rate. Leveraging this formulation, we develop MarkovScale, a practical system that applies these optimality criteria to achieve a theoretically grounded balance between accuracy and efficiency. Comprehensive experiments across 3 backbone LLMs and 5 benchmarks show that MarkovScale consistently outperforms state-of-the-art parallel and sequential scaling methods, representing a significant step toward optimal and resource-efficient inference in LLMs. The source code will be open upon acceptance at https://open-upon-acceptance.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

648. SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

作者:

We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at \url{https://scenetransporter.github.io/}

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

649. Lossless Vocabulary Reduction for Auto-Regressive Language Models

作者:

Tokenization---the process of decomposing a given text into a sequence of subwords called tokens---is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

650. Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

作者:

Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate in simulation that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

651. Bandit Learning in Matching Markets Robust to Adversarial Corruptions

作者:

This paper investigates the problem of bandit learning in two-sided decentralized matching markets with adversarial corruptions. In matching markets, players on one side aim to learn their unknown preferences over arms on the other side through iterative online learning, with the goal of identifying the optimal stable match. However, in real-world applications, stochastic rewards observed by players may be corrupted by malicious adversaries, potentially misleading the learning process and causing convergence to a sub-optimal match. We study this problem under two settings: one where the corruption level $C$ (defined as the sum of the largest adversarial alterations to the feedback across rounds) is known, and another where it is unknown. For the known corruption setting, we develop a robust variant of the classical Explore-Then-Gale-Shapley (ETGS) algorithm by incorporating widened confidence intervals. For the unknown corruption case, we propose a Multi-layer ETGS race method that adaptively mitigates adversarial effects without prior corruption knowledge. We provide theoretical guarantees for both algorithms by establishing upper bounds on their optimal stable regret, and further derive the lower bound to demonstrate their optimality.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

652. UniOD: A Universal Model for Outlier Detection across Diverse Domains

作者:

Outlier detection (OD), distinguishing inliers and outliers in completely unlabeled datasets, plays a vital role in science and engineering. Although there have been many insightful OD methods, most of them require troublesome hyperparameter tuning (a challenge in unsupervised learning) and costly model training for every task or dataset. In this work, we propose UniOD, a universal OD framework that leverages labeled datasets to train a single model capable of detecting outliers of datasets with different feature dimensions and heterogeneous feature spaces from diverse domains. Specifically, UniOD extracts uniform and comparable features across different datasets by constructing and factorizing multi-scale point-wise similarity matrices. It then employs graph neural networks to capture comprehensive within-dataset and between-dataset information simultaneously, and formulates outlier detection tasks as node classification tasks. As a result, once the training is complete, UniOD can identify outliers in datasets from diverse domains without any further model/hyperparameter selection and parameter optimization, which greatly improves convenience and accuracy in real applications. More importantly, we provide theoretical guarantees for the effectiveness of UniOD, consistent with our numerical results. We evaluate UniOD on 30 benchmark OD datasets against 17 baselines, demonstrating its effectiveness and superiority.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

653. Target-Aware Video Diffusion Models

作者:

We present a target-aware video diffusion model that generates videos from an input image, in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask, and the action is described through a text prompt. Our key motivation is to incorporate target awareness into video generation, enabling actors to perform directed actions on designated objects. This enables video diffusion models to act as motion planners, producing plausible predictions of human–object interactions by leveraging the priors of large-scale video generative models. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant attention regions and transformer blocks. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: zero-shot 3D HOI motion synthesis with physical plausibility and long-term video content creation.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

654. Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

作者:

Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we re-frame the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training. Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA-Pre achieves the highest performance across all model sizes. Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods. Beyond pre-training, we evaluate LoRA-Pre's effectiveness in fine-tuning scenarios. With the same rank, LoRA-Pre consistently outperforms all efficient fine-tuning baselines. Specifically, compared to standard LoRA, LoRA-Pre achieves substantial improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B, validating our approach's effectiveness across both pre-training and fine-tuning paradigms.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

655. IA2: Alignment with ICL Activations improves Supervised Fine-Tuning

作者:

Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: \textit{Can ICL's internal computations be used to improve the qualities of SFT?} We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL's rich functionality, we introduce \textbf{I}CL \textbf{A}ctivation \textbf{A}lignment (\act), a self-distillation technique which aims to replicate ICL's activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing \act as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and 2 model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

656. Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective

作者:

Unsupervised contrastive learning has shown significant performance improvements in recent years, often approaching or even rivaling supervised learning in various tasks. However, its learning mechanism is fundamentally different from supervised learning. Previous works have shown that difficult examples (well-recognized in supervised learning as examples around the decision boundary), which are essential in supervised learning, contribute minimally in unsupervised settings. In this paper, perhaps surprisingly, we find that the direct removal of difficult examples, although reduces the sample size, can boost the downstream classification performance of contrastive learning. To uncover the reasons behind this, we develop a theoretical framework modeling the similarity between different pairs of samples. Guided by this framework, we conduct a thorough theoretical analysis revealing that the presence of difficult examples negatively affects the generalization of contrastive learning. Furthermore, we demonstrate that the removal of these examples, and techniques such as margin tuning and temperature scaling can enhance its generalization bounds, thereby improving performance. Empirically, we propose a simple and efficient mechanism for selecting difficult examples and validate the effectiveness of the aforementioned methods, which substantiates the reliability of our proposed theoretical framework.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

657. A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction

作者:

Link sign prediction on a signed graph is a task to determine whether the relationship represented by an edge is positive or negative. Since the presence of negative edges violates the graph homophily assumption that adjacent nodes are similar, regular graph methods have not been applicable without auxiliary structures to handle them. We aim to directly model the latent statistical dependency among edges with the Gaussian copula and its corresponding correlation matrix, extending CopulaGNN (Ma et al., 2021). However, a naive modeling of edge-edge relations is computationally intractable even for a graph with moderate scale. To address this, we propose to 1) represent the correlation matrix as a Gramian of edge embeddings, significantly reducing the number of parameters, and 2) reformulate the conditional probability distribution to dramatically reduce the inference cost. We theoretically verify scalability of our method by proving its linear convergence. Also, our extensive experiments demonstrate that it achieves up to 437 times faster convergence than baselines, maintaining competitive prediction performance to the state-of-the-art models.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

658. FlashWorld: High-quality 3D Scene Generation within Seconds

作者:

We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, $10 \sim 100\times$ faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation mode. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

659. Efficient Benchmarking of Functional Connectivity Modeling via Structure-aware Core-set Selection

作者:

Benchmarking the hundreds of available functional connectivity (FC) models on large fMRI datasets is critical for reproducible neuroscience, but is often computationally infeasible, with full-scale comparisons requiring months of compute time. This creates a critical bottleneck, hindering data-driven model selection. To break this bottleneck, we address the challenge of FC benchmarking by introducing a pre-analytical step: selecting a small, representative core-set whose sole purpose is to preserve the relative performance ranking of FC models. We formulate this as a ranking recommendation problem and propose Structure-aware Contrastive Learning for Core-set Selection (SCLCS), a self-supervised framework to select these core-sets. SCLCS first uses an adaptive Transformer to learn each sample's unique FC structure. It then introduces a novel Structural Perturbation Score (SPS) to quantify the stability of these learned structures during training, identifying samples that represent foundational connectivity archetypes. Finally, it combines this stability-based ranking with a density-aware sampling strategy to ensure the selected core-set is both robust and diverse. On the large-scale REST-meta-MDD dataset, SCLCS preserves the ground-truth model ranking with just 10% of the data, outperforming state-of-the-art (SOTA) selection methods by up to 23.2% in ranking consistency (nDCG@k). To our knowledge, this is the first work to formalize core-set selection for FC model benchmarking, making previously intractable large-scale model comparisons feasible.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

660. FACM: Flow-Anchored Consistency Models

作者:

Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: Training the network exclusively on a shortcut objective leads to the catastrophic forgetting of the instantaneous velocity field that defines the flow. Our solution is to explicitly anchor the model in the underlying flow, ensuring high trajectory fidelity during training. We introduce the Flow-Anchored Consistency Model (FACM), where a Flow Matching (FM) task serves as a dynamic anchor for the primary CM shortcut objective. Key to this Flow-Anchoring approach is a novel expanded time interval strategy that unifies optimization for a single model while decoupling the two tasks to ensure stable, architecturally-agnostic training. By distilling a pre-trained LightningDiT model, our method achieves a state-of-the-art FID of 1.32 with two steps (NFE=2) and 1.70 with just one step (NFE=1) on ImageNet 256x256. To address the challenge of scalability, we develop a memory-efficient Chain-JVP that resolves key incompatibilities with FSDP. This method allows us to scale FACM training on a 14B parameter model (Wan 2.2), accelerating its Text-to-Image inference from 2x40 to 8 steps. Our code and pretrained models will be available to the public.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

661. From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation

作者:

Autoregressive (AR) models have emerged as a powerful framework for image generation, yet they remain bound by a fundamental limitation: once a prediction is made, it cannot be revised. Each step marches forward in a strict left-to-right sequence, causing small errors to accumulate and compromise the final image. In this work, we reimagine this process with TensorAR, a decoder-only AR model that shifts from predicting discrete tokens to predicting overlapping tensor windows. This simple change transforms image synthesis into a process of next-tensor prediction, enabling the model to refine earlier outputs while preserving the causal structure that defines autoregression. To guard against information leakage during training, we introduce a discrete tensor noising mechanism inspired by discrete diffusion theory, which injects categorical noise into input tensors. TensorAR is designed to be plug-and-play: unlike masked AR methods, it requires no architectural modifications, and unlike autoregressive diffusion, it preserves the familiar AR training paradigm. We evaluate TensorAR across both class-to-image and text-to-image tasks, showing consistent gains in generation quality and instruction-following ability, while achieving a superior balance between quality and latency. In doing so, TensorAR offers a new path forward for autoregressive generation---one where predictions are not just produced, but continually refined.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

662. Squeeze the Soaked Sponge: Efficient Off-policy RFT for Large Language Model

作者:

Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs), yet most existing Reinforcement Finetuning (RFT) methods are inherently \textit{on-policy} RL, failing to reuse historical data and thus preventing efficient scaling. In this work, we explore the potential of \textit{off-policy} RL to leverage historical data for rollout-efficient RFT. Specifically, we propose \textbf{Re}incarnating \textbf{Mix}-policy Proximal Policy Gradient (\textbf{ReMix}), which enables on-policy RFT methods to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio that utilizes the data from both current and past policies for efficient training; (2) KL-Convex policy constraint that combines the KL constraints on the base and precedent model to balance stability and flexibility; (3) Policy reincarnation that replaces the base model with the mix-policy RFT model in the mid way of training and restarts on-policy training, to achieve a seamless transition from early efficiency to steady convergence. In our experiments, we train a series of ReMix models based on PPO, GRPO from 1.5B, 7B base models. On five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500), ReMix achieves an average Pass@1 accuracy of \textbf{52.10\%} (with \textbf{0.079M rollouts}) and \textbf{64.39\%} (with \textbf{0.011M rollouts}) on 1.5B and 7B models, respectively. Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over \textbf{30x to 450x reduction in training cost in terms of rollout data volume}, demonstrating superior training efficiency. Additionally, our multifaceted analysis reveals insightful findings, including the implicit preference for shorter responses of off-policy RFT, the collapse mode of self-reflection under severe off-policyness, etc.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

663. Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation

作者:

While diffusion models have achieved remarkable performance in image generation, they often struggle with the imbalanced datasets frequently encountered in real-world applications, resulting in significant performance degradation on minority classes. In this paper, we identify model capacity allocation as a key and previously underexplored factor contributing to this issue, providing a perspective that is orthogonal to existing research. Our empirical experiments and theoretical analysis reveal that majority classes monopolize an unnecessarily large portion of the model's capacity, thereby restricting the representation of minority classes. To address this, we propose Capacity Manipulation (CM), which explicitly reserves model capacity for minority classes. Our approach leverages a low-rank decomposition of model parameters and introduces a capacity manipulation loss to allocate appropriate capacity for capturing minority knowledge, thus enhancing minority class representation. Extensive experiments demonstrate that CM consistently and significantly improves the robustness of diffusion models on imbalanced datasets, and when combined with existing methods, further boosts overall performance.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

664. TTT3R: 3D Reconstruction as Test-Time Training

作者:

Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear complexity in the sequence length. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, enabling a balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code will be made publicly available.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

665. Neural Force Field: Few-shot Learning of Generalized Physical Reasoning

作者:

Physical reasoning is a remarkable human ability that enables rapid learning and generalization from limited experience. Current AI models, despite extensive training, still struggle to achieve similar generalization, especially in Out-of-distribution (OOD) settings. This limitation stems from their inability to abstract core physical principles from observations. A key challenge is developing representations that can efficiently learn and generalize physical dynamics from minimal data. Here we present Neural Force Field (NFF), a framework extending Neural Ordinary Differential Equation (NODE) to learn complex object interactions through force field representations, which can be efficiently integrated through an Ordinary Differential Equation ( ODE) solver to predict object trajectories. Unlike existing approaches that rely on discrete latent spaces, NFF captures fundamental physical concepts such as gravity, support, and collision in continuous explicit force fields. Experiments on three challenging physical reasoning tasks demonstrate that NFF, trained with only a few examples, achieves strong generalization to unseen scenarios. This physics-grounded representation enables efficient forward-backward planning and rapid adaptation through interactive refinement. Our work suggests that incorporating physics-inspired representations into learning systems can help bridge the gap between artificial and human physical reasoning capabilities.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

666. ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

作者:

Unified multimodal models (UMMs) have shown remarkable advances in jointly understanding and generating text and images. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning: textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. As such, existing benchmarks rarely require the use of one modality to guide, verify, or refine outputs in the other. They therefore fail to capture a central aspiration of unified multimodal models, namely to support seamless reasoning across modalities. We address this gap with **ROVER**, a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1285 tasks grounded in 2,048 images, spanning two complementary settings. **Verbally-augmented reasoning for visual generation** evaluates whether models can use structured verbal prompts and reasoning chains to guide faithful image synthesis. **Visually-augmented reasoning for verbal generation** evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes. Experiments on 17 state-of-the-art UMMs reveal two key findings: (i) cross-modal reasoning capabilities strongly correlate with visual generation performance, particularly for interleaved image–text generation; and (ii) current models remain severely limited in visual-augmented reasoning, showing relative strength in perception and physical modeling but weakness in logical tasks. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation. More information on **Anonymous Page**: https://anony0923.github.io

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

667. FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

作者:

Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce **FS-DFM**, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1\,024-step discrete-flow baseline for generating 1\,024 tokens using a similar-size model, delivering up to 128× faster sampling and corresponding latency/throughput gains.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

668. Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models

作者:

Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

669. Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models

作者:

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. Yet current methods face an exploration dilemma: standard RL struggles to escape the local optima of pre-trained LLMs’ sharply peaked initial policies, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. We address this with a Risk-Sensitive Reinforcement Learning framework. By adopting a risk-seeking objective that interpolates between mean and maximum rewards, we derive a novel Risk-Sensitive GRPO (RS-GRPO) algorithm that emphasizes hard prompts to drive exploration. Across six mathematical reasoning benchmarks and five LLMs, RS-GRPO consistently improves pass@k performance while enhancing or maintaing pass@1.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

670. Learning linear state-space models with sparse system matrices

作者:

Due to tractable analysis and control, linear state-space models (LSSMs) provide a fundamental mathematical tool for time-series data modeling in various disciplines. In particular, many LSSMs have sparse system matrices because interactions among variables are limited or only a few significant relationships exist. However, current learning algorithms for LSSMs lack the ability to learn system matrices with the sparsity constraint due to the similarity transformation. To address this issue, we impose sparsity-promoting priors on system matrices to balance modeling error and model complexity. By taking hidden states of LSSMs as latent variables, we then explore the expectation--maximization (EM) algorithm to derive a maximum a posteriori (MAP) estimate of both hidden states and system matrices from noisy observations. Based on the Global Convergence Theorem, we further demonstrate that the proposed learning algorithm yields a sequence converging to a local maximum or saddle point of the joint posterior distribution. Finally, experimental results on simulation and real-world problems illustrate that the proposed algorithm can preserve the inherent topological structure among variables and significantly improve prediction accuracy over classical learning algorithms.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

671. Alita-G: Self-Evolving Generative Agent for Agent Generation

作者:

Large language models (LLMs) perform better when scaffolded into agents with memory, tools, and feedback. Beyond this, self-evolving agents have emerged, but current work largely limits adaptation to prompt rewriting or failure retries. Therefore, we present Alita-G, a self-evolution framework that transforms a general-purpose agent into a domain expert by systematically generating, abstracting, and curating Model Context Protocol (MCP) tools. In this framework, a generalist agent executes a curated suite of target-domain tasks and synthesizes candidate MCPs from successful trajectories. These are then abstracted to parameterized primitives and consolidated into a MCP Box. At inference time, Alita-G performs retrieval-augmented MCP selection with the help of each tool’s descriptions and use cases, before executing an agent equipped with the MCP Executor. Across several benchmarks GAIA, PathVQA, and Humanity's Last Exam, Alita-G attains strong gains while reducing computation costs. On GAIA validation, it achieves 83.03% pass@1 and 89.09% pass@3, establishing a new state-of-the-art result while reducing mean tokens per example by approximately 15% relative to a strong baseline agent. Alita-G thus provides a principled pathway from generalist capability to reusable, domain-specific competence, improving both accuracy and efficiency on complex reasoning tasks.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

672. InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement

作者:

Human–object–scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human–object interaction (HOI) and human–scene interaction (HSI), HOSI generation requires reasoning over dynamic object–scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse‑to‑fine instruction‑conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump‑aware guidance that mitigates collisions and penetrations during sampling without requiring fine‑grained scene geometry, enabling real‑time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo‑HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high‑fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state‑of‑the‑art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Code and datasets will be released upon acceptance.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

673. Continuous Space-Time Video Super-Resolution with 3D Fourier Fields

作者:

We introduce a novel formulation for continuous space-time video super-resolution. Instead of decoupling the representation of a video sequence into separate spatial and temporal components and relying on brittle, explicit frame warping for motion compensation, we encode video as a continuous, spatio-temporally coherent 3D Video Fourier Field (VFF). That representation offers three key advantages: (1) it enables cheap, flexible sampling at arbitrary locations in space and time; (2) it is able to simultaneously capture fine spatial detail and smooth temporal dynamics; and (3) it offers the possibility to include an analytical, Gaussian point spread function in the sampling to ensure aliasing-free reconstruction at arbitrary scale. The coefficients of the proposed, Fourier-like sinusoidal basis are predicted with a neural encoder with a large spatio-temporal receptive field, conditioned on the low-resolution input video. Through extensive experiments, we show that our joint modeling substantially improves both spatial and temporal super-resolution and sets a new state of the art for multiple benchmarks: across a wide range of upscaling factors, it delivers sharper and temporally more consistent reconstructions than existing baselines, while being computationally more efficient. Code will be published upon acceptance.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

674. FastVMT: Eliminating Redundancy in Video Motion Transfer

作者:

Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: **motion redundancy** arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; **gradient redundancy** occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43× speedup without degrading the visual fidelity or the temporal consistency of the generated videos.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

675. Is Softmax Loss all you need? A Principled Analysis of Softmax Loss and its Variants

作者:

**The Softmax Loss** is one of the most widely employed surrogate objectives for classification and ranking, owing to its elegant algebraic structure, intuitive probabilistic interpretation, and consistently strong empirical performance. To elucidate its theoretical properties, recent works have introduced the Fenchel–Young framework, situating Softmax loss as a canonical instance within a broad family of convex surrogates. This perspective not only clarifies the origins of its favorable properties, but also unifies it with alternatives such as Sparsemax and $\alpha$-Entmax under a common theoretical foundation. Concurrently, another line of research has addressed on the challenge of scalability: when the number of classes is exceedingly large, computations of the partition function become prohibitively expensive. Numerous approximation strategies have thus been proposed to retain the benefits of the exact objective while improving efficiency. However, their theoretical fidelity remains unclear, and practical adoption often relies on heuristics or exhaustive search. Building on these two perspectives, we present a principled investigation of the **Softmax-family** losses, encompassing both statistical and computational aspects. Within the Fenchel–Young framework, we examine whether different surrogates satisfy consistency with classification and ranking metrics, and analyze their gradient dynamics to reveal distinct convergence behaviors. For approximate Softmax methods, we introduce a systematic bias–variance decomposition that provides convergence guarantees. We further derive a per-epoch complexity analysis across the entire family, highlighting explicit trade-offs between accuracy and efficiency. Finally, extensive experiments on a representative recommendation task corroborate our theoretical findings, demonstrating a strong alignment between consistency, convergence, and empirical performance. Together, these results establish a principled foundation and offer practical guidance for loss selections in large-class machine learning applications.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

676. PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models

作者:

We introduce $\texttt{PairFlow}$, a lightweight preprocessing step for training Discrete Flow Models (DFMs) to achieve few-step sampling without requiring a pretrained teacher. DFMs have recently emerged as a new class of generative models for discrete data, offering strong performance. However, they suffer from slow sampling due to their iterative nature. Existing acceleration methods largely depend on finetuning, which introduces substantial additional training overhead. $\texttt{PairFlow}$ addresses this issue with a lightweight preprocessing step. Inspired by ReFlow and its extension to DFMs, we train DFMs from coupled samples of source and target distributions, without requiring any pretrained teacher. At the core of our approach is a closed-form inversion for DFMs, which allows efficient construction of paired source–target samples. Despite its extremely low cost, taking only up to 1.7\% of the compute needed for full model training, $\texttt{PairFlow}$ matches or even surpasses the performance of two-stage training involving finetuning. Furthermore, models trained with our framework provide stronger base models for subsequent distillation, yielding further acceleration after finetuning. Experiments on molecular data as well as binary and RGB images demonstrate the broad applicability and effectiveness of our approach.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

677. Productive LLM Hallucinations: Conditions, Mechanisms, and Benefits

作者:

Hallucinations in large language models (LLMs) are typically regarded as harmful errors to be suppressed. We revisit this assumption and ask whether, and under what conditions, hallucinations can instead be beneficial. To address this question, we introduce $\textbf{HIVE}$ ($\textbf{H}$allucination $\textbf{I}$nference and $\textbf{V}$erification $\textbf{E}$ngine), a task-agnostic framework that systematically evaluates the impact of hallucinated semantics across diverse tasks and models. By unifying generation, discrimination, and downstream evaluation, HIVE enables controlled comparative assessments of how hallucinations alter overall model performance. Extensive experiments on nine datasets and ten models show that hallucinations can yield substantial improvements up to $\textbf{+17.2}$ \% in accuracy especially in open-ended domains such as reasoning, biomedical, and vision language tasks. Stronger models consistently harness hallucinations, while weaker ones are more volatile. Mechanistic analyses show that hallucinations broaden semantic coverage, stabilize reasoning trajectories, and follow an inverted-U profile where moderate strength maximizes benefits across diverse tasks. These findings reframe hallucination from a defect to a controllable cognitive resource, suggesting opportunities for evaluating and training LLMs not merely to avoid hallucinations, but to exploit them constructively.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

678. Noisy-Pair Robust Representation Alignment for Positive-Unlabeled Learning

作者:

Positive-Unlabeled (PU) learning aims to train a binary classifier (positive vs. negative) where only limited positive data and abundant unlabeled data are available. While widely applicable, state-of-the-art PU learning methods substantially underperform their supervised counterparts on complex datasets, especially without auxiliary negatives or pre-estimated parameters (e.g., a 14.26% gap on CIFAR-100 dataset). We identify the primary bottleneck as the challenge of learning discriminative representations under unreliable supervision. To tackle this challenge, we propose NcPU, a non-contrastive PU learning framework that requires no auxiliary information. NcPU combines a noisy-pair robust supervised non-contrastive loss (NoiSNCL), which aligns intra-class representations despite unreliable supervision, with a phantom label disambiguation (PLD) scheme that supplies conservative negative supervision via regret-based label updates. Theoretically, NoiSNCL and PLD can iteratively benefit each other from the perspective of the Expectation-Maximization framework. Empirically, extensive experiments demonstrate that: (1) NoiSNCL enables simple PU methods to achieve competitive performance; and (2) NcPU achieves substantial improvements over state-of-the-art PU methods across diverse datasets, including challenging datasets on post-disaster building damage mapping, highlighting its promise for real-world applications. Code: https://github.com/ICLR2026-285/NcPU.git.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

679. PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

作者:

Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of \emph{choice-only shortcuts} in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1–54.2\%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants. We provide the source code and dataset viewer in the appendix, and will release the full source code, dataset, and annotation tool publicly upon acceptance.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

680. AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining

作者:

Formulaic alpha factor mining (FAFM) is a central problem in quantitative investment, where interpretable formulas are designed to extract predictive signals from historical financial series. With the emergence of large language models (LLMs), recent studies have begun to explore their roles in FAFM, yet their capabilities across different tasks and configurations remain unclear. In this work, we introduce AlphaBench, the first systematic benchmark for evaluating LLMs in FAFM. AlphaBench covers three core tasks, including factor generation, factor evaluation, and factor searching, which are all popular tasks integrated in the workflow of quantitative researchers. Beyond task-level evaluation, we further analyze how different LLM settings, including model type, prompting paradigm, and reasoning strategy, influence performance. Our experiments on a range of open-source and closed-source models reveal that LLMs hold strong potential in automating factor mining, while also facing persistent challenges in robustness, search efficiency, and practical usability.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

681. What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

作者:

State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 3

详细评分: 6, 6, 6

📄 openreview 📄 下载PDF

682. HoloPart: Generative 3D Part Amodal Segmentation

作者:

3D part amodal segmentation--decomposing a 3D shape into complete, semantically meaningful parts, even when occluded--is a challenging but crucial task for 3D content creation and understanding. Existing 3D part segmentation methods only identify visible surface patches, limiting their utility. Inspired by 2D amodal segmentation, we introduce this novel task to the 3D domain and propose a practical, two-stage approach, addressing the key challenges of inferring occluded 3D geometry, maintaining global shape consistency, and handling diverse shapes with limited training data. First, we leverage existing 3D part segmentation to obtain initial, incomplete part segments. Second, we introduce HoloPart, a novel diffusion-based model, to complete these segments into full 3D parts. HoloPart utilizes a specialized architecture with local attention to capture fine-grained part geometry and global shape context attention to ensure overall shape consistency. We introduce new benchmarks based on the ABO and PartObjaverse-Tiny datasets and demonstrate that HoloPart significantly outperforms state-of-the-art shape completion methods. By incorporating HoloPart with existing segmentation techniques, we achieve promising results on 3D part amodal segmentation, opening new avenues for applications in geometry editing, animation, and material assignment.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

683. Lifelong Embodied Navigation Learning

作者:

Embodied navigation agents powered by large language models have shown strong performance on individual tasks but struggle to continually acquire new navigation skills, which suffer from catastrophic forgetting. We formalize this challenge as lifelong embodied navigation learning (LENL), where an agent is required to adapt to a sequence of navigation tasks spanning multiple scenes and diverse user instruction styles, while retaining previously learned knowledge. To tackle this problem, we propose Uni-Walker, a lifelong embodied navigation framework that decouples navigation knowledge into task-shared and task-specific components with Decoder Extension LoRA (DE-LoRA). To learn the shared knowledge, we design a knowledge inheritance strategy and an experts co-activation strategy to facilitate shared knowledge transfer and refinement across multiple navigation tasks. To learn the specific knowledge, we propose an expert subspace orthogonality constraint together and a navigation-specific chain-of-thought reasoning mechanism to capture specific knowledge and enhance instruction-style understanding. Extensive experiments demonstrate the superiority of Uni-Walker for building universal embodied navigation agents with lifelong learning. We also provide the code of this work in the Supplementary Materials.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

684. ASMIL: Attention-Stabilized Multiple Instance Learning for Whole-Slide Imaging

作者:

Attention-based multiple instance learning (MIL) has emerged as a powerful framework for whole slide image (WSI) diagnosis, leveraging attention to aggregate instance-level features into bag-level predictions. Despite this success, we find that such methods exhibit a new failure mode: unstable attention dynamics. Across four representative attention-based MIL methods and two public WSI datasets, we observe that attention distributions oscillate across epochs rather than converging to a consistent pattern, degrading performance. This instability adds to two previously reported challenges: overfitting and over-concentrated attention distribution. To simultaneously overcome these three limitations, we introduce attention-stabilized multiple instance learning (ASMIL), a novel unified framework. ASMIL uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting. Extensive experiments demonstrate that ASMIL achieves up to a 6.49% F1 score improvement over state-of-the-art methods. Moreover, integrating the anchor model and normalized sigmoid into existing attention-based MIL methods consistently boosts their performance, with F1 score gains up to 10.73%. All code and data are publicly available at https://anonymous.4open.science/r/ASMIL-5018/.

📊 评审评分

平均分: 6.00

最低分: 6

最高分: 6

评审人数: 4

详细评分: 6, 6, 6, 6

📄 openreview 📄 下载PDF

ICLR 2026 高评分论文

高评分论文 (684篇) （按平均评分排序）