digitado

Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment

digitado ⋅ 15 de January de 2026

arXiv:2601.08849v1 Announce Type: new Abstract: Automated answer matching, which leverages LLMs to evaluate free-text responses by comparing them to a reference answer, shows substantial promise as a scalable and aligned alternative to human evaluation. However, its reliability requires robustness against strategic attacks such as guesswork or verbosity that may artificially inflate scores without improving actual correctness. In this work, we systematically investigate whether such tactics deceive answer matching models by prompting examinee models to: (1) generate verbose responses, […]

Ver mais

Like 0

Liked Liked

technocracy

crates.io: The Changes Being Made to Download Handling

digitado ⋅ 7 de March de 2026

Like the rest of the Rust community, crates.io has been growing rapidly, with download and package counts increasing 2-3x year-on-year. This growth doesn’t come without problems, and we have made some changes to download handling on crates.io to ensure we can keep providing crates for a long time to come. The Problem This growth has brought with it some challenges. The most significant of these is that all download requests currently go through the crates.io API, occasionally causing […]

Ver mais

Like 0

Liked Liked

technocracy

The Bayesian Geometry of Transformer Attention

digitado ⋅ 29 de January de 2026

arXiv:2512.22471v3 Announce Type: replace-cross Abstract: Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing emph{Bayesian wind tunnels} — controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, […]

Ver mais

Like 0

Liked Liked

technocracy

Zero grip, maximum fun: A practical guide to getting into amateur ice racing

digitado ⋅ 19 de February de 2026

In Formula One, grip is everything. The world’s best engineers devote their careers to designing cars that maximize downforce and grip to squeeze every bit of performance out of a set of four humble tires. These cars punish their drivers by slinging them at six Gs through corners and offer similar levels of abuse in braking. It’s all wildly impressive, but I’ve long maintained that those drivers are not the ones having the most fun. When it comes […]

Ver mais

Like 0

Liked Liked

technocracy

“It doesn’t feel safe”—Many international game developers plan to skip GDC in US

digitado ⋅ 9 de March de 2026

This week, tens of thousands of game developers and producers will once again gather in San Francisco, as they have since 1988, for the weeklong Game Developers Conference. But this year’s show will be missing many international developers who say they no longer feel comfortable traveling to the United States to attend, no matter how relevant the show is to their work and careers. Dozens of those developers who spoke to Ars in recent months say they’re wary […]

Ver mais

Like 0

Liked Liked

technocracy

Optimism Stabilizes Thompson Sampling for Adaptive Inference

digitado ⋅ 6 de February de 2026

arXiv:2602.06014v1 Announce Type: cross Abstract: Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify emph{optimism} as a key mechanism for restoring emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm’s pull […]

Ver mais

Like 0

Liked Liked

technocracy

$alpha^3$-Bench: A Unified Benchmark of Safety, Robustness, and Efficiency for LLM-Based UAV Agents over 6G Networks

digitado ⋅ 8 de January de 2026

arXiv:2601.03281v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used as high level controllers for autonomous Unmanned Aerial Vehicle (UAV) missions. However, existing evaluations rarely assess whether such agents remain safe, protocol compliant, and effective under realistic next generation networking constraints. This paper introduces $alpha^3$-Bench, a benchmark for evaluating LLM driven UAV autonomy as a multi turn conversational reasoning and control problem operating under dynamic 6G conditions. Each mission is formulated as a language mediated control […]

Ver mais

Like 0

Liked Liked

technocracy

Anthropic-Pentagon AI feud escalates

digitado ⋅ 17 de February de 2026

Read Online | Sign Up | Advertise Good morning, {{ first_name | AI enthusiasts }}. The Pentagon may soon label Anthropic a “supply chain risk” in response to the company’s limits on how the military uses its AI. The feud, which only appears to be escalating, highlights a deeper tension now shaping the AI era: who controls how frontier models are deployed in military operations — the labs that build them, or the governments that use them? In […]

Ver mais

Like 0

Liked Liked

technocracy

Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions

digitado ⋅ 8 de January de 2026

We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional […]

Ver mais

Like 0

Liked Liked

technocracy

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

digitado ⋅ 9 de January de 2026

Large language models (LLMs) have demonstrated competitive performance in zero-shot multilingual machine translation (MT). Some follow-up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state-of-the-art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy-to-hard curriculum multiple times during training […]

Ver mais

Like 0

Liked Liked