Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification
arXiv:2603.03336v1 Announce Type: new Abstract: Rankings derived from pairwise comparisons are central to many economic and computational systems. In the context of large language models (LLMs), rankings are typically constructed from human preference data and presented as leaderboards that guide deployment decisions. However, existing approaches rely on point estimates, implicitly treating rankings as fixed objects despite substantial estimation noise and context-dependent performance variation. Acting on such rankings can lead to misallocation and welfare loss when apparent differences are […]