Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems. Here we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. We find that current agents can contribute effectively to well-specified data-analysis workflows when task structure and evaluation criteria are clear, but their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving reliability, autonomy, and scientific reasoning.
SciAgentArena separates the agent running framework from the evaluation framework, configuring a dedicated environment per agent and unifying I/O before computing metrics. This resolves configuration conflicts between agents and enables evaluation beyond accuracy — including stability, reliability, and cost.
While frontier models saturate math benchmarks (GPT 5.2 reaches 100% on AIME), the strongest agent on each SciAgentArena domain leaves clear room to improve.
The benchmark spans the full arc of biomedical research — from identifying problems and collecting data to understanding disease and developing treatments, and from the molecular and cellular scale up to the whole human body.
ADMET prediction, protein–ligand binding affinity, lead optimization, drug-target interaction, and combination-synergy scoring.
Integration, clustering, differentially expressed genes, perturbation prediction, and trajectory inference.
Spatial graph construction, spatially variable genes, and neighborhood enrichment on a dedicated page.
Clinical-code normalization, rare-event extraction, outcome prediction, and treatment recommendation scored via FHIR.
Mendelian randomization, QC, PRS computation, and Polygenic Risk Score pipelines over real GWAS summary statistics.
eQTL computation (genetics × omics), drug-target identification, and synthetic-lethality prediction across modalities.
Every domain's tasks map onto the same reasoning, planning, and action requirements of real research.
Solve a long-horizon analysis problem step by step, testing the ability to sustain multi-step workflows.
Improve candidate solutions toward stated objectives — selecting methods or designing new strategies.
Explore a research domain and propose new, scientifically grounded hypotheses and ideas.
Determine whether a proposed task is scientifically and technically feasible before executing it.
Because code execution and evaluation are decoupled, the community can add agents and tasks simultaneously. Open a dedicated benchmark page, submit code or data, and receive automatic, step-wise evaluation — our goal is the "LeetCode" of scientific-agent design.
Current AI agents are useful but uneven scientific collaborators.
No single agent dominates. Rankings are inconsistent across tasks and domains, exposing agent-specific biases and the need for stronger generalization.
Heterogeneous contributions. Agents excel at fixed-pipeline data analysis but struggle to optimize molecules and algorithms or to derive genuinely novel discoveries.
Conservative method selection. Agents converge to popular defaults — Leiden, Harmony, scVI, Wilcoxon, Moran's I, OLS — instead of adapting to the task.
Limited self-exploration & stability. Agents call tools passively, vary across identical runs, and share error patterns such as mismatched package versions.
Weak validity checking. Agents tend to execute requests even when the scientific premise is invalid — a form of sycophancy that undermines reliability.
A roadmap forward. Expand knowledge bases, supply richer prompts and tool awareness, add resource-aware decision-making, and keep humans in the loop.
A cross-domain taxonomy of recurring agent errors, distilled from running 18 agents over ~200 tasks.
| Failure mode | Where it appears | Proposed fix |
|---|---|---|
| Version / environment-mismatched API or tool | Omics; Drug Discovery | Check package versions and pin signatures before coding. |
| Wrong library, namespace, or method wrapper | Omics; Drug Discovery | Use task-to-library mapping and API introspection. |
| Wrong data structure / output schema assumption | Omics | Inspect runtime objects and validate the final output. |
| Domain workflow over-generalization | Omics; EHR; Drug Discovery | Use domain-specific checklists before execution. |
| Insufficient data / context inspection | Omics; EHR; Drug Discovery; Genetics | Add a mandatory context-inspection stage. |
| Hallucinated valid actions | EHR | Ground every action in retrieved patient context. |
| Incomplete recommendation coverage | EHR | Map all active problems to required treatments. |
| Multi-objective optimization failure | Drug Discovery | Use constraint-first, window-aware optimization. |
| Tool availability ignored | Drug Discovery | Require tool calls with traceable evidence. |
| Executing ill-posed / invalid tasks | Drug Discovery; Omics; Genetics | Add premise checks and structured refusal when invalid. |
| Need for human–agent collaboration | Genetics | Request clarification on paths, covariates, phenotypes, PRS setup. |
18 agents spanning generalist LLMs, specialist scientific agents, and multimodal models — each run at maximal functionality with its most suitable backbone for a fair comparison.
If you find SciAgentArena useful, please cite our work.
@article{liu2026sciagentarena,
title = {Benchmarking AI Agents for Addressing Scientific Challenges Across Scales},
author = {Liu, Tianyu and Wang, Allen Xin and Panescu, Antonia and Chen, Lisa Xinyi
and Long, Wenxin and Wei, Xinyu and Jing, Yueqian and Zeng, Ziyao and others},
year = {2026},
note = {SciAgentArena},
url = {https://arxiv.org/abs/2606.12736}
}
Tasks & datasets: huggingface.co/datasets/iLOVE2D/SciAgentArena