A Living Benchmark for AI Agents in Science

SciAgentArena

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

Tianyu Liu, Allen Xin Wang, Antonia Panescu, Lisa Xinyi Chen, Wenxin Long, Xinyu Wei, Yueqian Jing, Ziyao Zeng, Jihang Chen, Sihan Jiang, Ziqing Wang, Siyi Gu, Siyu Chen, Xinyang Hu, Haoran Shao, Leqi Xu, Wangjie Zheng, Zhiyuan Cao, Ada Fang, Botao Yu, Oliver Sun, Rex Ying, Arman Cohan, Qingyu Chen, Lingzhou Xue, Kaize Ding, Yuanqi Du, Wengong Jin, Zhuoran Yang, Marinka Zitnik, James Zou, Hua Xu, Hongyu Zhao
Yale University · Broad Institute of MIT and Harvard · Penn State · Stanford · Northeastern · Northwestern · Harvard · Ohio State · Microsoft Research New England · UC Berkeley

Abstract

AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems. Here we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. We find that current agents can contribute effectively to well-specified data-analysis workflows when task structure and evaluation criteria are clear, but their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving reliability, autonomy, and scientific reasoning.

~200
Stepwise-verified tasks
18
AI agents evaluated
generalist · specialist · MLLM
5
Scientific domains
4
Task categories
Open submissions
a living benchmark

Overview

SciAgentArena separates the agent running framework from the evaluation framework, configuring a dedicated environment per agent and unifying I/O before computing metrics. This resolves configuration conflicts between agents and enables evaluation beyond accuracy — including stability, reliability, and cost.

Overview of SciAgentArena: limitations of current agent and science benchmarks, task examples, the running/evaluation framework, and headline scores.
Figure 1. (a) Limitations of current AI-agent benchmarks. (b) Limitations of current science benchmarks. (c) Selected example tasks. (d) The running and evaluation framework, with the share of tasks per category. (e) AIME accuracy of GPT 5.2 (100%) versus SciAgentArena scores of the domain-specific best agent — showing substantial remaining headroom on real scientific tasks.

Best-agent performance on representative tasks

While frontier models saturate math benchmarks (GPT 5.2 reaches 100% on AIME), the strongest agent on each SciAgentArena domain leaves clear room to improve.

68.1
hERG Prediction
Drug Discovery · AUROC
29.4
DEG detection
Single-Cell Omics · F1
47.6
Spatially variable genes
Spatial Omics · Jaccard
70
Direct Diagnosis
EHR Modeling · Accuracy
80
Polygenic Risk Score
Genetics · Accuracy

Five scientific domains, from molecules to patients

The benchmark spans the full arc of biomedical research — from identifying problems and collecting data to understanding disease and developing treatments, and from the molecular and cellular scale up to the whole human body.

Molecule

Drug Discovery

ADMET prediction, protein–ligand binding affinity, lead optimization, drug-target interaction, and combination-synergy scoring.

Cell

Single-Cell Omics

Integration, clustering, differentially expressed genes, perturbation prediction, and trajectory inference.

Tissue

Spatial Omics

Spatial graph construction, spatially variable genes, and neighborhood enrichment on a dedicated page.

Patient

EHR Modeling

Clinical-code normalization, rare-event extraction, outcome prediction, and treatment recommendation scored via FHIR.

Genome

Genetics

Mendelian randomization, QC, PRS computation, and Polygenic Risk Score pipelines over real GWAS summary statistics.

Cross-domain

Integrative Tasks

eQTL computation (genetics × omics), drug-target identification, and synthetic-lethality prediction across modalities.

Four core capability categories

Every domain's tasks map onto the same reasoning, planning, and action requirements of real research.

Data Analysis

Solve a long-horizon analysis problem step by step, testing the ability to sustain multi-step workflows.

Optimization

Improve candidate solutions toward stated objectives — selecting methods or designing new strategies.

Discovery

Explore a research domain and propose new, scientifically grounded hypotheses and ideas.

Validity

Determine whether a proposed task is scientifically and technically feasible before executing it.

An interactive, extensible platform

Because code execution and evaluation are decoupled, the community can add agents and tasks simultaneously. Open a dedicated benchmark page, submit code or data, and receive automatic, step-wise evaluation — our goal is the "LeetCode" of scientific-agent design.

The SciAgentArena platform: scientific roadmap from patient to molecule, the benchmark portal with seven benchmarks, and the submission/auto-evaluation system.
Figure 2. (a) Scientific challenges span from micro-to-macro discovery — patients, genetics, spatial/single-cell omics, and drug discovery. (b) The interactive benchmarking portal with per-domain benchmark pages. (c) Solution submission and automatic evaluation.

Key findings

Current AI agents are useful but uneven scientific collaborators.

1

No single agent dominates. Rankings are inconsistent across tasks and domains, exposing agent-specific biases and the need for stronger generalization.

2

Heterogeneous contributions. Agents excel at fixed-pipeline data analysis but struggle to optimize molecules and algorithms or to derive genuinely novel discoveries.

3

Conservative method selection. Agents converge to popular defaults — Leiden, Harmony, scVI, Wilcoxon, Moran's I, OLS — instead of adapting to the task.

4

Limited self-exploration & stability. Agents call tools passively, vary across identical runs, and share error patterns such as mismatched package versions.

5

Weak validity checking. Agents tend to execute requests even when the scientific premise is invalid — a form of sycophancy that undermines reliability.

6

A roadmap forward. Expand knowledge bases, supply richer prompts and tool awareness, add resource-aware decision-making, and keep humans in the loop.

Common failure modes & proposed fixes

A cross-domain taxonomy of recurring agent errors, distilled from running 18 agents over ~200 tasks.

Failure modeWhere it appearsProposed fix
Version / environment-mismatched API or toolOmics; Drug DiscoveryCheck package versions and pin signatures before coding.
Wrong library, namespace, or method wrapperOmics; Drug DiscoveryUse task-to-library mapping and API introspection.
Wrong data structure / output schema assumptionOmicsInspect runtime objects and validate the final output.
Domain workflow over-generalizationOmics; EHR; Drug DiscoveryUse domain-specific checklists before execution.
Insufficient data / context inspectionOmics; EHR; Drug Discovery; GeneticsAdd a mandatory context-inspection stage.
Hallucinated valid actionsEHRGround every action in retrieved patient context.
Incomplete recommendation coverageEHRMap all active problems to required treatments.
Multi-objective optimization failureDrug DiscoveryUse constraint-first, window-aware optimization.
Tool availability ignoredDrug DiscoveryRequire tool calls with traceable evidence.
Executing ill-posed / invalid tasksDrug Discovery; Omics; GeneticsAdd premise checks and structured refusal when invalid.
Need for human–agent collaborationGeneticsRequest clarification on paths, covariates, phenotypes, PRS setup.
Adapted from Figure 6 of the paper — the full merged-error taxonomy.

Agents evaluated

18 agents spanning generalist LLMs, specialist scientific agents, and multimodal models — each run at maximal functionality with its most suitable backbone for a fair comparison.

GPT 5.2Gemini 3 ProClaude Sonnet 4.6 CodexClaude CodeToolUniverse CellForgeSTELLAAutoBABiomni TxAgentMedeaCACTUSChemToolAgent DrugAgentLIDDiADELTAMRagent

Citation

If you find SciAgentArena useful, please cite our work.

@article{liu2026sciagentarena,
  title  = {Benchmarking AI Agents for Addressing Scientific Challenges Across Scales},
  author = {Liu, Tianyu and Wang, Allen Xin and Panescu, Antonia and Chen, Lisa Xinyi
            and Long, Wenxin and Wei, Xinyu and Jing, Yueqian and Zeng, Ziyao and others},
  year   = {2026},
  note   = {SciAgentArena},
  url    = {https://arxiv.org/abs/2606.12736}
}

Tasks & datasets: huggingface.co/datasets/iLOVE2D/SciAgentArena