arXiv 2605.23204

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

Guiyao Tie1, Jiawen Shi1, Dingjie Song2, Yixiao Huang1, Ziji Sheng1, Xueyang Zhou1, Yongchao Chen3, Daizong Liu4

Pan Zhou1, Ran Xu5, Lifang He2, Qingsong Wen6, Manling Li7, Cong Lu8, Shuai Li9, Pengtao Xie10

Yixuan Yuan11, Rui Meng14, Lei Xing13, Lichao Sun2, Caiming Xiong15, Philip S. Yu12, Jianfeng Gao16

1Huazhong University of Science and Technology 2Lehigh University 3Tsinghua University 4Wuhan University 5Salesforce Research 6Squirrel AI Learning

7Northwestern University 8Independent 9Shanghai Jiao Tong University 10University of California San Diego 11Chinese University of Hong Kong

12University of Illinois Chicago 13Stanford University 14Google Cloud AI Research 15Recursive Superintelligence 16Microsoft Research

Level-wise progression of AutoResearch across five workflow stages
Figure 1. AutoResearch progression across literature grounding, planning, execution, validation, and reporting.

Abstract

Scientific research is moving from isolated AI assistance toward integrated, longer-horizon workflow automation. AutoResearch captures this transition by treating scientific discovery as a coordinated process spanning literature grounding, planning, experimentation, validation, and reporting. Rather than framing the field through disconnected capabilities such as search, coding, or writing alone, the survey argues that scientific autonomy must be understood at the workflow level. What matters is not only whether an AI system can perform a subtask, but how control, execution, and validation are distributed across the entire research loop. To organize this landscape, the paper proposes a five-level autonomy spectrum from L0 (Human Only) to L4 (AI-Autonomous), identifies five recurring technical workflow stages, and shifts evaluation toward scientific quality dimensions including novelty, validity, impact, reliability, and provenance. It also emphasizes that autonomy ceilings are domain-conditioned, varying substantially across computational, empirical, clinical, and social sciences.

Core Contribution

A single survey framework that links autonomy levels, workflow stages, scientific-quality evaluation, and domain-conditioned ceilings.

Methodological View

The paper studies AutoResearch as a workflow system, not as a loose collection of isolated AI capabilities.

Evaluation Shift

Benchmark accuracy is insufficient; the survey centers novelty, validity, impact, reliability, and provenance.

Deployment Implication

Autonomy ceilings differ sharply across scientific fields because the structure of evidence and validation differs.

Survey Lens

  • Autonomy levels L0-L4
  • Five workflow stages
  • Scientific quality dimensions
  • Domain-aware deployment ceilings

Conceptual Overview

Because this is a survey, the homepage foregrounds the paper's framing logic before drilling into subcomponents. The survey framework first establishes the conceptual map, then unfolds into autonomy, technical foundations, evaluation, and domain constraints.

Overview of the AutoResearch survey framework
Figure 2. Overview of the survey framework: concept, technical foundations, evaluation, and domains.

Autonomy Spectrum

The five-level autonomy spectrum is the paper's backbone. It distinguishes not only who performs research tasks, but who retains control, who executes, and who ultimately validates scientific claims.

Five-level autonomy spectrum of AutoResearch
Figure 3. Autonomy spectrum from human-only inquiry to AI-autonomous science.
L0

Human Only

Humans plan, execute, and validate the full inquiry loop.

L1

Human-Led, AI-Assisted

AI supports bounded cognitive work while humans retain end-to-end control.

L2

Human-Verified, AI-Executed

AI executes substantial steps but humans still verify outputs and claims.

L3

AI-Led, Human-Assisted

AI coordinates the workflow and humans mainly intervene for exceptional cases.

L4

AI-Autonomous

End-to-end scientific closure no longer structurally depends on human oversight.

How to read the spectrum

The levels should not be read as a simple capability ladder. They encode changing distributions of scientific authority across control, execution, and validation, which is why the same system may appear strong in one stage but remain limited in full workflow autonomy.

Historical Trajectory

The historical view situates representative systems on top of the autonomy spectrum. This makes the survey longitudinal: it does not just define levels, it also shows how research automation has evolved toward them over time.

Historical trajectory of AutoResearch
Figure 4. Historical trajectory mapping representative systems onto the AutoResearch spectrum.

Technical Foundations

Instead of listing isolated methods, the survey decomposes AutoResearch into a recurring workflow. The five technical stages below are presented as a horizontal atlas so readers can scan the pipeline from grounding to reporting in one pass.

Technical workflow stages of AutoResearch
Figure 5. The five recurring technical stages of AutoResearch workflows.
Literature and research grounding

Stage I: Literature and Research Grounding

Anchors the workflow in search, evidence, and literature memory.

Hypothesis formation and planning

Stage II: Hypothesis Formation and Planning

Transforms grounded context into operationalizable candidate directions.

Experimentation and tool use

Stage III: Experimentation and Tool Use

Binds plans to code, tools, simulators, and executable substrates.

Feedback validation and review

Stage IV: Feedback, Validation, and Review

Introduces rejection pressure through reruns, critique, and review.

Reporting and knowledge communication

Stage V: Reporting and Knowledge Communication

Preserves provenance and claim-evidence alignment in final outputs.

Evaluation

The survey argues that scientific automation is advancing faster than scientific verifiability. Evaluation therefore has to move beyond task-level correctness and ask whether an entire workflow can sustain scientific quality.

Scientific quality for AutoResearch
Figure 6. Scientific quality dimensions for AutoResearch evaluation.

Novelty

Measures whether a system expands the scientific search space with genuinely new questions, hypotheses, or design choices.

  • Distinguishes recombination from substantive conceptual contribution.
  • Most visible in planning, hypothesis generation, and idea selection.

Validity

Checks that methods, execution traces, and final claims remain scientifically defensible under scrutiny.

  • Requires evidence-claim alignment rather than fluent but unsupported reporting.
  • Becomes the core bottleneck once AI begins executing larger workflow segments.

Impact

Asks whether outputs matter for downstream scientific progress, not merely whether they are technically correct.

  • Useful systems should influence follow-up experiments, benchmarks, or theory.
  • Impact must be interpreted relative to the norms of each target discipline.

Reliability

Evaluates stability across reruns, perturbations, tool changes, and long-horizon workflow execution.

  • One-off success is not enough for credible research automation.
  • Scientific systems must remain robust when environments or seeds shift.

Provenance

Requires that claims can be traced back to sources, tools, data, intermediate steps, and validation outcomes.

  • Provenance makes AutoResearch auditable instead of opaque.
  • It is the practical bridge between autonomy, trust, and accountability.

Domains

Scientific autonomy does not progress uniformly across disciplines. The paper emphasizes that domain structure controls how far workflow closure can realistically go.

Domain-conditioned autonomy ceilings in AutoResearch
Figure 7. Domain-conditioned autonomy ceilings across scientific disciplines.

Computational and Formal Sciences

These are the most automation-ready settings because workflows are code-native, replayable, and paired with relatively explicit correctness signals.

  • They currently offer the strongest path toward high workflow closure.
  • Execution, evaluation, and iteration can often be automated in the same digital substrate.

Physical Sciences and Engineering

Autonomy is supported by simulators, optimization pipelines, and instrumented environments, but remains coupled to hardware interfaces and physical experimentation.

  • Progress is strong when simulation and measurement loops are tightly integrated.
  • Real-world apparatus still imposes latency, calibration, and intervention costs.

Embodied Intelligence and Robotics

These workflows benefit from closed-loop control and automated evaluation, yet are constrained by real-world embodiment, reset cost, and long-horizon deployment friction.

  • Policy generation may scale faster than trustworthy real-world validation.
  • Data collection, safety, and reproducibility remain major bottlenecks.

Chemistry and Materials

Robotic labs and search-driven experimentation make this a strong candidate for partial workflow autonomy with measurable scientific outputs.

  • Closed-loop synthesis and characterization can support systematic iteration.
  • Throughput improves, but instrumentation and wet-lab orchestration still limit full closure.

Biology and Biomedicine

AI can accelerate analysis, protocol design, and multimodal interpretation, but biological systems remain noisy, heterogeneous, and difficult to fully control.

  • Autonomy is helped by large data resources and structured assays.
  • Experimental uncertainty and biological complexity keep humans in the loop.

Medicine and Clinical Research

Clinical domains face the tightest accountability, safety, and regulatory constraints, which sharply lowers the ceiling for autonomous scientific closure.

  • High-stakes workflows require strong oversight even when models perform well.
  • Validation is slow, risk-sensitive, and ethically constrained.

Economics and Social Sciences

Human behavior, causal ambiguity, and limited manipulability make automated discovery substantially harder than in closed technical environments.

  • Evidence is often indirect, delayed, and context-dependent.
  • These fields are less amenable to fully automated validation loops.

Earth and Environmental Sciences

These domains rely on observation, modeling, and slow feedback processes across open systems, creating persistent barriers to end-to-end automation.

  • Prediction can be strong even when intervention and validation are limited.
  • Long horizons and environmental uncertainty constrain autonomy ceilings.

Repository Resources

The project page is a paper-facing entry point; the repository remains the full survey companion with figures, narrative, and curated references.

BibTeX

@article{tie2026autoresearchai,
  title   = {AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery},
  author  = {Guiyao Tie and Jiawen Shi and Dingjie Song and Yixiao Huang and Ziji Sheng and Xueyang Zhou and Daizong Liu and Pan Zhou and Yongchao Chen and Ran Xu and Lifang He and Qingsong Wen and Manling Li and Cong Lu and Shuai Li and Pengtao Xie and Yixuan Yuan and Rui Meng and Lei Xing and Lichao Sun and Caiming Xiong and Philip S. Yu and Jianfeng Gao},
  journal = {arXiv preprint arXiv:2605.23204},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.23204}
}