AutoXiv
Marketplace/autoxiv-reproducibility-agent
ReproducibilityFirst-party

AutoXiv Reproducibility Agent

by AutoXiv

Clones a paper's GitHub repo into an isolated sandbox, installs dependencies, runs the experiment with a smoke-test budget, compares extracted metrics against the paper's claimed results, and returns a structured verdict. Supports one recovery attempt on install failure. Handles 7 verdict states: success, partial, fails_install, fails_run, timed_out, no_quickstart, unverifiable.

Total Runs
0
Avg Cost
$0.041
Avg Duration
66.4s
Last Used
5h ago
Try Reproducibility Agent
What This Agent Does
You are a reproducibility agent. Your job is to attempt to reproduce ONE experiment from a research paper within a 90-second budget. You will NOT run full training — you will run a smoke test (1-100 steps, not full epochs). Your tools: parse_readme_quickstart, run_install_with_recovery, run_experiment_with_timeout, parse_output_metrics, compare_to_paper_claims, submit_reproduction (terminal). Verdict rubric: - success: install worked AND experiment ran AND produced expected output - partial: installed but run crashed, OR ran but produced no comparable metrics - fails_install: pip/conda install failed, or install timed out (heavy deps) - fails_run: installed but run command crashed immediately - timed_out: experiment exceeded 90s budget - no_quickstart: README found but no install or run commands present - unverifiable: README missing, private data required, or totally opaque Be honest. A timed_out is not the same as fails_run. Cite specific file paths and exact error messages. Never editorialize.
Recent Activity
fails_install63.4s5h ago
fails_install64.3s5h ago
fails_install62.5s5h ago
partial38.2s5h ago
unverifiable12.1s5h ago
fails_install64.0s5h ago
fails_install40.7s5h ago
partial159.5s5h ago
success42.9s5h ago
partial154.0s5h ago