🧪 Autoresearch on an Old Research Idea

5 STAR | Source: ykumar.me | Date: 2026-03-24

Summary

Applying Andrej Karpathy's Autoresearch concept to a real ML research problem (eCLIP). The author used Claude Code as a research agent that iteratively improved an eval metric by modifying train.py, while reading instructions from program.md.

Core Approach

        Autoresearch is a constrained optimization loop with an LLM agent in the middle:
        Agent iteratively improves eval metric by modifying train.py
Reads instructions from program.md (split into "phases")
Uses scratchpad.md as working memory

    

Experiment Setup

Eval Metric: Mean Rank of retrieved embeddings
Dataset: Ukiyo-eVG (~11K Japanese woodblock prints with phrase→boundbox annotations)
Model: CLIP ViT-Small (22M) + DistilBERT (66M) + HeatmapProcessor
Training: 800 steps (~3 min per run on RTX 4090)
Time Budget: ~5 minutes per experiment to encourage rapid iteration

Sandboxing

Containerized training loop with network access removed
Claude Code permissions restricted: only edit two files and run run.sh
No direct Python execution, no pip installs, no network access, no git push

Results

Metric	Baseline	After Autoresearch
Mean Rank	344.68	157.43 (54% reduction)
Experiments	-	42 total, 13 committed, 29 reverted

Final Test Results

Metric	Test Score
Mean Rank	34.30
img→txt R@5	53.0%
txt→img R@5	51.4%

Key Discoveries

🔴 Biggest Win: Temperature Clamp Bug

Agent immediately found a bug in the code. The learnable temperature parameter was clamped at 2. Agent relaxed the limit and eval dropped by 113 points — the single biggest win, worth more than all architecture changes combined.

🟡 Hyperparameter Tuning

Further gains (-30 mean rank) came from hyperparameter tuning: increasing projection dimension and re-tuning learning rate. The agent acted like a hyperparameter optimization algorithm with reasoning.

🟠 Diminishing Returns

Architecture changes to attention mechanism didn't work
"Moonshot" ideas in Phase 5 didn't stick
Agent was "throwing spaghetti at the wall"

Key Insights

When the search space is clearly defined, the commit-or-revert loop is a surprisingly effective search strategy. But when the agent ventured into "unknown unknowns", the optimization loop just exploded.

Limitations

The "make only one change per experiment" constraint may have been too tight for moonshot ideas
Agent sometimes forgot permissions and made weird bash calls
Agent sometimes got tired of waiting and ended the conversation
Would not give it full autonomy just yet