🧪 Autoresearch on an Old Research Idea
Summary
Applying Andrej Karpathy's Autoresearch concept to a real ML research problem (eCLIP). The author used Claude Code as a research agent that iteratively improved an eval metric by modifying train.py, while reading instructions from program.md.
Core Approach
- Agent iteratively improves eval metric by modifying train.py
- Reads instructions from program.md (split into "phases")
- Uses scratchpad.md as working memory
Experiment Setup
- Eval Metric: Mean Rank of retrieved embeddings
- Dataset: Ukiyo-eVG (~11K Japanese woodblock prints with phrase→boundbox annotations)
- Model: CLIP ViT-Small (22M) + DistilBERT (66M) + HeatmapProcessor
- Training: 800 steps (~3 min per run on RTX 4090)
- Time Budget: ~5 minutes per experiment to encourage rapid iteration
Sandboxing
- Containerized training loop with network access removed
- Claude Code permissions restricted: only edit two files and run run.sh
- No direct Python execution, no pip installs, no network access, no git push
Results
| Metric | Baseline | After Autoresearch |
|---|---|---|
| Mean Rank | 344.68 | 157.43 (54% reduction) |
| Experiments | - | 42 total, 13 committed, 29 reverted |
Final Test Results
| Metric | Test Score |
|---|---|
| Mean Rank | 34.30 |
| img→txt R@5 | 53.0% |
| txt→img R@5 | 51.4% |
Key Discoveries
🔴 Biggest Win: Temperature Clamp Bug
Agent immediately found a bug in the code. The learnable temperature parameter was clamped at 2. Agent relaxed the limit and eval dropped by 113 points — the single biggest win, worth more than all architecture changes combined.
🟡 Hyperparameter Tuning
Further gains (-30 mean rank) came from hyperparameter tuning: increasing projection dimension and re-tuning learning rate. The agent acted like a hyperparameter optimization algorithm with reasoning.
🟠 Diminishing Returns
- Architecture changes to attention mechanism didn't work
- "Moonshot" ideas in Phase 5 didn't stick
- Agent was "throwing spaghetti at the wall"
Key Insights
When the search space is clearly defined, the commit-or-revert loop is a surprisingly effective search strategy. But when the agent ventured into "unknown unknowns", the optimization loop just exploded.
Limitations
- The "make only one change per experiment" constraint may have been too tight for moonshot ideas
- Agent sometimes forgot permissions and made weird bash calls
- Agent sometimes got tired of waiting and ended the conversation
- Would not give it full autonomy just yet