Human Baseliner for Open-Ended ML Research Tasks
Overview
We are hiring experienced machine learning engineers and researchers to serve as human baseliners for evaluations of open-ended machine learning research tasks. These evaluations measure how well AI agents perform on realistic AI R&D problems. To interpret agent performance, we also need strong human reference points: skilled practitioners attempting the same tasks under the same time and compute constraints. As a baseliner, you will complete self-contained ML research tasks in a sandboxed environment, working independently with your preferred tools and workflow. Your performance will be used as a benchmark against which frontier-model agents are evaluated.
What You’ll Do
- Attempt open-ended machine learning research tasks under a fixed time and compute budget (work trial)
- Work independently in a sandboxed Linux environment with internet access
- Use your preferred tooling, including IDEs and AI coding assistants such as Cursor, Claude Code, and ChatGPT
- Record your full working session via screen recording
- Complete a short pre-task and post-task questionnaire
- Submit your final work product, screen recording, and completed questionnaires: Post this you will be hired for a longer commitment
Commitment
- Minimum 20 hours per week if selected
- More availability is strongly preferred
Requirements
Candidates must meet all of the following:
- 3+ years of machine learning experience
- Time spent in a PhD program counts toward this requirement
- Undergraduate and master’s experience does not count
- Attended a top-100 university or worked at FAANG or a comparable company
- Experience with at least one major ML framework such as PyTorch, JAX, or TensorFlow
- Deep, hands-on expertise in at least one of the focus areas below:
- Pretraining under tight data and compute budgets
- PPO, reward shaping, custom `gym` / `gymnasium` environments, and throughput tuning
- Full fine-tuning, LoRA, QLoRA, DPO, RLHF, RLAIF, and distillation
- Large-scale corpus filtering, deduplication, subsampling, and benchmark contamination avoidance
- Architecture design under strict parameter-count or size constraints
- Modifying pretrained architectures, including attention patterns, pooling heads, or training objectives
- Contrastive training for embedding or retrieval models
- Generative vision or video modeling
- Multilingual or low-resource language experience
- Image or video data pipelines at scale
- Experience balancing competing model objectives such as safety and capability
- Prior work as an ML evaluator, red-teamer, or baseliner
Required Domain Expertise
Candidates must have strong practical experience in at least one of the following:
- Pretraining: training transformer language models from scratch
- Reinforcement learning: training agents in custom or existing environments
- Post-training: fine-tuning and aligning LLMs
- Dataset curation: building and cleaning large text corpora for LLM training
- Model architecture: designing and modifying neural network architectures
Logistics (work trial requirements)
- One baseline attempt per contractor per task
- Each task may only be attempted once by a given contractor
- All work is confidential and covered by NDA
- Compute and environment are provided; no personal GPU is required
Compensation
- Pay: $75 – $90/hour
- Type: Hourly contract
- Location: Remote
In-depth analysis: how it works, pay rates, pros & cons, and tips to get hired.
