Human Baseliner for Open-Ended ML Research Tasks

Posted today Hourly Remote English

Mercor

$75 – $90 per hour

Overview

We are hiring experienced machine learning engineers and researchers to serve as human baseliners for evaluations of open-ended machine learning research tasks. These evaluations measure how well AI agents perform on realistic AI R&D problems. To interpret agent performance, we also need strong human reference points: skilled practitioners attempting the same tasks under the same time and compute constraints. As a baseliner, you will complete self-contained ML research tasks in a sandboxed environment, working independently with your preferred tools and workflow. Your performance will be used as a benchmark against which frontier-model agents are evaluated.

What You’ll Do

Attempt open-ended machine learning research tasks under a fixed time and compute budget (work trial)

Work independently in a sandboxed Linux environment with internet access

Use your preferred tooling, including IDEs and AI coding assistants such as Cursor, Claude Code, and ChatGPT

Record your full working session via screen recording

Complete a short pre-task and post-task questionnaire

Submit your final work product, screen recording, and completed questionnaires: Post this you will be hired for a longer commitment

Commitment

Minimum 20 hours per week if selected

More availability is strongly preferred

Requirements

Candidates must meet all of the following:

3+ years of machine learning experience

Time spent in a PhD program counts toward this requirement

Undergraduate and master’s experience does not count

Attended a top-100 university or worked at FAANG or a comparable company

Experience with at least one major ML framework such as PyTorch, JAX, or TensorFlow

Deep, hands-on expertise in at least one of the focus areas below:

Pretraining under tight data and compute budgets

PPO, reward shaping, custom `gym` / `gymnasium` environments, and throughput tuning

Full fine-tuning, LoRA, QLoRA, DPO, RLHF, RLAIF, and distillation

Large-scale corpus filtering, deduplication, subsampling, and benchmark contamination avoidance

Architecture design under strict parameter-count or size constraints

Modifying pretrained architectures, including attention patterns, pooling heads, or training objectives

Contrastive training for embedding or retrieval models

Generative vision or video modeling

Multilingual or low-resource language experience

Image or video data pipelines at scale

Experience balancing competing model objectives such as safety and capability

Prior work as an ML evaluator, red-teamer, or baseliner

Required Domain Expertise

Candidates must have strong practical experience in at least one of the following:

Pretraining: training transformer language models from scratch

Reinforcement learning: training agents in custom or existing environments

Post-training: fine-tuning and aligning LLMs

Dataset curation: building and cleaning large text corpora for LLM training

Model architecture: designing and modifying neural network architectures

Logistics (work trial requirements)

One baseline attempt per contractor per task

Each task may only be attempted once by a given contractor

All work is confidential and covered by NDA

Compute and environment are provided; no personal GPU is required

Compensation

Pay: $75 – $90/hour
Type: Hourly contract
Location: Remote

Platform Review

Read our Mercor Review

In-depth analysis: how it works, pay rates, pros & cons, and tips to get hired.

Read Review

Human Baseliner for Open-Ended ML Research Tasks

Overview

What You’ll Do

Commitment

Requirements

Required Domain Expertise

Logistics (work trial requirements)

Compensation

New to Remote Gig Work?

Apply to Mercor

Human Baseliner for Open-Ended ML Research Tasks

Overview

What You’ll Do

Commitment

Requirements

Required Domain Expertise

Logistics (work trial requirements)

Compensation

New to Remote Gig Work?

Apply to Mercor

More jobs from Mercor

Other remote roles you might like