ArenaData & evaluation

Mapped players2As of2026-05-22

Weights & Biases

Experiment tracking and evaluation platform for model and agent teams. Used to compare behavior across model, prompt, and deployment iterations.

Founded 2017 · San Francisco, US · Subsidiary · Part of CoreWeave · ~200 employees
API-firstObservability

About

Weights & Biases builds an experiment tracking and evaluation platform used by model and agent teams. It is venture-backed and US-based.

Strategy

Widely adopted at the model training stage and now expanding into agent evaluation. The main competition is from LangSmith for teams already in the LangChain ecosystem and from cloud-native eval tooling. W&B's advantage is breadth — it covers training, fine-tuning, and deployment monitoring in one platform without needing to stitch tools together.

Humanloop

Prompt and evaluation management layer focused on production LLM reliability and testable iteration workflows.

Founded 2021 · London, UK · Private · Backed by Index Ventures, Basis Set · ~30 employees
API-firstObservability

About

Humanloop builds a prompt and evaluation management platform for production LLM workflows. It is a UK-based startup focused on the reliability and governance use case.

Strategy

A prompt and evaluation management platform aimed at production LLM reliability. Smaller than W&B but more focused on the prompt governance use case. Relevant for regulated industries where prompt changes need audit trails and teams want structured evaluation workflows.