ArenaData & evaluation
Weights & Biases
Experiment tracking and evaluation platform for model and agent teams. Used to compare behavior across model, prompt, and deployment iterations.
Founded 2017 · San Francisco, US · Subsidiary · Part of CoreWeave · ~200 employeesAPI-firstObservability
Weights & Biases
About
Weights & Biases builds an experiment tracking and evaluation platform used by model and agent teams. It is venture-backed and US-based.
Strategy
Widely adopted at the model training stage and now expanding into agent evaluation. The main competition is from LangSmith for teams already in the LangChain ecosystem and from cloud-native eval tooling. W&B's advantage is breadth — it covers training, fine-tuning, and deployment monitoring in one platform without needing to stitch tools together.
Humanloop
Prompt and evaluation management layer focused on production LLM reliability and testable iteration workflows.
Founded 2021 · London, UK · Private · Backed by Index Ventures, Basis Set · ~30 employeesAPI-firstObservability
Humanloop
About
Humanloop builds a prompt and evaluation management platform for production LLM workflows. It is a UK-based startup focused on the reliability and governance use case.
Strategy
A prompt and evaluation management platform aimed at production LLM reliability. Smaller than W&B but more focused on the prompt governance use case. Relevant for regulated industries where prompt changes need audit trails and teams want structured evaluation workflows.