Benchmark Design

Tasks, Families, and Scoring

Overview of what is measured, how scores are combined, and how mechanism families differ.

Version: v1.0 | Last updated: 2026-03-30

Evaluation Protocol

NeuroMIB evaluates interpretability quality across four tasks. EvalAI hosts public and private phases, while this repository provides generation, schema validation, and scorer logic used by EvalAI workers.

Task A: latent variable recovery
Task B: mechanism classification
Task C: support recovery
Task D: intervention prediction

Task Weighting

Benchmark Tasks

Each task contributes a weighted component to the total score. Metric lists below summarize the primary signals used for ranking and diagnostics.

Mechanism Family Explorer

Choose family: