Benchmark Design
Tasks, Families, and Scoring
Overview of what is measured, how scores are combined, and how mechanism families differ.
Evaluation Protocol
NeuroMIB evaluates interpretability quality across four tasks. EvalAI hosts public and private phases, while this repository provides generation, schema validation, and scorer logic used by EvalAI workers.
- Task A: latent variable recovery
- Task B: mechanism classification
- Task C: support recovery
- Task D: intervention prediction
Task Weighting
Benchmark Tasks
Each task contributes a weighted component to the total score. Metric lists below summarize the primary signals used for ranking and diagnostics.