Verifying Hypotheses
Hypotheses bring statistical discipline to your experiments. Each hypothesis couples a verifier (statistical test) with a ranker (function that scores the verifier’s output so treatments can be ordered).
1. Write a Verifier
Section titled “1. Write a Verifier”from crystallize import verifierfrom scipy.stats import ttest_ind
@verifierdef welch_t_test( baseline: dict[str, list[float]], treatment: dict[str, list[float]], *, alpha: float = 0.05,) -> dict[str, float | bool]: stat, p_value = ttest_ind( treatment["total"], baseline["total"], equal_var=False ) return {"p_value": p_value, "significant": p_value < alpha}baselineandtreatmentcontain the aggregated metrics recorded by your pipeline (values are lists across replicates).- Return any dictionary you like; keys become part of the hypothesis result.
- Parameters supplied to the decorated function (e.g.,
alpha) can be overridden when instantiating the verifier:welch_t_test(alpha=0.01).
2. Wrap It in a Hypothesis
Section titled “2. Wrap It in a Hypothesis”from crystallize import hypothesis
@hypothesis(verifier=welch_t_test(), metrics="total")def order_by_p_value(result: dict[str, float]) -> float: return result.get("p_value", 1.0)metrics="total"tells Crystallize to pass only thetotalmetric into the verifier.- The decorated function runs once per treatment and should return a scalar. Smaller values rank higher by default.
- Multiple metrics are supported: pass a list of metric names or nested lists for grouped metrics.
3. Attach to an Experiment
Section titled “3. Attach to an Experiment”experiment = ( Experiment.builder("hypothesis_demo") .datasource(fetch_numbers()) .add_step(add_delta()) .add_step(summarize()) .treatments([boost_total()]) .hypotheses([order_by_p_value]) .replicates(16) .build())
result = experiment.run()hyp = result.get_hypothesis("order_by_p_value")print(hyp.results) # {'boost_total': {'p_value': ..., 'significant': True/False}}print(hyp.ranking) # {'best': 'boost_total', 'ordered': ['boost_total']}Key fields on the returned HypothesisResult:
results: mapping of treatment → verifier output dictionary.ranking["best"]: treatment name with the lowest ranker value (orNoneif no treatments were active).errors: list of verifier failures (exceptions are captured so the run continues).
4. Display in the CLI
Section titled “4. Display in the CLI”- The run screen’s Summary tab has a dedicated Hypotheses table that shows significance flags per treatment.
- Press
Safter a run to jump to the summary; disabled treatments are greyed out. - Historical metrics (when
ArtifactPlugin(versioned=True)) can be loaded viaToggle All Treatmentsor persisted summaries.
5. Tips
Section titled “5. Tips”- Hypotheses run after all replicates finish. If you need per-replicate checks, add a pipeline step instead.
- Always record the metrics you reference—missing keys raise
KeyError. When pipelines return(data, metrics_dict), those metrics are automatically available. - Multiple hypotheses can share the same verifier with different
alphathresholds or metric sets. - Complex ranking logic (e.g., Pareto comparisons) belongs in the ranker function; return a tuple for lexicographic ordering if needed.