Integrating Statistical Tests
Crystallize separates data processing from statistical evaluation. Verifiers implement a statistical test and Hypotheses use them to compare baseline and treatment metrics.
1. Create a Verifier
Section titled “1. Create a Verifier”Use the @verifier decorator on a function that accepts baseline and treatment metric samples. Return any dictionary of results. This example wraps SciPy’s Welch t-test:
from crystallize import verifierfrom scipy.stats import ttest_ind
@verifierdef welch_t_test(baseline, treatment, *, alpha: float = 0.05): t_stat, p_value = ttest_ind( treatment["score"], baseline["score"], equal_var=False ) return {"p_value": p_value, "significant": p_value < alpha}Instantiate it with parameters if needed: t_test = welch_t_test(alpha=0.01).
2. Define a Hypothesis
Section titled “2. Define a Hypothesis”Hypotheses specify which metrics feed the verifier and how to rank treatments. Provide a single metric name, a list of names, or a list of metric groups.
from crystallize import hypothesis
@hypothesis(verifier=welch_t_test(), metrics="score")def rank_by_p(result): return result.get("p_value", 1.0)metrics="score"passes one metric list to the verifier.- Use
metrics=["a", "b"]to pass multiple lists. - Use
metrics=[["a"], ["b"]]to run the verifier on each group separately. Theverify()result mirrors the grouping (single dict or list of dicts).
3. Run the Experiment
Section titled “3. Run the Experiment”Add the hypothesis to your Experiment and run as usual:
exp = Experiment( datasource=my_source(), pipeline=my_pipeline,)exp.validate() # optionalresult = exp.run(treatments=[my_treatment()], hypotheses=[rank_by_p], replicates=10)print(result.get_hypothesis("rank_by_p").results)- MissingMetricError – Ensure all metric keys specified in
metricsexist inctx.metrics. - Multiple metrics – When using metric groups, the verifier runs separately for each group and returns a list of results.
- Custom statistics – Your verifier can call any library (SciPy, PyTorch, etc.) as long as it returns a dictionary.
Next Steps
Section titled “Next Steps”- Review Custom Pipeline Steps to compute the metrics you need.
- See Customizing Experiments for seeding and parallel options.