Skip to content

Integrating Statistical Tests

Crystallize separates data processing from statistical evaluation. Verifiers implement a statistical test and Hypotheses use them to compare baseline and treatment metrics.

Use the @verifier decorator on a function that accepts baseline and treatment metric samples. Return any dictionary of results. This example wraps SciPy’s Welch t-test:

from crystallize import verifier
from scipy.stats import ttest_ind
@verifier
def welch_t_test(baseline, treatment, *, alpha: float = 0.05):
t_stat, p_value = ttest_ind(
treatment["score"], baseline["score"], equal_var=False
)
return {"p_value": p_value, "significant": p_value < alpha}

Instantiate it with parameters if needed: t_test = welch_t_test(alpha=0.01).

Hypotheses specify which metrics feed the verifier and how to rank treatments. Provide a single metric name, a list of names, or a list of metric groups.

from crystallize import hypothesis
@hypothesis(verifier=welch_t_test(), metrics="score")
def rank_by_p(result):
return result.get("p_value", 1.0)
  • metrics="score" passes one metric list to the verifier.
  • Use metrics=["a", "b"] to pass multiple lists.
  • Use metrics=[["a"], ["b"]] to run the verifier on each group separately. The verify() result mirrors the grouping (single dict or list of dicts).

Add the hypothesis to your Experiment and run as usual:

exp = Experiment(
datasource=my_source(),
pipeline=my_pipeline,
)
exp.validate() # optional
result = exp.run(treatments=[my_treatment()], hypotheses=[rank_by_p], replicates=10)
print(result.get_hypothesis("rank_by_p").results)
  • MissingMetricError – Ensure all metric keys specified in metrics exist in ctx.metrics.
  • Multiple metrics – When using metric groups, the verifier runs separately for each group and returns a list of results.
  • Custom statistics – Your verifier can call any library (SciPy, PyTorch, etc.) as long as it returns a dictionary.