Skip to content

Crystallize | Rigorous, Reproducible Experiments in Python

Turn hypotheses into clear, reproducible, and statistically sound experiments.

Crystallize is a lightweight Python library for structuring scientific and machine learning (ML) experiments. It provides a clear, declarative framework to ensure your work is reproducible, statistically rigorous, and easy to understand. Instead of writing tangled scripts, you define modular components that Crystallize orchestrates into a controlled experiment.

At its heart, Crystallize organizes your research around a few key ideas:

  • Experiment: The central coordinator that runs your variations, collects data, and evaluates outcomes.
  • DataSource: A function or class that fetches or generates your initial data.
  • Pipeline: A sequence of PipelineStep objects that deterministically transform data.
  • Treatment: A variation applied to an experiment, like a new dataset or a different hyperparameter. The baseline (control) and treatments are compared.
  • Hypothesis: A formal assertion that a treatment will cause a specific, measurable effect, verified with a statistical test.
  • Reproducibility: Achieved through content-addressable caching of pipeline steps and immutable contexts that prevent accidental side effects between runs.

Get a feel for Crystallize by defining a simple A/B test. In this example, we test whether adding a value (delta) to our initial data significantly changes the outcome.

main.py
from crystallize import (
data_source,
hypothesis,
pipeline_step,
treatment,
verifier,
)
from crystallize import (
SeedPlugin,
ParallelExecution,
FrozenContext,
Experiment,
Pipeline,
)
from scipy.stats import ttest_ind
import random
# 1. Define how to get data
@data_source
def initial_data(ctx: FrozenContext):
return [0, 0, 0]
# 2. Define the data processing pipeline
@pipeline_step()
def add_delta(data, ctx: FrozenContext):
# The 'delta' value is injected by our treatment
return [x + ctx.get("delta", 0.0) for x in data]
@pipeline_step()
def add_random(data, ctx: FrozenContext):
# Add some random noise to the data
# So p-values don't throw errors
return [x + random.random() for x in data]
@pipeline_step()
def compute_metrics(data, ctx: FrozenContext):
# Record a metric for the hypothesis
ctx.metrics.add("result", sum(data))
return data
# 3. Define the treatment (the change we are testing)
add_ten = treatment(
name="add_ten_treatment",
apply={"delta": 10.0} # This dict is added to the context
)
# 4. Define the hypothesis to verify
@verifier
def welch_t_test(baseline, treatment, alpha: float = 0.05):
t_stat, p_value = ttest_ind(
treatment["result"], baseline["result"], equal_var=False
)
return {"p_value": p_value, "significant": p_value < alpha}
@hypothesis(verifier=welch_t_test(), metrics="result")
def check_for_improvement(res):
# The ranker function determines the "best" treatment.
# Lower p-value is better.
return res.get("p_value", 1.0)
# 5. Build and run the experiment
if __name__ == "__main__":
experiment = Experiment(
datasource=initial_data(),
pipeline=Pipeline([add_delta(), add_random(), compute_metrics()]),
plugins=[SeedPlugin(), ParallelExecution()],
)
experiment.validate() # optional
result = experiment.run(
treatments=[add_ten()],
hypotheses=[check_for_improvement],
replicates=20, # Run the experiment 20 times for statistical power
)
# Print the results for our hypothesis
hyp_result = result.get_hypothesis("check_for_improvement")
print(hyp_result.results)

This code defines a complete experiment, runs it 20 times for both the baseline (delta=0) and the treatment (delta=10), and uses a t-test to check if the difference in the result metric is statistically significant.


Crystallize’s documentation is organized by the Diátaxis framework, designed to help you find what you need quickly.

Migration Guide

Upgrading to v0.25.2? Learn about strict injection and deterministic seeding changes.

Troubleshooting

Fix common runtime errors (async loops in notebooks, missing context params, extras imports).


What makes Crystallize different from other experiment frameworks?

Crystallize is uniquely focused on scientific and statistical rigor. Its design enforces a clean separation between data processing (Pipelines), variations (Treatments), and evaluation (Hypotheses), which helps prevent common pitfalls in experimental design. The immutable context and automatic caching ensure results are trustworthy and reproducible.

How does Crystallize improve reproducibility?

  1. Immutable Contexts: The FrozenContext prevents steps from accidentally modifying parameters used by other steps, eliminating a common source of bugs.
  2. Content-Based Caching: Pipeline steps are cached based on a hash of their code and parameters, plus a hash of their input. If nothing has changed, Crystallize reuses the cached result, guaranteeing identical inputs produce identical outputs.
  3. Declarative Structure: By defining experiments in code (or YAML), the entire setup is version-controllable and shareable.

Is Crystallize ready for production?

Crystallize is currently in a pre-alpha stage, as noted in the README.md. The core API is stabilizing, but breaking changes are still possible. It is best suited for research and development environments where rigorous experimentation is valued.

Ready to dive in? Head over to the Getting Started Tutorial to build your first experiment.