Experimental Design and Model Evaluation
experimental design, randomization, blocking, train test split, cross validation, evaluation metrics
1 Role
This page is the capstone of the first-pass statistics module.
Its job is to answer the practical question behind almost every table, figure, or benchmark claim: can I trust this result, and if so, what exactly does it support?
2 First-Pass Promise
Read this page after Regression and Classification Basics.
If you stop here, you should still understand:
- why design choices such as randomization and controls matter before analysis
- why training, validation, and test data must play different roles
- what cross-validation is trying to estimate
- why model quality depends on the evaluation metric, not just on one score
3 Why It Matters
Weak experimental design can make strong models look convincing for the wrong reasons.
Typical failure modes are:
- no real control group or comparison baseline
- no randomization, so treatment groups differ before the intervention
- leakage from test data into model selection
- tuning decisions made on the same data used for final evaluation
- reporting only one convenient metric that hides the actual error costs
This is exactly where a lot of research mistakes happen. A model can be mathematically sophisticated and still be evaluated badly. A benchmark can look clean while the data split, randomization, or metric choice quietly invalidates the conclusion.
4 Prerequisite Recall
- confidence intervals and tests tell you how to reason about uncertainty once a design and model are in place
- regression and classification have different error notions and different evaluation needs
- overfitting means good training performance does not guarantee good future performance
5 Intuition
There are really two questions here:
Was the data collection or experiment structured to support a fair comparison?Was the model evaluated on data that were genuinely new to the fitting process?
The first question is about experimental design.
The second is about model evaluation.
For design, the big ideas are:
control: compare against something meaningfulrandomization: reduce systematic allocation biasblocking: control important sources of variation
For evaluation, the big ideas are:
training: fit the modelvalidation: choose model settingstest: estimate final predictive performance
If you mix these roles together, the reported performance becomes too optimistic.
6 Formal Core
Definition 1 (Experimental Design Basics) In a designed experiment, we deliberately vary one or more factors and measure one or more responses.
Core first-pass design principles are:
control: compare against an appropriate baseline or reference conditionrandomization: assign treatments in a way that reduces systematic biasblocking: group similar experimental units so nuisance variation is controlled rather than left to distort the comparison
Definition 2 (Training, Validation, and Test Roles) For predictive modeling:
training set: used to fit model parametersvalidation set: used to choose models, tuning parameters, or thresholdstest set: used only once for final assessment
If the test set influences modeling choices, it stops being an honest estimate of future predictive performance.
Proposition 1 (Cross-Validation and Metrics) Cross-validation estimates predictive performance by repeatedly splitting the data into training and evaluation folds.
A good evaluation metric must match the task:
- regression: RMSE, MAE, residual patterns
- classification: accuracy, precision, recall, false positive rate, confusion matrix, threshold-sensitive metrics
No single metric is universally “best.” The right metric depends on what kind of mistake matters in the application.
7 Worked Example
Suppose a team wants to compare two recommendation models, A and B, on click-through rate and downstream purchase behavior.
7.1 Poor Design
They do the following:
- run model
Ain week 1 - run model
Bin week 2 - report that
Bhas a higher click-through rate
Why is this weak?
- traffic may differ between weeks
- seasonality or product launches may have changed user behavior
- the comparison is confounded with time
So even if the summary statistic is computed correctly, the design does not isolate the model effect.
7.2 Better Design
A stronger first-pass design would:
- randomize users or sessions to
AorB - keep both models active in the same time window
- track the same primary response for both groups
- predefine the evaluation metric and decision rule
If important user strata differ a lot, the team might also block by region or device class to reduce nuisance variation.
7.3 Model Evaluation Side
Now suppose the team is fitting a classifier for purchase prediction.
A clean predictive workflow is:
- fit candidate models on training data
- choose thresholds or hyperparameters on validation data
- evaluate final precision/recall or other metrics once on held-out test data
If they repeatedly tune on the test set after seeing the score, the final test performance is no longer a trustworthy estimate of out-of-sample behavior.
7.4 Metric Choice
Suppose purchases are rare.
Then raw accuracy can be misleading. A model that predicts “no purchase” for almost everyone may look accurate while being useless.
In that case, metrics such as recall, precision, false positive rate, or a confusion-matrix view are often more informative than accuracy alone.
So the full lesson is:
- design tells you whether the comparison is fair
- evaluation tells you whether the predictive claim is honest
- metrics tell you whether the reported performance matches the actual task cost
8 Computation Lens
A practical first-pass checklist is:
- identify the experimental unit
- identify the treatment, factor, or model comparison
- specify whether there is randomization or blocking
- separate training, validation, and test roles clearly
- decide which metric matches the decision cost
- check whether any leakage or post-hoc retuning contaminated the final result
- report both the metric and the uncertainty or variation when possible
This is often more useful than adding a fancier model.
9 Application Lens
This page is directly relevant to:
- A/B tests and product experiments
- benchmark comparisons in ML papers
- repeated-seed model reporting
- evaluation on imbalanced classification tasks
- engineering experiments with nuisance variation such as hardware, batch, or site effects
In all of these settings, the design and evaluation protocol are part of the scientific claim, not just surrounding logistics.
10 Stop Here For First Pass
If you can now explain:
- why randomization and controls matter
- why training, validation, and test sets must be separated
- what cross-validation is trying to estimate
- why metric choice depends on task structure and cost
then this page has done its main job.
11 Go Deeper
The most useful next steps after this page are:
- Applications, to connect these evaluation ideas to the broader applied side of the site
- Regression and Classification Basics if you want to revisit task-specific metrics and output types
- Confidence Intervals and Hypothesis Testing if you want the inferential side of uncertainty around effects and comparisons
12 Optional Paper Bridge
- NIST Experimental Design -
First pass- official NIST introduction to what DOE is and why planning the experiment matters before data collection. Checked2026-04-24. - Penn State STAT 509 Randomization -
Second pass- official clinical-trials perspective on why randomization helps and what it does not solve by itself. Checked2026-04-24. - Penn State STAT 508 Lesson 3 -
Second pass- official lesson covering holdout splits, three-way splits, and cross-validation for predictive performance. Checked2026-04-24. - Google ML Crash Course: Accuracy, Precision, Recall -
Second pass- concise official source for choosing classification metrics when error costs and class imbalance matter. Checked2026-04-24.
13 Optional After First Pass
If you want more practice before moving on:
- inspect one ML paper and identify its training, validation, and test protocol
- rewrite a weak benchmark comparison into a better randomized or blocked design
- ask whether the chosen metric really matches the application cost of false positives and false negatives
14 Common Mistakes
- evaluating on the same data used for tuning
- reporting only accuracy on an imbalanced classification task
- confusing randomization with a cure for every bias
- ignoring nuisance variation that should have been blocked or controlled
- treating the test set as one more place to iterate on model design
15 Exercises
- Why is comparing model
Ain one week and modelBin the next week usually weaker than randomizing both in the same period? - In one sentence, explain the role of a validation set.
- Give one example where precision matters more than recall, and one where recall matters more than precision.
16 Sources and Further Reading
- NIST Experimental Design -
First pass- official explanation of factors, responses, and why experiment planning is essential. Checked2026-04-24. - Penn State STAT 509 Randomization -
First pass- official treatment of randomization and bias control from the clinical-trials viewpoint. Checked2026-04-24. - Penn State STAT 508 Lesson 3 -
Second pass- official lesson on train/validation/test splits and cross-validation for predictive performance. Checked2026-04-24. - Google ML Crash Course: Accuracy, Precision, Recall -
Second pass- official practical guide to metric choice under imbalance and asymmetric costs. Checked2026-04-24.
Sources checked online on 2026-04-24:
- NIST Experimental Design
- Penn State STAT 509 Randomization
- Penn State STAT 508 Lesson 3
- Google ML Crash Course classification metrics