Descriptive Statistics and Data Models
descriptive statistics, sample, population, parameter, statistic, exploratory data analysis
1 Role
This page is the entry point to statistics.
Its job is to teach the habits that must come before inference: identify what the data represent, what kind of variables you have, and which summaries or plots actually fit the problem.
2 First-Pass Promise
Read this page first in the statistics module.
If you stop here, you should still understand:
- the difference between a population, a sample, a parameter, and a statistic
- how to identify observational units and variable types
- how to summarize categorical and quantitative data differently
- why a data summary is only as trustworthy as the data model behind it
3 Why It Matters
A lot of bad statistical reasoning starts before any formula appears.
Typical failure modes are simple:
- the observational unit is unclear
- a statistic is treated like a population truth
- categorical data are summarized as if the labels had numeric meaning
- one extreme value dominates the mean and nobody notices
- a benchmark table hides variation because only one number is reported
In CS, AI, and engineering, this shows up constantly: benchmark summaries, A/B tests, sensor logs, ablation tables, error distributions, latency reports, and user studies all begin with descriptive statistics. If this layer is weak, the later inference is built on sand.
4 Prerequisite Recall
- probability describes uncertainty using models of random outcomes
- a random variable is a numerical quantity attached to an outcome
- expectation and variance describe average behavior and spread in a model
5 Intuition
Before asking “what conclusion should we draw?”, statistics asks a more basic question:
what exactly is being measured, on what units, under what collection process?
That is the data-model mindset.
A data model, at this level, is not a complicated probabilistic object. It is the structured description of:
- what one row or one observation stands for
- which variables were recorded
- which variables are categorical or quantitative
- which variable is explanatory and which is the response, if roles matter
- what population the sample is supposed to represent
Once that is clear, descriptive statistics become much easier. You know what should be counted, averaged, compared, graphed, or left alone.
6 Formal Core
Definition 1 (Core Statistical Roles)
Population: the larger collection of units you ultimately care aboutSample: the observed subset of units you actually measuredParameter: a numerical feature of the population, such as a population mean or proportionStatistic: a numerical feature computed from the sample, such as a sample mean or sample proportion
A central statistical task is to use sample statistics to learn something about population parameters.
Definition 2 (Data Model) For first-pass statistics, a useful working data model records:
- the observational unit
- the variables measured on each unit
- the variable type of each variable
- the role of each variable, when relevant
- the intended population and sampling story
Without this structure, numerical summaries are easy to misread.
Proposition 1 (Summary Rule) Choose descriptive tools to match the variable type:
- for
categoricalvariables, use counts, proportions, and bar-style displays - for
quantitativevariables, use summaries of center and spread such as mean, median, standard deviation, quartiles, and boxplots or histograms - for
groupeddata, compare summaries within groups rather than only pooling everything together
The right summary is the one that preserves the important structure of the data instead of hiding it.
7 Worked Example
Suppose an engineering team records inference latency for a prototype model on eight requests:
\[ 92,\;95,\;97,\;101,\;104,\;110,\;112,\;180 \text{ milliseconds.} \]
The observational unit is one request.
The main variable is latency_ms, which is quantitative. If the team also records device_type, then that variable is categorical.
Now compute a few descriptive summaries for latency_ms:
- sample mean: \[ \bar{x} = \frac{92+95+97+101+104+110+112+180}{8} = 111.375 \]
- sample median: the middle pair is \(101\) and \(104\), so the median is \[ \frac{101+104}{2}=102.5 \]
- minimum and maximum: \(92\) and \(180\)
What do we learn?
- the mean is pulled upward by the large value \(180\)
- the median stays closer to the bulk of the runs
- reporting only the mean would hide the possibility of a long-tail slowdown
This is exactly why descriptive statistics are not “just bookkeeping.” They determine what a reader sees as typical, variable, or suspicious.
If the same data were split by device_type, then a better summary might report separate medians and spreads for each device rather than one pooled number.
8 Computation Lens
A good first descriptive pass over any dataset is:
- identify the observational unit
- list the variables and classify them as categorical or quantitative
- decide whether any variable plays an explanatory or response role
- compute counts/proportions for categorical variables
- compute center and spread summaries for quantitative variables
- make at least one plot that can reveal skew, outliers, or imbalance
- ask whether pooled summaries are hiding meaningful subgroups
This is often the fastest way to catch data issues before building models.
9 Application Lens
In ML and systems papers, descriptive statistics appear in places that people often overlook:
- benchmark tables across seeds or datasets
- latency and throughput summaries
- calibration or error distributions
- class-imbalance tables
- ablation studies split by task or architecture
If a paper reports only one mean score with no sense of spread, grouping, or sample size, your first statistical question should be: what structure of the data is being hidden?
That is descriptive statistics doing real research work.
10 Stop Here For First Pass
If you can now explain:
- what population, sample, parameter, and statistic mean
- what a simple first-pass data model looks like
- how summaries differ for categorical versus quantitative variables
- why outliers, grouping, and collection design matter before inference
then this page has done its main job.
11 Go Deeper
The most useful next steps after this page are:
- Estimation and Bias-Variance, to understand how sample summaries target population quantities
- Expectation, Variance, Covariance if you want the probability-side view of average and spread
- Sample Spaces, Events, and Conditioning if you want to revisit how data collection and conditioning interact
12 Optional Paper Bridge
- Penn State STAT 500 Lesson 1: Collecting and Summarizing Data -
First pass- strong official lesson on variable types, sampling concerns, graphical summaries, and descriptive measures. Checked2026-04-24. - CMU OLI Probability & Statistics -
Second pass- useful official second perspective for structured beginner practice. Checked2026-04-24. - NIST/SEMATECH e-Handbook: Exploratory Data Analysis -
Paper bridge- strong official bridge from basic summaries to real analytic workflow, especially plots, anomalies, and model checking. Checked2026-04-24.
13 Optional After First Pass
If you want more practice before moving on:
- take one table from a paper and identify its observational unit, variables, and hidden sampling story
- compute both mean and median on a skewed dataset and explain the difference
- ask whether a pooled summary should be split by subgroup, seed, hardware, or class
14 Common Mistakes
- confusing a sample statistic with a population fact
- averaging category labels as if the labels had numeric meaning
- reporting only the mean when the data are skewed or contain outliers
- forgetting to say what one row of the dataset represents
- treating a descriptive summary as if it already implied causality or significance
15 Exercises
- A survey records
favorite operating systemfor 300 students. What are the observational units, variable type, and appropriate summaries? - A dataset contains response times in milliseconds with one unusually large outlier. Explain why the mean and median may tell different stories.
- In a benchmark table with results pooled across three hardware types, give one reason why the pooled mean might be misleading.
16 Sources and Further Reading
- Penn State STAT 500 Lesson 1: Collecting and Summarizing Data -
First pass- official open lesson with clear treatment of variables, data collection, and descriptive summaries. Checked2026-04-24. - MIT 18.05 Introduction to Probability and Statistics -
Second pass- official MIT course showing how descriptive and inferential viewpoints fit into one statistics sequence. Checked2026-04-24. - NIST/SEMATECH e-Handbook: Exploratory Data Analysis -
Paper bridge- excellent official reference for the habit of plotting and checking data before formal modeling. Checked2026-04-24.
Sources checked online on 2026-04-24:
- Penn State STAT 500 Lesson 1
- CMU OLI Probability & Statistics
- NIST exploratory data analysis chapter
- MIT 18.05 course page