Maximum Likelihood and Bayesian Basics
likelihood, maximum likelihood, bayesian inference, posterior, prior, map
1 Role
This page introduces two major ways to build point estimates from a statistical model.
Maximum likelihood starts from the observed data and asks which parameter value makes those data look most plausible. Bayesian inference adds a prior belief about the parameter and updates that belief after seeing the data.
2 First-Pass Promise
Read this page after Estimation and Bias-Variance.
If you stop here, you should still understand:
- what a likelihood function is
- how to form a simple maximum likelihood estimate
- what prior, likelihood, posterior, and posterior mean or mode mean
- how MLE and Bayesian updating answer slightly different questions
3 Why It Matters
These two viewpoints sit underneath a huge amount of modern statistical and ML practice.
They show up when you:
- fit a Bernoulli success probability
- estimate a Gaussian mean or variance
- train models by maximizing log-likelihood
- regularize a fit in a way that has a Bayesian interpretation
- compare evidence from data against prior domain knowledge
If you learn only the formulas, the topic feels fragmented. If you learn the shared picture, a lot of later material becomes more coherent:
likelihoodasks how well each parameter value explains the observed dataBayesasks how to revise a prior belief after seeing those data
4 Prerequisite Recall
- a parameter \(\theta\) is an unknown quantity indexing a probabilistic model
- an estimator is a rule based on the sample
- Bayes’ rule updates probabilities after conditioning on observed evidence
- for repeated Bernoulli trials, the binomial model is a natural probability model
5 Intuition
Suppose you flip a possibly biased coin and observe data.
There are at least two natural questions you could ask:
Which value of the bias p makes these observed flips look most plausible?What should I believe about p after combining the data with what I already believed?
The first question leads to maximum likelihood.
The second leads to Bayesian inference.
They use the same model ingredients, but the direction of interpretation is different:
- the likelihood treats the data as fixed and compares parameter values
- the posterior treats the data as observed and updates uncertainty about the parameter
That distinction matters. The likelihood function is not a probability distribution over \(\theta\) by itself. The posterior is.
6 Formal Core
Definition 1 (Likelihood) Suppose the data are \(x\) and the model has parameter \(\theta\) with sampling distribution \(p(x \mid \theta)\).
The likelihood function is \[
L(\theta; x) = p(x \mid \theta),
\] viewed as a function of \(\theta\) with the observed data \(x\) held fixed.
Likelihood compares how strongly different parameter values are supported by the observed data.
Definition 2 (Maximum Likelihood Estimate) A maximum likelihood estimate (MLE) is any value \(\hat{\theta}_{\mathrm{MLE}}\) satisfying \[
\hat{\theta}_{\mathrm{MLE}} \in \arg\max_{\theta} L(\theta; x).
\]
Because the logarithm is monotone, the MLE can also be found by maximizing the log-likelihood \[ \ell(\theta; x) = \log L(\theta; x). \]
Definition 3 (Bayesian Updating) In the Bayesian view, \(\theta\) itself has a prior distribution \(\pi(\theta)\).
After observing data \(x\), the posterior distribution is \[ \pi(\theta \mid x) \propto p(x \mid \theta)\pi(\theta). \]
So the posterior is proportional to:
\[ \text{likelihood} \times \text{prior}. \]
Common Bayesian point estimates include the posterior mean and the posterior mode, also called the MAP estimate.
7 Worked Example
Suppose \(X_1,\dots,X_n\) are i.i.d. Bernoulli\((p)\) and you observe \(x\) successes in \(n\) trials.
7.1 Maximum Likelihood
If we summarize the data by the count \(x\), then the likelihood is proportional to \[ L(p; x) \propto p^x(1-p)^{n-x}, \qquad 0 \le p \le 1, \] since the omitted binomial coefficient does not depend on \(p\) and therefore does not change the MLE.
The log-likelihood is \[ \ell(p; x) = x \log p + (n-x)\log(1-p). \]
Differentiate and set equal to zero: \[ \frac{d}{dp}\ell(p; x) = \frac{x}{p} - \frac{n-x}{1-p} = 0. \]
Solving gives \[ \hat{p}_{\mathrm{MLE}} = \frac{x}{n}. \]
So for Bernoulli data, the MLE is just the sample proportion.
7.2 Bayesian Version
Now suppose the prior on \(p\) is Beta\((a,b)\), with density proportional to \[ p^{a-1}(1-p)^{b-1}. \]
Multiplying prior and likelihood gives $$ (p x) px(1-p){n-x} p{a-1}(1-p){b-1}
= p{a+x-1}(1-p){b+n-x-1}. $$
So the posterior is \[ p \mid x \sim \operatorname{Beta}(a+x,\; b+n-x). \]
This is the clean first example of Bayesian updating:
- the prior contributes pseudo-counts \(a-1\) and \(b-1\)
- the data contribute \(x\) successes and \(n-x\) failures
- the posterior combines both
7.3 Concrete Numbers
Suppose you observe \(x=8\) successes in \(n=10\) trials.
- MLE: \[ \hat{p}_{\mathrm{MLE}} = \frac{8}{10} = 0.8 \]
If the prior is Beta\((2,2)\), then the posterior is \[ \operatorname{Beta}(10,4). \]
The posterior mean is \[ \mathbb{E}[p \mid x] = \frac{10}{14} \approx 0.714. \]
So the Bayesian estimate is pulled toward the prior center compared with the MLE.
That is the whole comparison in one picture:
- the MLE listens only to the observed data
- the Bayesian estimate listens to both data and prior
8 Computation Lens
A good workflow for simple likelihood-based estimation is:
- write down the model \(p(x \mid \theta)\)
- treat the observed data as fixed
- write the likelihood as a function of \(\theta\)
- simplify with the log-likelihood if needed
- optimize
- check whether the answer lies in the allowed parameter space
For simple Bayesian updating:
- choose a prior \(\pi(\theta)\)
- write the likelihood
- multiply prior and likelihood
- identify the posterior shape up to normalization
- extract a posterior summary such as mean, mode, or interval
9 Application Lens
These ideas connect directly to modern practice:
- many supervised learning models are trained by maximizing log-likelihood or an equivalent loss
- MAP estimation often looks like likelihood plus regularization
- Bayesian updating appears when prior structure matters, data are scarce, or uncertainty reporting is central
- model comparison, calibration, and probabilistic prediction all inherit this likelihood/posterior language
So even in advanced workflows, the underlying first-pass questions stay the same:
- what does the data-fitting term want?
- what does the prior or regularizer want?
- how much are they pulling against each other?
10 Stop Here For First Pass
If you can now explain:
- what a likelihood function is
- why likelihood is not itself a probability distribution over the parameter
- how to derive the Bernoulli MLE
- how prior and likelihood combine into a posterior
then this page has done its main job.
11 Go Deeper
The most useful next steps after this page are:
- Confidence Intervals and Hypothesis Testing, to move from point estimates to uncertainty statements and decisions
- Estimation and Bias-Variance if you want to compare MLE or Bayesian estimates through error criteria
- Joint, Conditional, and Bayes if the conditioning viewpoint still feels shaky
12 Optional Paper Bridge
- Penn State STAT 415 Lesson 1: Point Estimation -
First pass- official mathematical-statistics treatment introducing maximum likelihood as a central point-estimation method. Checked2026-04-24. - MIT 18.05 Bayesian Updating with Discrete Priors -
Second pass- official MIT notes showing Bayes tables, posterior updates, and the contrast with MLE. Checked2026-04-24. - Penn State STAT 415 Bayesian Methods -
Second pass- official source for the posterior-as-update viewpoint and Bayesian point estimation. Checked2026-04-24.
13 Optional After First Pass
If you want more practice before moving on:
- derive the MLE for a Gaussian mean with known variance
- compare a posterior mean with a posterior mode in a simple Beta-Binomial model
- identify a regularized ML objective and ask what prior it corresponds to
14 Common Mistakes
- treating the likelihood as a probability distribution over \(\theta\)
- forgetting that the data are fixed when viewing the likelihood as a function of the parameter
- thinking MLE and Bayes must give the same answer
- using Bayes’ rule without identifying the prior and posterior objects clearly
- forgetting that different Bayesian point summaries can give different estimates
15 Exercises
- For Bernoulli data with \(x\) successes in \(n\) trials, write the likelihood and derive the MLE.
- If the prior is Beta\((1,1)\) and the data are \(x=3\) successes in \(n=5\) trials, what is the posterior distribution?
- In words, explain one reason a Bayesian estimate might differ from the MLE on a small dataset.
16 Sources and Further Reading
- Penn State STAT 415 Lesson 1: Point Estimation -
First pass- official source introducing MLE as a principled point-estimation method. Checked2026-04-24. - Penn State STAT 415 Bayesian Methods -
Second pass- official introduction to posterior-based estimation and Bayesian updating. Checked2026-04-24. - MIT 18.05 Introduction to Statistics -
Second pass- official MIT reading with clear examples contrasting MLE and Bayesian updating. Checked2026-04-24. - MIT 18.05 Exam 2 review: MLE and Bayesian updating -
Paper bridge- concise official review sheet showing the computational patterns for MLE, conjugate priors, and MAP-style thinking. Checked2026-04-24.
Sources checked online on 2026-04-24:
- Penn State STAT 415 Lesson 1
- Penn State STAT 415 Bayesian Methods
- MIT 18.05 Introduction to Statistics
- MIT 18.05 Exam 2 review