Probability Theory

9 min read

Probability theory is the mathematical study of randomness. Born from gambling problems of the 17th century (Pascal, Fermat) and formalised by Kolmogorov in 1933, it underlies modern statistics, finance, physics, machine learning and information theory.

Probability space

A triple (Ω, ℱ, P) where Ω is the sample space (set of outcomes), ℱ is a σ-algebra of events (subsets), and P : ℱ → [0, 1] is a probability measure satisfying P(Ω) = 1 and countable additivity: P(∪Aᵢ) = ΣP(Aᵢ) for disjoint Aᵢ.

Axioms of probability

Kolmogorov's three axioms (1933):

Non-negativity: P(A) ≥ 0 for every event A.
Normalisation: P(Ω) = 1.
Countable additivity: for pairwise disjoint A₁, A₂, …, P(∪Aᵢ) = ΣP(Aᵢ).

From these flow:

P(∅) = 0, P(Aᶜ) = 1 − P(A).
P(A ∪ B) = P(A) + P(B) − P(A ∩ B) (inclusion–exclusion for 2 events).
Monotonicity: A ⊆ B ⇒ P(A) ≤ P(B).

Counting techniques

For finite equally likely outcomes:

Permutations of n objects taken r at a time: P(n, r) = n!/(n − r)!.
Combinations: C(n, r) = n!/(r!(n − r)!).
Multinomial: n! / (n₁!…n_k!).

Conditional probability and Bayes' theorem

P(A | B) = P(A ∩ B)/P(B), provided P(B) > 0. Independence: A and B are independent iff P(A ∩ B) = P(A)P(B), equivalently P(A | B) = P(A).

Law of total probability: if partitions Ω, then P(A) = ΣP(A | Bᵢ)P(Bᵢ).

Bayes' theorem:

P(Bⱼ | A) = P(A | Bⱼ)P(Bⱼ) / ΣP(A | Bᵢ)P(Bᵢ)

Cornerstone of inferential statistics, medical diagnosis, machine learning (Naive Bayes, Bayesian networks).

Random variables and distributions

A random variable X is a measurable function Ω → ℝ. Two types:

Discrete: takes countably many values; described by a probability mass function (pmf) p(x) = P(X = x).
Continuous: described by a probability density function (pdf) f(x); P(a ≤ X ≤ b) = ∫_a^b f(x) dx.

The cumulative distribution function (CDF) is F(x) = P(X ≤ x), non-decreasing from 0 to 1.

Important distributions

Distribution	Type	pmf/pdf	Mean	Variance
Bernoulli(p)	Discrete	p^x(1−p)^(1−x), x ∈	p	p(1−p)
Binomial(n, p)	Discrete	C(n,k)pᵏ(1−p)ⁿ⁻ᵏ	np	np(1−p)
Geometric(p)	Discrete	(1−p)ᵏ⁻¹p	1/p	(1−p)/p²
Poisson(λ)	Discrete	e⁻ᵝ λᵏ/k!	λ	λ
Uniform(a, b)	Continuous	1/(b−a)	(a+b)/2	(b−a)²/12
Exponential(λ)	Continuous	λe⁻ᵝˣ	1/λ	1/λ²
Normal(μ, σ²)	Continuous	(1/√(2πσ²))e^(−(x−μ)²/(2σ²))	μ	σ²
Gamma(α, β)	Continuous	β^α x^(α−1) e^(−βx)/Γ(α)	α/β	α/β²

Key Points

The Poisson distribution approximates Binomial(n, p) when n is large and p small (with λ = np).
The memoryless property characterises the exponential and geometric distributions: P(X > s + t | X > s) = P(X > t).
The standard normal N(0, 1) has 68 % of probability within ±1σ, 95 % within ±2σ, 99.7 % within ±3σ.
Sum of independent normals is normal; sum of independent Poissons is Poisson; sum of i.i.d. exponentials is gamma.

Expectation, variance and moments

Expectation: E[X] = Σx p(x) (discrete) or ∫x f(x) dx (continuous). Linearity: E[aX + bY] = aE[X] + bE[Y] (no independence needed).

Variance: Var(X) = E[(X − μ)²] = E[X²] − (E[X])². Standard deviation σ = √Var(X).

Covariance: Cov(X, Y) = E[XY] − E[X]E[Y]. Independent ⇒ Cov = 0 (but not conversely).

Correlation coefficient ρ = Cov(X, Y)/(σ_X σ_Y) ∈ [−1, 1].

Moment generating function (MGF) M_X(t) = E[e^(tX)] uniquely determines a distribution when it exists in an interval around 0.

Conditional expectation

For random variables X, Y, the conditional expectation E[X | Y] is a random variable (a function of Y) satisfying:

Tower property: E[E[X | Y]] = E[X].
Pulling-out known factors: E[g(Y)X | Y] = g(Y)E[X | Y].

Limit theorems

The two pillars of probability theory:

Law of Large Numbers (LLN)

For i.i.d. random variables X₁, X₂, … with finite mean μ, the sample mean X̄_n = (X₁ + … + X_n)/n converges to μ:

Weak LLN: X̄_n → μ in probability.
Strong LLN: X̄_n → μ almost surely.

Central Limit Theorem (CLT)

If X₁, …, X_n are i.i.d. with mean μ and finite variance σ², then

√n (X̄_n − μ)/σ → N(0, 1) in distribution

Hence for large n, X̄_n is approximately N(μ, σ²/n). The CLT explains why the normal distribution is ubiquitous: it is the universal limit of normalised sums.

A quick test: if a sample is collected from any population (not necessarily normal!) with mean μ and variance σ², and n is "large" (rule of thumb n ≥ 30), then the sample mean is approximately normal with mean μ and variance σ²/n. This is what permits z- and t-confidence intervals on the mean even for non-normal data.

Joint distributions and independence

For two random variables X, Y: joint pmf p(x, y) or joint pdf f(x, y). Marginals f_X(x) = ∫f(x, y) dy. Independence: f(x, y) = f_X(x) f_Y(y) for all x, y.

Conditional density: f_(x | y) = f(x, y)/f_Y(y), provided f_Y(y) > 0.

Beyond classical probability

Modern probability extends Kolmogorov's framework to:

Stochastic processes (Markov chains, Brownian motion, martingales).
Stochastic calculus (Itô, used in mathematical finance).
Information theory (entropy H = −Σp log p, mutual information).
Concentration inequalities (Markov, Chebyshev, Chernoff, Hoeffding) used in machine learning.
Quantum probability: events form a non-commutative algebra; classical probability is recovered for commuting observables.

Probability theory thus runs from gambling problems to the deepest currents of mathematics and modern science.