Probability Theory
Probability theory is the mathematical study of randomness. Born from gambling problems of the 17th century (Pascal, Fermat) and formalised by Kolmogorov in 1933, it underlies modern statistics, finance, physics, machine learning and information theory.
A triple (Ω, ℱ, P) where Ω is the sample space (set of outcomes), ℱ is a σ-algebra of events (subsets), and P : ℱ → [0, 1] is a probability measure satisfying P(Ω) = 1 and countable additivity: P(∪Aᵢ) = ΣP(Aᵢ) for disjoint Aᵢ.
Axioms of probability
Kolmogorov's three axioms (1933):
- Non-negativity: P(A) ≥ 0 for every event A.
- Normalisation: P(Ω) = 1.
- Countable additivity: for pairwise disjoint A₁, A₂, …, P(∪Aᵢ) = ΣP(Aᵢ).
From these flow:
- P(∅) = 0, P(Aᶜ) = 1 − P(A).
- P(A ∪ B) = P(A) + P(B) − P(A ∩ B) (inclusion–exclusion for 2 events).
- Monotonicity: A ⊆ B ⇒ P(A) ≤ P(B).
Counting techniques
For finite equally likely outcomes:
- Permutations of n objects taken r at a time: P(n, r) = n!/(n − r)!.
- Combinations: C(n, r) = n!/(r!(n − r)!).
- Multinomial: n! / (n₁!…n_k!).
Conditional probability and Bayes' theorem
P(A | B) = P(A ∩ B)/P(B), provided P(B) > 0. Independence: A and B are independent iff P(A ∩ B) = P(A)P(B), equivalently P(A | B) = P(A).
Law of total probability: if partitions Ω, then P(A) = ΣP(A | Bᵢ)P(Bᵢ).
Bayes' theorem:
P(Bⱼ | A) = P(A | Bⱼ)P(Bⱼ) / ΣP(A | Bᵢ)P(Bᵢ)
Cornerstone of inferential statistics, medical diagnosis, machine learning (Naive Bayes, Bayesian networks).
Random variables and distributions
A random variable X is a measurable function Ω → ℝ. Two types:
- Discrete: takes countably many values; described by a probability mass function (pmf) p(x) = P(X = x).
- Continuous: described by a probability density function (pdf) f(x); P(a ≤ X ≤ b) = ∫_a^b f(x) dx.
The cumulative distribution function (CDF) is F(x) = P(X ≤ x), non-decreasing from 0 to 1.
Important distributions
| Distribution | Type | pmf/pdf | Mean | Variance |
|---|---|---|---|---|
| Bernoulli(p) | Discrete | p^x(1−p)^(1−x), x ∈ | p | p(1−p) |
| Binomial(n, p) | Discrete | C(n,k)pᵏ(1−p)ⁿ⁻ᵏ | np | np(1−p) |
| Geometric(p) | Discrete | (1−p)ᵏ⁻¹p | 1/p | (1−p)/p² |
| Poisson(λ) | Discrete | e⁻ᵝ λᵏ/k! | λ | λ |
| Uniform(a, b) | Continuous | 1/(b−a) | (a+b)/2 | (b−a)²/12 |
| Exponential(λ) | Continuous | λe⁻ᵝˣ | 1/λ | 1/λ² |
| Normal(μ, σ²) | Continuous | (1/√(2πσ²))e^(−(x−μ)²/(2σ²)) | μ | σ² |
| Gamma(α, β) | Continuous | β^α x^(α−1) e^(−βx)/Γ(α) | α/β | α/β² |
- The Poisson distribution approximates Binomial(n, p) when n is large and p small (with λ = np).
- The memoryless property characterises the exponential and geometric distributions: P(X > s + t | X > s) = P(X > t).
- The standard normal N(0, 1) has 68 % of probability within ±1σ, 95 % within ±2σ, 99.7 % within ±3σ.
- Sum of independent normals is normal; sum of independent Poissons is Poisson; sum of i.i.d. exponentials is gamma.
Expectation, variance and moments
Expectation: E[X] = Σx p(x) (discrete) or ∫x f(x) dx (continuous). Linearity: E[aX + bY] = aE[X] + bE[Y] (no independence needed).
Variance: Var(X) = E[(X − μ)²] = E[X²] − (E[X])². Standard deviation σ = √Var(X).
Covariance: Cov(X, Y) = E[XY] − E[X]E[Y]. Independent ⇒ Cov = 0 (but not conversely).
Correlation coefficient ρ = Cov(X, Y)/(σ_X σ_Y) ∈ [−1, 1].
Moment generating function (MGF) M_X(t) = E[e^(tX)] uniquely determines a distribution when it exists in an interval around 0.
Conditional expectation
For random variables X, Y, the conditional expectation E[X | Y] is a random variable (a function of Y) satisfying:
- Tower property: E[E[X | Y]] = E[X].
- Pulling-out known factors: E[g(Y)X | Y] = g(Y)E[X | Y].
Limit theorems
The two pillars of probability theory:
Law of Large Numbers (LLN)
For i.i.d. random variables X₁, X₂, … with finite mean μ, the sample mean X̄_n = (X₁ + … + X_n)/n converges to μ:
- Weak LLN: X̄_n → μ in probability.
- Strong LLN: X̄_n → μ almost surely.
Central Limit Theorem (CLT)
If X₁, …, X_n are i.i.d. with mean μ and finite variance σ², then
√n (X̄_n − μ)/σ → N(0, 1) in distribution
Hence for large n, X̄_n is approximately N(μ, σ²/n). The CLT explains why the normal distribution is ubiquitous: it is the universal limit of normalised sums.
A quick test: if a sample is collected from any population (not necessarily normal!) with mean μ and variance σ², and n is "large" (rule of thumb n ≥ 30), then the sample mean is approximately normal with mean μ and variance σ²/n. This is what permits z- and t-confidence intervals on the mean even for non-normal data.
Joint distributions and independence
For two random variables X, Y: joint pmf p(x, y) or joint pdf f(x, y). Marginals f_X(x) = ∫f(x, y) dy. Independence: f(x, y) = f_X(x) f_Y(y) for all x, y.
Conditional density: f_(x | y) = f(x, y)/f_Y(y), provided f_Y(y) > 0.
Beyond classical probability
Modern probability extends Kolmogorov's framework to:
- Stochastic processes (Markov chains, Brownian motion, martingales).
- Stochastic calculus (Itô, used in mathematical finance).
- Information theory (entropy H = −Σp log p, mutual information).
- Concentration inequalities (Markov, Chebyshev, Chernoff, Hoeffding) used in machine learning.
- Quantum probability: events form a non-commutative algebra; classical probability is recovered for commuting observables.
Probability theory thus runs from gambling problems to the deepest currents of mathematics and modern science.