Sampling Theory

8 min read

Sampling theory addresses how to choose a subset of a population so that valid inference can be made about the whole. A well-designed sample of a few thousand can reveal more about a country of 200 million than a poorly designed census. Good sampling marries probability theory with practical considerations of cost, time and accessibility.

Population vs sample

A population is the entire collection of units about which we wish to make inferences. A sample is a subset of the population from which data are actually collected. The numerical descriptors of populations are called parameters (e.g., μ, σ²); of samples, statistics (e.g., X̄, S²).

Why sample?

A complete census is rarely feasible: too costly, too slow, or simply impossible (e.g., quality testing of every batch by destructive sampling). Sampling provides a controlled, statistically justified path to inference, often with surprisingly small sample sizes:

A national poll of ~1000 people gives a 95% CI of width ~6 percentage points on a proportion.
A clinical trial of a few hundred patients can detect a moderate treatment effect.

Probability sampling designs

In probability sampling, every unit has a known non-zero probability of selection — a prerequisite for valid inference.

Simple random sampling (SRS)

Every sample of size n has the same probability of being drawn. Easy to analyse but may be hard to implement (requires a complete sampling frame). The sample mean X̄ is unbiased for the population mean μ with variance σ²/n (with-replacement) or (σ²/n)(1 − n/N) (without-replacement, the finite-population correction).

Stratified sampling

Partition the population into homogeneous strata (e.g., age groups, regions) and sample within each. Estimator: X̄_st = Σ(Nᵢ/N) X̄ᵢ. Always more efficient than SRS when strata differ in means.

Optimal (Neyman) allocation: sample size in stratum i proportional to Nᵢσᵢ. Allocates more effort to strata with greater variability.

Cluster sampling

Population is divided into clusters (geographic areas, schools, households); a random sample of clusters is drawn and all units within sampled clusters are observed. Cheaper than SRS for geographically dispersed populations but typically less efficient because units within a cluster are correlated.

Systematic sampling

Choose every k-th unit from a list after a random start. Easy and quasi-random; can be biased if there is a periodic pattern in the list.

Multi-stage sampling

Combines cluster and stratified ideas: e.g., select districts → select villages → select households → select individuals. Used for national surveys (PSLM, HIES, NHIS).

PPS (probability proportional to size) sampling

Larger units (e.g., big firms, populous cities) have a selection probability proportional to their size. Often used in business surveys.

Key Points

Probability sampling: every unit has known P > 0 of selection.
Bias = E[estimator] − true value; sampling design affects both bias and variance.
Stratification almost always reduces variance; clustering usually increases it but reduces cost.
Design effect (Deff): ratio of variance under chosen design to variance under SRS of same size.

Non-probability sampling

Common in practice but generally not suitable for unbiased inference:

Convenience sampling: easiest-to-reach units (subjects walking into a clinic).
Quota sampling: meet pre-specified counts in each category (gender, age) without random selection.
Purposive / judgement sampling: expert chooses "representative" units.
Snowball sampling: subjects refer further subjects (used for hidden populations).

Use these methods cautiously and supplement with sensitivity analyses.

Sources of error

Survey error decomposes into:

Sampling error: variability from drawing one sample rather than another. Decreases as 1/√n.
Non-sampling error: bias from non-response, measurement, processing, frame errors. Often dominates sampling error in real surveys.
Selection bias: systematic exclusion of certain units (e.g., voluntary online polls).
Response bias: subjects misreport (social desirability, recall error).
Coverage bias: sampling frame omits part of target population.

The total survey error framework combines these into one accounting.

Sample size determination

For estimating a mean with margin of error E at confidence level 1 − α:

n = (z_ σ / E)²

For estimating a proportion p:

n = (z_)² p(1 − p) / E²

Conservative choice p = 0.5 maximises p(1 − p) and gives an upper bound on required n.

Example: for a 95% CI of width ±0.03 on a proportion (so E = 0.03), n ≈ 1.96² × 0.25 / 0.03² ≈ 1068. This is why national polls commonly use n ≈ 1000.

Estimators and their variances

Design	Estimator of mean	Variance
SRS (without replacement)	X̄	(σ²/n)(1 − n/N)
Stratified	Σ(Nᵢ/N) X̄ᵢ	Σ(Nᵢ/N)² · (σᵢ²/nᵢ)(1 − nᵢ/Nᵢ)
Cluster (equal clusters)	overall mean	Var depends on intra-cluster correlation ρ
Ratio estimator	X̄_y/X̄_x · μ_x	Approximate by Taylor expansion

Ratio and regression estimation

When an auxiliary variable x with known mean μ_x is available, use:

Ratio estimator: μ̂_y = (ȳ/x̄) μ_x. Efficient when y and x are nearly proportional.
Regression estimator: μ̂_y = ȳ + b(μ_x − x̄). Efficient when y is linear in x.

Both reduce variance by exploiting correlation between y and x.

Three foundational warnings about polls:

A poll with a 3% margin of error is only as good as its sample design; non-response can swamp this.
Online voluntary polls are not probability samples and may misrepresent populations even with very large n.
A confidence interval describes uncertainty about a single parameter at a single moment — not a forecast of the next election.

Real-world examples

Pakistan's HIES (Household Integrated Economic Survey) uses stratified two-stage cluster sampling: districts → enumeration blocks → households.
US Current Population Survey (CPS) samples about 60,000 households monthly using stratified multi-stage design; produces national unemployment estimates.
Indian NFHS combines stratified two-stage cluster designs with state-level oversampling.

Mastery of these designs — and of their weaknesses — separates a competent statistician from one who simply applies formulas.