Sampling Theory
Sampling theory addresses how to choose a subset of a population so that valid inference can be made about the whole. A well-designed sample of a few thousand can reveal more about a country of 200 million than a poorly designed census. Good sampling marries probability theory with practical considerations of cost, time and accessibility.
A population is the entire collection of units about which we wish to make inferences. A sample is a subset of the population from which data are actually collected. The numerical descriptors of populations are called parameters (e.g., μ, σ²); of samples, statistics (e.g., X̄, S²).
Why sample?
A complete census is rarely feasible: too costly, too slow, or simply impossible (e.g., quality testing of every batch by destructive sampling). Sampling provides a controlled, statistically justified path to inference, often with surprisingly small sample sizes:
- A national poll of ~1000 people gives a 95% CI of width ~6 percentage points on a proportion.
- A clinical trial of a few hundred patients can detect a moderate treatment effect.
Probability sampling designs
In probability sampling, every unit has a known non-zero probability of selection — a prerequisite for valid inference.
Simple random sampling (SRS)
Every sample of size n has the same probability of being drawn. Easy to analyse but may be hard to implement (requires a complete sampling frame). The sample mean X̄ is unbiased for the population mean μ with variance σ²/n (with-replacement) or (σ²/n)(1 − n/N) (without-replacement, the finite-population correction).
Stratified sampling
Partition the population into homogeneous strata (e.g., age groups, regions) and sample within each. Estimator: X̄_st = Σ(Nᵢ/N) X̄ᵢ. Always more efficient than SRS when strata differ in means.
Optimal (Neyman) allocation: sample size in stratum i proportional to Nᵢσᵢ. Allocates more effort to strata with greater variability.
Cluster sampling
Population is divided into clusters (geographic areas, schools, households); a random sample of clusters is drawn and all units within sampled clusters are observed. Cheaper than SRS for geographically dispersed populations but typically less efficient because units within a cluster are correlated.
Systematic sampling
Choose every k-th unit from a list after a random start. Easy and quasi-random; can be biased if there is a periodic pattern in the list.
Multi-stage sampling
Combines cluster and stratified ideas: e.g., select districts → select villages → select households → select individuals. Used for national surveys (PSLM, HIES, NHIS).
PPS (probability proportional to size) sampling
Larger units (e.g., big firms, populous cities) have a selection probability proportional to their size. Often used in business surveys.
- Probability sampling: every unit has known P > 0 of selection.
- Bias = E[estimator] − true value; sampling design affects both bias and variance.
- Stratification almost always reduces variance; clustering usually increases it but reduces cost.
- Design effect (Deff): ratio of variance under chosen design to variance under SRS of same size.
Non-probability sampling
Common in practice but generally not suitable for unbiased inference:
- Convenience sampling: easiest-to-reach units (subjects walking into a clinic).
- Quota sampling: meet pre-specified counts in each category (gender, age) without random selection.
- Purposive / judgement sampling: expert chooses "representative" units.
- Snowball sampling: subjects refer further subjects (used for hidden populations).
Use these methods cautiously and supplement with sensitivity analyses.
Sources of error
Survey error decomposes into:
- Sampling error: variability from drawing one sample rather than another. Decreases as 1/√n.
- Non-sampling error: bias from non-response, measurement, processing, frame errors. Often dominates sampling error in real surveys.
- Selection bias: systematic exclusion of certain units (e.g., voluntary online polls).
- Response bias: subjects misreport (social desirability, recall error).
- Coverage bias: sampling frame omits part of target population.
The total survey error framework combines these into one accounting.
Sample size determination
For estimating a mean with margin of error E at confidence level 1 − α:
n = (z_ σ / E)²
For estimating a proportion p:
n = (z_)² p(1 − p) / E²
Conservative choice p = 0.5 maximises p(1 − p) and gives an upper bound on required n.
Example: for a 95% CI of width ±0.03 on a proportion (so E = 0.03), n ≈ 1.96² × 0.25 / 0.03² ≈ 1068. This is why national polls commonly use n ≈ 1000.
Estimators and their variances
| Design | Estimator of mean | Variance |
|---|---|---|
| SRS (without replacement) | X̄ | (σ²/n)(1 − n/N) |
| Stratified | Σ(Nᵢ/N) X̄ᵢ | Σ(Nᵢ/N)² · (σᵢ²/nᵢ)(1 − nᵢ/Nᵢ) |
| Cluster (equal clusters) | overall mean | Var depends on intra-cluster correlation ρ |
| Ratio estimator | X̄_y/X̄_x · μ_x | Approximate by Taylor expansion |
Ratio and regression estimation
When an auxiliary variable x with known mean μ_x is available, use:
- Ratio estimator: μ̂_y = (ȳ/x̄) μ_x. Efficient when y and x are nearly proportional.
- Regression estimator: μ̂_y = ȳ + b(μ_x − x̄). Efficient when y is linear in x.
Both reduce variance by exploiting correlation between y and x.
Three foundational warnings about polls:
- A poll with a 3% margin of error is only as good as its sample design; non-response can swamp this.
- Online voluntary polls are not probability samples and may misrepresent populations even with very large n.
- A confidence interval describes uncertainty about a single parameter at a single moment — not a forecast of the next election.
Real-world examples
- Pakistan's HIES (Household Integrated Economic Survey) uses stratified two-stage cluster sampling: districts → enumeration blocks → households.
- US Current Population Survey (CPS) samples about 60,000 households monthly using stratified multi-stage design; produces national unemployment estimates.
- Indian NFHS combines stratified two-stage cluster designs with state-level oversampling.
Mastery of these designs — and of their weaknesses — separates a competent statistician from one who simply applies formulas.