Database Systems
A database is an organised collection of related data stored to support efficient retrieval and update. A Database Management System (DBMS) is the software that manages databases — controlling access, ensuring integrity, supporting concurrency and recovery.
A software system that enables users to define, create, maintain, query and control access to a database, providing data abstraction, persistence, concurrency and recovery.
Evolution of database systems
| Generation | Era | Example |
|---|---|---|
| File-based | 1960s | Flat files, COBOL records |
| Hierarchical | 1960s-70s | IBM IMS |
| Network | 1970s | CODASYL DBTG |
| Relational | 1970s onwards | Oracle, MySQL, PostgreSQL, SQL Server |
| Object-oriented | 1990s | ObjectStore, db4o |
| NoSQL | 2000s onwards | MongoDB, Cassandra, Redis |
| NewSQL | 2010s onwards | CockroachDB, Google Spanner |
Relational model
Proposed by E. F. Codd in 1970. Built on relations (tables) of tuples (rows) with attributes (columns).
Key terminology
- Relation — table.
- Tuple — row.
- Attribute — column.
- Domain — set of permissible values.
- Degree — number of attributes.
- Cardinality — number of tuples.
- Schema — structure description; instance — actual data.
Keys
- Super key — set of attributes that uniquely identifies tuples.
- Candidate key — minimal super key.
- Primary key — chosen candidate key; NOT NULL, UNIQUE.
- Alternate key — other candidate keys.
- Foreign key — attribute(s) referencing a primary key in another table.
Integrity constraints
- Domain — attribute values from its domain.
- Entity — primary key cannot be NULL.
- Referential — foreign-key values must match an existing primary-key value.
- Key — uniqueness.
Structured Query Language (SQL)
The ISO standard relational language. Three sublanguages:
- DDL (Data Definition) —
CREATE,ALTER,DROP,TRUNCATE. - DML (Data Manipulation) —
SELECT,INSERT,UPDATE,DELETE. - DCL (Data Control) —
GRANT,REVOKE.
Many sources add TCL (Transaction Control) — COMMIT, ROLLBACK, SAVEPOINT.
Basic query
SELECT name, salary
FROM employees
WHERE department = 'Finance'
ORDER BY salary DESC;
Joins
- INNER JOIN — only matching rows.
- LEFT/RIGHT OUTER JOIN — all rows from one side + matches.
- FULL OUTER JOIN — all rows from both, NULL where no match.
- CROSS JOIN — Cartesian product.
Aggregations
COUNT, SUM, AVG, MIN, MAX, with GROUP BY and HAVING.
WHEREfilters rows BEFORE grouping;HAVINGfilters AFTER grouping.NULLrepresents the absence of a value; comparisons with NULL useIS NULL.DISTINCTremoves duplicate rows.- Subqueries: scalar, row, table; correlated subqueries reference outer query.
Normalisation
The process of decomposing tables to remove anomalies (insertion, update, deletion) and redundancy.
| Normal form | Rule |
|---|---|
| 1NF | Atomic attribute values |
| 2NF | 1NF + no partial dependency on composite key |
| 3NF | 2NF + no transitive dependency |
| BCNF | Every determinant is a candidate key |
| 4NF | No non-trivial multivalued dependencies |
| 5NF | No join dependencies |
Pakistani CSS papers commonly test up to BCNF; higher forms are rare.
Functional dependency (FD)
X → Y means: whenever two tuples agree on X, they must agree on Y.
Armstrong's axioms
- Reflexivity: if Y ⊆ X, then X → Y.
- Augmentation: X → Y implies XZ → YZ.
- Transitivity: X → Y and Y → Z imply X → Z.
Transactions and ACID
A transaction is a logical unit of work. The DBMS guarantees ACID properties:
- Atomicity — all-or-nothing.
- Consistency — moves DB from one valid state to another.
- Isolation — concurrent transactions appear as if serial.
- Durability — once committed, persists despite failures.
Concurrency control
- Locking — two-phase locking (2PL).
- Timestamp ordering.
- Optimistic concurrency control.
- MVCC (Multi-Version Concurrency Control) — used by PostgreSQL, Oracle.
Isolation levels (SQL standard)
| Level | Dirty read | Non-repeatable read | Phantom read |
|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible |
| Read Committed | Prevented | Possible | Possible |
| Repeatable Read | Prevented | Prevented | Possible |
| Serializable | Prevented | Prevented | Prevented |
Recovery
- Write-Ahead Logging (WAL) — log changes before applying.
- Checkpointing.
- Undo / Redo logs.
Indexing
Indexes accelerate queries by maintaining a sorted structure on key columns.
- Primary index — on primary key.
- Secondary index — on non-key column.
- Clustered index — table physically ordered (one per table).
- Non-clustered index — separate structure with pointers.
- B+ tree — dominant disk-based index (most RDBMS).
- Hash index — O(1) lookup, no range queries.
Trade-off: indexes speed reads but slow writes (inserts, updates, deletes).
Query optimisation
A DBMS query optimiser chooses an execution plan from many alternatives:
- Logical optimisation — rewriting relational algebra (push selections, join reordering).
- Physical optimisation — choosing access paths (index scan, sequential scan) and join algorithms (nested loop, hash, merge).
NoSQL databases
Designed for web-scale workloads, often relaxing ACID for performance and partition tolerance.
Types
- Document — JSON-like docs (MongoDB, Couchbase).
- Key-value — Redis, DynamoDB.
- Column-family — Cassandra, HBase.
- Graph — Neo4j, ArangoDB.
CAP theorem (Brewer)
A distributed system can simultaneously guarantee at most two of:
- Consistency, Availability, Partition tolerance.
Most NoSQL stores favour AP or CP over CA. Modern NewSQL systems (CockroachDB, Spanner) push back, offering strong consistency at scale.
Data warehouse vs. OLTP
- OLTP — many short transactions, normalised schema.
- OLAP / Data warehouse — analytics, star/snowflake schema, columnar storage (Snowflake, BigQuery, Redshift).
ETL (Extract, Transform, Load) moves data from operational systems to warehouses for reporting.
In CSS short questions, a favourite catch is the difference between DELETE (DML, can be rolled back, row-by-row) and TRUNCATE (DDL, faster, cannot be rolled back, resets identity columns). Memorise this distinction.
Pakistani context
- NADRA maintains one of the world's largest biometric identity databases (over 240 million identities).
- FBR's IRIS system runs on Oracle for tax returns.
- The State Bank of Pakistan uses Oracle and SQL Server extensively.
- The Election Commission of Pakistan maintains the electoral roll database.
- NTC (National Telecom Corporation) offers data-centre services to the public sector.
Officers who can read schemas, write SQL, and reason about ACID guarantees are well-placed to evaluate vendor proposals and oversee digital-government initiatives.