Skip to main content

Computer Architecture

1. Von Neumann Architecture

Definition

The Von Neumann architecture, proposed by John Von Neumann in 1945, is characterised by a single unified memory space that stores both data and instructions, a single set of buses connecting memory to the CPU, and sequential execution of instructions.

Components

  1. Central Processing Unit (CPU) — executes instructions
  2. Memory (RAM) — stores both programs and data
  3. Input/Output devices — interact with the external world
  4. System bus — connects all components

Key Property: Stored Program Concept

Both instructions and data reside in the same memory. The CPU fetches instructions from memory, decodes them, and executes them. This is the stored program concept — the machine can modify its own instructions (though modern systems typically prevent this for security).


2. Harvard Architecture

Definition

The Harvard architecture uses separate memory spaces for instructions and data, each with its own bus.

Comparison: Von Neumann vs Harvard

PropertyVon NeumannHarvard
MemorySingle unified memorySeparate instruction and data memory
BusesSingle bus (bottleneck)Separate buses (parallel access)
SpeedSlower (bus contention)Faster (simultaneous fetch)
ComplexitySimpler hardwareMore complex hardware
Self-modifying codePossible (in theory)Not possible
Modern usageGeneral-purpose computersDSPs, microcontrollers, cache systems
info

Board-specific Modern CPUs use a modified Harvard architecture at the cache level: L1 cache is split into instruction cache and data cache (Harvard), while main memory is unified (Von Neumann).


3. CPU Components

Arithmetic Logic Unit (ALU)

The ALU performs:

  • Arithmetic operations: addition, subtraction, multiplication, division
  • Logical operations: AND, OR, NOT, XOR
  • Comparison operations: equal, less than, greater than
  • Bitwise operations: shift, rotate

The ALU is a combinational circuit — it has no internal state. It takes inputs from registers and produces outputs that are written back to registers.

Control Unit (CU)

The CU orchestrates the fetch-decode-execute cycle by generating control signals that:

  • Enable/disable registers (PC, MAR, MDR, etc.)
  • Control the ALU operation
  • Manage data flow on the buses
  • Determine the next instruction address

Registers

RegisterFull NamePurpose
PCProgram CounterHolds the address of the next instruction to fetch
MARMemory Address RegisterHolds the address to be accessed in memory
MDRMemory Data RegisterHolds data read from or to be written to memory
ACCAccumulatorStores results of ALU operations
CIRCurrent Instruction RegisterHolds the instruction currently being decoded
IRInstruction RegisterSynonym for CIR (board-dependent naming)
info

Board-specific

  • AQA uses: PC, MAR, MDR, ACC, CIR
  • CIE uses: PC, MAR, MDR, ACC, IR, B (B register as temporary)
  • OCR uses: PC, MAR, MDR, ACC, CIR, and may reference index registers

General Purpose Registers (GPRs): Additional registers for temporary storage during computation. The number varies by architecture (e.g., ARM has 16, x86-64 has 16).

Special Purpose Registers:

  • Status/Flags Register: Stores condition codes (Zero, Carry, Negative, Overflow) set by the ALU
  • Stack Pointer (SP): Points to the top of the call stack
  • Link Register (LR): Stores the return address for function calls (ARM-specific)

4. The Fetch-Decode-Execute Cycle

Definition

The CPU continuously cycles through three phases:

  1. Fetch: Load the next instruction from memory
  2. Decode: Determine what the instruction does
  3. Execute: Perform the operation

Detailed Register Transfers

Fetch

  1. MAR ← PC: Copy the program counter to the memory address register
  2. MDR ← Memory[MAR]: Read from memory at the address in MAR, store in MDR
  3. CIR ← MDR: Copy the instruction from MDR to the current instruction register
  4. PC ← PC + 1: Increment the program counter to point to the next instruction

Decode

  1. The CU examines the opcode portion of CIR to determine:
    • What operation to perform
    • What operands are needed
    • What addressing mode to use

The CU then generates the appropriate control signals.

Execute

  1. The CU sends control signals to execute the instruction. For example:
    • Add: ACC ← ACC + operand
    • Load: ACC ← Memory[address]
    • Store: Memory[address] ← ACC
    • Branch (conditional): If condition met, PC ← address

After execution, the cycle repeats from step 1.

Timing

Each step involves register transfers and bus operations. A typical fetch takes 3–4 clock cycles (one per bus operation), decode takes 1–2 cycles, and execute takes 1–5 cycles depending on the instruction.

Example: Trace the fetch-decode-execute cycle for ADD #5 (add immediate 5 to ACC)

Assume: PC = 0x1000, instruction at 0x1000 is ADD #5 (opcode: 0010, operand: 00000101)

  1. Fetch:

    • MAR ← PC: MAR = 0x1000
    • MDR ← Memory[0x1000]: MDR = 0010 00000101
    • CIR ← MDR: CIR = 0010 00000101
    • PC ← PC + 1: PC = 0x1001
  2. Decode:

    • CU reads opcode 0010 → identifies ADD immediate
    • CU reads operand 00000101 → value 5
  3. Execute:

    • ACC ← ACC + 5
    • Set flags in status register (zero, negative, carry, overflow)

5. The Bus System

Definition

A bus is a set of parallel wires that carries data, addresses, or control signals between components.

Types

BusWidth (typical)PurposeDirection
Data bus8, 16, 32, 64Carries data between CPU, memory, and I/OBidirectional
Address bus16, 32, 64Carries memory addressesUnidirectional (CPU→memory)
Control busVariableCarries control signals (read, write, clock, interrupt)Mostly CPU→devices

Key relationships:

  • Address bus width nn determines addressable memory: up to 2n2^n locations
  • Data bus width determines how many bits transferred per bus operation
  • A 32-bit address bus with byte-addressable memory can address 232=42^{32} = 4 GiB

6. Cache Memory

Motivation

CPU speeds have increased much faster than memory speeds. Cache memory bridges this gap by storing frequently accessed data in fast memory close to the CPU.

Locality of Reference

Temporal locality: If a memory location is accessed, it is likely to be accessed again soon (e.g., loop variables, instruction fetch in a loop).

Spatial locality: If a memory location is accessed, nearby locations are likely to be accessed soon (e.g., sequential array access, instruction stream).

Cache Levels

LevelLocationSizeSpeedPurpose
L1Inside CPU core32–64 KiB~1 nsHottest data, split I/D
L2Inside CPU256 KiB–1 MiB~3–10 nsBackup for L1
L3Shared among cores4–64 MiB~10–30 nsShared cache, reduces L1/L2 misses

Cache Mapping Techniques

Direct Mapping

Each memory block maps to exactly one cache line.

Cacheline=Blockaddressmod(Numberofcachelines)\mathrm{Cache line} = \mathrm{Block address} \bmod (\mathrm{Number of cache lines})

  • Advantage: Simple, fast lookup
  • Disadvantage: Conflict misses — two frequently used blocks mapping to the same line evict each other

Fully Associative Mapping

A memory block can be placed in any cache line.

  • Advantage: Minimises conflict misses
  • Disadvantage: Slow lookup (must search all lines), complex hardware

Set-Associative Mapping

The cache is divided into ss sets, each containing kk ways (lines). A block maps to a specific set, but can be placed in any line within that set.

Set=Blockaddressmods\mathrm{Set} = \mathrm{Block address} \bmod s

  • k=1k = 1: Direct mapping
  • k=sk = s: Fully associative
  • Common configurations: 4-way, 8-way, 16-way set-associative

Cache Replacement Policies

  • LRU (Least Recently Used): Evict the line that was accessed least recently
  • FIFO (First In, First Out): Evict the oldest line
  • Random: Evict a random line

7. Virtual Memory and Paging

Definition

Virtual memory gives each process the illusion of having its own private, contiguous address space, while physical memory may be fragmented and shared among processes.

Paging

Memory is divided into fixed-size pages (typically 4 KiB). Virtual addresses are translated to physical addresses using a page table.

Virtual address structure (32-bit, 4 KiB pages):

Offset (12 bits)Page number (20 bits)

Page table entry (PTE):

| Frame number (20 bits) | Flags (present, dirty, accessed, permissions) |

Translation: Physical address = Frame number (from PTE) concatenated with Offset.

Translation Lookaside Buffer (TLB)

The TLB is a small, fast cache of recently used page table entries. It avoids the overhead of a full page table lookup for every memory access.

  • TLB hit: Translation found in TLB — fast (1–2 cycles)
  • TLB miss: Must consult page table — slow (10–100 cycles)

Page Fault

When the CPU accesses a virtual page that is not in physical memory:

  1. A page fault exception is raised
  2. The OS locates the page on disk (swap space)
  3. The OS loads the page into a free frame
  4. The page table is updated
  5. The instruction is restarted

8. Pipelining

Definition

Pipelining overlaps the execution stages of multiple instructions, similar to an assembly line.

Clock: 1 2 3 4 5 6
Instr 1: F D E
Instr 2: F D E
Instr 3: F D E
Instr 4: F D E

Speedup Analysis

Without pipelining, nn instructions take n×kn \times k cycles (where kk is the number of stages).

With pipelining, the first instruction takes kk cycles, and each subsequent instruction takes 1 cycle. Total: k+(n1)k + (n - 1) cycles.

Speedup=nkk+n1nk\mathrm{Speedup} = \frac{nk}{k + n - 1} \xrightarrow{n \to \infty} k

A kk-stage pipeline achieves at most k×k\times speedup in the ideal case.

Pipeline Hazards

Data hazard: An instruction depends on the result of a previous instruction that has not yet completed.

Solutions: Forwarding (bypass), stalling, out-of-order execution.

Control hazard: A branch instruction changes the PC, but the next instruction(s) may already be in the pipeline.

Solutions: Branch prediction (static or dynamic), delayed branch, branch target buffer.

Structural hazard: Two instructions require the same hardware resource simultaneously.

Solutions: Duplicate resources, stalling.


9. RISC vs CISC

Definitions

RISC (Reduced Instruction Set Computer): Few, simple instructions that execute in one clock cycle. Emphasis on software complexity.

CISC (Complex Instruction Set Computer): Many, complex instructions that may take multiple cycles. Emphasis on hardware complexity.

Comparison

PropertyRISCCISC
Instruction countSmall (50–100)Large (200–500+)
Instruction formatFixed lengthVariable length
Execution time1 cycle (mostly)1–20+ cycles
Addressing modesFew (2–5)Many (10+)
RegistersMany (32+)Few (8–16)
MicrocodeNo (hardwired control)Yes (microprogrammed control)
PipelineEasy to pipelineHarder to pipeline
ExamplesARM, MIPS, RISC-Vx86/x86-64
Code densityLower (more instructions)Higher (fewer instructions)
Power consumptionLowerHigher
info

info (used in smartphones, Raspberry Pi) is RISC. Intel/AMD processors are CISC (but use RISC-like internal micro-operations).


Problem Set

Problem 1. A CPU has a 24-bit address bus and a 16-bit data bus. What is the maximum addressable memory?

Hint

Each address identifies one location, and each location holds one data bus width.

Answer

224=16,777,2162^{24} = 16,777,216 locations

Each location holds 16 bits = 2 bytes.

Total addressable memory: 16,777,216×2=33,554,43216,777,216 \times 2 = 33,554,432 bytes = 32 MiB.

Problem 2. Describe what happens during the fetch phase of the fetch-decode-execute cycle. Include all register transfers.

Hint

Four steps: MAR ← PC, read from memory, CIR ← MDR, PC ← PC + 1.

Answer
  1. The contents of the Program Counter (PC) are copied to the Memory Address Register (MAR) via the address bus.
  2. The data stored at the memory address held in the MAR is copied to the Memory Data Register (MDR) via the data bus.
  3. The contents of the MDR are copied to the Current Instruction Register (CIR).
  4. The Program Counter is incremented by 1 (or by the instruction length, for variable-length ISAs) to point to the next instruction.

Problem 3. Explain how temporal and spatial locality contribute to cache effectiveness.

Hint

Give concrete examples of code patterns that exhibit each type of locality.

Answer

Temporal locality: In a loop that processes an array, the loop counter variable is accessed repeatedly. After the first access loads it into cache, subsequent accesses hit the cache. Similarly, the instruction bytes of the loop body are fetched from cache after the first iteration.

Spatial locality: When accessing array[i], the cache loads a block (cache line) containing array[i] and several adjacent elements. Subsequent accesses to array[i+1], array[i+2], etc., are cache hits because they are in the same cache line.

Problem 4. A system uses 32-bit virtual addresses with 4 KiB pages. How many entries are in the page table? What is the size of each entry if physical addresses are 36 bits?

Hint

Number of virtual pages = 232/40962^{32}/4096. Each PTE stores a frame number and flags.

Answer

Number of pages: 232/212=220=1,048,5762^{32}/2^{12} = 2^{20} = 1,048,576 entries

Physical frame number bits: 3612=2436 - 12 = 24 bits

PTE size: 24 bits (frame number) + flags (typically 8–12 bits) ≈ 4 bytes per entry.

Total page table size: 1,048,576×4=41,048,576 \times 4 = 4 MiB per process.

Problem 5. Explain why a 5-stage pipeline (fetch, decode, execute, memory, writeback) processing 100 instructions takes 104 cycles, not 500.

Hint

After the pipeline fills, one instruction completes per cycle.

Answer

The first instruction takes 5 cycles to pass through all stages (pipeline fill). After that, one instruction completes per cycle. The last instruction finishes at cycle 5+99=1045 + 99 = 104.

Total: 5+(1001)=1045 + (100 - 1) = 104 cycles, compared to 100×5=500100 \times 5 = 500 without pipelining.

Speedup: 500/1044.81×500/104 \approx 4.81\times (approaching the theoretical maximum of 5×5\times).

Problem 6. Give an example of a data hazard in a pipeline and explain how forwarding can resolve it.

Hint

Consider two consecutive instructions where the second uses the result of the first.

Answer
ADD R1, R2, R3 // R1 ← R2 + R3
SUB R4, R1, R5 // R4 ← R1 - R5 (depends on R1)

The ADD instruction produces R1 in the "writeback" stage, but the SUB instruction needs R1 in its "decode" stage, which occurs before writeback. This is a data hazard.

Forwarding solution: The result of ADD is available after the "execute" stage (as a computed value). Instead of waiting for writeback, the result is forwarded directly from the execute stage output to the decode stage input of SUB, eliminating the stall.

Problem 7. Compare Von Neumann and Harvard architectures. Why is the modified Harvard architecture used in modern CPUs?

Hint

Consider bus contention and the practical need for unified main memory.

Answer

Von Neumann uses a single memory and bus for both instructions and data, causing contention. Harvard uses separate memories and buses, allowing simultaneous instruction fetch and data access.

Modern CPUs use modified Harvard: L1 cache is split into instruction cache (I-cache) and data cache (D-cache), providing Harvard benefits at the fastest level. Beyond L1, memory is unified (Von Neumann) because:

  1. Main memory must be flexible — programs need to load data and instructions dynamically
  2. Unified memory simplifies the memory management unit (MMU) design
  3. The cost of duplicate main memory buses is not justified given cache hit rates

Problem 8. A direct-mapped cache has 64 lines, each holding 16 bytes. Main memory has 65,536 blocks. How many bits are needed for the tag, line number, and offset fields?

Hint

Offset = log₂(block size). Line = log₂(cache lines). Tag = remaining bits from block address.

Answer

Offset: log2(16)=4\log_2(16) = 4 bits Line number: log2(64)=6\log_2(64) = 6 bits Total block address bits: log2(65536)=16\log_2(65536) = 16 bits

Tag: 166=1016 - 6 = 10 bits

Each cache line stores: 16 bytes (data) + 10 bits (tag) + 1 bit (valid) + 1 bit (dirty) ≈ 18 bytes total.

Problem 9. Explain the difference between a page fault and a TLB miss. Which is more expensive?

Hint

One involves disk I/O; the other involves a slower but still RAM-speed lookup.

Answer

TLB miss: The virtual-to-physical translation is not in the TLB cache. The CPU must consult the page table in main memory (a few extra memory accesses). Cost: ~10–100 cycles.

Page fault: The required page is not in physical memory at all. The OS must read it from disk (swap space) into a free frame, update the page table, and restart the instruction. Cost: ~100,000–10,000,000 cycles (disk access is ~10ms, while a CPU cycle is ~0.3ns).

A page fault is orders of magnitude more expensive than a TLB miss.

Problem 10. A RISC processor has 32 registers, each 32 bits wide. How many bits are needed to encode a register operand? What is the maximum number of 3-operand instructions possible if the opcode field is 8 bits?

Hint

Register field size = log₂(32). Total instruction size = opcode + 3 register fields.

Answer

Register operand: log2(32)=5\log_2(32) = 5 bits

Instruction format: 8 (opcode) + 5 + 5 + 5 = 23 bits

With 8-bit opcode: 28=2562^8 = 256 possible opcodes.

:::

:::

:::