Unified Memory Architecture

Nagaraj M
Mar 12
734
0
2

Article

Pre-requisite to understand this

CPU (Central Processing Unit): Executes general program instructions and controls system operations.
GPU (Graphics Processing Unit): Performs massively parallel computations used in graphics and AI workloads.
RAM (Random Access Memory): Temporary high-speed memory used by processors to store working data.
Memory Bandwidth: Speed at which data can move between processor and memory.
Memory Copy / Data Transfer: Moving data between different memory spaces such as CPU RAM and GPU VRAM.
System-on-Chip (SoC): Chip integrating CPU, GPU, memory controller, and accelerators in one package.
Memory Controller: Hardware unit that manages read/write requests to memory.

Introduction

Unified Memory Architecture (UMA) is a computer architecture where multiple processors such as CPU, GPU, and AI accelerators share the same physical memory pool instead of using separate memory systems. In traditional architectures, CPUs use system RAM while GPUs use dedicated VRAM. Data must be copied between these memory spaces before computation can occur. UMA removes this separation by allowing all processors to access a single unified address space and shared memory. This architecture simplifies memory management, reduces data movement overhead, and improves collaboration between heterogeneous computing units. UMA is commonly used in integrated graphics systems, mobile processors, gaming consoles, and modern SoC designs.

What problem we can solve with this?

In traditional heterogeneous computing systems, CPUs and GPUs maintain separate memory pools. When an application needs to process data using the GPU, the data must first be copied from CPU memory into GPU memory. This constant data transfer introduces latency, increases memory duplication, and consumes additional power. For workloads like AI inference, graphics rendering, and multimedia processing, frequent memory transfers become a performance bottleneck. Unified Memory Architecture addresses these challenges by allowing processors to operate on the same memory space. This eliminates redundant copies and simplifies application development. As a result, systems become more efficient, power-friendly, and easier to program.

Key problems solved

Data transfer overhead: Eliminates frequent copying between CPU RAM and GPU VRAM.
Memory duplication: Prevents storing multiple copies of the same dataset.
Programming complexity: Developers no longer need to manage explicit memory transfers.
Latency issues: Direct memory access reduces delays between processors.
Energy consumption: Less data movement reduces power usage.
Resource utilization: Memory can be dynamically allocated where needed.

How to implement / use this?

Implementing UMA requires hardware support and system-level coordination. The CPU, GPU, and other accelerators are connected to a shared memory controller that manages access to a unified RAM pool. Each processor operates within a shared virtual address space, meaning the same memory location can be accessed by different processors. Operating systems and runtime frameworks coordinate memory allocation and scheduling to avoid conflicts. Modern SoC designs place processors and memory controllers close together to minimize latency. Software frameworks can then allocate memory once and allow all compute units to access it directly. This approach simplifies heterogeneous computing workloads.

Implementation steps

Shared physical memory: System uses one RAM pool accessible by all processors.
Unified address space: CPU and GPU see the same memory addresses.
Memory controller arbitration: Controller schedules read/write requests to avoid conflicts.
Cache coherence mechanisms: Ensures processors see updated data values.
Operating system support: OS manages memory allocation and permissions.
Runtime frameworks: Programming environments expose unified memory APIs.

Sequence Diagram

This sequence diagram illustrates how processors interact with shared memory in a Unified Memory Architecture system. The application first requests computation from the CPU. The CPU retrieves input data from shared RAM using the memory controller. When the workload requires parallel processing, the CPU offloads tasks to the GPU. Instead of copying data to a separate memory space, the GPU directly accesses the same shared RAM through the memory controller. Both processors operate on the same dataset without duplication. Once computation is complete, the GPU returns results to the CPU, which then sends the final output back to the application. The key difference from traditional architectures is that data remains in a single memory location throughout the process.

Sequence flow steps

Application request: Application triggers computation request.
CPU reads data: CPU retrieves input data from shared memory.
Shared memory access: Memory controller fetches data from RAM.
Task offloading: CPU delegates parallel work to GPU.
Direct GPU access: GPU reads and writes data from the same RAM.
Result return: GPU sends processed results back to CPU.

Component Diagram

The component diagram shows the structural organization of systems that implement Unified Memory Architecture. The application interacts with the CPU to initiate computations. The CPU communicates with the memory controller to retrieve or store data in shared RAM. When parallel processing is required, the CPU offloads tasks to the GPU. Both the CPU and GPU send memory requests to the same memory controller, which arbitrates access to shared RAM. Because all components rely on the same memory pool, there is no need for explicit memory transfers between processors. This architecture simplifies system design and enables efficient collaboration between heterogeneous compute units. The memory controller plays a critical role in ensuring fair and efficient access to shared memory resources.

Component interaction steps

Application → CPU: Application requests computation.
CPU → Memory Controller: CPU requests memory access.
GPU → Memory Controller: GPU accesses shared data for parallel tasks.
Memory Controller → RAM: Controller manages read/write operations.
CPU → GPU: CPU delegates compute-intensive tasks.
GPU → CPU: GPU sends processed results back.

Advantages

Reduced data movement: Eliminates repeated copying between CPU and GPU memory.
Simpler programming model: Developers work with a single memory space.
Lower latency: Direct memory access speeds up processing.
Better power efficiency: Reduced memory transfers save energy.
Improved resource utilization: Memory dynamically allocated based on workload.
Compact system design: Ideal for SoC and integrated systems.
Enhanced heterogeneous computing: Enables CPU, GPU, and AI accelerators to collaborate efficiently.

Summary

Unified Memory Architecture is a modern computing design where multiple processors share a single memory pool rather than using separate memory systems. By eliminating redundant data transfers between CPU and GPU memory, UMA improves performance, reduces power consumption, and simplifies application development. It is widely used in integrated processors, mobile chips, and modern system-on-chip platforms where efficient collaboration between heterogeneous compute units is essential. With the growing importance of AI, graphics, and high-performance computing workloads, UMA plays a key role in enabling efficient data sharing and streamlined system architecture.