Imaging and ML software from a low-level programming perspective

Maxim Karpushin

Wolf Hauser

Imaging Software in Production: Context

ML in Production: Market Dynamics

Outline

Introduction
Programmable hardware overview
- Central Processing Units (CPU)
  - x86 vs ARM
  - Memory model
  - Caching, SIMD, multithreading
- Graphics Processing Units (GPU)
  - Graphics pipeline
  - SIMT
  - CUDA samples
- Digital Signal Processors (DSP)
  - Quick overview
Preparing ML models for production
Conclusion

Programmable Hardware Overview

Vanilla server/desktop system

Mobile SoC (System on Chip)

Example of Qualcomm© Snapdragon™ 8+ Gen 1 (2022)

Programmable Hardware Overview

Central Processing Units

Inside a Modern CPU

Inside each CPU core

Instruction fetcher
Instruction decoder
Instruction queue (for out-of-order execution)
Execution units (ALU, FPU, memory R/W, branch, etc)
Branch prediction
Register file (super fast memory, holds operands and results of arithmetic operations)

Several instances of all the above: multicore architecture
Cache (fast memory, hidden from the programmer)

Intel Skylake (2015)

Common Architectures

Family	x86(_64)	ARM
Instruction set classification	Complex Instruction Set Computer (CISC)	Reduced Instruction Set Computer (RISC)
Usage	Mainly desktop/data center. Weak mobile market penetration.	Mainly mobile. But: Apple Silicon, Nvidia Grace
Key brand names	Intel, AMD	ARM, Apple, Samsung, Qualcomm, ...
Word size	32 or 64 bits
Vector extensions (SIMD)	SSE (128 b), AVX (256 b), AVX-512	NEON (128 b), SVE (variable)
Simultaneous multithreading aka "Hyper-Threading"	Yes: commonly 2 logical per 1 physical	No, till recently: Cortex A65

Other architectures exist. To keep an eye on: RISC-V.

Execution Speed

Clock cycle: atomic value of elapsed time in a digital circuit
(0.2 ns at 5 GHz to 1 ns at 1 GHz).
Decoder width: number of instructions decoded simultaneously
(1 for very simple CPU, 4 for Intel/AMD, up to 8 in Apple Silicon).
Latency: cycles it takes before the result of an instruction is available for use in other instructions
(1 for OR/ADD, 3 or 4 for MUL, many for DIV (12 on Intel Skylake)).
Throughput: Average instructions per cycle
(0.05 to 6 for a single instruction type, increases for good instruction mix).
Instruction latencies and throughputs are context-dependent but measurable.
Multiply-accumulate (MAC) is typically a single 3-op instruction and faster than
multiplication followed by addition.
Complex floating-point math instructions (e.g. sqrt) may have content-dependent latency.

Single Instruction Multiple Data

Signal and image processing: apply the same instructions to many data samples.
Problem: decoding the same instructions over and over again is a waste of resources.
Solution: apply each instruction to several data samples.
Vertical vector instruction
Horizontal vector instruction
Operate on vectors stored in a dedicated register file

SSE, NEON: 128 bits = 2 × double or int64 / 4 × float or int32 / 8 × int16 / 16 × int8
AVX: 256 bits = 4 × double or int64 / 8 × float or int32 / 16 × int16 / 32 × int8
AVX-512: 512 bits = 8 × double or int64 / 16 × float or int32 / 32 × int16 / 64 × int8
SVE: Register size is implementation dependent

Programming for SIMD

Control flow is the same for all samples (no if/else)!
→ Compute both, then choose.
Cannot use recursion.
→ Use a different algorithm.
Sample number must be a multiple of vector size.
→ Round up in memory allocation, process a few useless samples.
Vector elements must/should be contiguous in memory.
→ Prefer structure of arrays over array of structures.
Compilers are bad at vectorizing (and programmers are bad at writing vectorizable code).
→ Use intrinsic functions to force both programmers and compilers.
Along which dimension should we vectorize an image?

Channels: RGB images, 4-element vectors: introduce useless 4th channel (opacity).
→ Store images as RGBA|RGBA|…
Rows: more flexible (works for any number of channels and any vector size).
→ For RGB images, store each channel in its own location.
Columns: bad idea (pixels are not stored contiguously).

Multitasking, Multiprocessing, Multithreading

CPUs typically work on more than one task at a time:
Display slides, check mail in background, install updates…
Preemptive multitasking: OS switches task every few milliseconds (1 ms = 1 Mio cycles).
Multiprocessing: several CPUs (or one CPU with several cores) execute several tasks in parallel.
Heterogeneous topology: different cores (using the same instruction set) inside one CPU:
Performance and efficiency cores.
Multithreading: inside a single process, split execution in several threads which execute in parallel.
Example mail client: one thread for the UI, one for talking to the server, one for search indexing.
Because slow tasks run in the background, the UI remains responsive.
Simultaneous multithreading (SMT) aka Hyperthreading: one physical core pretends to be two cores,
fetches instructions from two different threads and mixes the instructions in its instruction queue.
Multithreading allows each thread to work on a completely different task.
But we can also use it to split one big task (image processing) among several cores.

Programming for Multithreading

Multiple threads must not write to the same memory address at the same time (bug).
Multiple threads should avoid writing to neighboring memory addresses at the same time (slow).
Execution speed can vary between threads.
Interrupted by the OS or running on different types of cores.
Synchronizing data between threads is hard (requires mutexes and atomic variables).
Better let different threads work on unrelated problems.
Recipe: divide your image into tiles and process each tile in a separate thread.
Neighborhood filters require overlap between tiles to produce the correct output.
How many tiles? What is the best tile size and shape?

To keep all cores busy, need much more tiles than cores
(5-10 times more to avoid waiting for the last tile).
Creating a thread is not free, only create threads that have a significant amount of work to do
(one thread per pixel is a very bad idea).
Square tiles are best to minimize overlap (example: 256 × 256).
Horizontal bands are best to have consecutive memory access (example: 4000 × 16).

Memory Hierarchy

Size: number of bytes that can be stored.
Bandwidth: number of bytes that can be read per cycle.
Latency: number of cycles it takes to read or write one value.
Memory cannot be big and fast at the same time.
Gap between huge (but slow) main memory and superfast (but tiny) register file is too big.
Cache: hidden (from the programmer) intermediate memories with different size/speed trade-off.
Based on the assumption of locality:
- Temporal locality: program frequently reads and writes the same address.
- Spatial locality: program frequently reads and writes consecutive memory addresses.
Cache miss: occurring when the requested memory address is not found in a given cache.
Stalled thread: thread waiting for data from memory.
SMT: cache is shared between two threads.

Memory Access Speed

Intel Core i7-9xxx

L1 cache: 64 Kbytes per physical core, 4 cycles
- 32 Kbytes instruction cache + 32 Kbytes data cache
L2 cache: 256 Kbytes per physical core, 11 cycles
L3 cache: 8 Mbytes shared across cores, 39 cycles
RAM: 107 cycles.

A synthetic example

Programming for Cache

Spatial locality: easy.
- Store pixels consecutively in memory.
- Data is copied to cache by chunks of 64 bytes (cache line)
  → Wait for first pixel, get some more for free.
- CPU detects consecutive access patterns and applies prefetching.
  → Wait for first pixel, get all others for free.
Temporal locality: hard.
- Typical image processing operation (e.g., convolution) reads from one or more input tiles, applies some computation, and writes to an output tile. Typical algorithms consist of many operations.
- Tile of 256 × 256 pixels, RGB, float (4 bytes per sample): 768 kB → does not fit in L2 cache.
- Data must be fetched from L3 cache or even RAM every time.
  Shared between all cores → bottleneck, reduces the benefit ot multithreading.
- Use smaller tiles? Better for cache but less efficient (overlap).
- Solution: row-based image processing.

Programmable Hardware Overview

Graphics Processing Units

Simplified Graphic Pipeline

Vertex shader transforms 3D positions of vertices into 2D on-screen coordinates.
- Utilizes camera position, orientation, and intrinsic parameters (e.g., field of view).
- May also compute lighting parameters (e.g., reflected light intensity and color) based on the light source positions and per-vertex normals.
Rasterizer samples fragments out of primitives (e.g., triangles) in the 2D coordinate space.
- Interpolates issued vertex attributes, e.g., texture coordinates and lighting parameters.
Fragment shader uses the per-fragment interpolated attributes to determine pixel colors.
- Usually, it samples a texture at given texture coordinates to obtain the color value.
- May use lighting parameters to modulate the resulting color.

Programmable Shading Example: Lighting

GPU Compute Evolution: Back Then

*AlexNet was trained on two GTX580 [Krizhevsky et al.]

GPU Compute Evolution: These Days

Quadro RTX 6000 (Turing architecture, 2018):
- 16.3 TFLOPs single precision FP
- 32.6 TFLOPs half precision FP
- 130.5 TFLOPs half precision FP with TensorCores (new)
Tesla A100 (Ampere architecture, 2020):
- 19.5 TFLOPs single precision FP
- 156 TFLOPs single precision FP with TensorCores (new)
- 312 TFLOPs half precision FP with TensorCores
Tesla H100 (Hopper architecture, 2022):
- 48 TFLOPs single precision FP
- 400 TFLOPs single precision FP with TensorCores
- 800 TFLOPs half precision FP with TensorCores
- 1600 TFLOPs 8-bit FP with TensorCores (new)

Programming and Memory Model

GPUs achieve high throughput through extensive parallelization, necessitating specific software design approaches.
Let us take a closer look at memory first.

A GPU From Inside (Nvidia Ampere)

Things to point out: PCIe interface, HBM2 (global memory), SMs with shared L2 cache, NVLink and MIG

Closer Look at Ampere Streaming Multiprocessor

Things to point out: instruction caches, L1 data cache, register file, compute units, warp schedulers.

Common GPU Programming Frameworks

CUDA (Compute Unified Device Architecture)
- Parallel computing SDK for Nvidia GPUs
- Utilizes a C/C++-like language and a compilation toolchain
- Shipped with precompiled compute libraries (cuBLAS, cuDNN, cuFFT, etc.)
  - Closed-source, except CUTLASS
- Primary compute backend for modern ML: PyTorch and TensorFlow heavily rely on cuDNN/cuBLAS
ROCm (Radeon Open Compute)
- Parallel computing SDK for AMD GPUs
- Offers multiple programming models: OpenCL, HIP and OpenMP
- Open-source
- Officially supports a subset of the AMD GPU product line

Common GPU Programming Frameworks

OpenCL (Apple → Khronos Group)
- Parallel computing software stack used to develop programs using a C/C++-like language
- Usable across a wide hardware spectrum, including GPUs, CPUs, FPGAs, and various accelerators.
GLSL (OpenGL Shading Language, Khronos Group)
- C-like programming language supported by a vast range of graphics hardware.
- Initially designed for computer graphics, it now enables general-purpose parallel programming (compute shaders).
  - Compute shaders are not universally available across all platforms.
- Requires a host framework (OpenGL, Vulkan, WebGL) for operation.
- Typically compiled at runtime by the GPU driver into GPU-specific machine code.

SIMT: Single Instruction Multiple Threads

SIMT is a parallel program execution model GPUs are built upon.
It provides a hardware abstraction convenient for both programmers and hardware manufacturers to decouple the program source code and device capabilities in a scalable way.
The programs (shaders, kernels) are written in a way to handle a single data entry throughout their lifecycle.
- Typical granularity: 1 thread per pixel.
The hardware and its driver take care of threading.
- To specify the amount of work, threads are grouped into blocks (workgroups), which form a grid.
- Threads in the same block can efficiently exchange information: they have access to the same low-latency shared memory space.
  - In practice, the hardware runs them closely in time and memory.
- The grid configuration is set when launching the program.
At runtime, the same instruction is issued in multiple threads, potentially running in parallel.

CUDA Kernel Example


#include <cstdio>
#include <vector>

__global__ void saxpy(int n, float a, float *x, float *y) {                   // <-- The kernel function executed on GPU
  int i = blockIdx.x * blockDim.x + threadIdx.x;                              // Find out current thread position
  if (i < n) y[i] = a * x[i] + y[i];                                          // Compute the result
}

int main(void) {
  const int N = 1 << 20;
  std::vector<float> x(N), y(N);                                              // Declare storage for test data (CPU memory)

  for (int i = 0; i < N; i++) {                                               // Fill test data buffers with some numbers (in CPU memory)
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  float *d_x, *d_y;                                                           // Declare pointers in GPU memory
  cudaMalloc(&d_x, N * sizeof(float));                                        // Allocate GPU buffer to store x
  cudaMalloc(&d_y, N * sizeof(float));                                        // Allocate GPU buffer to store y

  cudaMemcpy(d_x, x.data(), N * sizeof(float), cudaMemcpyHostToDevice);       // Copy x from CPU memory to GPU
  cudaMemcpy(d_y, y.data(), N * sizeof(float), cudaMemcpyHostToDevice);       // Copy y from CPU memory to GPU

  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);                             // Launch the kernel, 1 thread per element, 256 thread per block

  cudaMemcpy(y.data(), d_y, N * sizeof(float), cudaMemcpyDeviceToHost);       // Copy the result back from GPU

  float maxError = 0.0f;                                                      // Check the result
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i] - 4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);                                                              // Free the GPU buffer
  cudaFree(d_y);                                                              // Free the GPU buffer
}

Occupancy

Performance is conditioned by the number of threads effectively running in parallel.
- Register pressure: amount of register memory required per thread vs the hardware capacity
  - Modern CUDA: 255 regular registers per thread max
  - Register spilling: if the register file size is insufficient to store all registers, data is stored in global memory.
  - Register spilling is not allowed in all SIMT implementations and often causes a dramatic slowdown.
- Amount of shared memory per thread block vs the hardware capacity
  - CUDA: 48 to 163 Kbytes depending on the CUDA Compute Capability
  - OpenGL ES 3.1 Compute Shader: 16 Kbytes as the minimum required by the standard
- SM capabilities (e.g., max number of warps per SM)
  - Warp (CUDA) is a group of 32 consecutive threads in a block.
  - CUDA: 1024 threads per thread block max
  - OpenGL ES 3.1 Compute Shader: 128 threads per workgroup as the minimum required by the standard
- Total number of SMs

Divergent threads


  // ...
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (mat[i] > 0) {
    // Branch A
  }
  else {
    // Branch B
  }
  // ...

If condition mat[i] > 0 or its opposite holds true for the entire warp, mostly no impact on performance.
If condition mat[i] > 0 holds for some threads but not for others within the same warp, both branches A and B will be executed using an "active threads mask", resulting in a performance penalty.

→ Common recommendation: align data-dependent execution paths with warp boundaries

Coalesced vs Non-coalesced Access

Consider a CUDA kernel going through a big square matrix of size N.

Version 1:


__global__ void matrix_traversal(float *mat, int N) {
  int col = blockIdx.x * 32 + threadIdx.x;
  int row = blockIdx.y * 32 + threadIdx.y;
  float element = mat[col * N + row];
  // ...
}

Version 2:


__global__ void matrix_traversal(float *mat, int N) {
  int col = blockIdx.x * 32 + threadIdx.x;
  int row = blockIdx.y * 32 + threadIdx.y;
  float element = mat[row * N + col];
  // ...
}

Kernel launch code:


      const dim3 threads(32, 32);       // processing the matrix by tiles of 32*32 elements
      const dim3 blocks(N/32, N/32);    // assuming N is a multiple of 32
      matrix_traversal<<< blocks, threads >>>(matrixDataDevicePtr, N);

Question: which implementation is more efficient and why?

Coalesced vs Non-coalesced Access

Version 1:


__global__ void matrix_traversal(float *mat, int N) {
  int col = blockIdx.x * 32 + threadIdx.x;
  int row = blockIdx.y * 32 + threadIdx.y;
  float element = mat[col * N + row];       //<---- LOAD
  // ...
}

Version 2:


__global__ void matrix_traversal(float *mat, int N) {
  int col = blockIdx.x * 32 + threadIdx.x;
  int row = blockIdx.y * 32 + threadIdx.y;
  float element = mat[row * N + col];       //<---- LOAD
  // ...
}

Nvidia GPU cache line size: 128 bytes (a hardware constant)
Global memory access is serialized.

Relative addresses per thread:
[0, 4N, 8N, ..., 32N]
Large N: every address falls into a different cache line
32 cache lines to load per warp
This is slow.

Relative addresses per thread:
[0, 4, 8, ..., 124]
All the addresses are in the same cache line
1 cache line to load per warp
This is fast.

Non-coalesced Access: What to Do?

It is not always possible to change the way the memory is accessed.
A common transposition trick using shared memory:
- Reading tiles in a coalesced manner and storing in a fast shared memory buffer
- The transposition is done when reading from it.
- Every thread receives the same value as before.

Version 1 upgrade:


__global__ void matrix_traversal(float *mat, int N) {
  __shared__ float buffer[32 * 32];
  int startCol = blockIdx.x * 32;
  int startRow = blockIdx.y * 32;

  int i = (startCol + threadIdx.y) * N + (startRow + threadIdx.x);
  buffer[threadIdx.y * 32 + threadIdx.x] = mat[i];    // now coalesced
  __syncthreads();                                    // waiting for all threads in the same block to pass through
  float element = buffer[threadIdx.x * 32 + threadIdx.y];
  // ...
}

... and there is a lot more.

Programmable Hardware Overview

Digital Signal Processors

DSP: Overview (Hexagon V67 Example)

Implements a specific SIMD instruction set
- VLIW (Very Long Instruction Word): grouping independent instructions in packets for parallel execution at compile time.
May run multiple hardware threads
Handles integer, fixed- and floating-point compute, as well as special data types (e.g. complex numbers)
Instruction set includes "special-purpose" application-specific instructions
- Video coding
- Software-defined radio
- Checksum computation
- ...
Building programs requires a specific SDK

Benchmarks?

Public vanilla FLOPs benchmarks are hard to find.
From a ML perspective: AI benchmark or MobileNet SSD SNPE sample
- DSP is much beefier compared to GPU and CPU.
- It is essentially a fixed-point machine though: it requires quantized models.
- Usable by means of Qualcomm Snapdragon Neural Processing Engine (SNPE) SDK and Android NN (e.g., TensorFlow Lite)

Preparing ML models for production

Production constraints

Another common situation: the target hardware is fixed-point.
It requires quantized NN weights and applies quantization to activations over a given number of bits per entry,
- up to a linear transformation,
- on per-layer or per-channel basis.
Example: Hexagon DSP, Snapdragon 8 Gen 1 (2021), the best you can get from SNPE:
- 16 bits per activation value
- 8 bits per weight value
  - Insufficient for raw image restoration algorithms
- 32 bits per bias value
Fixed-point model is a strict constraint in this example.
In general, fixed-point models are faster (at inference time), have reduced inference memory footprint and take less storage.

Reminder: Floating Point vs Fixed Point

Floating point:
- Dedicated compute units
- Infinite range (+Inf and -Inf are valid numbers)
- Higher accuracy around zero
- Divison by zero is allowed
- NaN
Fixed point:
- Integer compute, bit shift operations
- Finite range
- Uniform accuracy
- Division by zero leads to an exception

Tensor quantization

Quantization of a tensor: computing its approximate integer-valued representation in a given range.

Example: FP32 tensor → INT8 tensor, min and max values

Linear operations (namely, convolution) can be computed using integer arithmetics. The result is then stretched according to min and max values of input and kernel.
Non-linear activation functions require specific implementations (usually straightforward).

Model quantization

Quantization: producing fixed-point model parameters while keeping inference accuracy.
Post-training quantization: adjusting parameters of an already trained model.
- Parameters: quantizing as tensors.
- Activations: calibrating min and max by running inference on a representative dataset and collecting statistics.
No need to retrain the model
Data dependency: requires a representative dataset to calibrate the quantization parameters
Limited: may lead to poor accuracy
Advanced approaches exist, e.g., cross-layer equalization for models using ReLU [Nagel et al., 2018]

Model quantization

Quantization: producing fixed-point model parameters while keeping inference accuracy.
Quantization-aware training (QAT): introducing quantization during the training.

Parameters: using quantized representation during the forward pass. The optimizer updates the original (floating-point) version after the backward pass.
Activations:
- Forward pass: applying quantization as an additional activation function.
- Backward pass: only zeroing gradients for entries falling out of the valid range.
  - Quantization acting as an activation function has zero gradient everywhere, so cannot backpropagate through.