Imaging and ML software from a low-level programming perspective

Maxim Karpushin

Wolf Hauser

Imaging Software in Production: Context

ML in Production: Market Dynamics

Outline

  1. Introduction
  2. Programmable hardware overview
    • Central Processing Units (CPU)
      • x86 vs ARM
      • Memory model
      • Caching, SIMD, multithreading
    • Graphics Processing Units (GPU)
      • Graphics pipeline
      • SIMT
      • CUDA samples
    • Digital Signal Processors (DSP)
      • Quick overview
  3. Preparing ML models for production
  4. Conclusion

Programmable Hardware Overview

Vanilla server/desktop system

Mobile SoC (System on Chip)

Example of Qualcomm© Snapdragon™ 8+ Gen 1 (2022)

Programmable Hardware Overview

Central Processing Units

Inside a Modern CPU

  • Inside each CPU core
    • Instruction fetcher
    • Instruction decoder
    • Instruction queue (for out-of-order execution)
    • Execution units (ALU, FPU, memory R/W, branch, etc)
    • Branch prediction
    • Register file (super fast memory, holds operands and results of arithmetic operations)
  • Several instances of all the above: multicore architecture
  • Cache (fast memory, hidden from the programmer)
Intel Skylake (2015)

Common Architectures

Family x86(_64) ARM
Instruction set
classification
Complex Instruction Set Computer
(CISC)
Reduced Instruction Set Computer
(RISC)
Usage Mainly desktop/data center.
Weak mobile market penetration.
Mainly mobile.
But: Apple Silicon, Nvidia Grace
Key brand names Intel, AMD ARM, Apple, Samsung, Qualcomm, ...
Word size 32 or 64 bits
Vector extensions (SIMD) SSE (128 b), AVX (256 b), AVX-512 NEON (128 b), SVE (variable)
Simultaneous multithreading
aka "Hyper-Threading"
Yes:
commonly 2 logical per 1 physical
No, till recently: Cortex A65

Other architectures exist. To keep an eye on: RISC-V.

Execution Speed

  • Clock cycle: atomic value of elapsed time in a digital circuit
    (0.2 ns at 5 GHz to 1 ns at 1 GHz).
  • Decoder width: number of instructions decoded simultaneously
    (1 for very simple CPU, 4 for Intel/AMD, up to 8 in Apple Silicon).
  • Latency: cycles it takes before the result of an instruction is available for use in other instructions
    (1 for OR/ADD, 3 or 4 for MUL, many for DIV (12 on Intel Skylake)).
  • Throughput: Average instructions per cycle
    (0.05 to 6 for a single instruction type, increases for good instruction mix).
  • Instruction latencies and throughputs are context-dependent but measurable.
  • Multiply-accumulate (MAC) is typically a single 3-op instruction and faster than
    multiplication followed by addition.
  • Complex floating-point math instructions (e.g. sqrt) may have content-dependent latency.
Unofficial but highly acknowledged sources for x86 and ARM.

Single Instruction Multiple Data

  • Signal and image processing: apply the same instructions to many data samples.
  • Problem: decoding the same instructions over and over again is a waste of resources.
  • Solution: apply each instruction to several data samples.
  • Vertical vector instruction
  • Horizontal vector instruction
  • Operate on vectors stored in a dedicated register file
    • SSE, NEON: 128 bits = 2 × double or int64 / 4 × float or int32 / 8 × int16 / 16 × int8
    • AVX: 256 bits = 4 × double or int64 / 8 × float or int32 / 16 × int16 / 32 × int8
    • AVX-512: 512 bits = 8 × double or int64 / 16 × float or int32 / 32 × int16 / 64 × int8
    • SVE: Register size is implementation dependent

Programming for SIMD

  • Control flow is the same for all samples (no if/else)!
    → Compute both, then choose.
  • Cannot use recursion.
    → Use a different algorithm.
  • Sample number must be a multiple of vector size.
    → Round up in memory allocation, process a few useless samples.
  • Vector elements must/should be contiguous in memory.
    → Prefer structure of arrays over array of structures.
  • Compilers are bad at vectorizing (and programmers are bad at writing vectorizable code).
    → Use intrinsic functions to force both programmers and compilers.
  • Along which dimension should we vectorize an image?
    • Channels: RGB images, 4-element vectors: introduce useless 4th channel (opacity).
      → Store images as RGBA|RGBA|…
    • Rows: more flexible (works for any number of channels and any vector size).
      → For RGB images, store each channel in its own location.
    • Columns: bad idea (pixels are not stored contiguously).
See intrinsics references for x86 and ARM

Multitasking, Multiprocessing, Multithreading

  • CPUs typically work on more than one task at a time:
    Display slides, check mail in background, install updates…
  • Preemptive multitasking: OS switches task every few milliseconds (1 ms = 1 Mio cycles).
  • Multiprocessing: several CPUs (or one CPU with several cores) execute several tasks in parallel.
  • Heterogeneous topology: different cores (using the same instruction set) inside one CPU:
    Performance and efficiency cores.
  • Multithreading: inside a single process, split execution in several threads which execute in parallel.
    Example mail client: one thread for the UI, one for talking to the server, one for search indexing.
    Because slow tasks run in the background, the UI remains responsive.
  • Simultaneous multithreading (SMT) aka Hyperthreading: one physical core pretends to be two cores,
    fetches instructions from two different threads and mixes the instructions in its instruction queue.
  • Multithreading allows each thread to work on a completely different task.
    But we can also use it to split one big task (image processing) among several cores.

Programming for Multithreading

  • Multiple threads must not write to the same memory address at the same time (bug).
  • Multiple threads should avoid writing to neighboring memory addresses at the same time (slow).
  • Execution speed can vary between threads.
    Interrupted by the OS or running on different types of cores.
  • Synchronizing data between threads is hard (requires mutexes and atomic variables).
    Better let different threads work on unrelated problems.
  • Recipe: divide your image into tiles and process each tile in a separate thread.
  • Neighborhood filters require overlap between tiles to produce the correct output.
  • How many tiles? What is the best tile size and shape?
    • To keep all cores busy, need much more tiles than cores
      (5-10 times more to avoid waiting for the last tile).
    • Creating a thread is not free, only create threads that have a significant amount of work to do
      (one thread per pixel is a very bad idea).
    • Square tiles are best to minimize overlap (example: 256 × 256).
    • Horizontal bands are best to have consecutive memory access (example: 4000 × 16).

Memory Hierarchy

  • Size: number of bytes that can be stored.
  • Bandwidth: number of bytes that can be read per cycle.
  • Latency: number of cycles it takes to read or write one value.
  • Memory cannot be big and fast at the same time.
  • Gap between huge (but slow) main memory and superfast (but tiny) register file is too big.
  • Cache: hidden (from the programmer) intermediate memories with different size/speed trade-off.
  • Based on the assumption of locality:
    • Temporal locality: program frequently reads and writes the same address.
    • Spatial locality: program frequently reads and writes consecutive memory addresses.
  • Cache miss: occurring when the requested memory address is not found in a given cache.
  • Stalled thread: thread waiting for data from memory.
  • SMT: cache is shared between two threads.

Memory Access Speed

Intel Core i7-9xxx
  • L1 cache: 64 Kbytes per physical core, 4 cycles
    • 32 Kbytes instruction cache + 32 Kbytes data cache
  • L2 cache: 256 Kbytes per physical core, 11 cycles
  • L3 cache: 8 Mbytes shared across cores, 39 cycles
  • RAM: 107 cycles.
A synthetic example

Programming for Cache

  • Spatial locality: easy.
    • Store pixels consecutively in memory.
    • Data is copied to cache by chunks of 64 bytes (cache line)
      → Wait for first pixel, get some more for free.
    • CPU detects consecutive access patterns and applies prefetching.
      → Wait for first pixel, get all others for free.
  • Temporal locality: hard.
    • Typical image processing operation (e.g., convolution) reads from one or more input tiles, applies some computation, and writes to an output tile. Typical algorithms consist of many operations.
    • Tile of 256 × 256 pixels, RGB, float (4 bytes per sample): 768 kB → does not fit in L2 cache.
    • Data must be fetched from L3 cache or even RAM every time.
      Shared between all cores → bottleneck, reduces the benefit ot multithreading.
    • Use smaller tiles? Better for cache but less efficient (overlap).
    • Solution: row-based image processing.
      • Execute algorithm line by line, apply each operation as early as you can.
      • Very efficient, but code is hard to write.
It takes so long to get there [to main memory], your program will not run at any reasonable speed if you go to main memory. [Scott Meyers]

Programmable Hardware Overview

Graphics Processing Units

Simplified Graphic Pipeline

  • Vertex shader transforms 3D positions of vertices into 2D on-screen coordinates.
    • Utilizes camera position, orientation, and intrinsic parameters (e.g., field of view).
    • May also compute lighting parameters (e.g., reflected light intensity and color) based on the light source positions and per-vertex normals.
  • Rasterizer samples fragments out of primitives (e.g., triangles) in the 2D coordinate space.
    • Interpolates issued vertex attributes, e.g., texture coordinates and lighting parameters.
  • Fragment shader uses the per-fragment interpolated attributes to determine pixel colors.
    • Usually, it samples a texture at given texture coordinates to obtain the color value.
    • May use lighting parameters to modulate the resulting color.

Programmable Shading Example: Lighting

GPU Compute Evolution: Back Then

*AlexNet was trained on two GTX580 [Krizhevsky et al.]

GPU Compute Evolution: These Days

  • Quadro RTX 6000 (Turing architecture, 2018):
    • 16.3 TFLOPs single precision FP
    • 32.6 TFLOPs half precision FP
    • 130.5 TFLOPs half precision FP with TensorCores (new)
  • Tesla A100 (Ampere architecture, 2020):
    • 19.5 TFLOPs single precision FP
    • 156 TFLOPs single precision FP with TensorCores (new)
    • 312 TFLOPs half precision FP with TensorCores
  • Tesla H100 (Hopper architecture, 2022):
    • 48 TFLOPs single precision FP
    • 400 TFLOPs single precision FP with TensorCores
    • 800 TFLOPs half precision FP with TensorCores
    • 1600 TFLOPs 8-bit FP with TensorCores (new)

Programming and Memory Model

  • GPUs achieve high throughput through extensive parallelization, necessitating specific software design approaches.
  • Let us take a closer look at memory first.

A GPU From Inside (Nvidia Ampere)


Things to point out: PCIe interface, HBM2 (global memory), SMs with shared L2 cache, NVLink and MIG

Closer Look at Ampere Streaming Multiprocessor


Things to point out: instruction caches, L1 data cache, register file, compute units, warp schedulers.

Common GPU Programming Frameworks

  • CUDA (Compute Unified Device Architecture)
    • Parallel computing SDK for Nvidia GPUs
    • Utilizes a C/C++-like language and a compilation toolchain
    • Shipped with precompiled compute libraries (cuBLAS, cuDNN, cuFFT, etc.)
      • Closed-source, except CUTLASS
    • Primary compute backend for modern ML: PyTorch and TensorFlow heavily rely on cuDNN/cuBLAS
  • ROCm (Radeon Open Compute)
    • Parallel computing SDK for AMD GPUs
    • Offers multiple programming models: OpenCL, HIP and OpenMP
    • Open-source
    • Officially supports a subset of the AMD GPU product line

Common GPU Programming Frameworks

  • OpenCL (Apple → Khronos Group)
    • Parallel computing software stack used to develop programs using a C/C++-like language
    • Usable across a wide hardware spectrum, including GPUs, CPUs, FPGAs, and various accelerators.
  • GLSL (OpenGL Shading Language, Khronos Group)
    • C-like programming language supported by a vast range of graphics hardware.
    • Initially designed for computer graphics, it now enables general-purpose parallel programming (compute shaders).
      • Compute shaders are not universally available across all platforms.
    • Requires a host framework (OpenGL, Vulkan, WebGL) for operation.
    • Typically compiled at runtime by the GPU driver into GPU-specific machine code.

SIMT: Single Instruction Multiple Threads

  • SIMT is a parallel program execution model GPUs are built upon.
  • It provides a hardware abstraction convenient for both programmers and hardware manufacturers to decouple the program source code and device capabilities in a scalable way.
  • The programs (shaders, kernels) are written in a way to handle a single data entry throughout their lifecycle.
    • Typical granularity: 1 thread per pixel.
  • The hardware and its driver take care of threading.
    • To specify the amount of work, threads are grouped into blocks (workgroups), which form a grid.
    • Threads in the same block can efficiently exchange information: they have access to the same low-latency shared memory space.
      • In practice, the hardware runs them closely in time and memory.
    • The grid configuration is set when launching the program.
  • At runtime, the same instruction is issued in multiple threads, potentially running in parallel.

CUDA Kernel Example


#include <cstdio>
#include <vector>

__global__ void saxpy(int n, float a, float *x, float *y) {                   // <-- The kernel function executed on GPU
  int i = blockIdx.x * blockDim.x + threadIdx.x;                              // Find out current thread position
  if (i < n) y[i] = a * x[i] + y[i];                                          // Compute the result
}

int main(void) {
  const int N = 1 << 20;
  std::vector<float> x(N), y(N);                                              // Declare storage for test data (CPU memory)

  for (int i = 0; i < N; i++) {                                               // Fill test data buffers with some numbers (in CPU memory)
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  float *d_x, *d_y;                                                           // Declare pointers in GPU memory
  cudaMalloc(&d_x, N * sizeof(float));                                        // Allocate GPU buffer to store x
  cudaMalloc(&d_y, N * sizeof(float));                                        // Allocate GPU buffer to store y

  cudaMemcpy(d_x, x.data(), N * sizeof(float), cudaMemcpyHostToDevice);       // Copy x from CPU memory to GPU
  cudaMemcpy(d_y, y.data(), N * sizeof(float), cudaMemcpyHostToDevice);       // Copy y from CPU memory to GPU

  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);                             // Launch the kernel, 1 thread per element, 256 thread per block

  cudaMemcpy(y.data(), d_y, N * sizeof(float), cudaMemcpyDeviceToHost);       // Copy the result back from GPU

  float maxError = 0.0f;                                                      // Check the result
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i] - 4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);                                                              // Free the GPU buffer
  cudaFree(d_y);                                                              // Free the GPU buffer
}
    

Occupancy

  • Performance is conditioned by the number of threads effectively running in parallel.
    • Register pressure: amount of register memory required per thread vs the hardware capacity
      • Modern CUDA: 255 regular registers per thread max
      • Register spilling: if the register file size is insufficient to store all registers, data is stored in global memory.
      • Register spilling is not allowed in all SIMT implementations and often causes a dramatic slowdown.
    • Amount of shared memory per thread block vs the hardware capacity
      • CUDA: 48 to 163 Kbytes depending on the CUDA Compute Capability
      • OpenGL ES 3.1 Compute Shader: 16 Kbytes as the minimum required by the standard
    • SM capabilities (e.g., max number of warps per SM)
      • Warp (CUDA) is a group of 32 consecutive threads in a block.
      • CUDA: 1024 threads per thread block max
      • OpenGL ES 3.1 Compute Shader: 128 threads per workgroup as the minimum required by the standard
    • Total number of SMs

Divergent threads


  // ...
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (mat[i] > 0) {
    // Branch A
  }
  else {
    // Branch B
  }
  // ...
    
  • If condition mat[i] > 0 or its opposite holds true for the entire warp, mostly no impact on performance.
  • If condition mat[i] > 0 holds for some threads but not for others within the same warp, both branches A and B will be executed using an "active threads mask", resulting in a performance penalty.
→ Common recommendation: align data-dependent execution paths with warp boundaries

Coalesced vs Non-coalesced Access

  • Consider a CUDA kernel going through a big square matrix of size N.
Version 1:

__global__ void matrix_traversal(float *mat, int N) {
  int col = blockIdx.x * 32 + threadIdx.x;
  int row = blockIdx.y * 32 + threadIdx.y;
  float element = mat[col * N + row];
  // ...
}
        
Version 2:

__global__ void matrix_traversal(float *mat, int N) {
  int col = blockIdx.x * 32 + threadIdx.x;
  int row = blockIdx.y * 32 + threadIdx.y;
  float element = mat[row * N + col];
  // ...
}
        

Kernel launch code:


      const dim3 threads(32, 32);       // processing the matrix by tiles of 32*32 elements
      const dim3 blocks(N/32, N/32);    // assuming N is a multiple of 32
      matrix_traversal<<< blocks, threads >>>(matrixDataDevicePtr, N);
    

Question: which implementation is more efficient and why?

Coalesced vs Non-coalesced Access

Version 1:

__global__ void matrix_traversal(float *mat, int N) {
  int col = blockIdx.x * 32 + threadIdx.x;
  int row = blockIdx.y * 32 + threadIdx.y;
  float element = mat[col * N + row];       //<---- LOAD
  // ...
}
        
Version 2:

__global__ void matrix_traversal(float *mat, int N) {
  int col = blockIdx.x * 32 + threadIdx.x;
  int row = blockIdx.y * 32 + threadIdx.y;
  float element = mat[row * N + col];       //<---- LOAD
  // ...
}
        
  • Nvidia GPU cache line size: 128 bytes (a hardware constant)
  • Global memory access is serialized.

  • Relative addresses per thread:
    [0, 4N, 8N, ..., 32N]
  • Large N: every address falls into a different cache line
  • 32 cache lines to load per warp
  • This is slow.
  • Relative addresses per thread:
    [0, 4, 8, ..., 124]
  • All the addresses are in the same cache line
  • 1 cache line to load per warp
  • This is fast.

Non-coalesced Access: What to Do?

  • It is not always possible to change the way the memory is accessed.
  • A common transposition trick using shared memory:
    • Reading tiles in a coalesced manner and storing in a fast shared memory buffer
    • The transposition is done when reading from it.
    • Every thread receives the same value as before.

Version 1 upgrade:

__global__ void matrix_traversal(float *mat, int N) {
  __shared__ float buffer[32 * 32];
  int startCol = blockIdx.x * 32;
  int startRow = blockIdx.y * 32;

  int i = (startCol + threadIdx.y) * N + (startRow + threadIdx.x);
  buffer[threadIdx.y * 32 + threadIdx.x] = mat[i];    // now coalesced
  __syncthreads();                                    // waiting for all threads in the same block to pass through
  float element = buffer[threadIdx.x * 32 + threadIdx.y];
  // ...
}
    

... and there is a lot more.

Programmable Hardware Overview

Digital Signal Processors

DSP: Overview (Hexagon V67 Example)

  • Implements a specific SIMD instruction set
    • VLIW (Very Long Instruction Word): grouping independent instructions in packets for parallel execution at compile time.
  • May run multiple hardware threads
  • Handles integer, fixed- and floating-point compute, as well as special data types (e.g. complex numbers)
  • Instruction set includes "special-purpose" application-specific instructions
    • Video coding
    • Software-defined radio
    • Checksum computation
    • ...
  • Building programs requires a specific SDK

Benchmarks?

  • Public vanilla FLOPs benchmarks are hard to find.
  • From a ML perspective: AI benchmark or MobileNet SSD SNPE sample
    • DSP is much beefier compared to GPU and CPU.
    • It is essentially a fixed-point machine though: it requires quantized models.
    • Usable by means of Qualcomm Snapdragon Neural Processing Engine (SNPE) SDK and Android NN (e.g., TensorFlow Lite)

Preparing ML models for production

Production constraints

  • Another common situation: the target hardware is fixed-point.
  • It requires quantized NN weights and applies quantization to activations over a given number of bits per entry,
    • up to a linear transformation,
    • on per-layer or per-channel basis.
  • Example: Hexagon DSP, Snapdragon 8 Gen 1 (2021), the best you can get from SNPE:
    • 16 bits per activation value
    • 8 bits per weight value
      • Insufficient for raw image restoration algorithms
    • 32 bits per bias value
  • Fixed-point model is a strict constraint in this example.
  • In general, fixed-point models are faster (at inference time), have reduced inference memory footprint and take less storage.

Reminder: Floating Point vs Fixed Point

  • Floating point:
    • Dedicated compute units
    • Infinite range (+Inf and -Inf are valid numbers)
    • Higher accuracy around zero
    • Divison by zero is allowed
    • NaN
  • Fixed point:
    • Integer compute, bit shift operations
    • Finite range
    • Uniform accuracy
    • Division by zero leads to an exception

Tensor quantization

Quantization of a tensor: computing its approximate integer-valued representation in a given range.

Example: FP32 tensor → INT8 tensor, min and max values

  • Linear operations (namely, convolution) can be computed using integer arithmetics. The result is then stretched according to min and max values of input and kernel.
  • Non-linear activation functions require specific implementations (usually straightforward).

Model quantization

  • Quantization: producing fixed-point model parameters while keeping inference accuracy.
  • Post-training quantization: adjusting parameters of an already trained model.
    • Parameters: quantizing as tensors.
    • Activations: calibrating min and max by running inference on a representative dataset and collecting statistics.
  • No need to retrain the model
  • Data dependency: requires a representative dataset to calibrate the quantization parameters
  • Limited: may lead to poor accuracy
  • Advanced approaches exist, e.g., cross-layer equalization for models using ReLU [Nagel et al., 2018]

Model quantization

  • Quantization: producing fixed-point model parameters while keeping inference accuracy.
  • Quantization-aware training (QAT): introducing quantization during the training.
    • Parameters: using quantized representation during the forward pass. The optimizer updates the original (floating-point) version after the backward pass.
    • Activations:
      • Forward pass: applying quantization as an additional activation function.
      • Backward pass: only zeroing gradients for entries falling out of the valid range.
        • Quantization acting as an activation function has zero gradient everywhere, so cannot backpropagate through.