Family | x86(_64) | ARM |
Instruction set classification |
Complex Instruction Set Computer(CISC) | Reduced Instruction Set Computer(RISC) |
Usage | Mainly desktop/data center. Weak mobile market penetration. |
Mainly mobile. But: Apple Silicon, Nvidia Grace |
Key brand names | Intel, AMD | ARM, Apple, Samsung, Qualcomm, ... |
Word size | 32 or 64 bits | |
Vector extensions (SIMD) | SSE (128 b), AVX (256 b), AVX-512 | NEON (128 b), SVE (variable) |
Simultaneous multithreading aka "Hyper-Threading" |
Yes: commonly 2 logical per 1 physical |
No, till recently: Cortex A65 |
Other architectures exist. To keep an eye on: RISC-V.
*AlexNet was trained on two GTX580 [Krizhevsky et al.]
#include <cstdio>
#include <vector>
__global__ void saxpy(int n, float a, float *x, float *y) { // <-- The kernel function executed on GPU
int i = blockIdx.x * blockDim.x + threadIdx.x; // Find out current thread position
if (i < n) y[i] = a * x[i] + y[i]; // Compute the result
}
int main(void) {
const int N = 1 << 20;
std::vector<float> x(N), y(N); // Declare storage for test data (CPU memory)
for (int i = 0; i < N; i++) { // Fill test data buffers with some numbers (in CPU memory)
x[i] = 1.0f;
y[i] = 2.0f;
}
float *d_x, *d_y; // Declare pointers in GPU memory
cudaMalloc(&d_x, N * sizeof(float)); // Allocate GPU buffer to store x
cudaMalloc(&d_y, N * sizeof(float)); // Allocate GPU buffer to store y
cudaMemcpy(d_x, x.data(), N * sizeof(float), cudaMemcpyHostToDevice); // Copy x from CPU memory to GPU
cudaMemcpy(d_y, y.data(), N * sizeof(float), cudaMemcpyHostToDevice); // Copy y from CPU memory to GPU
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); // Launch the kernel, 1 thread per element, 256 thread per block
cudaMemcpy(y.data(), d_y, N * sizeof(float), cudaMemcpyDeviceToHost); // Copy the result back from GPU
float maxError = 0.0f; // Check the result
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i] - 4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x); // Free the GPU buffer
cudaFree(d_y); // Free the GPU buffer
}
// ...
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (mat[i] > 0) {
// Branch A
}
else {
// Branch B
}
// ...
mat[i] > 0
or its opposite holds true for the entire warp, mostly no impact on performance.mat[i] > 0
holds for some threads but not for others within the same warp, both branches A and B will be executed using an "active threads mask", resulting in a performance penalty.N
.
__global__ void matrix_traversal(float *mat, int N) {
int col = blockIdx.x * 32 + threadIdx.x;
int row = blockIdx.y * 32 + threadIdx.y;
float element = mat[col * N + row];
// ...
}
__global__ void matrix_traversal(float *mat, int N) {
int col = blockIdx.x * 32 + threadIdx.x;
int row = blockIdx.y * 32 + threadIdx.y;
float element = mat[row * N + col];
// ...
}
const dim3 threads(32, 32); // processing the matrix by tiles of 32*32 elements
const dim3 blocks(N/32, N/32); // assuming N is a multiple of 32
matrix_traversal<<< blocks, threads >>>(matrixDataDevicePtr, N);
Question: which implementation is more efficient and why?
__global__ void matrix_traversal(float *mat, int N) {
int col = blockIdx.x * 32 + threadIdx.x;
int row = blockIdx.y * 32 + threadIdx.y;
float element = mat[col * N + row]; //<---- LOAD
// ...
}
__global__ void matrix_traversal(float *mat, int N) {
int col = blockIdx.x * 32 + threadIdx.x;
int row = blockIdx.y * 32 + threadIdx.y;
float element = mat[row * N + col]; //<---- LOAD
// ...
}
[0, 4N, 8N, ..., 32N]
N
: every address falls into a different cache line[0, 4, 8, ..., 124]
__global__ void matrix_traversal(float *mat, int N) {
__shared__ float buffer[32 * 32];
int startCol = blockIdx.x * 32;
int startRow = blockIdx.y * 32;
int i = (startCol + threadIdx.y) * N + (startRow + threadIdx.x);
buffer[threadIdx.y * 32 + threadIdx.x] = mat[i]; // now coalesced
__syncthreads(); // waiting for all threads in the same block to pass through
float element = buffer[threadIdx.x * 32 + threadIdx.y];
// ...
}
Quantization of a tensor: computing its approximate integer-valued representation in a given range.
Example: FP32 tensor → INT8 tensor, min
and max
values