Beatmup
NNets module overview

Beatmup provides a way to run inference of user-defined neural networks on GPU using OpenGL.

The neural network (a NNets::Model instance) can be built in one of two ways:

The model data (e.g., convolution filters values) is stored in a ChunkCollection as plain single precision floating point arrays. They are indexed using the operation names. The model instance and input/output containers are supplied to NNets::InferenceTask which can be run in a thread pool of a Context, just as another AbstractTask.

Under the hood, the network is converted into a set of OpenGL ES 2.0-compliant GLSL shaders. The data is stored in textures in GPU memory. Beatmup takes care of building and executing shader programs.

With this Beatmup enables hardware-accelerated inference on any decent GPU, keeping the CPU available for other tasks. It allows to deploy easily the same model on various hardware, including inexpensive single-board computers, Android GPUs, integrated and discrete desktop GPUs from any vendor.

However, NNets module is still quite young and comes with a set of limitations.

  • The set of implemented features is limited. So far it is oriented to image classification and feature extraction exclusively. See NNets::AbstractOperation subclasses for the list of implemented neural network operations.
  • Not any model can be transformed into a Beatmup-compliant model. Most likely, a model needs to be designed and trained from scratch to be deployed with Beatmup. See NNets::Conv2D, NNets::Pooling2D and NNets::Dense operations descriptions for their corresponding constraints.
  • OpenGL may introduce a significant overhead. The inference thoughput achievable with Beatmup on powerful desktop GPUs is much likely limited compared to what can be achieved with vendor-specific proprietary technologies widely used for training and inference.
  • There are constraints related to the OpenGL ES 2.0 backend.
    • The activations of almost all operations are stored as 8-bit integers. This may require the training to be somehow aware of the activations quantization, otherwise with the increasing depth the error due to the quantization may cause performance degradation. However, the weights of the network are usually not quantized:
      • Conv2D filters and biases are stored in a floating point format. Possible quantization may apply if a given GPU does not support the single precision floating point computations.
      • Dense layers matrices and bias vectors are stored in floating point format if the GPU is OpenGL ES 3.1-compliant. Otherwise, a 16 bits fixed point representation is used.
    • The 8-bit sampled activations cover [0, 1] range. This strongly limits the activation functions that can be used in the model.
    • OpenGL may be inefficient to sample many feature channels at the same time or have hardware or driver-defined hard limit on the number of samples per output value (the latter is the case for Raspberry Pi). This constraints the width of the network. To overcome this, group convolutions and channel shuffling are suggested. The latter allows to shuffle channels between layers literally for free, which helps to increase the connectivity across the width of the network for group convolutions in particular.
  • The batch size is fundamentally and unconditionally equal to 1, i.e., the inference is run for one given input image at a time.