../ondevice-learning-note

On-Device Learning Note

Table of contents

Harvard Edge CS249R - Chapter 12 On-device Learning

Relevant Notions

On-Device Leaning, which is closed to TinyML, also called Federated Learning, is for embedded and edge ML deployments.

Limitations & Directions

On-device learning has limitation with i)compute resources, and ii)dataset size, accuracy, and generalization.

Comparatively, cloud learning is more complex, more efficient, and larger.

On-device learning has challenges in small dataset size, accuracy & efficient tradeoff, and generalization. Consider the speed requirement of some tasks and the non-IID (independent & identically distributed) data that on-device learning handling.

On-device Adaptation

Some details see here - Model Optimizations.

Resource consumption comes from → so to have corresponding solutions: i)ML model itself → reduce the complexity; ii)Optimization Process → modify optimizations; iii)Storing & processing the using datasets → new storage-efficient data representations.

Reducing model complexity

Modifying optimization process

Quantization-aware scaling

Quantization is a process of mapping a continuous range of values to a discrete set of values, including reducing precision of weights and activations from 32-bit floating point to lower-precision format eg 8-bit integers).

Untitled

QAS tries to minimize the quantization error, by adjusting the scale factors during quantization, based on the distribution of weights and activations, without hyper-parameters tuning.

This involves 2 steps: i)quantization-aware training, ii)quantization & scaling.

Sparse updates

Reduce memory consumption of full backward computation, by pruning gradient during backward propagation, ie skipping computing gradients of less important layers and sub-tensors.

Contribution analysis is needed for optimizing sparse update scheme, which measures the the accuracy improvement from biases and weights.

Layer-wise training

Sequential layer-by-layer training, instead of training end-to-end, avoid having to store intermediate feature maps.

Trading computation for memory

Release some of the memory storing intermediate results, which recomputed as needed.

New data representations for data compression

Transfer Learning

A model developed for a particular task can be reused as the starting point for a model on a second task. Regarding the context of on-device AI, this is to fine-tune a pre-trained model using smaller dataset on the device. Knowledge transfer, so called sometimes.

Eg researchers (Esteva et al.,2017) developed a transfer learning model to detect cancer in skin images, which is pre-trained on 1.28 million images to classify a broad range of objects and then specialized for cancer detection, by training on a dermatologist-curated dataset of skin images.

Pre-deployment specialization

Workflow: start with a pre-trained model → fine-tuning → validation → deployment.

Post-deployment specialization (on-the-fly personalization)

Workflow:

data collection → model fine-tuning → validation → deployment.

Core concepts

Taxonomy

Inductive transfer learning; transductive; unsupervised.

Untitled

Constraints & considerations

Federated Model Learning

A solution to train models partially on edge devices, only communicate model updates to the cloud.

FederatedAveraging

Google proposed a algorithm, FederatedAveraging, in 2016.

Performs SGD over several different edge devices. Figure 12.5: Google’s Proposed FederatedAverage Algorithm. Credit: McMahan et al. <a rel="noopener" target="_blank" href="https://arxiv.org/abs/1602.05629">2017

Performs SGD over several different edge devices. Figure 12.5: Google’s Proposed FederatedAverage Algorithm. Credit: McMahan et al. (2017).

Key vectors for further optimizing FL

G-Board, as an example of federated learning.

MedPerf, an open-source benchmarking for federated learning.

Security Concerns

Components in an AI Framework

Reference: Chapter 4 AI Workflow, and Chapter 6 AI Frameworks

Fig 4.1

Reference: Chapter 4 AI Workflow. 4.1 Overview.

Computational Graph

There are static one, ie executed after definition, and dynamic one, ie built on-the-fly.

Fig 6.4.2

Figure in 6.4.2.

Ahead-of-Time compilation of static one results in faster execution compared with dynamic one.

Optimization advantages of static computational graphs:

ML inference frameworks for embedded systems & microcontrollers

TensorFlow Lite MicroTiny EngineCMSIS-NN
an interpretercompiler-baseda software library

Comparison of above three frameworks, see here.

Future

ML systems Breakdown:

AI applications → Computational graphs → Tensor programs → Libraries and runtimes → Hardware primitives.

High-performance compilers & libraries

Horizontal perspective.

TensorFlow XLA, CUTLASS (Nvidia), TensorRT (Nvidia).

ML for ML frameworks

Hyperparameter optimization, eg Bayesian optimization, random search, grid search.

Neural architecture search (NAS).

AutoML.

Look into AI training

Optimization on optimization algorithms

Specifically, optimizations on commonly used SGD (Stochastic gradient descent).

Benchmarking of optimizations Evaluations would consider among training loss curves, generalization error, sensitivity to hyperparameters, computational efficiency (AlgoPerf, 2023).

Hyperparameter tuning

Before SOTA, for faster convergence and controllable gradient divergence, tuning methods are like, batch normalization through tuning aspects of internal covariate shift, adaptive learning rate, etc.

Some typical key hyperparameters across ML algorithms 🐟

Some prominent hyperparameter search algorithms 🌟

To know the tuning works from some system implications 🍥

Some auto tuners 🧶

For big ML, Google’s Vertex AI Cloud (Vertex AutoML), a commercial auto-tuning platform.

For tiny ML, Edge Impulse’s Efficient On-device Neural Network Tuner (EON Tuner), an automated hyperparameter optimization tool designed for microcontroller ML models.

Runtime complexity of matrix multiplication

As neural networks training comprise linear transformation operation (ie matrix multiplication) and non-linear transformation (ie activation), the former one occupies most of the computing time.

Consider a layer with M-dimension input and N-dimension output, consists K-dimension neurons. The matrix multiplication (by multiply-accumulate operations) happens for computing outputs of inputs * neurons, ie a (M1) matrix * (1K) matrix, supposing batch=1 → results in O(M*K). While for element-wise activation function, it is in O(N).

To reduce the time of these linear algebra operations, some optimizations for general sparse/dense matrix-matrix, and matrix-vector operations: 🍤

Computation vs. Memory

Computational profiles are different during process of training and inference. It is memory-bound (measured through operations/byte, memory bandwidth), vs compute-bound (measured through FLOPS, or operations/second, arithmetic throughput).

Batch size matters in the above computational profiles.

Hardware optimization is motivated by the above bottlenecks, more compute-side now than memory-side. Model architecture poses different computational or memory bottlenecks as well, eg Transformers and MLPs, vs CNNs.

Training parallelization

image6

Data parallelism

This could be used both for small and big models.

Typical workflow:

Divide the dataset → replicate the model → parallel computation → gradient aggregation → parallel parameter update → synchronization.

Model parallelism

This could be used for big models.

Typical taxonomy:

Efficient Models

Make models with less parameters have strong performance.

A Perspective of architecture

A Perspective of compression

image7

A Perspective of hardware

Specifically, inference hardware.

A Perspective of numerics

image8

Considerations

Metrics

Accuracy & Efficiency tradeoff.

And metrics:

Some benchmark datasets mentioned for providing standardized performance metrics:

Model Optimizations

From software to hardware: efficient model representation → numerics representation → hardware implementation.

Efficient model representation

Pruning, model compression, edge-friendly model architectures.

Pruning

Remove redundant / non-essential components of the model, eg connections between neurons, neurons, layers, activations.

structured pruning
(in training or after training)
Towards model-specific substructures.
i) Structures to target for pruning, eg neurons, channels or filters (in CNNs), layers.
ii) Establish a criteria for pruning, use some scores as quantitative metrics, eg
- weight magnitude, based on absolute values of the weights;
- gradient, low magnitudes have little effect on the loss;
- activation, consistently low values suggest less relevance;
- Taylor expansion, approximates the change in loss function from removing a given weight.
iii) Select a pruning strategy.
- iterative pruning, eg remove channels results in accuracy degradation.
- one-shot pruning, compress models rapidly by a targeting model size, sparsity level, available compute, acceptable accuracy losses.
unstructured pruning
(after training)
Towards individual weights.
More flexible than structured pruning, but challenge with sparse representations.
Comparison between structure pruning and unstructured pruning, https://harvard-edge.github.io/cs249r_book/contents/optimizations/optimizations.html#tbl-pruning_methods
iterative pruning
bayesian pruning
random pruning
sensitivity & movement pruning
lottery ticket hypothesisIterate until accuracy stop significantly degrade:
i) randomly initialize the weights;
ii) train the network until convergence;
iii) prune a percentage of the lowest weights;
iv) reset weights to initial values.

Model compression

Knowledge distillation, low-rank matrix factorization, tensor decomposition.

Knowledge distillationTrain a smaller student model to mimic a larger pre-trained teacher model.
Basic concepts:
- Use soft targets derived from teacher’s probabilistic predictions, ie a temperature-scaled softmax function with [teacher’s output] → soften probability distributions over classes.
- Loss function matters for distillation loss, between teacher & student, and classification loss, between student & true labels.
KL (Kullback-Leibler) divergence is commonly used.
- Temperature scaling in softmax function, a parameter controls the granularity of information distilled from teacher. Higher produces softer, and more informative distributions.
LRMF (Low-rank matrix factorization)A mathematical technique in linear algebra & matrix factorization, to approximate a given matrix, by decomposing it into two or more lower-dimensional matrices, as a product of lower-rank matrices.
Cross-realm eg, in recommendation systems, the user-item interaction matrix is factorized to capture latent factors related to user preferences & item attributes.
But there is tradeoff between fewer parameters to store m * k + k * n & more computations in runtime O(mkn) , where k is the rank as a challenging hyper-parameter.
Also information loss during factorization need to be noted.
Tensor decompositionHigher-dimensional analogue of matrix factorization.
Challenges as well, information loss, and nuanced hyper-parameter tuning.

Edge-aware model design

Design edge ML models through device-aware NAS (neural architecture search).

Depth-wise separable convolutions
(commonly used in image processing)
i) depthwise convolution, each input channel is convolved independently with its own set of learnable filters.
ii) pointwise convolution, combine output of depthwise convolution channels through a 1x1 convolution, creating inter-channel interactions.
An Example: SqueezeNet 2016
An Example: MobileNet 2017
An Example: EfficientNet 2023

Efficient numerics representation

Lower-precision numerics bring tradeoff between numerical accuracy & computational efficiency. Hardware-software co-design for facilitating compatibility with edge devices.

/On-device ML/ /Note/ /Edge ML/ /Tiny ML/