PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Opportunities

:::info
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Minghao Yan, University of Wisconsin-Madison;

(2) Hongyi Wang, Carnegie Mellon University;

(3) Shivaram Venkataraman, myan@cs.wisc.edu.

:::

Table of Links

Abstract & Introduction
Motivation
Opportunities
Architecture Overview
Proble Formulation: Two-Phase Tuning
Modeling Workload Interference
Experiments
Conclusion & References
A. Hardware Details
B. Experimental Results
C. Arithmetic Intensity
D. Predictor Analysis

3 OPPORTUNITIES

In this section, we perform empirical experiments to uncover new opportunities for optimizing energy use in NN inference. As discussed in Section 2, prior work did not study how memory frequency, minimum GPU frequency, and CPU frequency play a role in energy consumption. This is

partially limited by hardware constraints. Specialized power rails need to be built into the device during manufacturing to enable accurate measurement of energy consumption associated with each component. We leverage two Jetson developer kits, TX2 and Orin, which offer native support

for component-wise energy consumption measurement and frequency tuning, to study how these frequencies impact inference latency and energy consumption in modern deep learning workloads. We find that the default frequencies are much larger than optimal and throttling all these frequency knobs offer energy consumption reduction with minimal impact on inference latency.


Figure 3 illustrates the energy optimization landscape when varying GPU and memory frequencies, without imposing any constraints on latency SLO. The plot reveals that, without any other constraints, the energy optimization landscape generally exhibits a bowl shape. However, this shape varies depending on the models, devices, and other hyperparameters, such as batch sizes (See Appendix B for more results). Next, we dive into how each hardware component affects

inference energy consumption.


Setup: Experiments in this section are performed with 16-bit floating point number precision, as it has been demonstrated to have minimal impact on model accuracy in practice. We use Bert and EfficientNet models and vary the EfficientNet model size between B0, B4, B7 (Table 4).


Memory frequency experiment: For each model, we fix the GPU frequency at the optimal frequency determined by grid-search of all possible frequency configurations. We then examine the tradeoff between inference latency and energy consumption as we progressively throttle memory frequency. The range of available memory frequencies can be found in Table 1.


Results: Table 2 reveals that memory frequency plays a vital role in reducing energy consumption. The savings provided by memory frequency tuning are similar and consistent across models on both hardware platforms, ranging from approximately 12% to 25%. This indicates that the

default memory frequency is higher than optimal for modern Deep Learning workloads. For heavy workloads such as Bert, memory tuning can account for the majority of the energy consumption reduction. This can be partially attributed to the memory-bound nature of Transformer-based

models (Ivanov et al., 2021). Our result demonstrates that systems that aim to optimize energy use in neural network inference need to take memory frequency into account.


CPU Frequency Experiment: CPUs are only used for data pre-processing. Thus, we first measure the time spent in the data processing part of the inference pipeline. Next, we measure the energy saved by throttling the CPU frequency and assess the inference latency slowdown caused by reducing CPU frequency. The data preprocessing we perform is standard in almost all image processing and object detection pipelines, where we read the raw image file, convert it to

an RGB scale, resize it, and reorient it to the desired input resolution and data layout.


Results: The preprocessing time across different EfficientNet models remains constant since the operations performed are identical. As a result, the relative impact of CPU tuning on overall energy consumption depends on the ratio between preprocessing time and inference time. As the model size increases and inference duration increases, the influence of CPU tuning on overall energy consumption decreases. We observe that on both Jetson TX2 and Orin platforms, CPU tuning can decrease preprocessing energy consumption by approximately 30%. Depending on the model, quantization level, and batch size, this results in up to a 6% reduction in overall energy consumption.


Minimum GPU frequency experiment: We maintain the default hardware configuration and only adjust the minimum GPU frequency on Jetson Orin. Increasing the minimum GPU frequency forces the GPU DVFS mechanism to operate within a smaller range. We scale the model from

EfficientNet B0 to EfficientNet B7 to illustrate the effect of the GPU minimum frequency on inference latency.


Results: Table 3 indicates that tuning the minimum GPU frequency can significantly reduce energy consumption when the workload cannot fully utilize the computational power of the hardware. Notably, energy consumption and inference latency are reduced by forcing the GPU to operate

at a higher frequency. This differs from the tradeoff observed in other experiments, where we exchange inference latency for lower energy consumption. Tuning minimum GPU frequency can nearly halve the energy consumption for small models. As computational power becomes saturated with increasing model size, the return on tuning the minimum GPU frequency diminishes.


Figure 4 shows the per query energy cost as we vary the minimum and maximum GPU frequency. It shows that increasing the minimum GPU frequency from the default minimum leads to lower energy costs and inference latency.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.