PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Motivation

:::info
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Minghao Yan, University of Wisconsin-Madison;

(2) Hongyi Wang, Carnegie Mellon University;

(3) Shivaram Venkataraman, myan@cs.wisc.edu.

:::

Table of Links

Abstract & Introduction
Motivation
Opportunities
Architecture Overview
Proble Formulation: Two-Phase Tuning
Modeling Workload Interference
Experiments
Conclusion & References
A. Hardware Details
B. Experimental Results
C. Arithmetic Intensity
D. Predictor Analysis

2 MOTIVATION

Many deep neural networks have been deployed on edge devices to perform tasks such as image classification, object detection, and dialogue systems. Scenarios including smart home assistants (He et al., 2020), inventory and supply chain monitoring (jet), and autopilot (Gog et al., 2022) often use battery-based devices that contain GPUs to perform the aforementioned tasks. In these scenarios, pre-trained models are installed on the devices where the inference workload is

deployed.


Prior works have focused on optimizing the energy consumption of GPUs (Wang et al., 2020b; 2021; Tang et al., 2019; Strubell et al., 2019; Mei et al., 2017) in cloud scenarios (Qiao et al., 2021; Wan et al., 2020; Hodak et al., 2019) and training settings (Wang et al., 2020a; Peng et al., 2019;

Kang et al., 2022). On-device inference workloads exhibit different characteristics and warrant separate attention. In this section, we outline previous efforts in optimizing ondevice neural network inference and discuss our approach to holistically optimize energy consumption.

2.1 On-device Neural Network Deployment

Prior work in optimizing on-device neural network inference focuses on quantization (Kim et al., 2021; Banner et al., 2018; Courbariaux et al., 2015; 2014; Gholami et al., 2021), designing hardware-friendly network architectures (Xu et al., 2019; Lee et al., 2019; Sanh et al., 2019; Touvron et al., 2021; Howard et al., 2019), and leveraging hardware components specific to mobile settings, such as DSPs (Lane & Georgiev, 2015). Our work explores an orthogonal dimension and aims to answer a different question: Given a neural network to deploy on a specific device, how can we tune the device to reduce energy consumption?


In our work, we focus on edge devices that contain CPUs, memory, and GPUs. These devices are generally more powerful than DSPs often found on mobile devices. One such example is the Nvidia Jetson series, which is capable of handling a wide array of applications, ranging from AI to robotics and embedded IoT solutions (jet). The devices also come with dynamic voltage and frequency scaling (DVFS) capabilities that allow for the optimization of power consumption and thermal management during complex computational tasks. The Jetson series features a unified memory shared by both CPU and GPUs. We refer to the operating frequency of CPU, GPU, and shared memory as CPU frequency, GPU frequency, and memory frequency in this paper.


Case Study on Inventory Management: To understand the system requirements in edge NN inference, we next describe a case study of how NNs are deployed in an inventory management company. From our conversations, Company A works with Customer B to deploy neural networks on edge devices to optimize inventory management. To comply with regulations and protect privacy, data from each inventory site are required to be stored locally. The vast difference in the layout of the inventories makes it impossible to pre-train the model on data from every warehouse. Therefore, these devices come with a pre-trained model based on data from a small sample of inventories, which may have significantly different layouts and external environments compared to the actual deployment venue. Consequently, daily fine-tuning is required to enhance performance in the deployed sites, as the environment continually evolves. Similar arguments apply to smart home devices, where a model is pre-trained on selected properties, but the deployed households may be much more diverse. To address privacy concerns, on-device

fine-tuning of neural networks is preferred, as it keeps sensitive data locally. Therefore, edge devices often need to run both inference and periodic fine-tuning. Combining multiple workloads on edge devices can lead to SLO violations due to interference and increased energy use.

2.2 Holistic Energy Consumption Optimization

Some recent works have explored reducing energy consumption by optimizing for batch size and GPU maximum frequency (You et al., 2022; Nabavinejad et al., 2021; Komoda et al., 2013; Gu et al., 2023) and developing power models for modern GPUs (Kandiah et al., 2021; Hong & Kim, 2010;

Arafa et al., 2020; Lowe-Power et al., 2020). In this work, we argue that other hardware components also cause energy inefficiency and require separate optimization. We perform a grid search over GPU, memory, and CPU frequencies and various batch sizes to examine the Pareto frontier of inference latency and energy consumption. Figure 2 shows the tradeoff between the per-query energy consumption and inference latency (normalized to the optimal latency) on Jetson TX2 and Jetson Orin. Each point in the figure represents the optimal configuration that we find through grid search under a given inference latency budget and batch size. As Figure 2 shows, the Pareto frontier is not smooth globally and is difficult to capture by a simple model, which warrants more sophisticated optimization techniques to quickly converge to a hardware configuration that lies on the Pareto Frontier (Censor, 1977).


Zeus (You et al., 2022) attempts to reduce the energy consumption of neural network training by changing the GPU power limit and tuning training batch size. PolyThrottle also includes these two factors. In Zeus (You et al., 2022), the focus is on training workloads in data center settings,

where batch size tuning helps achieve an accuracy threshold in an energy-efficient way. We include batch size as part of PolyThrottle as it provides a trade-off between inference latency and throughput. Our empirical evaluation reveals new avenues available for optimization which complicates the search space, as we describe next.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.