:::info
Authors:
(1) Suzanna Sia, Johns Hopkins University;
(2) David Mueller;
(3) Kevin Duh.
:::
Table of Links
Abstract and 1. Background
2. Data and Settings
3. Where does In-context MT happen?
4. Characterising Redundancy in Layers
5. Inference Efficiency
6. Further Analysis
7. Conclusion, Acknowledgments, and References
A. Appendix
5. Inference Efficiency
Speeding up transformer inference is of great interest to the community (Fournier et al., 2023). We highlight the potential of speeding up inference time as a direct consequence of identifying where task recognition occurs in the model and redundancy of self-attention processing. Our results indicate that we can achieve significant speedups in inference by removing the processing of context-tokens all-together after a certain point in the model, with little to no impact on downstream performance.
Then, for a model with nℓ layers, the amount of processing in terms of speed and memory saved is approximately (nℓ − r)/nℓ × (k/k + 1).
Using the example of LLAMA7B (32 layers), we see from Figure 2 that the model is very close to it’s ceiling score after processing the examples at layer 14 (ℓ = 14). If we no longer need to process examples after ℓ = 14, under a prompt size of 5 the savings are approximately 45%.
For instruction-tuned models which are typically deployed in production, even if we assume that no examples are provided, savings can be non-trivial as very long-form instructions are typically provided to the model in an attempt to control it’s behavior (prompt engineering).
:::info
This paper is available on arxiv under CC BY 4.0 DEED license.
:::