Why Sparse Spectral Training Might Replace LoRA in AI Model Optimization

Table of Links

Abstract and 1. Introduction

  1. Related Work

  2. Low Rank Adaptation

    3.1 LoRA and 3.2 Limitation of LoRA

    3.3 ReLoRA*

  3. Sparse Spectral Training

    4.1 Preliminaries and 4.2 Gradient Update of U, VT with Σ

    4.3 Why SVD Initialization is Important

    4.4 SST Balances Exploitation and Exploration

    4.5 Memory-Efficient Implementation for SST and 4.6 Sparsity of SST

  4. Experiments

    5.1 Machine Translation

    5.2 Natural Language Generation

    5.3 Hyperbolic Graph Neural Networks

  5. Conclusion and Discussion

  6. Broader Impacts and References

Supplementary Information

A. Algorithm of Sparse Spectral Training

B. Proof of Gradient of Sparse Spectral Layer

C. Proof of Decomposition of Gradient of Weight

D. Proof of Advantage of Enhanced Gradient over Default Gradient

E. Proof of Zero Distortion with SVD Initialization

F. Experiment Details

G. Singular Value Pruning

H. Evaluating SST and GaLore: Complementary Approaches to Memory Efficiency

I. Ablation Study

4 Sparse Spectral Training

To address the limitations discussed previously, this section introduces Sparse Spectral Training (SST) and its detailed implementation.

4.1 Preliminaries

Sparse Spectral Training (SST) leverages sparse updates within the spectral domain of neural network weights. By updating singular vectors selectively based on their associated singular values, SST prioritizes the most significant spectral components. This transforms each linear layer as follows:


4.2 Gradient Update of U, VT with Σ

Update Σ. The diagonal matrix Σ, simplified as a vector of dimension m, is updated every step due to its low memory overhead. This ensures that all singular values are consistently adjusted to refine the model’s performance. The update is as follows:



where η represents the learning rate, and ∇LΣ is the gradient backpropagated to Σ. The max function with zero ensures that Σ values remain nonnegative.


Figure 1: Illustration of the Sparse Spectral Training (SST). For each iteration, all singular values and selected singular vectors are updated based on their significance, determined by a multinomial sampling using singular values as probabilities.






This theorem indicates that the enhanced gradient more closely approximates full-rank updates per iteration compared to the standard approach. Specifically, the default gradient’s dependence on Σi for magnitude could result in smaller updates if the current Σi is low, potentially stalling training. By decoupling the update mechanisms for direction and magnitude, the enhanced gradient method mitigates this issue.


Theorem 4.2 indicates that the enhanced gradient more closely approximates full-rank training in each step update, than the default gradient (proof in Appendix D). Specifically, the default gradient’s dependence on Σi could result in smaller updates if the current Σi is low, potentially stalling training. By decoupling the update mechanisms for direction (U·i and V·i) and magnitude (Σi), the enhanced gradient method mitigates this issue:



Each time we perform this Re-SVD, we consider it a new round. Each time we select vectors for updating, as described in Eq. 5, we call it a new iteration. The full method is detailed in Algorithm 2.

4.3 Why SVD Initialization is Important

This section outlines the advantages of using SVD initialization and periodic Re-SVD over zero initialization as employed in LoRA and ReLoRA methods.


Saddle Point Issues with Zero Initialization. Using zero initialization, the gradient updates for matrices A and B can lead to stagnation at saddle points. The gradient of A and B in Eq. 1 is:



4.4 SST Balances Exploitation and Exploration

From another prospective, SST combines the strategies of exploitation and exploration in spectral domain. In contrast, LoRA primarily focuses on exploitation by repeatedly adjusting the top-r singular values, as detailed in Section 3.2, neglecting the remaining spectral vectors. ReLoRA*, on the other hand, emphasizes exploration by periodically reinitializing the matrices B and A after each merging, thereby constantly seeking new directions for learning but ignoring previously established dominant directions.



4.5 Memory-Efficient Implementation for SST


4.6 Sparsity of SST





Table 1: BLEU scores on IWSLT’14 for Euclidean and hyperbolic Transformers. Values in bold indicate the highest performance among low-rank methods. Values marked with an “*” exceed the performance of their full-rank counterparts. The symbol "-" in the table indicates cases where training resulted in NaN losses. Notably, SST consistently outperforms other low-rank methods. Furthermore, the hyperbolic Transformer trained by SST shows improved performance over the full-rank hyperbolic Transformer, particularly as the dimension size increases.

:::info
Authors:

(1) Jialin Zhao, Center for Complex Network Intelligence (CCNI), Tsinghua Laboratory of Brain and Intelligence (THBI) and Department of Computer Science;

(2) Yingtao Zhang, Center for Complex Network Intelligence (CCNI), Tsinghua Laboratory of Brain and Intelligence (THBI) and Department of Computer Science;

(3) Xinghang Li, Department of Computer Science;

(4) Huaping Liu, Department of Computer Science;

(5) Carlo Vittorio Cannistraci, Center for Complex Network Intelligence (CCNI), Tsinghua Laboratory of Brain and Intelligence (THBI), Department of Computer Science, and Department of Biomedical Engineering Tsinghua University, Beijing, China.

:::


:::info
This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.