Transformer Performance: Hopfield Theory & Cross-Entropy Loss Data
Table of Links Abstract and 1 Introduction 2 Related Work 3 Model and 3.1 Associative memories 3.2 Transformer blocks 4 A New Energy Function 4.1 The layered structure 5 Cross-Entropy Loss 6 Empirical Results and 6.1 Empirical evaluation of the radius 6.2 Training GPT-2 6.3 Training Vanilla Transformers 7 Conclusion and Acknowledgments Appendix A. Deferred … Read more