HellaSwag: 36% of this popular large language model benchmark contains errors December 6, 2022 by Comments