LLMs aren’t bad at reasoning—they’re bad at exploring. Here’s how uniqueness-aware RL fixes exploration collapse by rewarding rare solutions.
LLMs aren’t bad at reasoning—they’re bad at exploring. Here’s how uniqueness-aware RL fixes exploration collapse by rewarding rare solutions.