I Found 221 Bugs in vLLM. They All Had the Same Root Cause

I audited vLLM’s C++ and CUDA code and found 221 places where PyTorch’s 64-bit tensor metadata is silently truncated to 32-bit int before being used in GPU buffer allocations. For GGUF model file code paths, an attacker controls the tensor dimensions through the file header, making this a deterministic GPU buffer overflow triggered by loading a crafted model. This same bug class already has 10 CVEs in llama.cpp and Ollama. I reported it to vLLM and it was closed. I have submitted a CWE proposal to MITRE to get this vulnerability class formally recognized. Full report with proof of concept and reproduction steps on GitHub.

Leave a Comment Cancel reply