OpenAI cut inference costs in half — sort of
The Information reports OpenAI found an optimization technique that halved inference requirements for existing models, letting it serve all signed-out ChatGPT users on just 100 GPUs. The technique wasn't disclosed; speculation ranges from quantization to query routing — and none of those improve larger models without quality trade-offs.