Just noticed GPT-4o hit 1,000 tokens per second in a demo I saw

I was watching some side by side comparisons last night and this model from OpenAI is doing 1,000 tokens per second on a single GPU. That's like reading a whole page in under a second. For context, most models still hover around 50 to 100 tokens per second for that size. It made me wonder if we're about to see real time voice conversations that actually keep up with us. Has anyone else noticed this speed bump in the latest demos?

2 comments

2 Comments

cameron_owens4914d ago

Is it actually doing 1000 tokens per second on a single GPU, or is that number from some kind of optimized pipeline they're running in the demo? Those staged demos usually have the model already warmed up and the hardware specifically tuned for that one task. Real world usage tends to be a lot slower once you add multiple users, context windows, and all the other overhead. Not saying it's fake, but I'd like to see what it does under normal load before getting too excited. How much of that speed is just smoke and mirrors for the cameras?

emerychen14d ago

Watched a similar demo last year where a company claimed their model ran at 500 tokens per second, but when I actually tried it with a full conversation history it choked down to maybe 80. That's the thing with these staged demos, @cameron_owens49, they always test in perfect conditions that nobody actually uses in real life. I'd wait for third party benchmarks before buying into any speed numbers from a fancy presentation.