Agentic · OpenAI
Speeding up agentic workflows with WebSockets in the Responses API
Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.
★ Tier-1 Source
By Brian Yu and Ashwin Nathan, Members of the Technical Staff.
Key facts
- In the Responses API, previous flagship models like GPT‑5 and GPT‑5.2 ran at roughly 65 tokens per second (TPS)
- For GPT‑5.3‑Codex‑Spark, they hit their 1,000 TPS target and saw bursts up to 4,000 TPS, showing that the Responses API could keep up with much faster inference in real production traffic
- In this post, they'll explain how they made agent loops using the API 40% faster end-to-end, letting users experience the jump in inference speed from 65 to nearly 1,000 tokens per second
- For the launch of GPT‑5.3‑Codex‑Spark, a fast coding model, their goal was an order of magnitude faster: over 1,000 TPS, enabled by specialized Cerebras hardware optimized for LLM inference
Summary
When you ask Codex to fix a bug, it scans through your codebase for relevant files, reads them to build context, makes edits, and runs tests to verify the fix worked. All of these requests can add up to minutes that users spend waiting for Codex to complete complex tasks. In the past, running LLM inference on GPUs was the slowest part of the agentic loop, so API service overhead was easy to hide. In this post, they'll explain how they made agent loops using the API 40% faster end-to-end, letting users experience the jump in inference speed from 65 to nearly 1,000 tokens per second. In the Responses API, previous flagship models like GPT‑5 and GPT‑5.2 ran at roughly 65 tokens per second (TPS).