Speeding up agentic workflows with WebSockets in the Responses API

Wed, Apr 22 · 10:00 AM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.

★ Tier-1 Source

Diagram titled “A Codex agent loop in practice” showing an iterative flow between Codex and the Responses API, with tool calls (rg, sed, apply_patch, pytest) and results exchanged until the final message: “The bug has been fixed.”

By Brian Yu and Ashwin Nathan, Members of the Technical Staff.

Key facts

In the Responses API, previous flagship models like GPT‑5 and GPT‑5.2 ran at roughly 65 tokens per second (TPS)
For GPT‑5.3‑Codex‑Spark, they hit their 1,000 TPS target and saw bursts up to 4,000 TPS, showing that the Responses API could keep up with much faster inference in real production traffic
In this post, they'll explain how they made agent loops using the API 40% faster end-to-end, letting users experience the jump in inference speed from 65 to nearly 1,000 tokens per second
For the launch of GPT‑5.3‑Codex‑Spark, a fast coding model, their goal was an order of magnitude faster: over 1,000 TPS, enabled by specialized Cerebras hardware optimized for LLM inference

Summary

When you ask Codex to fix a bug, it scans through your codebase for relevant files, reads them to build context, makes edits, and runs tests to verify the fix worked. All of these requests can add up to minutes that users spend waiting for Codex to complete complex tasks. In the past, running LLM inference on GPUs was the slowest part of the agentic loop, so API service overhead was easy to hide. In this post, they'll explain how they made agent loops using the API 40% faster end-to-end, letting users experience the jump in inference speed from 65 to nearly 1,000 tokens per second. In the Responses API, previous flagship models like GPT‑5 and GPT‑5.2 ran at roughly 65 tokens per second (TPS).

Read full article at OpenAI →

#agentic