Hey, When generating responses with structured output and non-streaming API, it sometimes takes 3s, sometimes 10-20s. I am firing that request subsequently while testing the app. Is this by design, or any place I can learn more about what contributes to such variation?
Unpredictable performance when using structured output
The models uses Speculative decoding
under the hood, and so indeed, some responses can be slower for faster than others. In general, regarding performance, I'd start with profiling the app with Instruments.app, as mentioned here. If you see that something is unusually slow, please share the details, and I'll be interested in taking a closer look from there.
Best,
——
Ziqiao Chen
Worldwide Developer Relations.