Foundation Models performance reality check - anyone else finding it slow?

Testing Foundation Models framework with a health-focused recipe generation app. The on-device approach is appealing but performance is rough. Taking 20+ seconds just to get recipe name and description. Same content from Claude API: 4 seconds.

I know it's beta and on-device has different tradeoffs, but this is approaching unusable territory for real-time user experience. The streaming helps psychologically but doesn't mask the underlying latency.The privacy/cost benefits are compelling but not if users abandon the feature before it completes.

Anyone else seeing similar performance? Is this expected for beta, or are there optimization techniques I'm missing?

Answered by DTS Engineer in 847849022

You can probably start with profiling your app with Instruments.app, as discussed in the WWDC25 code along session (starting at 24:32). How to set up the Foundation Models instrument is detailed here.

The Foundation Models instrument provides the token numbers the models generate. From there, you can calculate how many tokens per second. The number can vary a lot, but if it is consistently much worse than 20~30/s, I'd suggest that you file a feedback report and share your report ID here.

The WWDC25 session also discusses how to use prewarm and includesSchemaInInstructions to improve performance in cases that are appropriate. You can check if that can be applied to your app.

Best,
——
Ziqiao Chen
 Worldwide Developer Relations.

You can probably start with profiling your app with Instruments.app, as discussed in the WWDC25 code along session (starting at 24:32). How to set up the Foundation Models instrument is detailed here.

The Foundation Models instrument provides the token numbers the models generate. From there, you can calculate how many tokens per second. The number can vary a lot, but if it is consistently much worse than 20~30/s, I'd suggest that you file a feedback report and share your report ID here.

The WWDC25 session also discusses how to use prewarm and includesSchemaInInstructions to improve performance in cases that are appropriate. You can check if that can be applied to your app.

Best,
——
Ziqiao Chen
 Worldwide Developer Relations.

I've found that the prompts I would use for an API based current SOTA LLM really don't work that well with this little guy. Probably to be expected but after a lot of fine-tuning I've gotten some acceptable results.

Also, this is the worst they are ever gonna be. :-)

Issue resolved through systematic schema optimization.

The delay occurred during schema compilation phase, not during the actual streaming token generation. I did generate a .trace file as suggested but resolved the issue through direct code optimization before needing to analyze the profiling data.

Performance now within acceptable range for beta framework.

Could you describe the steps you took to optimize the schema?

I've found that asking ChatGPT to take a prompt I'd use for it and then optimize it for a three billion-parameter on-device LLM works well. It does a much better job at prompt engineering than I do.

Took me 1 second for basic on device generation/response. Super quick. Haven't been able to get the FoundationModelsTripPlanner streaming generation sample app to work, throws an error tho.

Foundation Models performance reality check - anyone else finding it slow?
 
 
Q