InferenceError referencing context length in FoundationModels framework

I'm experimenting with downloading an audio file of spoken content, using the Speech framework to transcribe it, then using FoundationModels to clean up the formatting to add paragraph breaks and such. I have this code to do that cleanup:

private func cleanupText(_ text: String) async throws -> String? {
    print("Cleaning up text of length \(text.count)...")
    let session = LanguageModelSession(instructions: "The content you read is a transcription of a speech. Separate it into paragraphs by adding newlines. Do not modify the content - only add newlines.")
    
    let response = try await session.respond(to: .init(text), generating: String.self)
    return response.content
}

The content length is about 29,000 characters. And I get this error:

InferenceError::inferenceFailed::Failed to run inference: Context length of 4096 was exceeded during singleExtend..

Is 4096 a reference to a max input length? Or is this a bug?

This is running on an M1 iPad Air, with iPadOS 26 Seed 1.

Answered by DTS Engineer in 844865022

Is 4096 a reference to a max input length? Or is this a bug?

That is not a bug. The framework has a context size limit, which is 4096 as of today, meaning that the token number of the whole session needs to be no more than 4096.

As @ziopiero mentioned, a token is not equal to a character or word in the input. How to calculate tokens depends on the tokenizer. I'd assume 3~4 characters per token in English, and one character per token in Japanese or Chinese.

When hitting the context size limit, as shown in the provided error message, consider shortening the prompt in a creative way, or starting a new session if the curent session has multiple prompts and responses. In your case, maybe splitting your input into smaller chunks can help?

Best,
——
Ziqiao Chen
 Worldwide Developer Relations.

I have the same problem. Note that 4096 is max token length, not text length, but I don't know how token length has been calculated

Is 4096 a reference to a max input length? Or is this a bug?

That is not a bug. The framework has a context size limit, which is 4096 as of today, meaning that the token number of the whole session needs to be no more than 4096.

As @ziopiero mentioned, a token is not equal to a character or word in the input. How to calculate tokens depends on the tokenizer. I'd assume 3~4 characters per token in English, and one character per token in Japanese or Chinese.

When hitting the context size limit, as shown in the provided error message, consider shortening the prompt in a creative way, or starting a new session if the curent session has multiple prompts and responses. In your case, maybe splitting your input into smaller chunks can help?

Best,
——
Ziqiao Chen
 Worldwide Developer Relations.

While the context size limit is not a bug it seems that the FoundationModels framework does have a bug where it will regurgitate this error even if the context size is smaller than 4096. I'm trying to take information from a file the user selected and get a list. I'm using your suggested token division of 3.5 but still get the error -> Unhandled error streaming response: InferenceError::inferenceFailed::Failed to run inference: Context length of 4096 was exceeded during singleExtend.. when running this code:

 do {
    let languageModelSession = LanguageModelSession(model: .default, instructions: "Can you give me a concise list of barcodes from this CSV import?")
    
    //"Tell me something simple."
    
    let purifiedContent = content.replacingOccurrences(of: "\n", with: ",")
    
    let prompt: String = "Here is the data -> \(purifiedContent)"
    
    let characterCount = prompt.count
    let estimatedTokens = Double(characterCount) / 3.5
    let tokenCount = Int(round(estimatedTokens))
    print("Estimated tokens: \(tokenCount)")
    
    print(prompt)
    
    print(prompt.count)
    
    let response = try await languageModelSession.respond(to: prompt)
    
    print(response.content)
    
}catch {
    print(error)
}

Would you mind to share the value of the tokenCount when the error happens, and probably your prompt as well? The "3~4 characters per token" is an estimation, and you might want to over estimate a bit. If the token count is far below the limit and the error is still triggered, I'd be super interested in reproducing the error on my side...

Best,
——
Ziqiao Chen
 Worldwide Developer Relations.

You can use the Foundation Models instrument, in the Instruments app, to analyze the use of the framework and details about your session: it features a column reporting the number of input and output tokens, which you might want to check to see if they're consistent with your expectations.

This WWDC25 video can be useful too.

InferenceError referencing context length in FoundationModels framework
 
 
Q