--

It does NOT increase context length.

I believe it's just a matter of saving network transfer between Ollama server and the client, after all, the final API call to the actual LLM is using the whole text, and the decode happens, I believe, is before Ollama send data to the underneath llama.cpp server.

--

--

No responses yet