Jan 12, 2024
It does NOT increase context length.
I believe it's just a matter of saving network transfer between Ollama server and the client, after all, the final API call to the actual LLM is using the whole text, and the decode happens, I believe, is before Ollama send data to the underneath llama.cpp server.