For multi-shot example, do you mean provide multiple example for the model to "think/follow" in the user message (which then being captured as bot/assistant/reply message context)?
In chat mode, my understanding for current LLM, e.g. ChatGPT or other local LLM like Llama2, is they don't have memory, each "round" of reply it would take the whole message history into 1 execution.
They are like:
----- First Round ----
[
{role: system, content: you are helpful chatbot},
{role: user, content: hi },
{role: assistant, content: hello stranger }
]
----- Second Round ----
[
{role: system, content: you are helpful chatbot},
{role: user, content: hi },
{role: assistant, content: hello stranger },
{role: user, content: what is the weather today },
{role: assistant, content: let me see, today is sunny},
]
So each time the model is fed with complete history, and I guess it would take the /chat endpoint to process these array of message object into some plain text with special delimiter and then fit into the same underneath process the /complete endpoint would use as well.
So that's also why the context would be used up (as message is accumulating), and the frontend chat interface would try to truncate some old message history and thus we experience forgetting
I am guessing it's how it works for now, but I don't have source code of ChatGPT frontend, and I only guess by what they describe for "Assistant" functionality (which help you truncate old history)
The final question on how to control LLM output, I think I have no good idea, but following some prompt practice and iterative trial and verify might be the way to go, at least for models for now