Overview
Faster, cheaper LLM responses
A simple way to improve performance when using LLMs is to cache responses using the user query as the key. This has the disadvantage of only allowing for caching of exact matches.
A more useful form of caching is semantic caching: caching based on the embedding of the query, and returning a response if the query passes a threshold of semantic similarity. This allows for a higher cache hit rate, meaning faster responses for your users and reduced OpenAI bills.
To enable semantic caching with Unkey:
- Set up a new gateway in the dashboard
- Replace the baseURL of your OpenAI constructor with your new gateway URL
Subsequent responses will be cached. You can monitor the cache via our dashboard.
Unkey’s semantic cache supports streaming, making it useful for web-based chat applications where you want to display results in real-time.
As with all our work, semantic caching is open-source on Github.
Get started
Set up a new semantic cache gateway
Set the baseURL of your OpenAI constructor to us the gateway
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
baseURL: "https://<gateway>.llm.unkey.io",
});
Add your baseURL to your OpenAI constructor. This will forward all requests via your new gateway.
Test it out
Make a request to your new gateway. You will see new logs arrive at the logs page. After the first new request, subsequent requests will be cached.
const chatCompletion = await openai.chat.completions.create({
messages: [
{
role: "user",
content: "What's the capital of France?",
},
],
model: "gpt-3.5-turbo",
stream: true,
});
for await (const chunk of chatCompletion) {
process.stdout.write(chunk.choices[0].delta.content);
}
Monitor your savings
New requests will appear in the logs tab. Visit the analytics tab to monitor cache hits and misses and see your time and cost savings over time.
Was this page helpful?