How it works

Why Fleet?

Catalogue

Our Clients

Resources

How it works

Why Fleet?


Our Clients

Resources

The differences between prompt context, RAG, and fine-tuning

IT

The differences between prompt context, RAG, and fine-tuning and why we chose prompting.

Robin

Robin Marillia


Senior Full Stack Engineer

Implementation approaches for AI knowledge systems

When integrating internal knowledge into AI applications, three main approaches stand out:

  1. Prompt Context – Load all relevant information into the context window and leverage prompt caching.
  2. Retrieval-Augmented Generation (RAG) – Use text embeddings to fetch only the most relevant information for each query.
  3. Fine-Tuning – Train a foundation model to better align with specific needs.

Each approach has its own strengths and trade-offs:

  • Prompt Context is the simplest to implement, requires no additional infrastructure, and benefits from increasing context window sizes (now reaching hundreds of thousands of tokens). However, it can get expensive with large inputs and may suffer from context overflow.
  • RAG reduces token usage by retrieving only relevant snippets, making it efficient for large knowledge bases. However, it requires maintaining an embedding database and tuning retrieval mechanisms.
  • Fine-Tuning offers the best customization, improving response quality and efficiency. However, it demands significant resources, time, and ongoing model updates.

Why we chose prompt context

The Differences Between Prompt Context, RAG, and Fine-Tuning

For our current needs, prompt context was the most practical choice:

  • It allows for a fast development cycle without additional infrastructure.
  • Large context windows (100k+ tokens) are enough for our small knowledge base.
  • Prompt caching helps reduce latency and cost.

Prompt caching: structuring the prompt for efficiency

One thing we quickly learned was that without proper prompt caching, the implications are significant:

  • Cost Impact: Each request would require sending the entire context window (potentially tens or hundreds of thousands of tokens) to the AI provider, dramatically increasing API costs. For high-volume applications, this can quickly become prohibitively expensive.
  • Latency Issues: Processing large prompts from scratch with each request adds considerable processing time—often seconds per request—creating a poor user experience.

Almost every AI provider provides a way to cache prompts but we had to learn how to structure the prompt to take advantage of that. OpenAI’s documentation puts it best:

“Cache hits are only possible for exact prefix matches within a prompt. To realize caching benefits, place static content like instructions and examples at the beginning of your prompt, and put variable content, such as user-specific information, at the end. This also applies to images and tools, which must be identical between requests.”

This meant we had to put everything dynamic, like user queries or session-specific data, at the bottom of the prompt, while keeping static instructions and examples at the top. Otherwise, even small changes would prevent the cache from working, leading to unnecessary re-computation and extra costs.

For context-heavy applications like ours, implementing effective prompt caching wasn't just an optimization—it was essential for making the solution financially and technically viable.

Massive context windows: a game-changer

Another reason we stuck with prompt context is that context windows keep getting larger. OpenAI’s latest models support over 128k tokens, which is already more than our entire knowledge base. Google’s Gemini models take it even further, reaching up to 2 million tokens—basically enough to fit multiple books in a single prompt.

With these improvements, the main downside of prompt context—running out of space—is becoming less of an issue.

The future: a hybrid approach

While prompt context works well for us right now, we know it won’t scale forever. As our knowledge base expands, we expect to combine RAG for efficiency and fine-tuning for more specialized responses. But for now, keeping it simple has been the right call.

In order to optimise your experience, we use cookies 🍪, which you accept by continuing to browse.

Find out more