RAG, Memory & Context

Advanced Retrieval-Augmented Generation

Last updated 2026-08-01

What's new

2026-08-01

Reinforcement learning (RL) (a way to teach AI by rewarding good actions and punishing bad ones) is often used with clear rewards, but real-world tasks are often messy and lack clear rewards.
Primordial AI (a company that builds tools for AI) has developed tools to help with RL in messy, real-world tasks, including a training platform called Lab (a place to test and improve AI models).
Environments (the space where AI learns, made up of tasks, tools, and rewards) can be used for more than just RL, like creating data for other AI training methods.
Many real-world tasks, like writing reports or handling customer service, lack clear rewards, making it challenging to teach AI effectively.

2026-07-28

Tokens (small text units AI models process) matter for response length, cost, and speed, so budget them wisely in your prompts.
Context windows (the AI's temporary memory) are finite, so manage them to avoid quality loss as conversations grow or large files are added.
Prompts (your messages to the AI) should clearly state goals, context, and desired outcomes, with standing rules for persistent guidance.
Always verify AI responses, as they can sound confident but be incorrect, including fabricated citations, made-up statistics, or plausible false facts.

2026-07-25

Video AI systems are being developed to have a "memory layer" (ability to understand and preserve the meaning and context of video content over time) to better understand and analyze videos, which are not just collections of images but have spatial and temporal relationships.
Current AI systems struggle with video because they treat it like text, losing important visual and temporal information, and lack the ability to connect events across different videos or time periods.
Troll Lab, a startup, is building a new AI system that can understand video like humans do, by preserving and connecting meaningful moments, entities, and metadata across videos, and making it accessible through an API (a tool that allows different software applications to communicate with each other).
This new system aims to enable a new class of video intelligence applications, moving beyond simple search and retrieval to answering complex questions that require understanding of entities, timelines, and evidence across an entire video collection.

2026-07-22

A major AI hardware company built a practical knowledge base (a searchable collection of information) using AI, pulling data from Slack, wikis, code repos (like GitHub), and custom databases.
They used a technique called retrieval augmented generation (RAG), which helps AI models answer specific questions by pulling relevant information from a company's own data.
This system allows anyone in the company to ask questions and get accurate answers, making it easier to find information and make decisions.
The creator of the video plans to show how to build a similar system, highlighting its usefulness for both small teams and large organizations.

2026-07-19

AI tools like GitHub Copilot and Open Claw (software that writes code for you) are getting better at tasks like coding, thanks to their built-in knowledge (intrinsic knowledge) and reasoning abilities.
Microsoft's Foundry (a platform for building and managing AI tools) offers thousands of models (pre-trained AI tools) to help you build AI-powered agents (AI tools that can perform tasks for you).
Microsoft IQ (a set of tools for connecting AI tools to your organization's data) helps AI tools access and use your organization's data, like documents, emails, and analytics (data analysis tools like Power BI).
AI tools are evolving to use more sophisticated systems (context engineering) to retrieve and use data, moving from simple data sets to company-wide data and from basic search to complex retrieval systems.

2026-07-13

**Semantic tool selection** (picking the right tools for the job) helps AI agents (computer programs that use AI to do tasks) use fewer tokens (units of text AI processes), reducing costs and hallucinations (making up incorrect information).
**Graph rack** (structured data search) replaces text searches with precise queries, helping AI agents find verifiable answers for tasks like counting or multi-step reasoning.
**Multi-agent validation** (using a second AI agent) checks responses before they reach users, improving accuracy and catching errors.
**Neuro-symbolic guardians** (rules written in Python programming language) ensure AI agents follow specific rules, preventing incorrect or unwanted actions.

2026-07-10

Mixbread, a new AI tool (software that uses artificial intelligence), is teaching AI agents (AI programs that can perform tasks) to use better search methods, closing what they call the "knowledge gap" (the difference between AI's reasoning abilities and its ability to find information).
They've shown that AI's performance drops significantly when it can't access the right information, but using Mixbread's search tool can recover most of that performance.
Mixbread's AI agent uses four main search tools: overview search (a wide semantic search), main semantic search (a detailed search), filter chunks (sorting and finding chunks based on metadata), and grep (a keyword match search tool).
The agent can perform up to four search rounds, with parallel searches in each round, to explore different aspects of a query and pick the best search tool for each.

2026-06-28

The "taste" skill (open-source GitHub project) helps improve AI-generated front-end design, making websites look better with features like image-to-code and redesign tools.
"Impeccable" (open-source front-end design skill) is now built into GitHub Copilot (a tool that helps write code), offering 23 commands to refine and critique designs, with a live browser editor for visual adjustments.
"Awesome design.md" (based on Google Stitch's design.md principle) uses existing websites as templates, breaking down their design elements to help you create your own unique site with a similar look and feel.
"Ponytail" (fast-growing AI repo) aims to make Claude Code (AI tool for coding) more efficient, reducing the amount of code it writes while maintaining the same output, making it faster and cheaper to use.

2026-06-16

A new tool called Headroom (a program that sits between your AI and the internet to reduce the data it processes) can make AI agents up to 10 times cheaper by compressing logs and other data before it reaches the AI model, sometimes reducing data by 96%.
Headroom is open-source (free to use and modify) and easy to install, working with tools like Claude Code, Cursor, and Ada (AI coding assistants) to compress data like logs, code, and tool outputs.
It uses different strategies for different content types, like keeping error messages in logs while removing repetitive, unimportant lines, and it leaves a trail (a special code) to retrieve the original data if needed.
While Headroom can save a lot on large, noisy data, it may not save anything or could even cost more on small, recent data, and it's designed to protect the most recent messages to preserve the AI's reasoning context.

2026-06-13

WorkOS (a company that helps other companies manage user logins and enterprise features) created an internal tool called Studio to help non-technical employees answer their own questions about the business using data from their databases (Snowflake), project management tools (Linear), and note-taking apps (Notion).
Studio uses a combination of an AI language model (Opus) and a system called Lane Graph (an agent that helps the AI understand and interact with different tools) to understand questions, find the relevant data, and provide answers.
Employees can ask Studio questions in a special dashboard or even through Slack (a popular workplace messaging app), making it easy to get information without needing to know how to write complex database queries (SQL).
Studio can also create reusable tools called widgets that display data in a useful way, like a table or chart, which employees can share and use regularly, such as during team meetings.

2026-06-10

AI agents (computer programs that can do tasks for you) have evolved from simple prompts to complex multi-agent systems, but more agents can lead to clashes and issues.
Current AI models struggle with large amounts of information, focusing only on the start and end, ignoring the middle, like a U-shaped curve.
To improve results, strategic context optimization (picking the most important information) is better than dumping everything into the AI.
Solutions like context engines (smart filters), summarization, knowledge graphs (maps of connections), and iterative retrieval (like library cards) can help, each with its own pros and cons.

2026-06-03

A memory system (notes about who you are and your work) stops AI tools like Claude or ChatGPT (AI assistants you chat with) from forgetting mid-conversation.
Store three memory levels: who you are (role, tone, preferences), what you're working on, what happened before—so Claude remembers you seamlessly.
Set it up in Claude once and it works everywhere—ChatGPT, other AI tools—so you don't repeat your background to each tool separately.
Start a fresh conversation anytime; the AI knows your context (background info) already, without needing you to paste old messages.

Key points

What it is

Advanced Retrieval-Augmented Generation (RAG, a method that lets AI chat with your own data) is an improved way for AI to use your data when answering questions.
It expands your question, runs multiple searches at once, and merges results to give you the best answer.
Advanced RAG reduces the work needed to build a search system or database, letting you focus on building useful AI agents.

How to use it

Prepare your data by splitting documents into small, targeted pieces.
Generate numerical representations of each piece and store them in a vector database (a store for those representations).
When a user asks a question, the system retrieves the most relevant pieces and passes them to a large language model (LLM, a type of AI) to generate an answer.

Watch out for

Don't treat all data the same; use exact searches for code and RAG for natural language documents.
Avoid building a pipeline where knowledge disappears after each conversation; write better documents first and keep the information.
Don't use a one-size-fits-all approach; match the retrieval method to the data type.

Tools named

Gemini (AI tool that handles storage and searching), OpenAI (AI tool that handles storage and searching), Llama Index (AI platform that uses hybrid retrieval), ripgrep (command-line search tool)

Lesson 1: What is Advanced Retrieval-Augmented Generation and why it matters

Advanced Retrieval-Augmented Generation (RAG) is an improved method for getting an AI to use your own data when answering questions. Basic RAG works like this: an AI agent (an AI that performs tasks) only knows what’s in its training data. If it lacks information, it must go retrieve it, then augment its answer with that new data, and finally generate a response. But building a basic RAG pipeline from scratch is a pain — you need to parse PDFs, chunk text, generate embeddings (numeric representations of text), set up a vector database (a storage system for those numeric representations), and write retrieval logic. That’s a lot of infrastructure just to ask a question about a document.

Advanced RAG makes this easier and more powerful. For example, when you search, it does not just search — it expands your question first. A local AI model generates three types of subqueries, then fires six searches in parallel (three vector searches, three keyword searches). All run simultaneously, results merge through reciprocal rank fusion, and a local reranker scores the final order. This all happens in under a second on your machine, with no cloud or API keys.

Why this matters for AI development: it drastically reduces the work you need to do. Instead of building your own search system or database, you can use tools like Gemini or OpenAI that handle the storage and searching for you. You only need to upload a document, and the AI takes care of everything else. For real-world use, an AI assistant can know your name, business, priorities, and team, and can check in with others, create things, research, or plan your day. Advanced RAG makes this possible without writing complex retrieval code yourself, letting you focus on building useful AI agents instead of infrastructure.

Sources

Lesson 2: How to use Advanced Retrieval-Augmented Generation: step-by-step

Advanced Retrieval-Augmented Generation (RAG — a method that lets AI chat with your own data) is essential for handling large collections of natural language documents like wikis or medical records. While building a RAG pipeline from scratch used to be painful—requiring parsing PDFs, chunking text (splitting documents into small pieces), generating embeddings (numerical representations of text), setting up a vector database (a store for those embeddings), and writing retrieval logic—modern tools make it much easier.

Here is a step-by-step example. First, prepare your data by chunking your documents into small, targeted slices. Second, generate embeddings for each chunk and store them in a vector database. Third, when a user asks a question, the system retrieves only the most relevant chunks—fast, cheap, and precise. For instance, searching 10 million documents for "Star Wars spaceships" using keyword search would miss mentions of "X-wing" or "Millennium Falcon," but RAG's semantic understanding catches those variations. Fourth, pass the retrieved chunks to a large language model (LLM) to generate an answer. The key insight: for code, tools like ripgrep (a command-line search tool) are better because code is perfectly structured, but for messy documents, RAG is irreplaceable. Smart teams now build agentic retrieval, where the system chooses between exact searches and conceptual searches automatically. This hybrid approach, used by platforms like Llama Index, makes RAG smarter without adding overhead. The result: you avoid the "nobody wants" pain of building infrastructure from scratch while getting answers that actually work at scale.

Sources

Lesson 3: Best practices and pitfalls

Building a RAG (retrieval-augmented generation) pipeline from scratch is a pain nobody wants. You need to parse PDFs, chunk text, generate embeddings (numerical representations of meaning), set up a vector database, and write retrieval logic. That’s a lot of infrastructure just to ask a question about a document.

The biggest pitfall is treating all data the same. For coding agents, RAG is dead. Code has perfect spelling, built-in organization via file structure, and tools like ripgrep (an exact pattern searcher) that find identifiers in milliseconds. Embeddings and chunking add latency for no gain. For natural language documents, however, RAG is essential. Keyword search misses the X-wing if you search for “Star Wars spaceship” because it requires literal matches.

The smartest teams avoid committing to one retrieval strategy. Instead, they use agentic retrieval (letting the AI choose the right tool per query). If the question is an exact identifier lookup, route to ripgrep. If it’s a conceptual search across thousands of documents, route to the vector database.

Another mistake is building a pipeline where knowledge disappears after each conversation. Most RAG implementations retrieve a chunk, generate an answer, and then the conversation ends. Knowledge is forgotten. The best practice is to write better documents first, treating raw sources as source code and the LLM as a compiler, producing a wiki that persists. Run health checks to find inconsistent data and impute missing information with web searches.

The pipeline nobody wants is the one-size-fits-all approach. The strategy that works is hybrid: match the retrieval method to the data type, and never let retrieved knowledge vanish without being written down.

Sources