RAG, Memory & Context

Parallel Search Session Optimization

Last updated 2026-07-31

What's new

2026-07-31

AI tools (like OpenClaw, a personal assistant app) can sometimes appear to work fine while actually failing to remember important information, a problem called "silent success."
The "harness" (the system managing the AI) is crucial for reliable AI performance, not just the AI model (the "engine") itself, as it handles tasks like state management and ordering.
AI systems should have clear ownership and replay paths for every fact they use, ensuring that information is stored and can be retrieved correctly for future use.
With more event sources and action surfaces, AI failures can be easier to trigger and harder to explain, making a robust harness even more important.

2026-07-25

Notion, a collaboration tool (like a digital workspace), is integrating AI to work alongside humans, calling these AI helpers "agents" (like digital coworkers).
They've seen AI evolve from simple tasks (like drafting emails) to handling complex workflows, but most companies struggle to implement AI effectively due to data silos (information trapped in separate systems).
Upgrading AI models can be costly, with new versions sometimes using more resources without a corresponding increase in revenue, creating tough choices for companies.
Notion emphasizes the need for a "durable system of record" (a reliable central database) to make AI work well, but many companies face high costs and challenges in achieving this.

2026-07-22

A major AI hardware company built a practical knowledge base (a searchable collection of information) using AI, pulling data from Slack, wikis, code repos (like GitHub), and custom databases.
They used a technique called retrieval augmented generation (RAG), which helps AI models answer specific questions by pulling relevant information from a company's own data.
This system allows anyone in the company to ask questions and get accurate answers, making it easier to find information and make decisions.
The creator of the video plans to show how to build a similar system, highlighting its usefulness for both small teams and large organizations.

2026-07-19

AI systems can be designed as workflows (predefined steps like a checklist) or agents (flexible, open-ended tasks like a trusted senior employee), with each having different costs, speeds, and safety needs.
Retrieval (looking up current data), tools (approved actions), and memory (storing context) enhance AI models, making them more accurate and useful for specific tasks.
Five workflow patterns—chaining (sequential steps), routing (categorizing and dispatching), parallelization (concurrent tasks), orchestration (dynamic subtasks), and evaluator-optimizer (generate and critique)—help match AI autonomy to the task at hand.
Start with a single AI call, then add workflows or agents only when necessary, as they increase complexity, cost, and latency.

2026-07-13

Hermes (a free, open-source AI assistant) gained popularity for its strong memory, allowing it to remember past conversations, unlike Claude (a paid AI assistant) which struggles with this.
Hermes' success is due to its simple setup and better memory, with users praising its ability to recall past context, making it feel like a helpful colleague.
However, Hermes has some issues, like occasionally overwriting good information and requiring separate setup and maintenance, which can be complicated.
The speaker decided to create a similar memory system within Claude, keeping its benefits and fixing its issues, all without needing extra servers or subscriptions.

2026-07-10

Mixbread, a new AI tool (software that uses artificial intelligence), is teaching AI agents (AI programs that can perform tasks) to use better search methods, closing what they call the "knowledge gap" (the difference between AI's reasoning abilities and its ability to find information).
They've shown that AI's performance drops significantly when it can't access the right information, but using Mixbread's search tool can recover most of that performance.
Mixbread's AI agent uses four main search tools: overview search (a wide semantic search), main semantic search (a detailed search), filter chunks (sorting and finding chunks based on metadata), and grep (a keyword match search tool).
The agent can perform up to four search rounds, with parallel searches in each round, to explore different aspects of a query and pick the best search tool for each.

2026-06-28

Paul Yushin and Luis François Bouchard teach how to build a personal AI research operating system (OS) to manage and retrieve notes from tools like Obsidian, Readwise, Notion, and Google Drive.
They emphasize creating a system that integrates with personal values and thoughts, using AI to pull relevant notes for current projects instead of manually searching.
The AI research OS is designed to be adaptable, allowing users to customize it for their specific needs, such as referencing past work to avoid repetition.
For complex problems, they recommend using a retrieval-augmented generation (RAG) pipeline with vector databases, but note that this requires infrastructure and is not as user-friendly for daily use.

2026-06-16

Loop craft (designing automated processes for AI agents) is a new skill, where you set up systems that repeatedly prompt AI agents, check their work, and improve it over time.
AI agents can work in teams, breaking tasks into smaller parts and solving them in parallel, but this requires careful planning to be effective and not waste resources.
Before using loops, ensure the task repeats often, results can be automatically verified, your budget can handle the extra cost, and the agent has proper tools to do the job.
Start small with one simple loop, including a way to automatically check and accept or reject the agent's work, to avoid unnecessary costs and ensure productivity.

2026-06-13

A new system combines ideas from various memory architectures (like Hermes, GBrain, Memarch) to create a better memory tool for AI agents (AI tools that can perform tasks), making it easier to store, inject, and recall information.
The system improves upon Claw code (a type of AI agent), enhancing its memory by adding automatic storage (saving information without the agent deciding what's important) and better recall (finding information by meaning, not just exact words).
Mem search (a memory tool) was chosen for its balance between capturing all information and keeping storage lean, using a fast model like Haiku to summarize conversations daily.

2026-06-10

Together AI is working on a project to extend the context length of AI models to 5 million, which is like giving the model a much larger memory to work with.
They're using a technique called DeepSpeed Ulysses (a way to split up the work of understanding long sequences across multiple GPUs, the powerful processors used for AI tasks) to help manage the huge amount of data.
They also use something called activation checkpointing (a method to save space by recomputing data as needed instead of storing it all at once) to further reduce memory usage.
This work is important for applications like AI agents (programs that can perform tasks based on instructions) and video generation, where understanding long sequences of data is crucial.

2026-06-04

Dynamic workflows (a way for AI to split big tasks into smaller, focused subtasks) solve problems like "agent laziness" (AI skipping tasks) and "goal drift" (losing focus over time).
"Classify and act" works like a receptionist: it first sorts a task (e.g., email is a bug vs. refund request) and then sends it to the right specialist AI to handle it.
"Fan out and synthesize" breaks a big job — like research or due diligence — into separate mini-tasks done by different AI agents at once, then combines their answers into one final result.

2026-06-03

Human attention is the real bottleneck in coding, not AI ability—models can handle fifty tasks but humans can only supervise a few daily.
Agent teams (multiple AI workers) coordinate through patterns like delegation (one assigns work to another) and verification (one checks another's work).
Factory's Missions system combines these patterns with orchestrators (planners), workers (coders), and validators (testers) that pass work cleanly from one to the next.
Success criteria written before coding—not after—catch actual bugs; tests written after implementation just confirm whatever was built, missing real problems.

Key points

What it is

Parallel Search Session Optimization is a technique where an AI system runs multiple searches or tasks at the same time, instead of one after another.
It expands your question into subqueries and runs six searches in parallel: three vector searches (finding meaning similarity) and three keyword searches.
Results are merged and ranked using reciprocal rank fusion (a method to combine different search rankings) and a local reranker (a tool that scores the final order).
This technique speeds up research, reduces mistakes, and helps gather more information before the AI's context (working memory) is compressed.

How to use it

Install QMD (a plugin for Claude Code, an AI coding assistant) with one line of code to enable parallel search tools.
Send out multiple sub-agents (AI programs for isolated tasks) to research different problems simultaneously, such as analyzing architecture patterns or researching an API.
Limit parallel sessions to three or four maximum to avoid overwriting or losing track of progress.
Delegate file-heavy investigations to sub-agents and have a main session reconcile all results afterward.

Watch out for

Running too many parallel sessions can cause agents to overwrite each other's outputs, leading to lost progress or errors.
Always manage persistent memory carefully across agents to prevent losing progress or introducing errors.
Without careful management, you might assume the AI's output is correct when it might not be, requiring human double-checking.

Tools named

QMD (a plugin for Claude Code to enable parallel search tools), Claude Code (an AI coding assistant)

Lesson 1: What is Parallel Search Session Optimization and why it matters

Parallel Search Session Optimization is a technique where an AI system launches multiple independent searches or agents (AI programs that can act on their own) at the same time, rather than running them one after another. Instead of making one search query and waiting for a result, the system expands your question first — a local AI model generates three types of subqueries, then fires six searches in parallel: three vector searches (searches that find meaning similarity) and three keyword searches, all running simultaneously. The results merge through reciprocal rank fusion (a method to combine different search rankings) and a local reranker scores the final order, all in under a second.

This matters for AI development because it dramatically speeds up research and reduces mistakes. When working on a coding project, a developer can send out five parallel sub-agents to research different problems at once — one analyzing architecture patterns, another researching an API, another checking the codebase structure, a fourth reviewing patterns, and a fifth evaluating optimization. All five run simultaneously and return results. Without parallel search, each task would run sequentially, wasting time and context (the working memory the AI holds about your project). Every new AI coding session starts with a blank slate, and context compression kicks in after 60%, causing earlier decisions to vanish. Parallel sessions help you gather more information before that compression erases earlier work. However, running too many sessions risks agents overwriting each other, so limit to three to four parallel sessions and delegate file-heavy investigations to sub-agents while keeping actual tabs on what's happening.

Sources

Lesson 2: How to use Parallel Search Session Optimization: step-by-step

Parallel search session optimization means running multiple searches at the same time and merging the best results. When you issue a query, the system expands your question first. A local AI model generates three types of subqueries: a HyDE query (a hypothetical document that would answer your question), dense retrieval sentences for vector search, and BM25 keywords for lexical search. Then it fires six searches in parallel — three vector searches and three BM25 searches run simultaneously. All results merge through reciprocal rank fusion (a formula that blends ranked lists from different searches). A local re-ranker then scores the final order. The entire process completes in under a second, all on your machine.

To use this with Claude Code, install QMD in one line: "Claude plugin install QMD at QMD." This gives Claude four new tools: Query for hybrid search, Get for document retrieval, Multi-get for batch lookups, and Status for index health. Every new session automatically searches your past work for relevant context, so you never re-explain your project. For teams, use HTTP transport with a shared long-lived server.

You can also run parallel research sessions manually. Send out five parallel sub-agents — one analyzes architecture patterns, one researches an API, one checks your codebase structure, one reviews paid AI patterns, and one evaluates token optimization. They all run simultaneously. A main session then reconciles all results. Just ensure they do not overwrite each other by managing persistent memory across agents. This technique can slash token costs 60 to 90% on long sessions.

Sources

Lesson 3: Best practices and pitfalls

Parallel search sessions (running multiple AI queries at the same time) can speed up your work, but they have common pitfalls. When searches run in parallel, they can overwrite each other's outputs unless you manage persistent memory carefully across agents. This means you risk losing progress or introducing errors that require human double-checking. Best practice is to limit parallel sessions to three or four maximum — beyond that, it becomes easy to lose track of what each session is doing and assume the AI's output is correct when it might not be.

To make parallel searches work well, delegate file-heavy investigations to sub agents (specialized AI workers for isolated tasks). For example, send out five parallel sub agents simultaneously: one to analyze architecture patterns, one to research an API, one to check your codebase structure, one to review pricing models, and one to evaluate token optimization. These agents cannot talk to each other during their individual research unless you set up an agent team, so have a main session reconcile all the results afterward.

Another effective technique is to prepare reusable skill documents stored as IDs — this prevents wasting tokens (the units AI models charge for) by repeatedly searching for the same fixed information. Also, start every complex task in plan mode to outline the work before executing anything in parallel. This reduces the chance of sessions duplicating effort or conflicting with each other.

Sources