Before jumping into today’s blog post, I want to share a quick update:
12 days ago, I launched a new GitHub repo called “Agents Towards Production”, a toolbox for building production-level AI agents.
The response has been incredible. The project has already passed 7,000 stars and received over 150,000 visits from people around the world.
I hope it proves useful to many of you, and I’ll keep updating it regularly.
Nir
Oh, I almost forgot the memory part. How ironic. Just kidding, let's begin:
What separates a forgetful chatbot from a truly smart AI agent? Imagine chatting with a virtual travel assistant for months, only to have it repeatedly ask for your preferences as if meeting you for the first time each conversation. Frustrating, right? The key difference lies in memory – not just having memory, but using it smartly. Just as humans selectively remember important details and let trivial ones fade, AI agents need clever strategies to remember what matters and forget what doesn't.
Why Memory Matters for AI Agents
Early AI systems were mostly stateless – they processed each query independently, with no memory of past interactions. This is like talking to someone with severe memory loss: every conversation starts from scratch. A basic thermostat, for example, doesn't "remember" yesterday's temperature; it just reacts to the current reading. But add memory to an AI system, and it transforms into something smarter. A smart thermostat can learn your schedule by remembering past data – turning down heat when you're typically away, saving energy based on patterns.
In conversational AI, memory is even more important. Consider a customer support chatbot that can recall your previous support tickets: it avoids making you repeat information and can tailor its answers using what it "knows" about your past issues. Similarly, ChatGPT keeps a chat history within a session so it doesn't forget context from one message to the next. Without this, each user message would be treated alone, leading to broken, frustrating conversations.
However, memory in AI agents is not as simple as it is for humans. Large language models have limited context windows – they can only consider a fixed amount of text at a time. If you simply add the entire conversation history every time, you quickly hit these limits. The model might start ignoring older content or lose coherence if the context is too long. Moreover, storing everything slows down processing and increases costs. The challenge is clear: how to keep important information without overloading the system.
Short-Term vs Long-Term: Learning from Human Memory
AI researchers often look to human memory for inspiration. We have short-term memory for recent stuff – like remembering a phone number just long enough to dial it – and long-term memory for knowledge we keep over days, months, or years. AI agents organize memory in similar ways:
Short-Term Memory (Working Memory): This is the AI's immediate context within a single session. In a chatbot, it might be the last few user and AI messages stored in the prompt. For instance, if you tell a travel assistant, "Book a trip to Paris in December," it will keep track of "destination: Paris " and "timing: December" during that chat. This short-term memory is like sticky notes on the agent's desk – handy for the current conversation, but they get thrown away once the session ends.
Long-Term Memory: This is information the agent keeps across sessions and over time. It's often stored in external databases so the AI can look it up when needed. If you chat with your AI travel agent again months later and say, "Plan something like last time," it should recall that "last time" you went to Paris in December. That's long-term memory at work – like the agent's personal diary that keeps experiences over time.
The bottom line: giving AI the right kind of memory makes it behave less like a forgetful goldfish and more like a thoughtful companion. But simply having memory isn't enough; it must be used efficiently.
From Simple to Smart: Strategies for Memory Management
Let's explore how AI agents manage their memory, from simple approaches to advanced techniques.
1. Sequential (Keep-It-All) – The Simple Approach
The most basic method is what early chatbots did: keep adding every new message to the conversation history, and feed the whole thing to the model each time. This sequential memory chain keeps the full conversation record. It's like carrying the entire transcript of a conversation as context.
The benefit is simplicity – nothing fancy, just raw memory of everything said. In short conversations, this works fine and ensures no detail is lost. However, as conversations grow, this approach runs into trouble. The context can quickly overflow the model's limit, or become so large that processing it is slow and expensive. It's as if you tried to recall every word of every conversation you've had in the past month whenever someone asks you a question – your brain would overload.
2. Sliding Window – Focus on Recent Messages
A better approach is the sliding window memory. Instead of keeping the entire history, the agent keeps only the most recent N messages as context. As new messages come in, the oldest ones get dropped – the window slides forward. This copies how humans naturally focus on the latest part of a conversation; we tend to recall what was just said and might lose track of details from an hour ago.
Benefits: The sliding window ensures the context stays within a manageable size. It keeps the conversation relevant and recent, which is often enough since recent dialogue usually guides the next response. Performance stays consistent no matter how long the overall conversation.
Drawbacks: The obvious downside is that the agent might "forget" important information from earlier in the conversation. If a crucial detail was mentioned 50 messages ago and falls out of the window, it won't remember it.
3. Summarization – Distill the Important Parts
What if, instead of dropping old information entirely, the AI could remember it in a condensed form? This is where summary-based memory comes in. The idea is to regularly take the conversation so far, create a brief summary of the important points, and use that summary as a stand-in for the full history.
Think of it like taking notes during a long meeting. Rather than recalling every sentence said, you write down key decisions and facts. Later, those notes help your memory without you needing to relive the entire meeting. AI agents do something similar: after every 10 messages or when the context gets too large, they create a summary of earlier dialogue, then discard the detailed logs.
Benefits: Summarization allows the AI to keep relevant information over very long conversations without exceeding context limits. The agent can maintain awareness of past topics, decisions, or user preferences that occurred far back in the conversation.
Drawbacks: The quality of this approach depends on the quality of the summaries. Important details can be lost – a summary might miss a seemingly minor detail that later turns out to be crucial. Also, creating summaries adds extra computation and potential delays.
4. Retrieval-Based Memory – Smart Recall
Now we arrive at the advanced approach: retrieval-based memory. This strategy gives the AI agent something like an external brain or a personal search engine. Instead of pushing a fixed window or a summary into the model, the conversation history is stored in an external database, and when needed, the agent retrieves the most relevant pieces to include in context.
Here's how it works: imagine every conversation turn the agent encounters is a book in a library. When a new query comes, the agent doesn't read the entire library (too slow!) – it quickly checks the catalog to find which books might be relevant, pulls out just those, and reads them to answer the query.
Benefits: Retrieval-based memory allows an agent to remember large amounts of information over long periods. The agent can surface details from much earlier in a conversation or from long-term knowledge even if the current context window is small. The memory adapts to the current question – only relevant info is brought in.
Drawbacks: The complexity of setup and maintenance is higher. You need systems to store information, algorithms for fast search, and careful tuning to ensure relevant information is retrieved. If the retrieval isn't accurate, the agent might be led astray.
Next-Generation Memory Architectures
Beyond these core strategies, cutting-edge AI systems are implementing even more sophisticated approaches that push the boundaries of what's possible.
Memory-Augmented Transformers
To understand this, imagine a regular AI model as a student taking a test with only a small piece of paper for scratch work. No matter how long or complex the test questions get, the student can only work with what fits on that one small paper. If the test has 100 questions, by the time they reach question 50, they've run out of space and have to erase their notes from the earlier questions.
Memory-augmented transformers solve this by giving the AI a stack of sticky notes it can use alongside its main scratch paper. Here's how it works:
The main paper (regular context window) handles the immediate conversation, just like before
The sticky notes (memory tokens) store important information from earlier in the conversation
When the main paper fills up, instead of erasing everything, the AI writes key points on sticky notes
Later, when needed, the AI can look back at these sticky notes to remember what happened before
For example, let's say you're having a long planning session with an AI about organizing a conference. Early in the conversation, you mention your budget is $50,000 and you prefer morning sessions. As the conversation grows longer, this information would normally get pushed out of the AI's immediate memory. But with memory tokens, the AI writes "Budget: $50,000, prefers morning sessions" on a sticky note.
Hours later, when you ask "What venues fit our requirements?", the AI can check its sticky notes, see your budget and timing preferences, and give you relevant suggestions even though that information was mentioned way earlier in the conversation.
The clever part is that the AI learns which information deserves a sticky note and which can be safely forgotten. It's like having a smart assistant who knows the difference between important decisions and casual small talk.
Hierarchical Memory Systems
Just as your brain has different types of memory operating at different time scales, advanced AI agents now implement multi-layered memory hierarchies. Picture it like a company's filing system:
Working memory is like the papers on your desk – immediately accessible but limited in space
Short-term memory is like your filing cabinet – larger capacity, quick access for recent projects
Long-term memory is like the company archives – vast storage that requires more effort to access but preserves everything important
These systems automatically manage what information lives at each level, promoting important details from working memory to long-term storage while letting trivial information fade away. It's like having a smart assistant who knows exactly which emails to keep in your inbox, which to file away, and which to delete.
Compression and Consolidation
Modern memory systems implement sophisticated compression techniques inspired by how our brains consolidate memories during sleep. Instead of storing raw conversation text, these systems compress experiences into dense representations that capture the essential meaning while using far less storage space.
But how does this actually work under the hood? Let's break down the algorithmic process:
Step 1: Convert to Embeddings First, the system converts text into numerical vectors (embeddings) that capture semantic meaning. Instead of storing "I love Italian food, especially pasta with marinara sauce," it stores a dense vector like [0.23, -0.45, 0.67, ...] that mathematically represents this preference.
Step 2: Identify Patterns and Clusters The system groups similar memories together. All your food preferences might cluster in one area, travel preferences in another. It's like organizing your memories into themed folders, where each folder can be represented by a single "prototype" vector that captures the essence of that category.
Step 3: Hierarchical Abstraction Instead of remembering every individual conversation turn, the system builds layers of abstraction:
Level 1: "User mentioned liking pasta on Tuesday"
Level 2: "User prefers Italian cuisine"
Level 3: "User has strong food preferences"
Each level up uses fewer numbers to store more general patterns.
Step 4: Importance Scoring The system assigns relevance scores to memories using attention mechanisms. Recent interactions get higher scores, frequently referenced topics get higher scores, and emotionally significant moments (detected through language patterns) get preserved with higher fidelity.
Step 5: Lossy Compression Like how JPEG compresses images by removing details the eye won't miss, the system removes conversational details that don't affect future interactions. It might compress "It was a really, really, really good restaurant" down to just "positive restaurant experience" while preserving the restaurant name and your rating.
Step 6: Reconstruction When Needed When the AI needs to recall something, it doesn't retrieve the exact original text. Instead, it reconstructs a response based on the compressed representation, kind of like how you might retell a story in your own words rather than reciting it verbatim.
Think of it like the difference between storing a full movie file versus storing just the plot summary, key quotes, and emotional moments. The compressed version takes up much less space but still captures what matters for future reference. These systems can achieve dramatic space savings while maintaining the ability to recall important details when needed.
Operating System Memory Management
Some of the most innovative approaches borrow concepts from computer operating systems. Just as your computer manages memory by moving data between RAM and hard disk storage, these AI systems implement "virtual memory" for conversations.
The agent maintains a small "active" memory (like RAM) for immediate processing, while storing the bulk of its memories in external storage (like a hard drive). When it needs to recall something from long ago, it can "page in" that memory, temporarily bringing it into active context. This allows virtually unlimited conversation length while maintaining fast response times.
Graph-Based Memory Networks
Instead of storing memories as simple text, advanced systems organize information as interconnected knowledge graphs. Think of this like how your brain connects related memories – remembering your college friend might trigger memories of the dorm, the cafeteria food, that one professor, and so on.
These graph-based systems capture not just what was said, but the relationships between different pieces of information. When you mention "that restaurant we discussed," the system can traverse its knowledge graph to find not just the restaurant name, but also your dietary preferences, the occasion you were planning for, and your budget constraints – all connected in a web of related memories.
Advanced Techniques for Memory Optimization
Beyond the core strategies above, there are additional tricks AI developers use:
Token Compression: Pack more information in fewer tokens by using more efficient phrasing. An agent could rephrase a long paragraph into a short statement that conveys the same facts.
Smart Filtering: Not all pieces of memory are equally important. AI agents can assign scores to potential memory contents to decide what to keep. If the conversation is about travel plans, the agent might score context related to "flights" and "hotels" higher than an offhand joke made earlier.
Dynamic Memory Allocation: Some advanced systems adjust how they use memory based on context complexity. If the user asks a very complex question, the agent might allocate more of its context budget to pulling in relevant background info.
Strategic Forgetting: Counterintuitively, forgetting can be a feature. AI agents are now being trained in strategic forgetting, meaning they learn rules for what to keep and what to remove from memory. Once a task is completed, the agent might "forget" the false starts and errors that happened along the way.
Temporal Awareness: Advanced memory systems understand time and can weight recent information more heavily than older information, or conversely, identify patterns that emerge over long time periods.
Memory Consolidation: Like how human brains replay and strengthen important memories during rest, AI systems can implement background processes that identify and reinforce key information while discarding noise.
Real-World Examples
Personal Travel Assistant: In January, you chatted about a dream vacation to Paris, including preferred hotels and that you love traveling by train. By June, when you say, "Hey, plan something like last time but for July," the AI instantly recalls that "last time" refers to Paris and that you prefer trains over flights. It remembers your hotel preference and travel mode, but it didn't need to remember that you joked about buying woolen socks.
Customer Support Chatbot: You contact a support AI about a product issue. Two weeks later, the problem returns and you come back. A well-designed support agent will retrieve the context of your previous interaction – it knows what you tried already. It greets you with, "I see last time we updated your device drivers; let's explore further solutions."
AI Coding Assistant: Consider an AI pair-programmer that you've been using over months. It has seen the evolution of your project. Initially, it suggested some outdated library and you firmly told it to never use that again. An optimized assistant will have strategically remembered that feedback and avoids the wrong suggestion in future sessions.
Choosing the Right Strategy
The choice of memory strategy depends on your specific use case, technical constraints, and user needs. Here's a comprehensive guide to help you decide:
For Simple Applications
Sequential (Keep-It-All): Use when you have very short interactions (under 10 exchanges) and need perfect recall. Good for simple Q&A bots or brief customer service interactions.
Sliding Window: Perfect for medium-length conversations where recent context matters most. Ideal for real-time chat applications, gaming NPCs, or task-focused assistants where older information naturally becomes irrelevant.
For Long-Form Interactions
Summarization: Choose when you need to maintain context over long conversations but can accept some information loss. Great for therapy bots, educational tutors, or creative writing assistants where the overall flow matters more than exact details.
Retrieval-Based Memory: Essential for applications requiring accurate recall of specific facts over time. Perfect for personal assistants, customer support systems, or any agent that needs to reference past interactions precisely.
For Advanced Applications
Memory-Augmented Transformers: Use when you need to handle extremely long conversations or documents while maintaining detailed recall. Ideal for research assistants, complex problem-solving systems, or agents working with large knowledge bases.
Hierarchical Memory Systems: Choose for sophisticated applications that need different types of memory (working, short-term, long-term). Perfect for personal AI companions, enterprise assistants, or any system requiring human-like memory organization.
Graph-Based Memory Networks: Essential when relationships between information matter as much as the information itself. Ideal for recommendation systems, knowledge management tools, or agents that need to understand complex interconnections.
For Resource-Constrained Environments
Token Compression: Use when you have strict computational limits but still need decent memory. Good for mobile applications, edge computing, or high-volume services where cost matters.
Smart Filtering + Strategic Forgetting: Perfect for applications that generate lots of noise but need to preserve important signals. Ideal for social media monitoring, news analysis, or any high-volume data processing.
For Specialized Needs
Operating System Memory Management: Choose for applications with highly variable conversation lengths and unpredictable memory needs. Perfect for customer service platforms, multi-user systems, or applications with diverse use cases.
Compression and Consolidation: Use for applications that need long-term learning while managing storage costs. Ideal for personal assistants that improve over months/years, or systems that need to learn user patterns over time.
Dynamic Memory Allocation: Essential for applications with varying complexity levels. Perfect for educational systems that adapt to student needs, or assistants that handle both simple queries and complex research tasks.
Temporal Awareness: Use when timing and sequence matter significantly. Ideal for scheduling assistants, project management tools, or any system where understanding "when" is as important as "what."
Hybrid Approaches (Recommended for Production)
Most successful systems combine multiple strategies:
Sliding Window + Retrieval: Immediate context plus searchable history
Summarization + Graph Networks: Compressed narrative plus relationship mapping
Hierarchical + Compression: Multi-level memory with efficient storage
Token Compression + Smart Filtering: Resource efficiency with quality preservation
Decision Framework
Ask yourself these questions:
How long are typical interactions? (Short → sliding window, Long → summarization/retrieval)
How important is exact recall? (Critical → retrieval/memory-augmented, Flexible → summarization)
What are your resource constraints? (Limited → compression/filtering, Flexible → hierarchical/graph)
Do relationships between information matter? (Yes → graph networks, No → simpler approaches)
Will the system learn over time? (Yes → consolidation/temporal awareness, No → session-based approaches)
How many users will you serve? (Few → sophisticated approaches, Many → efficient compression)
The key is starting simple and evolving your memory strategy as your application grows in complexity and user base.
Amazing write up! Thank you
Love the structure of this article. Looking forwarw to slow reading it later today.
Based on this recent study from anthropic, it doesn't seem like that a more refined RAG system will not really solve the alignment problem or reduce the externalities of unpredictability from autonomous agents down to a manageable level that would make sense for a real world deployment just yet. Curious to hear your thoughts on this.
https://www.anthropic.com/research/project-vend-1