Bot-BIN: Persistent Semantic Memory for AI Chatbots
A synchronization framework for converting ephemeral conversation context into durable, searchable memory using the AIF-BIN v2 binary format
terronex.dev
Version 1.0 | February 2026
Abstract
Modern AI chatbots and agents suffer from context window limitations that prevent long-term memory persistence. Bot-BIN addresses this fundamental constraint by providing a synchronization layer that converts markdown-based memory files into the AIF-BIN v2 binary format with embedded 384-dimensional vector embeddings. This enables semantic search across conversation history, decisions, and accumulated knowledge. Performance benchmarks demonstrate search latency of 0.39ms per 1,000 chunks and encoding throughput of 2.1 million similarity operations per second. Bot-BIN operates entirely offline with zero cloud dependencies, maintaining user privacy while enabling AI systems to develop persistent, queryable memory.
Table of Contents
1. Introduction
The current generation of large language models operates within fixed context windows, typically ranging from 4,096 to 200,000 tokens. While these windows have grown substantially, they remain fundamentally ephemeral. Each conversation session begins with a blank slate, and accumulated knowledge from prior interactions is lost unless explicitly re-injected into the context.
This limitation creates a significant gap between how AI systems operate and how humans expect intelligent assistants to behave. A human assistant remembers past conversations, accumulates knowledge about preferences and decisions, and builds contextual understanding over time. Current AI chatbots cannot replicate this behavior without external memory systems.
Bot-BIN provides a practical solution to this problem. It operates as a synchronization layer between human-readable markdown files and machine-optimized binary memory files. The system watches for changes to memory files, automatically generates vector embeddings, and maintains a searchable archive of all accumulated knowledge.
2. The Memory Problem in AI Systems
2.1 Context Window Constraints
Large language models process input as a sequence of tokens within a fixed-size window. Once this window is filled, older tokens are truncated or the system must employ compression strategies. This creates several practical limitations:
- Session Isolation: Each conversation starts fresh with no memory of prior sessions
- Context Overflow: Long conversations eventually exceed window limits, losing early context
- Expensive Retrieval: Injecting large amounts of historical context consumes token budget
- No Selective Recall: Without semantic search, relevant past information cannot be efficiently retrieved
2.2 Existing Solutions and Their Limitations
Several approaches have been developed to address AI memory limitations:
| Approach | Mechanism | Limitations |
|---|---|---|
| Vector Databases | Cloud-hosted embedding storage | Requires network, vendor lock-in, privacy concerns |
| RAG Pipelines | Retrieve-then-generate | Complex infrastructure, high latency |
| Conversation Logs | Append-only text files | No semantic search, keyword matching only |
| Knowledge Graphs | Entity-relationship storage | Requires structured data extraction, brittle |
Bot-BIN takes a different approach: local-first binary files that contain both the original text and pre-computed embeddings. This eliminates cloud dependencies while enabling fast semantic search.
3. System Architecture
3.1 Component Overview
3.2 Data Flow
The Bot-BIN system operates through a three-phase pipeline:
3.3 State Management
Bot-BIN maintains sync state in a JSON file that tracks file hashes and modification times. This enables incremental sync, where only changed files are re-processed. The state file structure:
{
"files": {
"memory/2026-02-01.md": {
"hash": "a1b2c3d4e5f6...",
"synced_at": "2026-02-01T23:45:00Z",
"chunks": 12,
"output": "memory/aifbin/2026-02-01.aif-bin"
}
},
"last_sync": "2026-02-01T23:45:00Z"
}
4. AIF-BIN v2 Binary Format
4.1 Format Specification
AIF-BIN v2 uses MessagePack encoding for efficient binary serialization. The format provides a self-contained unit of semantic memory that includes source text, pre-computed embeddings, and rich metadata.
4.1.1 Header Structure
Offset Size Field Description
0 4 Magic "AIFB" (0x41 0x49 0x46 0x42)
4 1 Version Format version (2)
5 1 Flags Compression, encryption flags
6 2 Reserved Future use
8 4 Metadata Len Length of metadata block
12 4 Chunks Len Number of chunks
16 N Metadata MessagePack-encoded metadata
16+N M Chunks MessagePack-encoded chunk array
4.1.2 Chunk Structure
Each chunk contains:
| Field | Type | Description |
|---|---|---|
| id | string | UUID v4 chunk identifier |
| text | string | Original text content (up to 500 tokens) |
| embedding | float32[] | 384-dimensional vector from MiniLM |
| metadata | object | Source position, timestamps, tags |
4.2 Embedding Model
Bot-BIN uses the all-MiniLM-L6-v2 sentence transformer model by default. This model provides an optimal balance of speed, quality, and resource efficiency:
5. Synchronization Engine
5.1 Change Detection Algorithm
The sync engine uses MD5 hashing to detect file changes. This approach is chosen over file modification times because it correctly handles cases where file content is restored to a previous state (same mtime, different content).
def needs_sync(file_path, state):
current_hash = md5(read_file(file_path))
previous_hash = state.get(file_path, {}).get('hash')
return current_hash != previous_hash
5.2 Chunking Strategy
Text is split into chunks using a paragraph-aware algorithm that respects natural boundaries while maintaining target chunk sizes. The chunking parameters:
- Target size: 500 tokens (approximately 375 words)
- Overlap: 50 tokens between adjacent chunks for context continuity
- Boundaries: Prefer splits at paragraph breaks, sentence ends, or list items
- Minimum: Chunks smaller than 50 tokens are merged with adjacent chunks
5.3 Tracked Files
Bot-BIN automatically tracks these files in the workspace:
MEMORY.mdin workspace root (if present)- All
*.mdfiles in thememory/directory
Additional paths can be configured via environment variables or command-line arguments.
6. Semantic Search Implementation
6.1 Query Processing
Semantic search operates by embedding the query text using the same MiniLM model, then computing cosine similarity against all indexed chunk embeddings:
def search(query, collection, limit=10):
# Embed the query
query_embedding = model.encode(query)
# Load all indexed chunks
results = []
for file in collection.files:
chunks = load_aifbin(file)
for chunk in chunks:
score = cosine_similarity(query_embedding, chunk.embedding)
results.append((chunk, score))
# Return top-k by similarity
results.sort(key=lambda x: x[1], reverse=True)
return results[:limit]
6.2 Cosine Similarity
Cosine similarity measures the angle between two vectors, producing a score between -1 and 1. For normalized embeddings (which MiniLM produces), this simplifies to a dot product:
similarity(a, b) = sum(a[i] * b[i] for i in range(384))
6.3 Hybrid Retrieval
Bot-BIN supports optional hybrid retrieval that combines vector similarity with BM25 keyword matching. This improves recall for queries containing specific named entities or technical terms:
hybrid_score = (alpha * vector_score) + ((1 - alpha) * keyword_score)
# Default alpha = 0.7 (favor semantic similarity)
7. Performance Benchmarks
7.1 Test Environment
| Component | Specification |
|---|---|
| Platform | Linux (WSL2) |
| Runtime | Node.js v22.22.0 |
| Embedding Dimensions | 384 (all-MiniLM-L6-v2) |
| Benchmark Date | February 5, 2026 |
7.2 Results Summary
7.3 Detailed Benchmarks
| Operation | Iterations | Avg Latency | Throughput |
|---|---|---|---|
| Cosine Similarity (384 dims) | 10,000 | 0.47 microseconds | 2,111,469 ops/sec |
| Search 1,000 chunks | 1,000 | 0.39 ms | 2,570 ops/sec |
| Search 10,000 chunks | 100 | 4.81 ms | 208 ops/sec |
| Search 100,000 chunks | 10 | 48.04 ms | 20.8 ops/sec |
| Chunk 10K words | 1,000 | 0.46 ms | 2,161 ops/sec |
| MessagePack encode (100 chunks) | 100 | 0.50 ms | 1,988 ops/sec |
| MessagePack decode (100 chunks) | 100 | 0.48 ms | 2,104 ops/sec |
Performance Insight
Search latency scales linearly with collection size. For typical chatbot memory stores (1,000-10,000 chunks), search completes in under 5ms, enabling real-time retrieval during conversation without perceptible delay.
8. Integration Patterns
8.1 Heartbeat-Based Sync
For AI systems with periodic heartbeat or polling mechanisms, Bot-BIN sync can be triggered automatically:
# In HEARTBEAT.md or similar automation config
1. Run `python3 botbin.py sync` on each heartbeat
2. Silent output if no files changed
3. Memory files always up-to-date for search
8.2 MCP Integration
Bot-BIN integrates with the Model Context Protocol (MCP) for AI agent tooling. The MCP server exposes memory search as a callable tool:
{
"mcpServers": {
"botbin-memory": {
"command": "python3",
"args": ["botbin.py", "mcp-server"]
}
}
}
8.3 Direct CLI Usage
# Sync all changed memory files
python3 botbin.py sync
# Search memories by meaning
python3 botbin.py search "what API architecture did we choose"
# Show sync status
python3 botbin.py status
# Extract original text from binary
python3 botbin.py extract memory/aifbin/file.aif-bin
9. Security Considerations
9.1 Local-First Architecture
Bot-BIN operates entirely offline. No data is transmitted to external servers. This provides strong privacy guarantees for sensitive conversation data:
- All processing happens on the local machine
- Embedding model runs locally (downloaded once, cached)
- No API keys or cloud accounts required
- Memory files remain under user control
9.2 File Security
AIF-BIN v2 supports optional encryption for at-rest security. When enabled, chunk content and embeddings are encrypted with AES-256-GCM before serialization. The encryption key is derived from a user-provided passphrase using Argon2id.
9.3 Access Control
Memory files should be protected with appropriate filesystem permissions. Recommended configuration:
# Restrict memory directory to owner only
chmod 700 memory/
chmod 600 memory/*.md memory/aifbin/*.aif-bin
10. Future Work
10.1 Planned Enhancements
- Memory Agent: Autonomous background process that consolidates, prunes, and optimizes memory stores
- Multi-modal Support: Embeddings for images and audio referenced in memory files
- Incremental Re-indexing: Update embeddings when better models become available
- Federated Memory: Sync memory across multiple devices while maintaining privacy
10.2 Integration Roadmap
- VS Code extension for memory inspection and search
- Obsidian plugin for note-to-memory workflow
- Claude Desktop MCP server for native integration
- Web-based memory inspector (Recall Studio)
References
- AIF-BIN Format Specification v2.0, Terronex, 2026
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Reimers & Gurevych, 2019
- MessagePack Specification, msgpack.org
- Model Context Protocol, Anthropic, 2024