Bot-BIN: Persistent Semantic Memory for AI Chatbots

A synchronization framework for converting ephemeral conversation context into durable, searchable memory using the AIF-BIN v2 binary format

Terronex Research

terronex.dev

Version 1.0 | February 2026

Abstract

Modern AI chatbots and agents suffer from context window limitations that prevent long-term memory persistence. Bot-BIN addresses this fundamental constraint by providing a synchronization layer that converts markdown-based memory files into the AIF-BIN v2 binary format with embedded 384-dimensional vector embeddings. This enables semantic search across conversation history, decisions, and accumulated knowledge. Performance benchmarks demonstrate search latency of 0.39ms per 1,000 chunks and encoding throughput of 2.1 million similarity operations per second. Bot-BIN operates entirely offline with zero cloud dependencies, maintaining user privacy while enabling AI systems to develop persistent, queryable memory.

Table of Contents

Introduction
The Memory Problem in AI Systems
System Architecture
AIF-BIN v2 Binary Format
Synchronization Engine
Semantic Search Implementation
Performance Benchmarks
Integration Patterns
Security Considerations
Future Work

1. Introduction

The current generation of large language models operates within fixed context windows, typically ranging from 4,096 to 200,000 tokens. While these windows have grown substantially, they remain fundamentally ephemeral. Each conversation session begins with a blank slate, and accumulated knowledge from prior interactions is lost unless explicitly re-injected into the context.

This limitation creates a significant gap between how AI systems operate and how humans expect intelligent assistants to behave. A human assistant remembers past conversations, accumulates knowledge about preferences and decisions, and builds contextual understanding over time. Current AI chatbots cannot replicate this behavior without external memory systems.

Bot-BIN provides a practical solution to this problem. It operates as a synchronization layer between human-readable markdown files and machine-optimized binary memory files. The system watches for changes to memory files, automatically generates vector embeddings, and maintains a searchable archive of all accumulated knowledge.

2. The Memory Problem in AI Systems

2.1 Context Window Constraints

Large language models process input as a sequence of tokens within a fixed-size window. Once this window is filled, older tokens are truncated or the system must employ compression strategies. This creates several practical limitations:

Session Isolation: Each conversation starts fresh with no memory of prior sessions
Context Overflow: Long conversations eventually exceed window limits, losing early context
Expensive Retrieval: Injecting large amounts of historical context consumes token budget
No Selective Recall: Without semantic search, relevant past information cannot be efficiently retrieved

2.2 Existing Solutions and Their Limitations

Several approaches have been developed to address AI memory limitations:

Approach	Mechanism	Limitations
Vector Databases	Cloud-hosted embedding storage	Requires network, vendor lock-in, privacy concerns
RAG Pipelines	Retrieve-then-generate	Complex infrastructure, high latency
Conversation Logs	Append-only text files	No semantic search, keyword matching only
Knowledge Graphs	Entity-relationship storage	Requires structured data extraction, brittle

Bot-BIN takes a different approach: local-first binary files that contain both the original text and pre-computed embeddings. This eliminates cloud dependencies while enabling fast semantic search.

3. System Architecture

3.1 Component Overview

workspace/ +-- memory/ | +-- 2026-02-01.md Source: Human-readable notes | +-- 2026-02-02.md Source: Daily conversation logs | +-- aifbin/ | +-- 2026-02-01.aif-bin Binary: Embedded vectors | +-- 2026-02-02.aif-bin Binary: Semantic searchable +-- MEMORY.md Source: Long-term persistent memory +-- botbin.py Engine: Sync + Search CLI +-- aifbin_pro.py Library: AIF-BIN v2 operations +-- aifbin_spec_v2.py Library: Binary format spec

3.2 Data Flow

The Bot-BIN system operates through a three-phase pipeline:

Phase 1: Change Detection +------------------+ +------------------+ +------------------+ | memory/*.md | --> | Hash Compare | --> | Changed Files | | MEMORY.md | | (MD5 state) | | Queue | +------------------+ +------------------+ +------------------+ Phase 2: Embedding Generation +------------------+ +------------------+ +------------------+ | Changed Files | --> | Chunk Text | --> | all-MiniLM-L6 | | Queue | | (500 tokens) | | (384-dim) | +------------------+ +------------------+ +------------------+ Phase 3: Binary Serialization +------------------+ +------------------+ +------------------+ | Embeddings + | --> | MessagePack | --> | .aif-bin | | Metadata | | Encoding | | Files | +------------------+ +------------------+ +------------------+

3.3 State Management

Bot-BIN maintains sync state in a JSON file that tracks file hashes and modification times. This enables incremental sync, where only changed files are re-processed. The state file structure:

{
  "files": {
    "memory/2026-02-01.md": {
      "hash": "a1b2c3d4e5f6...",
      "synced_at": "2026-02-01T23:45:00Z",
      "chunks": 12,
      "output": "memory/aifbin/2026-02-01.aif-bin"
    }
  },
  "last_sync": "2026-02-01T23:45:00Z"
}

4. AIF-BIN v2 Binary Format

4.1 Format Specification

AIF-BIN v2 uses MessagePack encoding for efficient binary serialization. The format provides a self-contained unit of semantic memory that includes source text, pre-computed embeddings, and rich metadata.

4.1.1 Header Structure

Offset  Size    Field           Description
0       4       Magic           "AIFB" (0x41 0x49 0x46 0x42)
4       1       Version         Format version (2)
5       1       Flags           Compression, encryption flags
6       2       Reserved        Future use
8       4       Metadata Len    Length of metadata block
12      4       Chunks Len      Number of chunks
16      N       Metadata        MessagePack-encoded metadata
16+N    M       Chunks          MessagePack-encoded chunk array

4.1.2 Chunk Structure

Each chunk contains:

Field	Type	Description
id	string	UUID v4 chunk identifier
text	string	Original text content (up to 500 tokens)
embedding	float32[]	384-dimensional vector from MiniLM
metadata	object	Source position, timestamps, tags

4.2 Embedding Model

Bot-BIN uses the all-MiniLM-L6-v2 sentence transformer model by default. This model provides an optimal balance of speed, quality, and resource efficiency:

384 Embedding Dimensions

22M Model Parameters

~10ms Inference Latency

~90MB Model Size

5. Synchronization Engine

5.1 Change Detection Algorithm

The sync engine uses MD5 hashing to detect file changes. This approach is chosen over file modification times because it correctly handles cases where file content is restored to a previous state (same mtime, different content).

def needs_sync(file_path, state):
    current_hash = md5(read_file(file_path))
    previous_hash = state.get(file_path, {}).get('hash')
    return current_hash != previous_hash

5.2 Chunking Strategy

Text is split into chunks using a paragraph-aware algorithm that respects natural boundaries while maintaining target chunk sizes. The chunking parameters:

Target size: 500 tokens (approximately 375 words)
Overlap: 50 tokens between adjacent chunks for context continuity
Boundaries: Prefer splits at paragraph breaks, sentence ends, or list items
Minimum: Chunks smaller than 50 tokens are merged with adjacent chunks

5.3 Tracked Files

Bot-BIN automatically tracks these files in the workspace:

MEMORY.md in workspace root (if present)
All *.md files in the memory/ directory

Additional paths can be configured via environment variables or command-line arguments.

6. Semantic Search Implementation

6.1 Query Processing

Semantic search operates by embedding the query text using the same MiniLM model, then computing cosine similarity against all indexed chunk embeddings:

def search(query, collection, limit=10):
    # Embed the query
    query_embedding = model.encode(query)
    
    # Load all indexed chunks
    results = []
    for file in collection.files:
        chunks = load_aifbin(file)
        for chunk in chunks:
            score = cosine_similarity(query_embedding, chunk.embedding)
            results.append((chunk, score))
    
    # Return top-k by similarity
    results.sort(key=lambda x: x[1], reverse=True)
    return results[:limit]

6.2 Cosine Similarity

Cosine similarity measures the angle between two vectors, producing a score between -1 and 1. For normalized embeddings (which MiniLM produces), this simplifies to a dot product:

similarity(a, b) = sum(a[i] * b[i] for i in range(384))

6.3 Hybrid Retrieval

Bot-BIN supports optional hybrid retrieval that combines vector similarity with BM25 keyword matching. This improves recall for queries containing specific named entities or technical terms:

hybrid_score = (alpha * vector_score) + ((1 - alpha) * keyword_score)
# Default alpha = 0.7 (favor semantic similarity)

7. Performance Benchmarks

7.1 Test Environment

Component	Specification
Platform	Linux (WSL2)
Runtime	Node.js v22.22.0
Embedding Dimensions	384 (all-MiniLM-L6-v2)
Benchmark Date	February 5, 2026

7.2 Results Summary

2.1M Cosine ops/sec

0.39ms Search 1K chunks

4.8ms Search 10K chunks

48ms Search 100K chunks

7.3 Detailed Benchmarks

Operation	Iterations	Avg Latency	Throughput
Cosine Similarity (384 dims)	10,000	0.47 microseconds	2,111,469 ops/sec
Search 1,000 chunks	1,000	0.39 ms	2,570 ops/sec
Search 10,000 chunks	100	4.81 ms	208 ops/sec
Search 100,000 chunks	10	48.04 ms	20.8 ops/sec
Chunk 10K words	1,000	0.46 ms	2,161 ops/sec
MessagePack encode (100 chunks)	100	0.50 ms	1,988 ops/sec
MessagePack decode (100 chunks)	100	0.48 ms	2,104 ops/sec

Performance Insight

Search latency scales linearly with collection size. For typical chatbot memory stores (1,000-10,000 chunks), search completes in under 5ms, enabling real-time retrieval during conversation without perceptible delay.

8. Integration Patterns

8.1 Heartbeat-Based Sync

For AI systems with periodic heartbeat or polling mechanisms, Bot-BIN sync can be triggered automatically:

# In HEARTBEAT.md or similar automation config
1. Run `python3 botbin.py sync` on each heartbeat
2. Silent output if no files changed
3. Memory files always up-to-date for search

8.2 MCP Integration

Bot-BIN integrates with the Model Context Protocol (MCP) for AI agent tooling. The MCP server exposes memory search as a callable tool:

{
  "mcpServers": {
    "botbin-memory": {
      "command": "python3",
      "args": ["botbin.py", "mcp-server"]
    }
  }
}

8.3 Direct CLI Usage

# Sync all changed memory files
python3 botbin.py sync

# Search memories by meaning
python3 botbin.py search "what API architecture did we choose"

# Show sync status
python3 botbin.py status

# Extract original text from binary
python3 botbin.py extract memory/aifbin/file.aif-bin

9. Security Considerations

9.1 Local-First Architecture

Bot-BIN operates entirely offline. No data is transmitted to external servers. This provides strong privacy guarantees for sensitive conversation data:

All processing happens on the local machine
Embedding model runs locally (downloaded once, cached)
No API keys or cloud accounts required
Memory files remain under user control

9.2 File Security

AIF-BIN v2 supports optional encryption for at-rest security. When enabled, chunk content and embeddings are encrypted with AES-256-GCM before serialization. The encryption key is derived from a user-provided passphrase using Argon2id.

9.3 Access Control

Memory files should be protected with appropriate filesystem permissions. Recommended configuration:

# Restrict memory directory to owner only
chmod 700 memory/
chmod 600 memory/*.md memory/aifbin/*.aif-bin

10. Future Work

10.1 Planned Enhancements

Memory Agent: Autonomous background process that consolidates, prunes, and optimizes memory stores
Multi-modal Support: Embeddings for images and audio referenced in memory files
Incremental Re-indexing: Update embeddings when better models become available
Federated Memory: Sync memory across multiple devices while maintaining privacy

10.2 Integration Roadmap

VS Code extension for memory inspection and search
Obsidian plugin for note-to-memory workflow
Claude Desktop MCP server for native integration
Web-based memory inspector (Recall Studio)

References

AIF-BIN Format Specification v2.0, Terronex, 2026
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Reimers & Gurevych, 2019
MessagePack Specification, msgpack.org
Model Context Protocol, Anthropic, 2024