Why Binary?
JSON is ubiquitous, human-readable, and easy to debug. So why did we choose a binary format for AIF-BIN?
Size matters. Embeddings are arrays of 384-768 floating point numbers. In JSON, a single 384-dimension embedding takes ~3KB. In binary (packed float32), it's 1.5KB. For a file with thousands of chunks, that's gigabytes of difference.
Parsing speed. JSON parsing requires string operations, escape handling, and type inference. Binary formats like MessagePack parse 5-10x faster because the structure is explicit.
Instant validation. A binary header lets you validate file integrity in microseconds without reading the entire file. JSON requires parsing everything first.
File Structure Overview
An AIF-BIN v2 file consists of five sections:
Header Metadata Original Raw Content Chunks Footer
Each section has a specific purpose and can be accessed independently using the offset pointers in the header.
The Header (64 bytes)
The header provides instant file validation and section navigation:
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | magic | AIFBIN\x00\x01 — identifies the format |
| 8 | 4 | version | Format version (currently 2) |
| 12 | 4 | padding | Reserved for future use |
| 16 | 8 | metadataOffset | Byte offset to metadata section |
| 24 | 8 | originalRawOffset | Byte offset to original document (or 0xFF... if absent) |
| 32 | 8 | contentChunksOffset | Byte offset to chunks section |
| 40 | 8 | versionsOffset | Byte offset to versions (or 0xFF... if absent) |
| 48 | 8 | footerOffset | Byte offset to footer |
| 56 | 8 | totalSize | Total file size for validation |
Reading the first 64 bytes tells you everything about the file structure. No parsing required.
Metadata Section
The metadata section stores file-level information as MessagePack:
{
"source_file": "/path/to/original.md",
"migration_date": "2026-02-04T12:00:00Z",
"embedding_model": "all-MiniLM-L6-v2",
"embedding_dim": 384,
"chunk_count": 42,
"custom_fields": { ... }
}
Structure: [u64 length][MessagePack data]
This section is optional but recommended. It enables tools to display file information without parsing chunks.
Content Chunks Section
The heart of AIF-BIN: text chunks with their embeddings.
[u32 chunk_count]
For each chunk:
[u32 type] // 1=TEXT, 2=TABLE, 3=IMAGE, 4=AUDIO, 5=VIDEO, 6=CODE
[u64 data_length] // Length of raw data
[u64 meta_length] // Length of chunk metadata
[MessagePack meta] // Chunk-specific metadata (including embedding)
[bytes data] // Raw chunk content
The chunk metadata contains the embedding vector:
{
"id": "chunk-uuid",
"embedding": [0.023, -0.156, 0.089, ...], // 384 floats
"start_char": 0,
"end_char": 512,
"heading": "Introduction"
}
Why store embeddings in metadata? It allows flexibility. Different chunks can use different embedding models. The embedding dimension is discovered at read time, not fixed in the format.
Footer Section
The footer provides integrity verification and fast chunk lookup:
[u32 index_count]
For each index entry:
[u32 chunk_id]
[u64 offset]
[u64 checksum] // CRC64 of all preceding data
The checksum catches corruption from disk errors or incomplete writes. The index enables O(1) access to any chunk without scanning the file.
Why MessagePack?
We chose MessagePack over Protocol Buffers or FlatBuffers because:
- Schema-less — No .proto files or code generation required
- Self-describing — Types are encoded with data
- Compact — 50-75% smaller than JSON
- Wide support — Libraries in every language
For structured sections (header, footer), we use fixed binary layouts. For flexible sections (metadata, chunks), MessagePack handles the complexity.
Chunking Strategy
AIF-BIN doesn't prescribe a specific chunking algorithm, but we recommend:
- 800 tokens per chunk — Fits most embedding model context windows
- 100 token overlap — Preserves context at boundaries
- Semantic boundaries — Split on paragraphs/sections when possible
- Preserve structure — Keep headings with their content
The format stores chunk boundaries, so retrieval can return exact source locations.
Embedding Model Compatibility
AIF-BIN files are model-agnostic. The metadata records which model produced the embeddings:
all-MiniLM-L6-v2— 384 dimensions, fastestall-mpnet-base-v2— 768 dimensions, best qualitytext-embedding-ada-002— OpenAI's 1536 dimensionsbge-base-en-v1.5— 768 dimensions, good balance
For search to work correctly, the query must be embedded with the same model. AIF-BIN Recall handles this automatically.
Size Comparison
For a 10,000-word document with 50 chunks:
| Format | Size | Parse Time |
|---|---|---|
| JSON (text + embeddings) | ~2.5 MB | ~150ms |
| AIF-BIN v2 | ~1.0 MB | ~15ms |
| Savings | 60% | 90% |
The savings compound with larger files and collections.
Implementation Resources
Reference implementations are available in 8 languages:
- Core SDKs — Python, TypeScript, Rust, Go, C#, Java, Swift, Kotlin
- Technical Specification — Full format documentation
- AIF-BIN Recall — TypeScript implementation with search
The format is MIT licensed. Build what you need.