The AIF-BIN Format: A Technical Deep Dive

Why Binary?

JSON is ubiquitous, human-readable, and easy to debug. So why did we choose a binary format for AIF-BIN?

Size matters. Embeddings are arrays of 384-768 floating point numbers. In JSON, a single 384-dimension embedding takes ~3KB. In binary (packed float32), it's 1.5KB. For a file with thousands of chunks, that's gigabytes of difference.

Parsing speed. JSON parsing requires string operations, escape handling, and type inference. Binary formats like MessagePack parse 5-10x faster because the structure is explicit.

Instant validation. A binary header lets you validate file integrity in microseconds without reading the entire file. JSON requires parsing everything first.

File Structure Overview

An AIF-BIN v2 file consists of five sections:

Header Metadata Original Raw Content Chunks Footer

Each section has a specific purpose and can be accessed independently using the offset pointers in the header.

The Header (64 bytes)

The header provides instant file validation and section navigation:

Offset	Size	Field	Description
0	8	`magic`	`AIFBIN\x00\x01` — identifies the format
8	4	`version`	Format version (currently 2)
12	4	`padding`	Reserved for future use
16	8	`metadataOffset`	Byte offset to metadata section
24	8	`originalRawOffset`	Byte offset to original document (or 0xFF... if absent)
32	8	`contentChunksOffset`	Byte offset to chunks section
40	8	`versionsOffset`	Byte offset to versions (or 0xFF... if absent)
48	8	`footerOffset`	Byte offset to footer
56	8	`totalSize`	Total file size for validation

Reading the first 64 bytes tells you everything about the file structure. No parsing required.

Metadata Section

The metadata section stores file-level information as MessagePack:

{
  "source_file": "/path/to/original.md",
  "migration_date": "2026-02-04T12:00:00Z",
  "embedding_model": "all-MiniLM-L6-v2",
  "embedding_dim": 384,
  "chunk_count": 42,
  "custom_fields": { ... }
}

Structure: [u64 length][MessagePack data]

This section is optional but recommended. It enables tools to display file information without parsing chunks.

Content Chunks Section

The heart of AIF-BIN: text chunks with their embeddings.

[u32 chunk_count]
For each chunk:
  [u32 type]           // 1=TEXT, 2=TABLE, 3=IMAGE, 4=AUDIO, 5=VIDEO, 6=CODE
  [u64 data_length]    // Length of raw data
  [u64 meta_length]    // Length of chunk metadata
  [MessagePack meta]   // Chunk-specific metadata (including embedding)
  [bytes data]         // Raw chunk content

The chunk metadata contains the embedding vector:

{
  "id": "chunk-uuid",
  "embedding": [0.023, -0.156, 0.089, ...],  // 384 floats
  "start_char": 0,
  "end_char": 512,
  "heading": "Introduction"
}

Why store embeddings in metadata? It allows flexibility. Different chunks can use different embedding models. The embedding dimension is discovered at read time, not fixed in the format.

Footer Section

The footer provides integrity verification and fast chunk lookup:

[u32 index_count]
For each index entry:
  [u32 chunk_id]
  [u64 offset]
[u64 checksum]        // CRC64 of all preceding data

The checksum catches corruption from disk errors or incomplete writes. The index enables O(1) access to any chunk without scanning the file.

Why MessagePack?

We chose MessagePack over Protocol Buffers or FlatBuffers because:

Schema-less — No .proto files or code generation required
Self-describing — Types are encoded with data
Compact — 50-75% smaller than JSON
Wide support — Libraries in every language

For structured sections (header, footer), we use fixed binary layouts. For flexible sections (metadata, chunks), MessagePack handles the complexity.

Chunking Strategy

AIF-BIN doesn't prescribe a specific chunking algorithm, but we recommend:

800 tokens per chunk — Fits most embedding model context windows
100 token overlap — Preserves context at boundaries
Semantic boundaries — Split on paragraphs/sections when possible
Preserve structure — Keep headings with their content

The format stores chunk boundaries, so retrieval can return exact source locations.

Embedding Model Compatibility

AIF-BIN files are model-agnostic. The metadata records which model produced the embeddings:

all-MiniLM-L6-v2 — 384 dimensions, fastest
all-mpnet-base-v2 — 768 dimensions, best quality
text-embedding-ada-002 — OpenAI's 1536 dimensions
bge-base-en-v1.5 — 768 dimensions, good balance

For search to work correctly, the query must be embedded with the same model. AIF-BIN Recall handles this automatically.

Size Comparison

For a 10,000-word document with 50 chunks:

Format	Size	Parse Time
JSON (text + embeddings)	~2.5 MB	~150ms
AIF-BIN v2	~1.0 MB	~15ms
Savings	60%	90%

The savings compound with larger files and collections.

Implementation Resources

Reference implementations are available in 8 languages:

Core SDKs — Python, TypeScript, Rust, Go, C#, Java, Swift, Kotlin
Technical Specification — Full format documentation
AIF-BIN Recall — TypeScript implementation with search

The format is MIT licensed. Build what you need.