Trigram-Indexed Code Search

TriSeek

A code search engine that indexes every 3-byte sequence in your codebase, then answers queries in milliseconds using posting list intersection and parallel verification.

20.1x
Peak measured speedup
<5ms
Index open time
38/39
Medium+ repo wins
Scroll
01 — The Core Idea

What is a Trigram?

A trigram is a sliding window of 3 consecutive bytes through text. The string hello produces: hel, ell, llo. By building an index of which files contain which trigrams, we can dramatically narrow the search space for any query.

Extracted Trigrams
Sliding Window Visualization
---
?

Why 3 bytes?

With 3 bytes we get 16.7 million possible trigrams (256³). This is selective enough to narrow search candidates dramatically, yet small enough that every reasonable search query of 3+ characters produces at least one trigram to look up. A bigram (2-byte) index would be too broad; a 4-gram too sparse.

Trigram as u32

Each trigram is encoded as a u32 integer: byte[0] | (byte[1] << 8) | (byte[2] << 16). This compact representation makes hash lookups and comparisons blazing fast — just integer operations, no string allocation.

trigram encoding
// "hel" → trigram u32
let t = 0x68         // 'h'
       | (0x65 << 8)  // 'e'
       | (0x6C << 16); // 'l'
// t = 7104872
02 — The Key Difference

ripgrep vs TriSeek

ripgrep is the gold standard for grep. It's incredibly fast at scanning files — but it still scans every single file, every time. TriSeek takes a fundamentally different approach: build an index once, then answer queries by looking up which files could match before reading any of them.

ripgrep brute force
1
Walk every file in the repository, respecting .gitignore
2
Read each file from disk into memory (or mmap it)
3
Run the regex / literal matcher against every byte of every file using SIMD-accelerated search
4
Output matches as they're found
Files touched for query "HttpResponse" in Kubernetes (28K files)

Every dot = a file read from disk. All 28,132 files scanned.

~500 ms per query
TriSeek indexed lookup
1
Extract trigrams from query: "Htt", "ttp", "tpR", "pRe", "Res", "esp", "spo", "pon", "ons", "nse"
2
Look up posting lists — instant HashMap lookups from mmap'd index. Zero disk I/O.
3
Intersect posting lists using sorted merge. Only files containing ALL trigrams survive.
4
Verify only candidates — read just the handful of surviving files, in parallel across all CPU cores.
Files touched for query "HttpResponse" in Kubernetes (28K files)

Only 5 files read from disk. 28,127 files never touched.

~64 ms per query
💡

The Fundamental Tradeoff

ripgrep pays a linear cost proportional to repo size on every query — it must touch every file. TriSeek pays an upfront indexing cost once, then each query cost is proportional to the number of matching files, not the total repo size. The larger the codebase, the bigger the advantage.

ripgrep query cost

O(total files × avg file size)

TriSeek query cost

O(posting list size + candidate files)

When TriSeek wins

  • → Medium/large repos (1K+ files)
  • → Selective queries with literal strings
  • → Session workloads (many queries, index stays hot)
  • → Agent workflows (Codex, Claude) doing 20+ searches

When ripgrep wins

  • → Small repos (page cache makes brute force free)
  • → Pure regex with no extractable literals (\d{3}-\d{4})
  • → One-off queries (no index yet built)
  • → Very high match-count patterns (most files match)
03 — Building the Index

How Indexing Works

TriSeek walks your repository, extracts trigrams from every file, and builds an inverted index — a map from each trigram to the set of files containing it.

1

Walk Repository

Using the ignore crate's parallel walker, all files are discovered respecting .gitignore rules. Threads are capped at min(cores, 8) for optimal I/O throughput.

walker.rs
fn walk_repository_parallel(root: &Path) {
    WalkBuilder::new(root)
        .threads(num_cpus::get().min(8))
        .build_parallel()
        .run(|| { /* process each file */ });
}
2

Extract Trigrams

For each file, a sliding window of 3 bytes produces all trigrams. These are collected into a HashSet<u32> per file (deduplication — we only need to know if a trigram appears, not how many times).

trigram extraction
for window in content.windows(3) {
    let tri = window[0] as u32
           | (window[1] as u32) << 8
           | (window[2] as u32) << 16;
    trigrams.insert(tri);
}
3

Build Posting Lists

The inverted index maps each trigram to a sorted list of document IDs. If trigram "htt" appears in files 4, 17, and 203, its posting list is [4, 17, 203].

htt
server.rs
client.rs
proxy.rs
ttp
server.rs
client.rs
config.rs
proxy.rs
tps
server.rs
client.rs
proxy.rs
4

Write Binary Index

The in-memory index is persisted as fast.idx — a custom flat binary format designed for instant mmap loading. No deserialization needed; posting lists are read directly from mapped memory.

04 — Query Execution

How Search Works

When you search for https, TriSeek extracts its trigrams, intersects the posting lists to find candidate files, then verifies matches in parallel.

Step-by-Step: Searching for "https"

1. Extract query trigrams
htt
ttp
tps
2. Look up posting lists
htt
2
7
15
23
41
ttp
2
5
7
15
23
30
41
tps
2
15
23
41
50
3. Sorted intersection — O(n+m) merge

Walk all lists simultaneously with pointers. Only advance the pointer on the smallest value. When all pointers agree, that's a candidate.

Candidates (files containing ALL trigrams)
File 2
File 15
File 23
File 41

From 28,000 files in the repo → only 4 candidates to verify. That's a 7,000x reduction.

4. Parallel verification with Rayon

Read each candidate file and run the actual regex/literal match. This happens across all CPU cores simultaneously.

engine.rs — sorted intersection
fn sorted_intersect(a: &[u32], b: &[u32]) -> Vec<u32> {
    let (mut i, mut j) = (0, 0);
    let mut result = Vec::new();
    while i < a.len() && j < b.len() {
        match a[i].cmp(&b[j]) {
            Equal   => { result.push(a[i]); i += 1; j += 1; }
            Less    => i += 1,
            Greater => j += 1,
        }
    }
    result
}
05 — The Fast Index

Binary Format Layout

The fast.idx file is a flat binary format designed for mmap. No parsing, no deserialization — the OS maps the file into memory and we read posting lists as raw u32 slices directly from the mapped region.

fast.idx memory layout
Header 96 B
Content Trigram Table 12 B × N
Content Postings 4 B × total
Path Trigram Table 12 B × M
Path Postings 4 B × total
Doc Table 46 B × docs
String Pool variable
Header (magic, version, section offsets)
Content trigrams + postings
Path trigrams + postings
Document metadata + strings
Before (bincode)
~600ms

Index load time
Deserialize + allocate

vs
After (mmap)
<5ms

Index load time
Zero-copy mmap

Why mmap matters

Traditional deserialization reads the entire file into memory, allocates new data structures, and copies data. With mmap, the operating system maps the file directly into virtual memory. Reading a posting list is just casting a pointer — the data stays on disk until accessed, and the OS manages caching.

fastindex.rs — zero-copy posting list read
// Posting list is a raw slice of u32 from mapped memory
let ptr = mmap[offset..offset + count * 4].as_ptr() as *const u32;
let postings = unsafe {
    slice::from_raw_parts(ptr, count)
};
// No allocation. No copy. Just a pointer cast.
06 — Parallel Verification

Rayon Parallelism

After intersection narrows candidates to a handful of files, TriSeek verifies each one in parallel across all CPU cores using Rayon's work-stealing thread pool.

Parallel file verification across 8 threads
Click to visualize parallel verification
engine.rs — parallel verification
use rayon::prelude::*;

candidates.par_iter().for_each(|(doc_id, path, rel)| {
    if done.load(Ordering::Relaxed) { return; }

    // Read file (mmap for large, Vec for small)
    let bytes = read_candidate(&path);

    // Run actual pattern match
    let hits = match_in_bytes(&bytes, &pattern);

    if !hits.is_empty() {
        results.lock().extend(hits);
        let count = total_found.fetch_add(
            hits.len(), Ordering::Relaxed
        );
        if count >= max_results {
            done.store(true, Ordering::Relaxed);
        }
    }
});
🎯

Early Termination

An AtomicBool flag signals all threads to stop once we have enough results. This prevents unnecessary I/O when only the first N matches are needed — common in editor integrations.

📋

Work Stealing

Rayon's work-stealing scheduler automatically balances load. If one thread finishes its files quickly (small files), it steals work from threads with larger files. No manual balancing needed.

07 — Results

Benchmark Results

On the fresh full rerun, TriSeek auto mode wins 38 of 65 single-query benchmarks overall, or 38 of 39 after excluding the two small repos where cold shell scans remain cheaper, and 8 of 10 session benchmarks across all repo sizes.

Note

The charts below show representative results for medium and large repositories only (Kubernetes, Linux, Rust). Small repos (serde, ripgrep) are excluded from the charts because the fresh full rerun still shows shell tools ahead on cold single-query latency at that scale. Session benchmarks include all 5 repos.

Kubernetes — 28,132 files, 257 MB

Linux Kernel — 92,916 files, 1.5 GB

Rust Compiler — 58,472 files, 198 MB

Session Benchmarks — 20 sequential queries (agent workload)

Repository TriSeek p50 rg p50 Speedup
serde (small) 34 ms 82 ms 2.4x
kubernetes (medium) 292 ms 3,575 ms 12.2x
linux (large) 613 ms 10,342 ms 16.9x
rust (large) 375 ms 5,954 ms 15.9x
ripgrep (small) 39 ms 92 ms 2.3x
38/39

medium+ repo benchmarks won

Overall full-rerun result: 38/65 wins. All 26 losses came from the two small repos.

20.1x

peak measured speedup

Linux kernel, literal_selective single-query workload

References & Prior Art

Standing on Shoulders

TriSeek's trigram indexing approach builds on decades of research in string matching and code search:

1

Ukkonen, "Approximate string-matching with q-grams and maximal matches" (1992)

The theoretical foundation for q-gram (n-gram) based string matching. Established that decomposing strings into overlapping q-length substrings enables efficient approximate and exact matching through set intersection. TriSeek's trigram decomposition (q=3) directly descends from this framework.

Theoretical Computer Science, Volume 92, Issue 1, pp. 191–211

sciencedirect.com/science/article/pii/0304397592901434

2

Manber & Wu, "GLIMPSE: A Tool to Search Through Entire File Systems" (1994)

The first practical system to use an inverted index over character n-grams for searching entire file systems. GLIMPSE demonstrated the core tradeoff TriSeek exploits: a small index enables dramatic pruning of the search space, making queries sub-linear in corpus size.

USENIX Winter Technical Conference, 1994

usenix.org/.../glimpse-tool-search-through-entire-file-systems

3

Navarro & Baeza-Yates, "A Practical q-Gram Index for Text Retrieval Allowing Errors" (1998)

Extended q-gram indexing to support approximate matching and established cost models for choosing optimal q values. Their analysis of posting list intersection costs informed TriSeek's sorted-merge intersection strategy and the choice of q=3 as the selectivity sweet spot.

CLEI Electronic Journal, Volume 1, Number 2, 1998

clei.org/cleiej/.../article/view/379

4

Russ Cox, "Regular Expression Matching with a Trigram Index" (2012)

The engineering blueprint for applying trigram indexing to code search at scale. Described how Google Code Search extracted trigram sets from regex queries, intersected posting lists to find candidate files, and verified matches with full regex evaluation. This is the direct algorithmic ancestor of TriSeek's search pipeline.

swtch.com/~rsc/regexp/regexp4.html

5

GitHub, "The technology behind GitHub's new code search" (2023)

GitHub's engineering blog describing their Blackbird code search system, which uses n-gram indexing with content-defined chunking, delta-encoded posting lists, and a Rust-based query engine. Demonstrated that trigram-style indexing remains the state of the art for code search even at the scale of 200M+ repositories.

github.blog/2023-02-06-the-technology-behind-githubs-new-code-search