A code search engine that indexes every 3-byte sequence in your codebase, then answers queries in milliseconds using posting list intersection and parallel verification.
A trigram is a sliding window of 3 consecutive bytes through text. The string hello produces: hel, ell, llo. By building an index of which files contain which trigrams, we can dramatically narrow the search space for any query.
With 3 bytes we get 16.7 million possible trigrams (256³). This is selective enough to narrow search candidates dramatically, yet small enough that every reasonable search query of 3+ characters produces at least one trigram to look up. A bigram (2-byte) index would be too broad; a 4-gram too sparse.
Each trigram is encoded as a u32 integer: byte[0] | (byte[1] << 8) | (byte[2] << 16). This compact representation makes hash lookups and comparisons blazing fast — just integer operations, no string allocation.
// "hel" → trigram u32 let t = 0x68 // 'h' | (0x65 << 8) // 'e' | (0x6C << 16); // 'l' // t = 7104872
ripgrep is the gold standard for grep. It's incredibly fast at scanning files — but it still scans every single file, every time. TriSeek takes a fundamentally different approach: build an index once, then answer queries by looking up which files could match before reading any of them.
Every dot = a file read from disk. All 28,132 files scanned.
Only 5 files read from disk. 28,127 files never touched.
ripgrep pays a linear cost proportional to repo size on every query — it must touch every file. TriSeek pays an upfront indexing cost once, then each query cost is proportional to the number of matching files, not the total repo size. The larger the codebase, the bigger the advantage.
O(total files × avg file size)
O(posting list size + candidate files)
\d{3}-\d{4})TriSeek walks your repository, extracts trigrams from every file, and builds an inverted index — a map from each trigram to the set of files containing it.
Using the ignore crate's parallel walker, all files are discovered respecting .gitignore rules. Threads are capped at min(cores, 8) for optimal I/O throughput.
fn walk_repository_parallel(root: &Path) { WalkBuilder::new(root) .threads(num_cpus::get().min(8)) .build_parallel() .run(|| { /* process each file */ }); }
For each file, a sliding window of 3 bytes produces all trigrams. These are collected into a HashSet<u32> per file (deduplication — we only need to know if a trigram appears, not how many times).
for window in content.windows(3) { let tri = window[0] as u32 | (window[1] as u32) << 8 | (window[2] as u32) << 16; trigrams.insert(tri); }
The inverted index maps each trigram to a sorted list of document IDs. If trigram "htt" appears in files 4, 17, and 203, its posting list is [4, 17, 203].
The in-memory index is persisted as fast.idx — a custom flat binary format designed for instant mmap loading. No deserialization needed; posting lists are read directly from mapped memory.
When you search for https, TriSeek extracts its trigrams, intersects the posting lists to find candidate files, then verifies matches in parallel.
Walk all lists simultaneously with pointers. Only advance the pointer on the smallest value. When all pointers agree, that's a candidate.
From 28,000 files in the repo → only 4 candidates to verify. That's a 7,000x reduction.
Read each candidate file and run the actual regex/literal match. This happens across all CPU cores simultaneously.
fn sorted_intersect(a: &[u32], b: &[u32]) -> Vec<u32> { let (mut i, mut j) = (0, 0); let mut result = Vec::new(); while i < a.len() && j < b.len() { match a[i].cmp(&b[j]) { Equal => { result.push(a[i]); i += 1; j += 1; } Less => i += 1, Greater => j += 1, } } result }
The fast.idx file is a flat binary format designed for mmap. No parsing, no deserialization — the OS maps the file into memory and we read posting lists as raw u32 slices directly from the mapped region.
Index load time
Deserialize + allocate
Index load time
Zero-copy mmap
Traditional deserialization reads the entire file into memory, allocates new data structures, and copies data. With mmap, the operating system maps the file directly into virtual memory. Reading a posting list is just casting a pointer — the data stays on disk until accessed, and the OS manages caching.
// Posting list is a raw slice of u32 from mapped memory let ptr = mmap[offset..offset + count * 4].as_ptr() as *const u32; let postings = unsafe { slice::from_raw_parts(ptr, count) }; // No allocation. No copy. Just a pointer cast.
After intersection narrows candidates to a handful of files, TriSeek verifies each one in parallel across all CPU cores using Rayon's work-stealing thread pool.
use rayon::prelude::*; candidates.par_iter().for_each(|(doc_id, path, rel)| { if done.load(Ordering::Relaxed) { return; } // Read file (mmap for large, Vec for small) let bytes = read_candidate(&path); // Run actual pattern match let hits = match_in_bytes(&bytes, &pattern); if !hits.is_empty() { results.lock().extend(hits); let count = total_found.fetch_add( hits.len(), Ordering::Relaxed ); if count >= max_results { done.store(true, Ordering::Relaxed); } } });
An AtomicBool flag signals all threads to stop once we have enough results. This prevents unnecessary I/O when only the first N matches are needed — common in editor integrations.
Rayon's work-stealing scheduler automatically balances load. If one thread finishes its files quickly (small files), it steals work from threads with larger files. No manual balancing needed.
On the fresh full rerun, TriSeek auto mode wins 38 of 65 single-query benchmarks overall, or 38 of 39 after excluding the two small repos where cold shell scans remain cheaper, and 8 of 10 session benchmarks across all repo sizes.
Note
The charts below show representative results for medium and large repositories only (Kubernetes, Linux, Rust). Small repos (serde, ripgrep) are excluded from the charts because the fresh full rerun still shows shell tools ahead on cold single-query latency at that scale. Session benchmarks include all 5 repos.
| Repository | TriSeek p50 | rg p50 | Speedup |
|---|---|---|---|
| serde (small) | 34 ms | 82 ms | 2.4x |
| kubernetes (medium) | 292 ms | 3,575 ms | 12.2x |
| linux (large) | 613 ms | 10,342 ms | 16.9x |
| rust (large) | 375 ms | 5,954 ms | 15.9x |
| ripgrep (small) | 39 ms | 92 ms | 2.3x |
medium+ repo benchmarks won
Overall full-rerun result: 38/65 wins. All 26 losses came from the two small repos.
peak measured speedup
Linux kernel, literal_selective single-query workload
TriSeek's trigram indexing approach builds on decades of research in string matching and code search:
Ukkonen, "Approximate string-matching with q-grams and maximal matches" (1992)
The theoretical foundation for q-gram (n-gram) based string matching. Established that decomposing strings into overlapping q-length substrings enables efficient approximate and exact matching through set intersection. TriSeek's trigram decomposition (q=3) directly descends from this framework.
Theoretical Computer Science, Volume 92, Issue 1, pp. 191–211
Manber & Wu, "GLIMPSE: A Tool to Search Through Entire File Systems" (1994)
The first practical system to use an inverted index over character n-grams for searching entire file systems. GLIMPSE demonstrated the core tradeoff TriSeek exploits: a small index enables dramatic pruning of the search space, making queries sub-linear in corpus size.
USENIX Winter Technical Conference, 1994
usenix.org/.../glimpse-tool-search-through-entire-file-systems
Navarro & Baeza-Yates, "A Practical q-Gram Index for Text Retrieval Allowing Errors" (1998)
Extended q-gram indexing to support approximate matching and established cost models for choosing optimal q values. Their analysis of posting list intersection costs informed TriSeek's sorted-merge intersection strategy and the choice of q=3 as the selectivity sweet spot.
CLEI Electronic Journal, Volume 1, Number 2, 1998
Russ Cox, "Regular Expression Matching with a Trigram Index" (2012)
The engineering blueprint for applying trigram indexing to code search at scale. Described how Google Code Search extracted trigram sets from regex queries, intersected posting lists to find candidate files, and verified matches with full regex evaluation. This is the direct algorithmic ancestor of TriSeek's search pipeline.
GitHub, "The technology behind GitHub's new code search" (2023)
GitHub's engineering blog describing their Blackbird code search system, which uses n-gram indexing with content-defined chunking, delta-encoded posting lists, and a Rust-based query engine. Demonstrated that trigram-style indexing remains the state of the art for code search even at the scale of 200M+ repositories.
github.blog/2023-02-06-the-technology-behind-githubs-new-code-search