A deep dive into every crate, module, struct, and data flow in TriSeek — how the pieces fit together, what each component does, and how a search query travels through the system.
TriSeek is a Cargo workspace with 4 crates, each with a clear responsibility boundary. Dependencies flow downward: CLI depends on everything, Index depends on Core, Core depends on nothing.
TriSeek/ Cargo.toml # workspace root crates/ search-core/src/ lib.rs # re-exports all modules query.rs # QueryRequest, SearchKind, CaseMode result.rs # SearchResponse, SearchHit, SearchLineMatch planner.rs # plan_query(), route_query(), extract_regex_literals() trigram.rs # encode_trigram(), trigrams_from_bytes() repo.rs # RepoStats, RepoCategory, IndexMetadata metrics.rs # ProcessMetrics, SearchMetrics, SessionMetrics search-index/src/ lib.rs # re-exports engine.rs # SearchEngine, IndexBackend, parallel verification fastindex.rs # FastIndex mmap format, write_fast_index() build.rs # build_index(), update_index(), BuildAccumulator walker.rs # walk_repository_parallel(), binary detection model.rs # PersistedIndex, RuntimeIndex, DeltaSnapshot storage.rs # file I/O: base.bin, delta.bin, fast.idx crates/search-cli/src/ main.rs # CLI entry: build, search, session, update, stats search-bench/src/ main.rs # benchmark harness with cold/warm trials ~/.triseek/indexes/<root-key>/ # generated index directory base.bin # bincode-serialized full index fast.idx # mmap binary format (primary) delta.bin # optional incremental changes metadata.json # index metadata + repo stats
| Crate | Purpose | Used In |
|---|---|---|
| rayon 1.10 | Work-stealing parallelism | engine.rs — parallel file verification |
| memmap2 0.9 | Memory-mapped file I/O | fastindex.rs — zero-copy index, engine.rs — large file reads |
| ignore 0.4 | Git-aware parallel file walking | walker.rs — respects .gitignore |
| regex / regex-syntax | Pattern matching + AST parsing | engine.rs — verification, planner.rs — literal extraction |
| bincode 2.0 | Binary serialization | storage.rs — base.bin / delta.bin format |
| xxhash-rust | Fast 64-bit hashing | walker.rs — file fingerprinting for delta detection |
| clap 4.5 | CLI argument parsing | triseek main.rs |
| globset 0.4 | Glob pattern matching | engine.rs — --glob path filters |
The core crate has zero I/O. It defines query types, result types, the trigram encoding scheme, repo classification, and — most importantly — the query planner that decides how to execute each search.
Every search starts here. The CLI parses user input into this struct, which flows through the entire pipeline.
The planner runs in two phases: plan (what shape is this query?) then route (which engine should execute it?).
pub fn plan_query(request: &QueryRequest) -> QueryPlan { match request.kind { Path => strategy: PathIndex, Literal|Auto => { if pattern.len() < 3 => ShortLiteral + DirectScan else => Literal + Indexed } Regex => { seeds = extract_regex_literals(pattern) if longest_seed >= 3 => RegexAnchored + Indexed else => RegexWeak + DirectScan } } }
3 bytes packed into a u32. All text is lowercased before encoding for case-insensitive index lookups.
pub type Trigram = u32; pub fn encode_trigram(bytes: &[u8]) -> Option<Trigram> { (bytes[0] as u32) << 16 | (bytes[1] as u32) << 8 | bytes[2] as u32 } pub fn trigrams_from_bytes(bytes: &[u8]) -> Vec<Trigram> { let normalized = normalize_for_index(bytes); // lowercase normalized.windows(3) .filter_map(encode_trigram) .collect::<BTreeSet<_>>() // dedup + sort .into_iter().collect() }
The full audit trail: what was requested, how it was planned, which engine ran it, and every hit found.
| Category | Files | Disk Size | Example |
|---|---|---|---|
| Small | < 5K | < 200 MB | serde, ripgrep |
| Medium | 5K–50K | 200 MB–2 GB | kubernetes |
| Large | 50K–500K | 2–20 GB | linux, rust |
| VeryLarge | > 500K | > 20 GB | chromium |
The index exists in three layers: PersistedIndex (full snapshot), optional DeltaSnapshot (incremental changes), and RuntimeIndex (merged in-memory view). The FastIndex provides the same data via mmap.
Each file in the repo gets a DocumentRecord with a unique doc_id. The fingerprint enables delta detection on subsequent builds.
Full snapshot serialized with bincode. Contains everything needed to answer queries.
Incremental changes since last full build. Merged into base at load time via RuntimeIndex::from_snapshots().
If delta_ratio > 25% of base, triggers a full rebuild instead.
The engine abstracts over two index implementations. It prefers Fast (mmap) but falls back to Legacy when a delta layer exists.
pub fn open(index_dir: &Path) -> Result<Self> { let metadata = read_index_metadata(index_dir)?; let has_delta = delta_exists(index_dir); if fast_index_exists(index_dir) && !has_delta { // Preferred: zero-copy mmap let fast = FastIndex::open(fast_index_path(index_dir))?; return Ok(SearchEngine { backend: Fast(fast), .. }); } // Fallback: deserialize + merge let base = load_base(index_dir)?; let delta = load_delta(index_dir)?; let runtime = RuntimeIndex::from_snapshots(base, delta); Ok(SearchEngine { backend: Legacy(runtime), .. }) }
The build pipeline walks the repository in parallel, extracts trigrams from each file, accumulates posting lists, and writes both bincode (legacy) and mmap (fast) index formats.
Uses the ignore crate's parallel walker which natively respects .gitignore, .ignore, and hidden file rules.
For each ScannedFile, the accumulator assigns a doc_id, extracts content + path trigrams, and builds inverted posting lists.
fn push(&mut self, file: ScannedFile) { let doc_id = self.next_doc_id; // sequential u32 self.next_doc_id += 1; // Content trigrams → posting lists for tri in trigrams_from_bytes(&file.contents) { self.content_postings .entry(tri).or_default().push(doc_id); } // Path trigrams → separate posting lists for tri in trigrams_from_bytes(file.relative_path.as_bytes()) { self.path_postings .entry(tri).or_default().push(doc_id); } // Filename + extension maps for exact lookups self.filename_map.entry(file.file_name.to_lowercase())...; self.extension_map.entry(ext.to_lowercase())...; }
Instead of rebuilding from scratch, update_index() compares fingerprints (size + mtime + xxh3 hash) to detect changes.
Delta exists as delta.bin alongside base.bin. At load time, RuntimeIndex::from_snapshots() merges them. Note: when a delta exists, the Fast (mmap) backend is not used — it falls back to Legacy to apply the merge.
What happens from the moment you type a query to when results appear. Every search travels through the same pipeline: parse → plan → route → filter → candidates → verify → results.
CLI parses input
triseek search "HttpResponse" → QueryRequest { kind: Auto, pattern: "HttpResponse", engine: Auto }
plan_query() classifies the query
12 chars → shape: Literal, selectivity: High, strategy: Indexed, seeds: ["httpresponse"]
route_query() selects engine
Kubernetes = Medium repo, High selectivity → route: Indexed. Reason: "medium_repo, high selectivity, index available"
SearchEngine::open() loads fast.idx via mmap
No delta exists → Fast backend. Mmap maps file into address space in <5ms. No deserialization.
fast_path_filtered_docs() applies path filters
No path filters in this query → all 28,132 doc_ids pass through.
fast_index_candidates() intersects posting lists
Extract trigrams from "httpresponse" (lowercased): "htt", "ttp", "tpr", "pre", "res", "esp", "spo", "pon", "ons", "nse". Look up each in content_table → 10 posting lists. Sorted intersection yields ~5 candidate doc_ids.
collect_content_hits_parallel() verifies candidates
5 candidates distributed across rayon thread pool. Each file: read (mmap if >32KB), compile regex from escaped literal, scan lines. AtomicBool for early termination if max_results reached.
Return SearchResponse
Hits with file paths + line matches, summary (files_with_matches, total_line_matches), metrics (wall_millis: ~50ms, candidate_docs: 5, bytes_scanned: 45KB).
route_query() in planner.rs decides which execution engine handles each query. The decision considers repo size, query shape, selectivity, whether an index exists, and session context.
After routing, adjust_route_for_filters() in main.rs may override: if heavy path filters are set and the route was Ripgrep, it may switch to Indexed if the index can narrow candidates faster than rg can walk+filter.
Open index → path filter → trigram candidates → parallel verify. Cost: O(posting lists + candidates).
Best for: medium+ repos, selective queries, session workloads.
Spawn rg --json subprocess. Parse JSON output for hits. Cost: O(all files).
Best for: small repos, weak regex, one-off queries, no index built yet.
Walk repo with scanner, apply path/content filters inline. Cost: O(all files).
Best for: short literals (<3 chars), path-only queries without index.
The ~1000-line engine module is the heart of TriSeek. It orchestrates path filtering, candidate selection, parallel verification, and line-level match extraction.
Before any content search, path filters narrow the doc set. Each filter type uses the most efficient lookup available.
| Filter | Fast Backend | Legacy Backend |
|---|---|---|
| exact_paths | path_to_doc HashMap O(1) | path_lookup HashMap O(1) |
| exact_names | filename_map HashMap O(1) | filename_map HashMap O(1) |
| extensions | extension_map HashMap O(1) | extension_map HashMap O(1) |
| path_substrings | Trigram search on path postings (3+ chars) or linear scan (<3) | Same, via path_postings HashMap |
| path_prefixes | Linear scan all_docs(), starts_with check | Same |
| globs | GlobSet match on all_docs() | Same |
Multiple filters are intersected: a doc must pass ALL filters. Result is a sorted Vec<u32> of surviving doc_ids.
match plan.shape { Literal | Auto if pattern.len() < 3 => { // Too short for trigrams — return all filtered docs return filtered_docs; } Literal | Auto => { // Extract trigrams from the literal, intersect posting lists let candidates = fast_candidates_for_seed(fast, pattern); sorted_intersect(&candidates, &filtered_docs) } RegexAnchored => { // Use extracted literal seeds from the regex if regex_has_unescaped_alternation(pattern) { // OR pattern: union of each seed's candidates seeds.iter().fold(Vec::new(), |acc, seed| sorted_union(&acc, &fast_candidates_for_seed(fast, seed)) ) } else { // AND pattern: intersection of each seed's candidates seeds.iter().fold(None, |acc, seed| intersect_sorted(acc, fast_candidates_for_seed(fast, seed)) ) } } }
The final step: read each candidate file and run the actual pattern match. This is the only phase that touches the filesystem.
done flag for early terminationtotal_found tracks match countBoth implement Deref<Target=[u8]> so the matcher doesn't care which variant it gets.
All posting lists and candidate sets are sorted Vec<u32>. Two-pointer merge avoids HashSet allocation overhead.
while i < a.len() && j < b.len() { match a[i].cmp(&b[j]) { Equal => { out.push(a[i]); i+=1; j+=1; } Less => i += 1, Greater => j += 1, } }
while i < a.len() && j < b.len() { match a[i].cmp(&b[j]) { Equal => { out.push(a[i]); i+=1; j+=1; } Less => { out.push(a[i]); i+=1; } Greater => { out.push(b[j]); j+=1; } } } // drain remaining from both sides
A custom flat binary format designed for mmap. No parsing step — posting lists are read as raw &[u32] slices directly from mapped memory. Every section is at a fixed or header-declared offset.
At open(), all entries are loaded into a HashMap<u32, (u32, u32)> for O(1) lookup.
Strings stored in pool; entries hold (offset, length) pairs.
pub fn content_postings(&self, trigram: u32) -> Option<Vec<u32>> { // O(1) HashMap lookup let (offset, count) = self.content_table.get(&trigram)?; // Calculate byte position in mmap'd region let byte_offset = self.content_postings_offset + (*offset as usize) * 4; // Zero-copy: cast raw bytes to &[u32] let ptr = self.mmap[byte_offset..].as_ptr() as *const u32; let slice = unsafe { slice::from_raw_parts(ptr, *count as usize) }; Some(slice.to_vec()) } // Total cost: 1 HashMap lookup + 1 pointer cast + 1 Vec copy
The CLI binary and benchmark harness are thin wrappers that parse arguments, dispatch to the engine, and format output.
| Command | Handler | What It Does |
|---|---|---|
| build | handle_build() | Full index build: walk → accumulate → persist base.bin + fast.idx |
| update | handle_update() | Incremental update: fingerprint diff → delta or rebuild |
| search | handle_search() | Single query: plan → route → execute → JSON output |
| session | handle_session() | Multi-query from JSON file: per-query routing + aggregated metrics |
| stats | handle_stats() | Display index metadata: repo stats, build time, doc count |
| measure | handle_measure() | Scan repo without building index: for repo classification |
Reads a manifest YAML listing repositories and query types. For each repo+query, runs cold trials (fresh process) and warm trials (reused index), measuring wall time via process metrics. Compares against ripgrep baseline.