TriSeek — Code Architecture

01 — Workspace

Project Structure

TriSeek is a Cargo workspace with 4 crates, each with a clear responsibility boundary. Dependencies flow downward: CLI depends on everything, Index depends on Core, Core depends on nothing.

Entry Points

triseek

CLI binary. Parses args, routes commands (build, search, session, update, stats), formats output as JSON.

search-bench

Benchmark harness. Loads manifest YAML, runs cold/warm trials, measures p50/p90, compares against ripgrep baseline.

↓

Index & Engine

engine.rs

SearchEngine struct. Dual-backend (Fast/Legacy). Parallel verification via rayon. Regex matching + line extraction.

fastindex.rs

Mmap-based binary index. Zero-copy posting list reads. 96-byte header, trigram tables, doc table, string pool.

build.rs

Index construction. Full builds + delta updates. BuildAccumulator collects trigrams per file, then persists.

walker.rs

Parallel file walker using ignore crate. Respects .gitignore. Binary detection. Up to 8 threads.

model.rs

Data models: PersistedIndex, RuntimeIndex, DeltaSnapshot, DocumentRecord. Merge logic for base+delta.

storage.rs

File I/O layer. Reads/writes base.bin, delta.bin, metadata.json, fast.idx. Path conventions.

↓

Core

planner.rs

Query planning + adaptive routing. Classifies query shape, selectivity, repo size. Chooses Indexed/Scan/Ripgrep.

trigram.rs

Trigram encoding (3 bytes → u32), extraction from text/bytes. Normalization for case-insensitive matching.

query.rs

QueryRequest struct with pattern, kind, filters (path, extension, glob), case mode, max_results.

result.rs

SearchResponse, SearchHit (Content/Path), SearchLineMatch, SearchSummary. Full query audit trail.

repo.rs

RepoStats, RepoCategory (Small/Medium/Large/VeryLarge), IndexMetadata, FileFingerprint, BuildStats.

metrics.rs

ProcessMetrics (wall/cpu/rss), SearchMetrics (candidates/verified/bytes), SessionMetrics (amortized costs).

Directory Layout

file tree

TriSeek/
  Cargo.toml                      # workspace root
  crates/
    search-core/src/
      lib.rs                      # re-exports all modules
      query.rs                    # QueryRequest, SearchKind, CaseMode
      result.rs                   # SearchResponse, SearchHit, SearchLineMatch
      planner.rs                  # plan_query(), route_query(), extract_regex_literals()
      trigram.rs                  # encode_trigram(), trigrams_from_bytes()
      repo.rs                     # RepoStats, RepoCategory, IndexMetadata
      metrics.rs                  # ProcessMetrics, SearchMetrics, SessionMetrics
    search-index/src/
      lib.rs                      # re-exports
      engine.rs                   # SearchEngine, IndexBackend, parallel verification
      fastindex.rs                # FastIndex mmap format, write_fast_index()
      build.rs                    # build_index(), update_index(), BuildAccumulator
      walker.rs                   # walk_repository_parallel(), binary detection
      model.rs                    # PersistedIndex, RuntimeIndex, DeltaSnapshot
      storage.rs                  # file I/O: base.bin, delta.bin, fast.idx
    crates/search-cli/src/
      main.rs                     # CLI entry: build, search, session, update, stats
    search-bench/src/
      main.rs                     # benchmark harness with cold/warm trials
  ~/.triseek/indexes/<root-key>/   # generated index directory
    base.bin                      # bincode-serialized full index
    fast.idx                      # mmap binary format (primary)
    delta.bin                     # optional incremental changes
    metadata.json                 # index metadata + repo stats

Key Dependencies

Crate	Purpose	Used In
rayon 1.10	Work-stealing parallelism	engine.rs — parallel file verification
memmap2 0.9	Memory-mapped file I/O	fastindex.rs — zero-copy index, engine.rs — large file reads
ignore 0.4	Git-aware parallel file walking	walker.rs — respects .gitignore
regex / regex-syntax	Pattern matching + AST parsing	engine.rs — verification, planner.rs — literal extraction
bincode 2.0	Binary serialization	storage.rs — base.bin / delta.bin format
xxhash-rust	Fast 64-bit hashing	walker.rs — file fingerprinting for delta detection
clap 4.5	CLI argument parsing	triseek main.rs
globset 0.4	Glob pattern matching	engine.rs — --glob path filters

02 — Crate Deep Dive

search-core: The Brain

The core crate has zero I/O. It defines query types, result types, the trigram encoding scheme, repo classification, and — most importantly — the query planner that decides how to execute each search.

QueryRequest query.rs

Every search starts here. The CLI parses user input into this struct, which flows through the entire pipeline.

struct QueryRequest

kindSearchKindLiteral | Regex | Path | Auto

engineSearchEngineKindIndexed | DirectScan | Ripgrep | Auto

patternStringThe search pattern

case_modeCaseModeSensitive | Insensitive

path_substringsVec<String>Filter: path must contain these

path_prefixesVec<String>Filter: path starts with these

exact_pathsVec<String>Filter: exact path match

exact_namesVec<String>Filter: exact filename

extensionsVec<String>Filter: file extensions (.rs, .go)

globsVec<String>Filter: glob patterns

max_resultsOption<usize>Early termination limit

Query Planner planner.rs

The planner runs in two phases: plan (what shape is this query?) then route (which engine should execute it?).

QueryShape enum

LiteralPlain string, 3+ chars

ShortLiteral< 3 chars (no useful trigrams)

RegexAnchoredRegex with extractable literals

RegexWeakPure regex, no literal seeds

PathPath-only search

QuerySelectivity enum

High> 5 chars → very few candidates

Medium3-5 chars → moderate candidates

Low< 3 chars → many candidates

UnknownRegex without clear literal length

planner.rs — plan_query()

pub fn plan_query(request: &QueryRequest) -> QueryPlan {
    match request.kind {
        Path         => strategy: PathIndex,
        Literal|Auto => {
            if pattern.len() < 3 => ShortLiteral + DirectScan
            else                  => Literal + Indexed
        }
        Regex        => {
            seeds = extract_regex_literals(pattern)
            if longest_seed >= 3 => RegexAnchored + Indexed
            else                   => RegexWeak + DirectScan
        }
    }
}

Trigram Encoding trigram.rs

3 bytes packed into a u32. All text is lowercased before encoding for case-insensitive index lookups.

trigram.rs

pub type Trigram = u32;

pub fn encode_trigram(bytes: &[u8]) -> Option<Trigram> {
    (bytes[0] as u32) << 16
  | (bytes[1] as u32) << 8
  |  bytes[2] as u32
}

pub fn trigrams_from_bytes(bytes: &[u8]) -> Vec<Trigram> {
    let normalized = normalize_for_index(bytes); // lowercase
    normalized.windows(3)
        .filter_map(encode_trigram)
        .collect::<BTreeSet<_>>() // dedup + sort
        .into_iter().collect()
}

SearchResponse result.rs

The full audit trail: what was requested, how it was planned, which engine ran it, and every hit found.

struct SearchResponse

requestQueryRequestOriginal query

effective_kindSearchKindResolved search kind

engineSearchEngineKindWhich engine ran

routingAdaptiveRoutingDecisionWhy this engine was chosen

planQueryPlanShape, selectivity, seeds

hitsVec<SearchHit>Content { path, lines } | Path { path }

summarySearchSummaryfiles_with_matches, total_line_matches

metricsSearchMetricsTiming, candidates, bytes scanned

Repo Classification repo.rs

Category	Files	Disk Size	Example
Small	< 5K	< 200 MB	serde, ripgrep
Medium	5K–50K	200 MB–2 GB	kubernetes
Large	50K–500K	2–20 GB	linux, rust
VeryLarge	> 500K	> 20 GB	chromium

03 — Index Data Model

How Data is Stored

The index exists in three layers: PersistedIndex (full snapshot), optional DeltaSnapshot (incremental changes), and RuntimeIndex (merged in-memory view). The FastIndex provides the same data via mmap.

DocumentRecord model.rs

Each file in the repo gets a DocumentRecord with a unique doc_id. The fingerprint enables delta detection on subsequent builds.

struct DocumentRecord

doc_idu32Unique identifier within index

relative_pathString"src/engine.rs" (always / separators)

file_nameString"engine.rs"

extensionOption<String>"rs" (lowercase)

fingerprintFileFingerprint{ size, modified_unix_secs, hash: xxh3_64 }

PersistedIndex base.bin

Full snapshot serialized with bincode. Contains everything needed to answer queries.

struct PersistedIndex

docsVec<DocumentRecord>

content_postingsVec<PostingListEntry>

path_postingsVec<PostingListEntry>

filename_mapVec<NamePostingEntry>

extension_mapVec<NamePostingEntry>

DeltaSnapshot delta.bin

Incremental changes since last full build. Merged into base at load time via RuntimeIndex::from_snapshots().

struct DeltaSnapshot

removed_pathsVec<String>

docsVec<DocumentRecord>

content_postingsVec<PostingListEntry>

+ same maps

If delta_ratio > 25% of base, triggers a full rebuild instead.

IndexBackend engine.rs

The engine abstracts over two index implementations. It prefers Fast (mmap) but falls back to Legacy when a delta layer exists.

enum IndexBackend

Fast(FastIndex)mmap'd binary format. Used when fast.idx exists and no delta layer. <5ms open time.

Legacy(RuntimeIndex)Deserialized from bincode. Merges base + delta. HashMaps in memory. ~600ms open time.

engine.rs — open()

pub fn open(index_dir: &Path) -> Result<Self> {
    let metadata = read_index_metadata(index_dir)?;
    let has_delta = delta_exists(index_dir);

    if fast_index_exists(index_dir) && !has_delta {
        // Preferred: zero-copy mmap
        let fast = FastIndex::open(fast_index_path(index_dir))?;
        return Ok(SearchEngine { backend: Fast(fast), .. });
    }

    // Fallback: deserialize + merge
    let base = load_base(index_dir)?;
    let delta = load_delta(index_dir)?;
    let runtime = RuntimeIndex::from_snapshots(base, delta);
    Ok(SearchEngine { backend: Legacy(runtime), .. })
}

04 — Index Build Pipeline

Building the Index

The build pipeline walks the repository in parallel, extracts trigrams from each file, accumulates posting lists, and writes both bincode (legacy) and mmap (fast) index formats.

walk_parallel()

walker.rs

8 threads, .gitignore

→

BuildAccumulator

build.rs

trigrams per file

→

PersistedIndex

model.rs

posting lists + docs

→

persist()

storage.rs

base.bin + metadata.json

→

write_fast_index()

fastindex.rs

fast.idx (mmap)

Parallel Walker walker.rs

Uses the ignore crate's parallel walker which natively respects .gitignore, .ignore, and hidden file rules.

For each file:

1. Check file size ≤ max (default: no limit)
2. Read full contents into memory
3. Binary detection: null bytes in first 4KB, or >20% control chars
4. Compute xxh3_64 hash for fingerprinting
5. Emit ScannedFile to shared Mutex<Vec>

struct ScannedFile

relative_pathString

contentsVec<u8>

content_hashu64 (xxh3)

extensionOption<String>

file_sizeu64

BuildAccumulator build.rs

For each ScannedFile, the accumulator assigns a doc_id, extracts content + path trigrams, and builds inverted posting lists.

build.rs — push()

fn push(&mut self, file: ScannedFile) {
    let doc_id = self.next_doc_id;  // sequential u32
    self.next_doc_id += 1;

    // Content trigrams → posting lists
    for tri in trigrams_from_bytes(&file.contents) {
        self.content_postings
            .entry(tri).or_default().push(doc_id);
    }

    // Path trigrams → separate posting lists
    for tri in trigrams_from_bytes(file.relative_path.as_bytes()) {
        self.path_postings
            .entry(tri).or_default().push(doc_id);
    }

    // Filename + extension maps for exact lookups
    self.filename_map.entry(file.file_name.to_lowercase())...;
    self.extension_map.entry(ext.to_lowercase())...;
}

Delta Updates build.rs — update_index()

Instead of rebuilding from scratch, update_index() compares fingerprints (size + mtime + xxh3 hash) to detect changes.

How many files changed since last build?

> 25% of files → full rebuild (faster than delta merge)

≤ 25% → create DeltaSnapshot with only changed/new/removed files

Delta exists as delta.bin alongside base.bin. At load time, RuntimeIndex::from_snapshots() merges them. Note: when a delta exists, the Fast (mmap) backend is not used — it falls back to Legacy to apply the merge.

05 — Search Execution

End-to-End Search Flow

What happens from the moment you type a query to when results appear. Every search travels through the same pipeline: parse → plan → route → filter → candidates → verify → results.

CLI Parse

main.rs

args → QueryRequest

→

plan_query()

planner.rs

shape + selectivity

→

route_query()

planner.rs

Indexed / rg / Scan

→

Path Filters

engine.rs

ext, name, glob, prefix

→

Trigram Candidates

engine.rs

posting list intersect

→

Parallel Verify

engine.rs + rayon

regex on candidates

→

SearchResponse

result.rs

hits + metrics

Example: Searching for "HttpResponse" in Kubernetes

CLI parses input

triseek search "HttpResponse" → QueryRequest { kind: Auto, pattern: "HttpResponse", engine: Auto }

plan_query() classifies the query

12 chars → shape: Literal, selectivity: High, strategy: Indexed, seeds: ["httpresponse"]

route_query() selects engine

Kubernetes = Medium repo, High selectivity → route: Indexed. Reason: "medium_repo, high selectivity, index available"

SearchEngine::open() loads fast.idx via mmap

No delta exists → Fast backend. Mmap maps file into address space in <5ms. No deserialization.

fast_path_filtered_docs() applies path filters

No path filters in this query → all 28,132 doc_ids pass through.

fast_index_candidates() intersects posting lists

Extract trigrams from "httpresponse" (lowercased): "htt", "ttp", "tpr", "pre", "res", "esp", "spo", "pon", "ons", "nse". Look up each in content_table → 10 posting lists. Sorted intersection yields ~5 candidate doc_ids.

collect_content_hits_parallel() verifies candidates

5 candidates distributed across rayon thread pool. Each file: read (mmap if >32KB), compile regex from escaped literal, scan lines. AtomicBool for early termination if max_results reached.

Return SearchResponse

Hits with file paths + line matches, summary (files_with_matches, total_line_matches), metrics (wall_millis: ~50ms, candidate_docs: 5, bytes_scanned: 45KB).

06 — Adaptive Routing

The Decision Tree

route_query() in planner.rs decides which execution engine handles each query. The decision considers repo size, query shape, selectivity, whether an index exists, and session context.

Routing Logic

Did user explicitly choose an engine? (--engine index/scan/rg)

Yes → use that engine directly

Is an index available?

No + Path query → DirectScan (walk files, match paths)

No + Content query → Ripgrep (fastest without index)

Index exists. What's the repo category?

Small + not repeated_session → Ripgrep (page cache makes brute force free)

Medium + not repeated_session + Low selectivity → Ripgrep

Medium/Large/VeryLarge + High/Medium selectivity → Indexed

Any size + repeated_session hint → Indexed (amortize index load)

After routing, adjust_route_for_filters() in main.rs may override: if heavy path filters are set and the route was Ripgrep, it may switch to Indexed if the index can narrow candidates faster than rg can walk+filter.

Indexed

Open index → path filter → trigram candidates → parallel verify. Cost: O(posting lists + candidates).

Best for: medium+ repos, selective queries, session workloads.

Ripgrep

Spawn rg --json subprocess. Parse JSON output for hits. Cost: O(all files).

Best for: small repos, weak regex, one-off queries, no index built yet.

DirectScan

Walk repo with scanner, apply path/content filters inline. Cost: O(all files).

Best for: short literals (<3 chars), path-only queries without index.

07 — Search Engine Internals

Inside engine.rs

The ~1000-line engine module is the heart of TriSeek. It orchestrates path filtering, candidate selection, parallel verification, and line-level match extraction.

Path Filtering Pipeline

Before any content search, path filters narrow the doc set. Each filter type uses the most efficient lookup available.

Filter	Fast Backend	Legacy Backend
exact_paths	path_to_doc HashMap O(1)	path_lookup HashMap O(1)
exact_names	filename_map HashMap O(1)	filename_map HashMap O(1)
extensions	extension_map HashMap O(1)	extension_map HashMap O(1)
path_substrings	Trigram search on path postings (3+ chars) or linear scan (<3)	Same, via path_postings HashMap
path_prefixes	Linear scan all_docs(), starts_with check	Same
globs	GlobSet match on all_docs()	Same

Multiple filters are intersected: a doc must pass ALL filters. Result is a sorted Vec<u32> of surviving doc_ids.

Trigram Candidate Selection

engine.rs — fast_index_candidates()

match plan.shape {
    Literal | Auto if pattern.len() < 3 => {
        // Too short for trigrams — return all filtered docs
        return filtered_docs;
    }
    Literal | Auto => {
        // Extract trigrams from the literal, intersect posting lists
        let candidates = fast_candidates_for_seed(fast, pattern);
        sorted_intersect(&candidates, &filtered_docs)
    }
    RegexAnchored => {
        // Use extracted literal seeds from the regex
        if regex_has_unescaped_alternation(pattern) {
            // OR pattern: union of each seed's candidates
            seeds.iter().fold(Vec::new(), |acc, seed|
                sorted_union(&acc, &fast_candidates_for_seed(fast, seed))
            )
        } else {
            // AND pattern: intersection of each seed's candidates
            seeds.iter().fold(None, |acc, seed|
                intersect_sorted(acc, fast_candidates_for_seed(fast, seed))
            )
        }
    }
}

Parallel Verification rayon par_iter()

The final step: read each candidate file and run the actual pattern match. This is the only phase that touches the filesystem.

How it works

Resolve all candidates upfront as (doc_id, PathBuf, rel_path)
Build regex matcher from pattern + case mode
Rayon distributes files across work-stealing thread pool
Each thread: read file (mmap >32KB, Vec <32KB)
match_lines() scans for regex matches, extracts line number + column
AtomicBool done flag for early termination
AtomicUsize total_found tracks match count
Mutex protects shared results Vec

CandidateBytes

enum CandidateBytes

Owned(Vec<u8>)< 32KB files

Mapped(Mmap)≥ 32KB files, zero-copy

Both implement Deref<Target=[u8]> so the matcher doesn't care which variant it gets.

Set Operations on Sorted Arrays

All posting lists and candidate sets are sorted Vec<u32>. Two-pointer merge avoids HashSet allocation overhead.

sorted_intersect() — O(n+m)

while i < a.len() && j < b.len() {
    match a[i].cmp(&b[j]) {
        Equal   => { out.push(a[i]); i+=1; j+=1; }
        Less    => i += 1,
        Greater => j += 1,
    }
}

sorted_union() — O(n+m)

while i < a.len() && j < b.len() {
    match a[i].cmp(&b[j]) {
        Equal   => { out.push(a[i]); i+=1; j+=1; }
        Less    => { out.push(a[i]); i+=1; }
        Greater => { out.push(b[j]); j+=1; }
    }
}
// drain remaining from both sides

08 — Fast Index Format

Inside fast.idx

A custom flat binary format designed for mmap. No parsing step — posting lists are read as raw &[u32] slices directly from mapped memory. Every section is at a fixed or header-declared offset.

Binary Layout

Header
96B

Content
Trigram Table

Content
Postings

Path
Trigram Table

Path
Postings

Doc
Table

String
Pool

Header (96 bytes)

magic[u8; 8]"TRISEEK\0"

versionu32Currently 2

num_docsu32Total documents indexed

num_content_trigramsu32Distinct content trigrams

num_path_trigramsu32Distinct path trigrams

content_table_offsetu64Byte offset to content trigram table

content_postings_offsetu64Byte offset to content posting arrays

path_table_offsetu64Byte offset to path trigram table

path_postings_offsetu64Byte offset to path posting arrays

docs_offsetu64Byte offset to doc table

strings_offsetu64Byte offset to string pool

strings_sizeu64Total string pool size

Trigram Table Entry (12 bytes each)

trigramu32The trigram key

offsetu32Index into posting array

countu32Number of doc_ids

At open(), all entries are loaded into a HashMap<u32, (u32, u32)> for O(1) lookup.

Doc Table Entry (46 bytes each)

doc_idu32

path_offsetu32 + u16 len

name_offsetu32 + u16 len

ext_offsetu32 + u8 len

fingerprintu64 + i64 + u64

Strings stored in pool; entries hold (offset, length) pairs.

Read Path: Trigram Lookup → Posting List

fastindex.rs — content_postings()

pub fn content_postings(&self, trigram: u32) -> Option<Vec<u32>> {
    // O(1) HashMap lookup
    let (offset, count) = self.content_table.get(&trigram)?;

    // Calculate byte position in mmap'd region
    let byte_offset = self.content_postings_offset
                    + (*offset as usize) * 4;

    // Zero-copy: cast raw bytes to &[u32]
    let ptr = self.mmap[byte_offset..].as_ptr() as *const u32;
    let slice = unsafe {
        slice::from_raw_parts(ptr, *count as usize)
    };
    Some(slice.to_vec())
}
// Total cost: 1 HashMap lookup + 1 pointer cast + 1 Vec copy

09 — CLI & Bench

Entry Points

The CLI binary and benchmark harness are thin wrappers that parse arguments, dispatch to the engine, and format output.

CLI Commands crates/search-cli/src/main.rs

Command	Handler	What It Does
build	handle_build()	Full index build: walk → accumulate → persist base.bin + fast.idx
update	handle_update()	Incremental update: fingerprint diff → delta or rebuild
search	handle_search()	Single query: plan → route → execute → JSON output
session	handle_session()	Multi-query from JSON file: per-query routing + aggregated metrics
stats	handle_stats()	Display index metadata: repo stats, build time, doc count
measure	handle_measure()	Scan repo without building index: for repo classification

Benchmark Harness search-bench/main.rs

Reads a manifest YAML listing repositories and query types. For each repo+query, runs cold trials (fresh process) and warm trials (reused index), measuring wall time via process metrics. Compares against ripgrep baseline.

Benchmark types

literal_selective — rare string
literal_moderate — medium frequency
literal_high — common string
regex_anchor — regex with literal seeds
regex_weak — pure regex, no literals
multi_or — alternation pattern
path_* — filename/path queries
session_20/100 — multi-query workload

Output

report.json — full results
report.csv — p50/p90 per query
summary.md — human-readable report
Correctness validation: match counts must agree with rg baseline