Rust Workspace · 4 Crates · ~3,500 LOC

Code Architecture

A deep dive into every crate, module, struct, and data flow in TriSeek — how the pieces fit together, what each component does, and how a search query travels through the system.

01 — Workspace

Project Structure

TriSeek is a Cargo workspace with 4 crates, each with a clear responsibility boundary. Dependencies flow downward: CLI depends on everything, Index depends on Core, Core depends on nothing.

Entry Points
triseek
CLI binary. Parses args, routes commands (build, search, session, update, stats), formats output as JSON.
search-bench
Benchmark harness. Loads manifest YAML, runs cold/warm trials, measures p50/p90, compares against ripgrep baseline.
Index & Engine
engine.rs
SearchEngine struct. Dual-backend (Fast/Legacy). Parallel verification via rayon. Regex matching + line extraction.
fastindex.rs
Mmap-based binary index. Zero-copy posting list reads. 96-byte header, trigram tables, doc table, string pool.
build.rs
Index construction. Full builds + delta updates. BuildAccumulator collects trigrams per file, then persists.
walker.rs
Parallel file walker using ignore crate. Respects .gitignore. Binary detection. Up to 8 threads.
model.rs
Data models: PersistedIndex, RuntimeIndex, DeltaSnapshot, DocumentRecord. Merge logic for base+delta.
storage.rs
File I/O layer. Reads/writes base.bin, delta.bin, metadata.json, fast.idx. Path conventions.
Core
planner.rs
Query planning + adaptive routing. Classifies query shape, selectivity, repo size. Chooses Indexed/Scan/Ripgrep.
trigram.rs
Trigram encoding (3 bytes → u32), extraction from text/bytes. Normalization for case-insensitive matching.
query.rs
QueryRequest struct with pattern, kind, filters (path, extension, glob), case mode, max_results.
result.rs
SearchResponse, SearchHit (Content/Path), SearchLineMatch, SearchSummary. Full query audit trail.
repo.rs
RepoStats, RepoCategory (Small/Medium/Large/VeryLarge), IndexMetadata, FileFingerprint, BuildStats.
metrics.rs
ProcessMetrics (wall/cpu/rss), SearchMetrics (candidates/verified/bytes), SessionMetrics (amortized costs).

Directory Layout

file tree
TriSeek/
  Cargo.toml                      # workspace root
  crates/
    search-core/src/
      lib.rs                      # re-exports all modules
      query.rs                    # QueryRequest, SearchKind, CaseMode
      result.rs                   # SearchResponse, SearchHit, SearchLineMatch
      planner.rs                  # plan_query(), route_query(), extract_regex_literals()
      trigram.rs                  # encode_trigram(), trigrams_from_bytes()
      repo.rs                     # RepoStats, RepoCategory, IndexMetadata
      metrics.rs                  # ProcessMetrics, SearchMetrics, SessionMetrics
    search-index/src/
      lib.rs                      # re-exports
      engine.rs                   # SearchEngine, IndexBackend, parallel verification
      fastindex.rs                # FastIndex mmap format, write_fast_index()
      build.rs                    # build_index(), update_index(), BuildAccumulator
      walker.rs                   # walk_repository_parallel(), binary detection
      model.rs                    # PersistedIndex, RuntimeIndex, DeltaSnapshot
      storage.rs                  # file I/O: base.bin, delta.bin, fast.idx
    crates/search-cli/src/
      main.rs                     # CLI entry: build, search, session, update, stats
    search-bench/src/
      main.rs                     # benchmark harness with cold/warm trials
  ~/.triseek/indexes/<root-key>/   # generated index directory
    base.bin                      # bincode-serialized full index
    fast.idx                      # mmap binary format (primary)
    delta.bin                     # optional incremental changes
    metadata.json                 # index metadata + repo stats

Key Dependencies

CratePurposeUsed In
rayon 1.10Work-stealing parallelismengine.rs — parallel file verification
memmap2 0.9Memory-mapped file I/Ofastindex.rs — zero-copy index, engine.rs — large file reads
ignore 0.4Git-aware parallel file walkingwalker.rs — respects .gitignore
regex / regex-syntaxPattern matching + AST parsingengine.rs — verification, planner.rs — literal extraction
bincode 2.0Binary serializationstorage.rs — base.bin / delta.bin format
xxhash-rustFast 64-bit hashingwalker.rs — file fingerprinting for delta detection
clap 4.5CLI argument parsingtriseek main.rs
globset 0.4Glob pattern matchingengine.rs — --glob path filters
02 — Crate Deep Dive

search-core: The Brain

The core crate has zero I/O. It defines query types, result types, the trigram encoding scheme, repo classification, and — most importantly — the query planner that decides how to execute each search.

Q

QueryRequest query.rs

Every search starts here. The CLI parses user input into this struct, which flows through the entire pipeline.

struct QueryRequest
kindSearchKindLiteral | Regex | Path | Auto
engineSearchEngineKindIndexed | DirectScan | Ripgrep | Auto
patternStringThe search pattern
case_modeCaseModeSensitive | Insensitive
path_substringsVec<String>Filter: path must contain these
path_prefixesVec<String>Filter: path starts with these
exact_pathsVec<String>Filter: exact path match
exact_namesVec<String>Filter: exact filename
extensionsVec<String>Filter: file extensions (.rs, .go)
globsVec<String>Filter: glob patterns
max_resultsOption<usize>Early termination limit
P

Query Planner planner.rs

The planner runs in two phases: plan (what shape is this query?) then route (which engine should execute it?).

QueryShape enum

LiteralPlain string, 3+ chars
ShortLiteral< 3 chars (no useful trigrams)
RegexAnchoredRegex with extractable literals
RegexWeakPure regex, no literal seeds
PathPath-only search

QuerySelectivity enum

High> 5 chars → very few candidates
Medium3-5 chars → moderate candidates
Low< 3 chars → many candidates
UnknownRegex without clear literal length
planner.rs — plan_query()
pub fn plan_query(request: &QueryRequest) -> QueryPlan {
    match request.kind {
        Path         => strategy: PathIndex,
        Literal|Auto => {
            if pattern.len() < 3 => ShortLiteral + DirectScan
            else                  => Literal + Indexed
        }
        Regex        => {
            seeds = extract_regex_literals(pattern)
            if longest_seed >= 3 => RegexAnchored + Indexed
            else                   => RegexWeak + DirectScan
        }
    }
}
T

Trigram Encoding trigram.rs

3 bytes packed into a u32. All text is lowercased before encoding for case-insensitive index lookups.

trigram.rs
pub type Trigram = u32;

pub fn encode_trigram(bytes: &[u8]) -> Option<Trigram> {
    (bytes[0] as u32) << 16
  | (bytes[1] as u32) << 8
  |  bytes[2] as u32
}

pub fn trigrams_from_bytes(bytes: &[u8]) -> Vec<Trigram> {
    let normalized = normalize_for_index(bytes); // lowercase
    normalized.windows(3)
        .filter_map(encode_trigram)
        .collect::<BTreeSet<_>>() // dedup + sort
        .into_iter().collect()
}
R

SearchResponse result.rs

The full audit trail: what was requested, how it was planned, which engine ran it, and every hit found.

struct SearchResponse
requestQueryRequestOriginal query
effective_kindSearchKindResolved search kind
engineSearchEngineKindWhich engine ran
routingAdaptiveRoutingDecisionWhy this engine was chosen
planQueryPlanShape, selectivity, seeds
hitsVec<SearchHit>Content { path, lines } | Path { path }
summarySearchSummaryfiles_with_matches, total_line_matches
metricsSearchMetricsTiming, candidates, bytes scanned
C

Repo Classification repo.rs

CategoryFilesDisk SizeExample
Small< 5K< 200 MBserde, ripgrep
Medium5K–50K200 MB–2 GBkubernetes
Large50K–500K2–20 GBlinux, rust
VeryLarge> 500K> 20 GBchromium
03 — Index Data Model

How Data is Stored

The index exists in three layers: PersistedIndex (full snapshot), optional DeltaSnapshot (incremental changes), and RuntimeIndex (merged in-memory view). The FastIndex provides the same data via mmap.

D

DocumentRecord model.rs

Each file in the repo gets a DocumentRecord with a unique doc_id. The fingerprint enables delta detection on subsequent builds.

struct DocumentRecord
doc_idu32Unique identifier within index
relative_pathString"src/engine.rs" (always / separators)
file_nameString"engine.rs"
extensionOption<String>"rs" (lowercase)
fingerprintFileFingerprint{ size, modified_unix_secs, hash: xxh3_64 }
P

PersistedIndex base.bin

Full snapshot serialized with bincode. Contains everything needed to answer queries.

struct PersistedIndex
docsVec<DocumentRecord>
content_postingsVec<PostingListEntry>
path_postingsVec<PostingListEntry>
filename_mapVec<NamePostingEntry>
extension_mapVec<NamePostingEntry>
D

DeltaSnapshot delta.bin

Incremental changes since last full build. Merged into base at load time via RuntimeIndex::from_snapshots().

struct DeltaSnapshot
removed_pathsVec<String>
docsVec<DocumentRecord>
content_postingsVec<PostingListEntry>
+ same maps

If delta_ratio > 25% of base, triggers a full rebuild instead.

B

IndexBackend engine.rs

The engine abstracts over two index implementations. It prefers Fast (mmap) but falls back to Legacy when a delta layer exists.

enum IndexBackend
Fast(FastIndex)mmap'd binary format. Used when fast.idx exists and no delta layer. <5ms open time.
Legacy(RuntimeIndex)Deserialized from bincode. Merges base + delta. HashMaps in memory. ~600ms open time.
engine.rs — open()
pub fn open(index_dir: &Path) -> Result<Self> {
    let metadata = read_index_metadata(index_dir)?;
    let has_delta = delta_exists(index_dir);

    if fast_index_exists(index_dir) && !has_delta {
        // Preferred: zero-copy mmap
        let fast = FastIndex::open(fast_index_path(index_dir))?;
        return Ok(SearchEngine { backend: Fast(fast), .. });
    }

    // Fallback: deserialize + merge
    let base = load_base(index_dir)?;
    let delta = load_delta(index_dir)?;
    let runtime = RuntimeIndex::from_snapshots(base, delta);
    Ok(SearchEngine { backend: Legacy(runtime), .. })
}
04 — Index Build Pipeline

Building the Index

The build pipeline walks the repository in parallel, extracts trigrams from each file, accumulates posting lists, and writes both bincode (legacy) and mmap (fast) index formats.

walk_parallel()
walker.rs
8 threads, .gitignore
BuildAccumulator
build.rs
trigrams per file
PersistedIndex
model.rs
posting lists + docs
persist()
storage.rs
base.bin + metadata.json
write_fast_index()
fastindex.rs
fast.idx (mmap)
W

Parallel Walker walker.rs

Uses the ignore crate's parallel walker which natively respects .gitignore, .ignore, and hidden file rules.

For each file:

  • 1. Check file size ≤ max (default: no limit)
  • 2. Read full contents into memory
  • 3. Binary detection: null bytes in first 4KB, or >20% control chars
  • 4. Compute xxh3_64 hash for fingerprinting
  • 5. Emit ScannedFile to shared Mutex<Vec>
struct ScannedFile
relative_pathString
contentsVec<u8>
content_hashu64 (xxh3)
extensionOption<String>
file_sizeu64
A

BuildAccumulator build.rs

For each ScannedFile, the accumulator assigns a doc_id, extracts content + path trigrams, and builds inverted posting lists.

build.rs — push()
fn push(&mut self, file: ScannedFile) {
    let doc_id = self.next_doc_id;  // sequential u32
    self.next_doc_id += 1;

    // Content trigrams → posting lists
    for tri in trigrams_from_bytes(&file.contents) {
        self.content_postings
            .entry(tri).or_default().push(doc_id);
    }

    // Path trigrams → separate posting lists
    for tri in trigrams_from_bytes(file.relative_path.as_bytes()) {
        self.path_postings
            .entry(tri).or_default().push(doc_id);
    }

    // Filename + extension maps for exact lookups
    self.filename_map.entry(file.file_name.to_lowercase())...;
    self.extension_map.entry(ext.to_lowercase())...;
}
U

Delta Updates build.rs — update_index()

Instead of rebuilding from scratch, update_index() compares fingerprints (size + mtime + xxh3 hash) to detect changes.

How many files changed since last build?
> 25% of files → full rebuild (faster than delta merge)
≤ 25% → create DeltaSnapshot with only changed/new/removed files

Delta exists as delta.bin alongside base.bin. At load time, RuntimeIndex::from_snapshots() merges them. Note: when a delta exists, the Fast (mmap) backend is not used — it falls back to Legacy to apply the merge.

05 — Search Execution

End-to-End Search Flow

What happens from the moment you type a query to when results appear. Every search travels through the same pipeline: parse → plan → route → filter → candidates → verify → results.

CLI Parse
main.rs
args → QueryRequest
plan_query()
planner.rs
shape + selectivity
route_query()
planner.rs
Indexed / rg / Scan
Path Filters
engine.rs
ext, name, glob, prefix
Trigram Candidates
engine.rs
posting list intersect
Parallel Verify
engine.rs + rayon
regex on candidates
SearchResponse
result.rs
hits + metrics
E

Example: Searching for "HttpResponse" in Kubernetes

1

CLI parses input

triseek search "HttpResponse" → QueryRequest { kind: Auto, pattern: "HttpResponse", engine: Auto }

2

plan_query() classifies the query

12 chars → shape: Literal, selectivity: High, strategy: Indexed, seeds: ["httpresponse"]

3

route_query() selects engine

Kubernetes = Medium repo, High selectivity → route: Indexed. Reason: "medium_repo, high selectivity, index available"

4

SearchEngine::open() loads fast.idx via mmap

No delta exists → Fast backend. Mmap maps file into address space in <5ms. No deserialization.

5

fast_path_filtered_docs() applies path filters

No path filters in this query → all 28,132 doc_ids pass through.

6

fast_index_candidates() intersects posting lists

Extract trigrams from "httpresponse" (lowercased): "htt", "ttp", "tpr", "pre", "res", "esp", "spo", "pon", "ons", "nse". Look up each in content_table → 10 posting lists. Sorted intersection yields ~5 candidate doc_ids.

7

collect_content_hits_parallel() verifies candidates

5 candidates distributed across rayon thread pool. Each file: read (mmap if >32KB), compile regex from escaped literal, scan lines. AtomicBool for early termination if max_results reached.

8

Return SearchResponse

Hits with file paths + line matches, summary (files_with_matches, total_line_matches), metrics (wall_millis: ~50ms, candidate_docs: 5, bytes_scanned: 45KB).

06 — Adaptive Routing

The Decision Tree

route_query() in planner.rs decides which execution engine handles each query. The decision considers repo size, query shape, selectivity, whether an index exists, and session context.

?

Routing Logic

Did user explicitly choose an engine? (--engine index/scan/rg)
Yes → use that engine directly
Is an index available?
No + Path query → DirectScan (walk files, match paths)
No + Content query → Ripgrep (fastest without index)
Index exists. What's the repo category?
Small + not repeated_session → Ripgrep (page cache makes brute force free)
Medium + not repeated_session + Low selectivity → Ripgrep
Medium/Large/VeryLarge + High/Medium selectivity → Indexed
Any size + repeated_session hint → Indexed (amortize index load)

After routing, adjust_route_for_filters() in main.rs may override: if heavy path filters are set and the route was Ripgrep, it may switch to Indexed if the index can narrow candidates faster than rg can walk+filter.

I

Indexed

Open index → path filter → trigram candidates → parallel verify. Cost: O(posting lists + candidates).

Best for: medium+ repos, selective queries, session workloads.

R

Ripgrep

Spawn rg --json subprocess. Parse JSON output for hits. Cost: O(all files).

Best for: small repos, weak regex, one-off queries, no index built yet.

S

DirectScan

Walk repo with scanner, apply path/content filters inline. Cost: O(all files).

Best for: short literals (<3 chars), path-only queries without index.

07 — Search Engine Internals

Inside engine.rs

The ~1000-line engine module is the heart of TriSeek. It orchestrates path filtering, candidate selection, parallel verification, and line-level match extraction.

F

Path Filtering Pipeline

Before any content search, path filters narrow the doc set. Each filter type uses the most efficient lookup available.

FilterFast BackendLegacy Backend
exact_pathspath_to_doc HashMap O(1)path_lookup HashMap O(1)
exact_namesfilename_map HashMap O(1)filename_map HashMap O(1)
extensionsextension_map HashMap O(1)extension_map HashMap O(1)
path_substringsTrigram search on path postings (3+ chars) or linear scan (<3)Same, via path_postings HashMap
path_prefixesLinear scan all_docs(), starts_with checkSame
globsGlobSet match on all_docs()Same

Multiple filters are intersected: a doc must pass ALL filters. Result is a sorted Vec<u32> of surviving doc_ids.

C

Trigram Candidate Selection

engine.rs — fast_index_candidates()
match plan.shape {
    Literal | Auto if pattern.len() < 3 => {
        // Too short for trigrams — return all filtered docs
        return filtered_docs;
    }
    Literal | Auto => {
        // Extract trigrams from the literal, intersect posting lists
        let candidates = fast_candidates_for_seed(fast, pattern);
        sorted_intersect(&candidates, &filtered_docs)
    }
    RegexAnchored => {
        // Use extracted literal seeds from the regex
        if regex_has_unescaped_alternation(pattern) {
            // OR pattern: union of each seed's candidates
            seeds.iter().fold(Vec::new(), |acc, seed|
                sorted_union(&acc, &fast_candidates_for_seed(fast, seed))
            )
        } else {
            // AND pattern: intersection of each seed's candidates
            seeds.iter().fold(None, |acc, seed|
                intersect_sorted(acc, fast_candidates_for_seed(fast, seed))
            )
        }
    }
}
V

Parallel Verification rayon par_iter()

The final step: read each candidate file and run the actual pattern match. This is the only phase that touches the filesystem.

How it works

  1. Resolve all candidates upfront as (doc_id, PathBuf, rel_path)
  2. Build regex matcher from pattern + case mode
  3. Rayon distributes files across work-stealing thread pool
  4. Each thread: read file (mmap >32KB, Vec <32KB)
  5. match_lines() scans for regex matches, extracts line number + column
  6. AtomicBool done flag for early termination
  7. AtomicUsize total_found tracks match count
  8. Mutex protects shared results Vec

CandidateBytes

enum CandidateBytes
Owned(Vec<u8>)< 32KB files
Mapped(Mmap)≥ 32KB files, zero-copy

Both implement Deref<Target=[u8]> so the matcher doesn't care which variant it gets.

O

Set Operations on Sorted Arrays

All posting lists and candidate sets are sorted Vec<u32>. Two-pointer merge avoids HashSet allocation overhead.

sorted_intersect() — O(n+m)
while i < a.len() && j < b.len() {
    match a[i].cmp(&b[j]) {
        Equal   => { out.push(a[i]); i+=1; j+=1; }
        Less    => i += 1,
        Greater => j += 1,
    }
}
sorted_union() — O(n+m)
while i < a.len() && j < b.len() {
    match a[i].cmp(&b[j]) {
        Equal   => { out.push(a[i]); i+=1; j+=1; }
        Less    => { out.push(a[i]); i+=1; }
        Greater => { out.push(b[j]); j+=1; }
    }
}
// drain remaining from both sides
08 — Fast Index Format

Inside fast.idx

A custom flat binary format designed for mmap. No parsing step — posting lists are read as raw &[u32] slices directly from mapped memory. Every section is at a fixed or header-declared offset.

L

Binary Layout

Header
96B
Content
Trigram Table
Content
Postings
Path
Trigram Table
Path
Postings
Doc
Table
String
Pool
Header (96 bytes)
magic[u8; 8]"TRISEEK\0"
versionu32Currently 2
num_docsu32Total documents indexed
num_content_trigramsu32Distinct content trigrams
num_path_trigramsu32Distinct path trigrams
content_table_offsetu64Byte offset to content trigram table
content_postings_offsetu64Byte offset to content posting arrays
path_table_offsetu64Byte offset to path trigram table
path_postings_offsetu64Byte offset to path posting arrays
docs_offsetu64Byte offset to doc table
strings_offsetu64Byte offset to string pool
strings_sizeu64Total string pool size
T

Trigram Table Entry (12 bytes each)

trigramu32The trigram key
offsetu32Index into posting array
countu32Number of doc_ids

At open(), all entries are loaded into a HashMap<u32, (u32, u32)> for O(1) lookup.

D

Doc Table Entry (46 bytes each)

doc_idu32
path_offsetu32 + u16 len
name_offsetu32 + u16 len
ext_offsetu32 + u8 len
fingerprintu64 + i64 + u64

Strings stored in pool; entries hold (offset, length) pairs.

R

Read Path: Trigram Lookup → Posting List

fastindex.rs — content_postings()
pub fn content_postings(&self, trigram: u32) -> Option<Vec<u32>> {
    // O(1) HashMap lookup
    let (offset, count) = self.content_table.get(&trigram)?;

    // Calculate byte position in mmap'd region
    let byte_offset = self.content_postings_offset
                    + (*offset as usize) * 4;

    // Zero-copy: cast raw bytes to &[u32]
    let ptr = self.mmap[byte_offset..].as_ptr() as *const u32;
    let slice = unsafe {
        slice::from_raw_parts(ptr, *count as usize)
    };
    Some(slice.to_vec())
}
// Total cost: 1 HashMap lookup + 1 pointer cast + 1 Vec copy
09 — CLI & Bench

Entry Points

The CLI binary and benchmark harness are thin wrappers that parse arguments, dispatch to the engine, and format output.

C

CLI Commands crates/search-cli/src/main.rs

CommandHandlerWhat It Does
buildhandle_build()Full index build: walk → accumulate → persist base.bin + fast.idx
updatehandle_update()Incremental update: fingerprint diff → delta or rebuild
searchhandle_search()Single query: plan → route → execute → JSON output
sessionhandle_session()Multi-query from JSON file: per-query routing + aggregated metrics
statshandle_stats()Display index metadata: repo stats, build time, doc count
measurehandle_measure()Scan repo without building index: for repo classification
B

Benchmark Harness search-bench/main.rs

Reads a manifest YAML listing repositories and query types. For each repo+query, runs cold trials (fresh process) and warm trials (reused index), measuring wall time via process metrics. Compares against ripgrep baseline.

Benchmark types

  • literal_selective — rare string
  • literal_moderate — medium frequency
  • literal_high — common string
  • regex_anchor — regex with literal seeds
  • regex_weak — pure regex, no literals
  • multi_or — alternation pattern
  • path_* — filename/path queries
  • session_20/100 — multi-query workload

Output

  • report.json — full results
  • report.csv — p50/p90 per query
  • summary.md — human-readable report
  • Correctness validation: match counts must agree with rg baseline