Prototype / Academic Project

RAGeATM: Evidence-Bound Local RAG Assistant Prototype

A small explainable Retrieval-Augmented Generation prototype that demonstrates grounded answer behavior, retrieval thresholds, and refusal when local evidence is insufficient.

Book a Call View GitHub

At a glance

Context

Prototype / Academic Project

Current state

Local RAG prototype

Role

Sole builder for ingestion, retrieval, thresholding, generation modes, and benchmark notes

Screenshot placeholder

Evidence-Bound Retrieval-Augmented Generation Prototype

Actual screenshots are not included in this repository yet. This placeholder avoids inventing visuals while reserving space for dashboard, terminal, or demo evidence.

local source documents

searchable chunks

15 × 772

TF-IDF matrix shape

TF-IDF + cosine similarity

retrieval method

Problem

AI assistants can hallucinate or answer unsupported questions when they respond without checking whether the available evidence actually supports the answer.

Solution / What I Built

I built a local RAG prototype that ingests text files, chunks them, indexes the chunks with TF-IDF, retrieves evidence with cosine similarity, checks a minimum relevance threshold, and only answers when the retrieved context clears that threshold. When the local evidence is insufficient, the assistant refuses instead of guessing.

Results

Indexed 7 local source documents into 15 searchable chunks and produced 7/7 useful retrieval/refusal decisions on a small sanity benchmark using TF-IDF + cosine similarity with a default 0.12 threshold.

Quantified Outcomes

These numbers describe project artifacts and sanity checks. They are not client ROI, deployment adoption, actuarial accuracy, or broad model-accuracy claims.

local source documents

searchable chunks

15 × 772

TF-IDF matrix shape

TF-IDF + cosine similarity

retrieval method

0.12

default minimum relevance threshold

7/7

useful retrieval/refusal decisions on the small benchmark

Architecture

The pipeline is shown as explicit stages so the system boundary is inspectable.

1
data/raw text files
Seven local source documents provide the bounded knowledge corpus.
2
ingestion
Text files are loaded into the prototype for local processing.
3
chunking
Documents are split into 15 searchable chunks.
4
TF-IDF index
The chunk corpus becomes a 15 × 772 TF-IDF matrix.
5
cosine similarity retrieval
Queries retrieve top-k chunks by lexical similarity.
6
threshold check
The assistant answers only when retrieved context clears the minimum relevance threshold.
7
answer or refuse
In-domain questions are answered from local evidence; unsupported questions are refused.

Technical Stack

PythonTF-IDFCosine SimilarityLocal RetrievalThreshold RefusalOptional OpenAI Mode

Applied Relevance

Where the pattern matters

Internal documentation assistants
Course assistants
Policy Q&A
Small-business knowledge assistants
Grounded AI patterns

Proof Surfaces

Available artifacts are labeled directly. Missing visuals stay as placeholders until real screenshots are added.

Demo Behavior

Available now

The prototype demonstrates both grounded answers and explicit refusal when the local corpus does not support a question.

In-domain questions retrieve local context and answer from that evidence.
An out-of-domain question such as "What is the capital of France?" refuses because the local corpus does not support the answer.

Architecture / Pipeline

Available now

The flow is intentionally small and inspectable: local files become chunks, chunks become TF-IDF features, retrieval is thresholded, and generation depends on retrieved evidence.

data/raw text files -> ingestion -> chunking -> TF-IDF index.
Cosine similarity returns top-k local evidence.
Minimum relevance threshold decides whether to answer or refuse.

Grounding Controls

Available now

The main applied lesson is the refusal boundary, not broad RAG accuracy.

Top-k retrieval with minimum relevance threshold.
Offline retrieval-conditioned generation by default.
Optional OpenAI mode only when configured.

Artifacts & Evidence

Available now

The evidence is intentionally modest and quantified as a sanity benchmark.

7 local source documents, 15 searchable chunks, and 15 × 772 TF-IDF feature matrix.
7-question sanity benchmark with 7/7 useful retrieval/refusal decisions.
Public GitHub repository linked for code review.

Limitations

What this does not claim

Small educational corpus.
Lexical TF-IDF retrieval, not neural embeddings.
No Chroma/vector database.
No persistent memory.
No agents/tools.
No voice/UI/deployment.
The 7/7 result is a sanity benchmark, not broad accuracy.

Next Improvements

Reasonable next steps

Add a lightweight UI for demonstrating answer/refusal behavior.
Compare TF-IDF retrieval against an embedding-based retriever on the same corpus.
Expand the benchmark beyond seven sanity-check questions.
Add richer citation display and evaluation logging before claiming broader quality.

Future Work: Retrieval Capability Ladder

A staged view of how RAGeATM could grow from simple lexical retrieval into a more measurable, semantic, context-aware, and eventually multimodal research-assistant harness.

RAGeATM is currently best understood as a small but useful RAG prototype: enough to demonstrate retrieval, grounding, and evaluation discipline, but not yet a production research platform. The next work is not simply to make it bigger. The stronger path is to make retrieval more measurable, reproducible, semantic, and context-aware while avoiding overclaims about what current AI systems truly understand.

TF-IDF and BM25 retrieve based primarily on lexical overlap, while embedding-based and LLM-assisted retrieval can better capture semantic similarity, paraphrase, and conceptual relevance. This makes them more capable of retrieving documents related to the user’s underlying intent, although they should not be described as fully understanding the ‘question beneath the question’ in a human sense.

LLMs can approximate deeper intent by modeling semantic context, conversational history, and inferred goals, but this remains probabilistic pattern-based reasoning rather than true human understanding.

Future Work: Retrieval Capability Ladder
Level	System type	What it compares	Meaning captured	Question under the question ability	Personal context ability	Real-world grounding	Best use case	Fatal weakness
1	Exact keyword search	Literal word/string overlap	5%	0%	0%	0%	Finding exact names, IDs, phrases, codes	Misses anything phrased differently
2	TF-IDF	Weighted term overlap	10-20%	0-5%	0%	0%	Simple document retrieval where vocabulary matches	No real semantics; treats text as bag-of-words
3	BM25	Improved keyword relevance with saturation/length normalization	20-35%	5%	0%	0%	Strong classic search baseline	Still mostly lexical; synonyms and paraphrases are weak
4	Static embeddings	Word/document vectors learned from language patterns	35-50%	10-20%	0-5%	0%	Finding semantically related text	Limited context sensitivity
5	Modern embedding models	Query/document meaning vectors	55-75%	25-45%	5-15%	0-5%	RAG retrieval, semantic search, paraphrase matching	Can retrieve conceptually similar but wrong context
6	Hybrid search	BM25 + embeddings	65-85%	30-50%	5-15%	0-5%	Serious RAG systems	More complex; requires tuning and evaluation
7	Reranked retrieval	Initial retrieval + LLM/cross-encoder relevance judgment	75-90%	40-60%	10-20%	0-5%	High-quality RAG retrieval	Slower/costlier; still depends on retrieved candidates
8	LLM reading retrieved context	Retrieved docs + generated reasoning	80-95% for answer synthesis	50-70%	15-35%	0-10%	Answering from documents with explanation	Can hallucinate, overgeneralize, or sound more certain than it is
9	LLM with memory/user profile	Query + history + user goals + documents	80-95%	65-80%	50-75%	5-15%	Personalized assistants, tutoring, coaching, project guidance	Risk of assuming too much about the user
10	Agentic AI with tools	Text + memory + documents + actions + external systems	85-95%	70-85%	60-80%	20-45%	Research assistants, workflow automation, coding agents	Tool errors, bad planning, weak verification
11	Multimodal grounded AI	Text + vision + audio + environment + actions	85-98%	75-90%	70-85%	50-75%	Real-world assistance, robotics, field analysis	Still not human lived experience
12	Human-level social/contextual understanding	Language + memory + embodiment + relationships + lived experience	95-100%	90-100%	90-100%	90-100%	Real relational discernment	Current AI does not truly have this

These percentages are heuristic gauges, not universal benchmark results. They are meant to communicate increasing capability scope, not claim exact measured performance.

Clean interpretation

Clean interpretation of retrieval methods
Method	What it really knows
TF-IDF	"These documents share important words with the query."
BM25	"These documents share important words in a more search-optimized way."
Embeddings	"These documents are conceptually close to the query."
Hybrid retrieval	"These documents match both the words and the meaning."
Reranking	"Of the retrieved documents, these are probably most relevant to the user’s actual question."
LLM + memory	"Given this user’s history, goals, and wording, this may be what they are really asking."
Grounded AI	"Given the person’s behavior, environment, constraints, and history, this is probably the deeper issue."

The practical future work for RAGeATM is to climb this ladder carefully: first by improving reproducibility and evaluation, then by comparing lexical, embedding, hybrid, and reranked retrieval, then by testing whether memory, user goals, and multimodal inputs actually improve retrieval quality without creating unjustified confidence.

Public Repository

The public code link is provided for review of the prototype and technical approach. This does not represent paid deployment, production adoption, or client ROI unless stated elsewhere on the page.

View GitHub Repository

Related Case Studies

More portfolio context.

Prototype / Academic ProjectApplied dashboard prototype

WeatherForge

A Minnesota severe-weather analytics dashboard that turns large NOAA weather datasets into county-level risk views, cleaned analytics layers, and decision-support reporting surfaces.

PythonShinyPlotlyParquet

Read case study

R&DActive Build

DGM

Workflow orchestration layer in active development for managing state, decision flow, and human review inside StormIQ.

PythonFastAPIWorkflow GraphsQueue-backed Jobs

Read case study

Back to all case studies

RAGeATM: Evidence-Bound Local RAG Assistant Prototype

Evidence-Bound Retrieval-Augmented Generation Prototype

Problem

Solution / What I Built

Results

Quantified Outcomes

Architecture

data/raw text files

ingestion

chunking

TF-IDF index

cosine similarity retrieval

threshold check

answer or refuse

Technical Stack

Applied Relevance

Proof Surfaces

Demo Behavior

Architecture / Pipeline

Grounding Controls

Artifacts & Evidence

Limitations

Next Improvements

Future Work: Retrieval Capability Ladder

Clean interpretation

Public Repository

Related Case Studies

WeatherForge

DGM