Skip to content

RAGeATM: Evidence-Bound Local RAG Assistant Prototype

A small explainable Retrieval-Augmented Generation prototype that demonstrates grounded answer behavior, retrieval thresholds, and refusal when local evidence is insufficient.

Context

Prototype / Academic Project

Current state

Local RAG prototype

Role

Sole builder for ingestion, retrieval, thresholding, generation modes, and benchmark notes

Problem

AI assistants can hallucinate or answer unsupported questions when they respond without checking whether the available evidence actually supports the answer.

Solution / What I Built

I built a local RAG prototype that ingests text files, chunks them, indexes the chunks with TF-IDF, retrieves evidence with cosine similarity, checks a minimum relevance threshold, and only answers when the retrieved context clears that threshold. When the local evidence is insufficient, the assistant refuses instead of guessing.

Results

Indexed 7 local source documents into 15 searchable chunks and produced 7/7 useful retrieval/refusal decisions on a small sanity benchmark using TF-IDF + cosine similarity with a default 0.12 threshold.

Quantified Outcomes

These numbers describe project artifacts and sanity checks. They are not client ROI, deployment adoption, actuarial accuracy, or broad model-accuracy claims.

7

local source documents

15

searchable chunks

15 × 772

TF-IDF matrix shape

TF-IDF + cosine similarity

retrieval method

0.12

default minimum relevance threshold

7/7

useful retrieval/refusal decisions on the small benchmark

Architecture

The pipeline is shown as explicit stages so the system boundary is inspectable.

  1. 1

    data/raw text files

    Seven local source documents provide the bounded knowledge corpus.

  2. 2

    ingestion

    Text files are loaded into the prototype for local processing.

  3. 3

    chunking

    Documents are split into 15 searchable chunks.

  4. 4

    TF-IDF index

    The chunk corpus becomes a 15 × 772 TF-IDF matrix.

  5. 5

    cosine similarity retrieval

    Queries retrieve top-k chunks by lexical similarity.

  6. 6

    threshold check

    The assistant answers only when retrieved context clears the minimum relevance threshold.

  7. 7

    answer or refuse

    In-domain questions are answered from local evidence; unsupported questions are refused.

Technical Stack

PythonTF-IDFCosine SimilarityLocal RetrievalThreshold RefusalOptional OpenAI Mode

Applied Relevance

Where the pattern matters

  • Internal documentation assistants
  • Course assistants
  • Policy Q&A
  • Small-business knowledge assistants
  • Grounded AI patterns

Proof Surfaces

Available artifacts are labeled directly. Missing visuals stay as placeholders until real screenshots are added.

Demo Behavior

Available now

The prototype demonstrates both grounded answers and explicit refusal when the local corpus does not support a question.

  • In-domain questions retrieve local context and answer from that evidence.
  • An out-of-domain question such as "What is the capital of France?" refuses because the local corpus does not support the answer.

Architecture / Pipeline

Available now

The flow is intentionally small and inspectable: local files become chunks, chunks become TF-IDF features, retrieval is thresholded, and generation depends on retrieved evidence.

  • data/raw text files -> ingestion -> chunking -> TF-IDF index.
  • Cosine similarity returns top-k local evidence.
  • Minimum relevance threshold decides whether to answer or refuse.

Grounding Controls

Available now

The main applied lesson is the refusal boundary, not broad RAG accuracy.

  • Top-k retrieval with minimum relevance threshold.
  • Offline retrieval-conditioned generation by default.
  • Optional OpenAI mode only when configured.

Artifacts & Evidence

Available now

The evidence is intentionally modest and quantified as a sanity benchmark.

  • 7 local source documents, 15 searchable chunks, and 15 × 772 TF-IDF feature matrix.
  • 7-question sanity benchmark with 7/7 useful retrieval/refusal decisions.
  • Public GitHub repository linked for code review.

Limitations

What this does not claim

  • Small educational corpus.
  • Lexical TF-IDF retrieval, not neural embeddings.
  • No Chroma/vector database.
  • No persistent memory.
  • No agents/tools.
  • No voice/UI/deployment.
  • The 7/7 result is a sanity benchmark, not broad accuracy.

Next Improvements

Reasonable next steps

  • Add a lightweight UI for demonstrating answer/refusal behavior.
  • Compare TF-IDF retrieval against an embedding-based retriever on the same corpus.
  • Expand the benchmark beyond seven sanity-check questions.
  • Add richer citation display and evaluation logging before claiming broader quality.

Future Work: Retrieval Capability Ladder

A staged view of how RAGeATM could grow from simple lexical retrieval into a more measurable, semantic, context-aware, and eventually multimodal research-assistant harness.

RAGeATM is currently best understood as a small but useful RAG prototype: enough to demonstrate retrieval, grounding, and evaluation discipline, but not yet a production research platform. The next work is not simply to make it bigger. The stronger path is to make retrieval more measurable, reproducible, semantic, and context-aware while avoiding overclaims about what current AI systems truly understand.

TF-IDF and BM25 retrieve based primarily on lexical overlap, while embedding-based and LLM-assisted retrieval can better capture semantic similarity, paraphrase, and conceptual relevance. This makes them more capable of retrieving documents related to the user’s underlying intent, although they should not be described as fully understanding the ‘question beneath the question’ in a human sense.

LLMs can approximate deeper intent by modeling semantic context, conversational history, and inferred goals, but this remains probabilistic pattern-based reasoning rather than true human understanding.

Future Work: Retrieval Capability Ladder
LevelSystem typeWhat it comparesMeaning capturedQuestion under the question abilityPersonal context abilityReal-world groundingBest use caseFatal weakness
1Exact keyword searchLiteral word/string overlap5%0%0%0%Finding exact names, IDs, phrases, codesMisses anything phrased differently
2TF-IDFWeighted term overlap10-20%0-5%0%0%Simple document retrieval where vocabulary matchesNo real semantics; treats text as bag-of-words
3BM25Improved keyword relevance with saturation/length normalization20-35%5%0%0%Strong classic search baselineStill mostly lexical; synonyms and paraphrases are weak
4Static embeddingsWord/document vectors learned from language patterns35-50%10-20%0-5%0%Finding semantically related textLimited context sensitivity
5Modern embedding modelsQuery/document meaning vectors55-75%25-45%5-15%0-5%RAG retrieval, semantic search, paraphrase matchingCan retrieve conceptually similar but wrong context
6Hybrid searchBM25 + embeddings65-85%30-50%5-15%0-5%Serious RAG systemsMore complex; requires tuning and evaluation
7Reranked retrievalInitial retrieval + LLM/cross-encoder relevance judgment75-90%40-60%10-20%0-5%High-quality RAG retrievalSlower/costlier; still depends on retrieved candidates
8LLM reading retrieved contextRetrieved docs + generated reasoning80-95% for answer synthesis50-70%15-35%0-10%Answering from documents with explanationCan hallucinate, overgeneralize, or sound more certain than it is
9LLM with memory/user profileQuery + history + user goals + documents80-95%65-80%50-75%5-15%Personalized assistants, tutoring, coaching, project guidanceRisk of assuming too much about the user
10Agentic AI with toolsText + memory + documents + actions + external systems85-95%70-85%60-80%20-45%Research assistants, workflow automation, coding agentsTool errors, bad planning, weak verification
11Multimodal grounded AIText + vision + audio + environment + actions85-98%75-90%70-85%50-75%Real-world assistance, robotics, field analysisStill not human lived experience
12Human-level social/contextual understandingLanguage + memory + embodiment + relationships + lived experience95-100%90-100%90-100%90-100%Real relational discernmentCurrent AI does not truly have this

These percentages are heuristic gauges, not universal benchmark results. They are meant to communicate increasing capability scope, not claim exact measured performance.

Clean interpretation

Clean interpretation of retrieval methods
MethodWhat it really knows
TF-IDF"These documents share important words with the query."
BM25"These documents share important words in a more search-optimized way."
Embeddings"These documents are conceptually close to the query."
Hybrid retrieval"These documents match both the words and the meaning."
Reranking"Of the retrieved documents, these are probably most relevant to the user’s actual question."
LLM + memory"Given this user’s history, goals, and wording, this may be what they are really asking."
Grounded AI"Given the person’s behavior, environment, constraints, and history, this is probably the deeper issue."

The practical future work for RAGeATM is to climb this ladder carefully: first by improving reproducibility and evaluation, then by comparing lexical, embedding, hybrid, and reranked retrieval, then by testing whether memory, user goals, and multimodal inputs actually improve retrieval quality without creating unjustified confidence.

Public Repository

The public code link is provided for review of the prototype and technical approach. This does not represent paid deployment, production adoption, or client ROI unless stated elsewhere on the page.

Related Case Studies

More portfolio context.

Prototype / Academic ProjectApplied dashboard prototype

WeatherForge

A Minnesota severe-weather analytics dashboard that turns large NOAA weather datasets into county-level risk views, cleaned analytics layers, and decision-support reporting surfaces.

PythonShinyPlotlyParquet
Read case study
R&DActive Build

DGM

Workflow orchestration layer in active development for managing state, decision flow, and human review inside StormIQ.

PythonFastAPIWorkflow GraphsQueue-backed Jobs
Read case study
Back to all case studies