go back to projects

python cli + agent skill + github action

ProofRAG

A reproducible evaluation loop for RAG systems: generate golden sets, run predictions, score retrieval and generation, then gate regressions in CI.

links
built with
  • RAG Evaluation
  • LLM-as-Judge
  • Python CLI
  • Agent Skills
  • GitHub Actions
  • DeepEval
  • Ragas
ProofRAG started from a practical RAG problem: teams can change a chunker, retriever, reranker, prompt, model, or context-packing strategy, but without a stable evaluation set they are mostly comparing impressions. The hard part is not only running metrics; it is producing a useful golden set from the actual corpus and keeping the evaluation loop repeatable.
The tool packages that loop as both a Python CLI and an agent skill. It reads a corpus, generates and validates corpus-grounded test cases, calls the user's RAG system through HTTP or a Python callable, judges answers with a pinned LLM-as-judge, computes retrieval metrics, and emits a self-contained HTML scorecard.
Project Surface
Package:proofrag
Runtime:Python 3.11+
Core deps:none
Interfaces:CLI, Skill, Action
License:MIT
Evaluation Axes
Retrieval
  • Recall@k
  • Precision@k
  • NDCG@k
  • MRR
Generation
  • Groundedness
  • Correctness
  • Completeness
  • Citation quality

The Loop

The workflow is intentionally explicit. Each step produces an artifact that can be reviewed, committed, compared, or uploaded by CI.
Artifacts
Golden set:jsonl
Validation:json
Predictions:jsonl
Results:json
Report:html
Command Sequence
proofrag generate --corpus ./docs --out goldenset.jsonl --n 20
proofrag validate --goldenset goldenset.jsonl --corpus ./docs
proofrag run --goldenset goldenset.jsonl --endpoint http://localhost:8000/ask --out predictions.jsonl
proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl --out results.json
proofrag report --results results.json --out scorecard.html
Golden Set Design
  • Questions are generated from the user's own corpus rather than a generic benchmark.
  • Difficulty tiers include single-document, multi-document, and unanswerable questions.
  • Each gold context preserves source path, chunk id, chunk index, character count, and extension metadata.
  • Validation catches duplicate ids/questions, missing contexts, source coverage problems, and corpus fingerprint drift.

Scorecards

Reports are static HTML files with no external assets. They are designed to be uploaded as CI artifacts, shared with teammates, or opened locally after an agent run.
Evaluation Report
ProofRAG HTML scorecard showing RAG evaluation metrics
Blind A/B Report
ProofRAG blind A/B comparison report

CI Gates

The GitHub Action turns the same CLI into a merge gate. One mode enforces an absolute score floor, while another compares the candidate run against a committed baseline and fails only when a metric regresses beyond a tolerance.
Gate Modes
Absolute:--fail-under
Regression:proofrag diff
Artifacts:html + json
Summary:markdown
Action Example
- uses: unshDee/proofrag@v0
  with:
    goldenset: eval/goldenset.jsonl
    predictions: predictions.jsonl
    baseline: eval/baseline.json
    fail-under: "0.7"

Agent-Native Surface

ProofRAG also ships as an Agent Skill. The skill does not hide the machinery; it gives an agent a durable procedure for finding a RAG entrypoint, generating or validating test data, running the CLI, and returning the scorecard path.
Providers
Anthropic:extra
OpenAI:extra
Ollama:base url
PDF:optional
Backend Swaps
The default scoring path uses ProofRAG's own pinned LLM judge, but the workflow also supports DeepEval and Ragas as generation-scoring backends. Retrieval metrics, reports, diffing, comparison, and CI gates stay the same when the backend changes.

Related RAG Utilities

I also maintain rag-utils, but I treat it differently from ProofRAG. It is intentionally a collection of scripts I reach for while building RAG systems, not a single product.
rag-utils script collection
  • DOCX to chunks with media metadata
  • PDF to Markdown conversion
  • Offline retrieval evaluation
  • Chunk quality scoring
  • Overlap merging and semantic dedup
  • Context window packing
  • Query expansion and HyDE
  • SQLite embedding cache
  • Span-based RAG tracing
  • Hybrid retrieval with reciprocal rank fusion
notes

too bright? click ↝