FactCheckLIAR is a retrieval-augmented fact-checking app built on the LIAR dataset. Given a user claim, it searches for related statements using a hybrid sparse+dense retriever, predicts the claim's truthfulness with a fine-tuned BERT classifier, and turns the result into either a concise verdict or a detailed evidence view. I recently extended the project with persistent indexes, optional Ollama-based response generation, a template fallback, and a more practical CLI/Streamlit workflow.
Dataset
- 12.8k manually labelled statements
- 6 veracity classes
- Speaker, party, context, and historical truthfulness metadata
Claims rarely appear in exactly the same wording as the dataset examples, so the retriever combines lexical and semantic similarity. BM25 handles keyword overlap, while FAISS searches dense vectors produced with
all-MiniLM-L6-v2. The two scores are normalized and fused with a weighted sum before selecting the most relevant supporting claims.
Sparse index
BM25 tokenizes the LIAR statements and ranks them by term relevance, which keeps exact political terms, names, and phrases influential in the result set. Dense index
FAISS stores normalized sentence-transformer embeddings so the app can find semantically similar claims even when the wording changes. The classifier is
unshDee/liar_qa, a
bert-base-uncased sequence classifier fine-tuned for the six LIAR labels. The app loads the model from a local
classifier_model/ directory when available, otherwise it downloads the saved model from Hugging Face. The CLI can still train from scratch with
--train_classifier.