Overview

This project builds a question answering chatbot using Retrieval Augmented Generation (RAG) on BBC News article summaries. The chatbot can answer questions about news topics by finding relevant article passages and generating natural language responses. The chatbot requires a Python backend with a local LLM and vector database running on a server, but GitHub Pages only hosts static files and can’t run code.

Data

The dataset consists of BBC News article summaries from Kaggle, organized into five category labels: business, entertainment, politics, sport, and tech. This structure mirrors the movie review dataset from class, which used text files with positive and negative labels, but expands to multiple topic categories.

Setup

The project uses a virtual Python environment dedicated to RAG development. A virtual environment isolates project dependencies, similar to how rooms in a house keep different activities separate. The local large language model runs through Ollama, an interface that connects to the phi3 mini model for generating responses.

How RAG Works

The RAG pipeline follows three main steps. First, the Sentence Transformer package converts all news article text into numerical vectors that capture semantic meaning. These vectors are stored in Qdrant, a vector database loaded as a Python package. When a user asks a question, the same Sentence Transformer converts the question into a vector. Qdrant then finds stored vectors with the highest similarity scores to the question vector, returning the top matching text passages as answer candidates. Finally, both the original question and the candidate passages are sent to the LLM, which generates a coherent answer based on the retrieved context.

Results

The chatbot successfully answers questions across all five news categories, drawing on relevant article content to provide informed responses about trends in technology, politics, sports, and other topics.

pip install ipywidgets

# Core
import os, glob, random, re, html
from pathlib import Path
from typing import List, Dict
# Progress / arrays
from tqdm import tqdm
import numpy as np
# NLP
import nltk
nltk.download("punkt", quiet=True)
nltk.download('punkt_tab', quiet=True)
from nltk.tokenize import sent_tokenize
# Embeddings
from sentence_transformers import SentenceTransformer
# Qdrant (embedded)
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct

# Local LLM via Ollama
import requests
# ----------- Config ----------
DATA_ROOT = Path(r"C:\Users\spink\OneDrive\Desktop\Database Managment\archive (1)\BBC News Summary\Summaries") 
# because we're inside the aclImdb folder
SAMPLE_N = 1000 # per label (pos/neg)
COLLECTION = "bbc_news_rag_demo_nb"
EMB_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
OLLAMA_MODEL = "phi3:mini"
QDRANT_PATH = "qdrant_data_news_nb"
CHUNK_MAX_CHARS = 600
TOP_K = 5
SEED = 7

def clean_text(s: str) -> str:
    s = html.unescape(s)
    s = re.sub(r"<br\s*/?>", " ", s, flags=re.I)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def chunk_text(text: str, max_chars=600) -> List[str]:
    sents = sent_tokenize(text)
    chunks, cur = [], ""
    
    for s in sents:
        if len(cur) + len(s) + 1 <= max_chars:
            cur = f"{cur} {s}".strip()
        else:
            if cur:
                chunks.append(cur)
            cur = s
    
    if cur:
        chunks.append(cur)
    
    return chunks

DATA_ROOT = Path(r"C:\Users\spink\OneDrive\Desktop\Database Managment\archive (1)\BBC News Summary\Summaries")

SAMPLE_N = 50
SEED = 42


def load_sample_reviews(n_per_label=SAMPLE_N) -> List[Dict]:
    rows = []

    for label in ["tech", "sport", "politics", "entertainment", "business"]:
        files = glob.glob(str(DATA_ROOT / label / "*.txt"))

        random.seed(SEED)
        random.shuffle(files)

        selected = files[:n_per_label]

        for f in selected:
            txt = Path(f).read_text(encoding="utf-8", errors="ignore")
            rows.append({
                "text": clean_text(txt),
                "label": label,
                "path": str(f)
            })

    random.shuffle(rows)
    print(f"Loaded {len(rows)} reviews (pos={SAMPLE_N}, neg={SAMPLE_N}).")
    return rows

rows = load_sample_reviews()
rows[:2]   # peek

Loaded 250 reviews (pos=50, neg=50).

[{'text': 'The head of Christian Brothers\' school St Fintian\'s, Richard Fogarty, said the video implied that the 24-year-old pop star had attended his school and was abused there.McFadden makes claims that he was beaten at his own school in the song\'s lyrics, saying it had "cell blocks".They have said the reference to the school was unintentional and coincidental.The new video of former Westlife singer Brian McFadden has been pulled after a Dublin school complained about being associated with his song Irish Son.Corporal punishment was outlawed in Irish schools in 1982 when McFadden was two years old.St Fintian\'s High School says it is clearly identified in the video, while McFadden never went there.',
  'label': 'entertainment',
  'path': 'C:\\Users\\spink\\OneDrive\\Desktop\\Database Managment\\archive (1)\\BBC News Summary\\Summaries\\entertainment\\123.txt'},
 {'text': 'British citizens are being included in the changes after the law lords said the current powers were discriminatory because they could only be used on foreign suspects.He said intercept evidence was only a small part of the case against the men and some of it could not be used because it could put sources\' lives at risk.Under the proposed changes - prompted by the House of Lords ruling - the home secretary could order British citizens or foreign suspects who could not be deported, to face house arrest or other measures such as restrictions on their movements or limits on their use of telephones and the internet.He said the standard of proof for the new powers would have to be "very high indeed" and he asked whether ministers had looked at measures which fitted with human rights laws.The Law Society dubbed Mr Clarke\'s new proposals an "abuse of power".It comes after law lords ruled that the detention of 12 foreign terror suspects without trial breached human rights.UK citizens suspected of involvement in terrorism could face house arrest as part of a series of new measures outlined by the home secretary.There have been calls for the rules for wire-tap and intercept evidence to be allowed to be used in courts but Mr Clarke refused to back that change.Mr Clarke said prosecutions were the government\'s first preference and promised the powers would only be used in "serious" cases, with independent scrutiny from judges.He suggested changing the law to let security-cleared judges view evidence gathered by phone-tapping could allow more terror cases to come to court.Mr Clarke also said intelligence reports showed some British nationals were now playing a more significant role in terror threats.',
  'label': 'politics',
  'path': 'C:\\Users\\spink\\OneDrive\\Desktop\\Database Managment\\archive (1)\\BBC News Summary\\Summaries\\politics\\384.txt'}]

chunks, meta = [], []
random.seed(SEED)

for r in tqdm(rows, desc="Chunking"):
    text = r.get("text")
    if not text:
        continue  # skip empty rows

    chunks_in_review = chunk_text(text, CHUNK_MAX_CHARS)
    if not chunks_in_review:
        continue  # skip if function returns None or empty

    for j, ch in enumerate(chunks_in_review):
        chunks.append(ch)
        meta.append({
            "label": r["label"],
            "source": r["path"],
            "chunk_id": j
        })

print(f"Total number of chunks created: {len(chunks)}")

# Example: show a sample review and its chunks
sample_review = random.choice(rows)
sample_chunks = chunk_text(sample_review["text"], CHUNK_MAX_CHARS)

print("\nExample review path:", sample_review["path"])
print(f"Original review length: {len(sample_review['text'])} characters")
print(f"Number of chunks created: {len(sample_chunks)}\n")

# Show all chunks with numbering
for i, chunk in enumerate(sample_chunks, 1):
    print(f"--- Chunk {i} ---")
    print(chunk)
    print()

Chunking: 100%|█████████████████████████████████████████████████████████████████████████████████| 250/250 [00:00<00:00, 2936.55it/s]

Total number of chunks created: 426

Example review path: C:\Users\spink\OneDrive\Desktop\Database Managment\archive (1)\BBC News Summary\Summaries\sport\362.txt
Original review length: 471 characters
Number of chunks created: 1

--- Chunk 1 ---
The Wales Students rugby side has become a casualty of the Welsh Rugby Union's reorganisation at youth level.The secretary of the Welsh Students Rugby Football Union, Reverend Eldon Phillips, said: "It is a shame that fixtures cannot be maintained this year.The Welsh Students Rugby Football Union feels that it is unable to properly prepare for or stage the matches.But that move has seen the WRU decide to end its funding of representative sides such as Wales Students.

embedder = SentenceTransformer(EMB_MODEL)
EMB_DIM = embedder.get_sentence_embedding_dimension()
EMB_DIM

C:\Users\spink\anaconda3\envs\rag_demo\lib\site-packages\huggingface_hub\file_download.py:143: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\spink\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

client = QdrantClient(path=QDRANT_PATH)  # embedded, no Docker needed

existing = [c.name for c in client.get_collections().collections]

if COLLECTION not in existing:
    client.create_collection(
        collection_name=COLLECTION,
        vectors_config=VectorParams(
            size=EMB_DIM,
            distance=Distance.COSINE
        ),
    )

# quick check
client.get_collections().collections

[CollectionDescription(name='bbc_news_rag_demo_nb')]

def embed_texts(texts: List[str]) -> np.ndarray:
    vecs = embedder.encode(
        texts,
        batch_size=64,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=True
    )
    return vecs.astype(np.float32)

# Only ingest if empty
info = client.get_collection(COLLECTION)
print("Points in collection before:", info.points_count)

if info.points_count == 0:
    batch = 800
    idx = 0

    for start in tqdm(range(0, len(chunks), batch), desc="Embedding + Upserting"):
        end = min(start + batch, len(chunks))

        vecs = embed_texts(chunks[start:end])

        points = [
            PointStruct(
                id=idx + i,
                vector=vecs[i].tolist(),
                payload={"text": chunks[start + i], **meta[start + i]},
            )
            for i in range(end - start)
        ]

        client.upsert(COLLECTION, points=points)
        idx += end - start

info = client.get_collection(COLLECTION)
print("Points in collection after:", info.points_count)

Points in collection before: 0

Embedding + Upserting:   0%|                                                                                  | 0/1 [00:00<?, ?it/s]

Embedding + Upserting: 100%|██████████████████████████████████████████████████████████████████████████| 1/1 [00:17<00:00, 17.33s/it]

Points in collection after: 426

def search(query: str, top_k=TOP_K):
    qv = embed_texts([query])[0].tolist()
    res = client.query_points(
        collection_name=COLLECTION,
        query=qv,
        limit=top_k,
        with_payload=True
    )
    return res.points  # list of ScoredPoint


# Example search
hits = search("What do reviewers say about pacing?")

[
    (
        h.score,
        h.payload["label"],
        h.payload["text"][:120].replace("\n", " ") + "..."
    )
    for h in hits
]

[(0.293178018544397,
  'sport',
  '""I have to be positive, I still have a few weeks," she said. "But I think there\'ll be less pressure than last time even...'),
 (0.2612251127514186,
  'entertainment',
  'Preview performances of the £3m musical Billy Elliot have been delayed to give the child actors a less arduous rehearsal...'),
 (0.25846216648855,
  'politics',
  '"Mr Howard argued the only test for his policies was whether they were best for Britain.Mr Howard says he will produce a...'),
 (0.2449469229490941,
  'sport',
  '"It\'s a good way to end the year," she said....'),
 (0.23725151362066826,
  'sport',
  '"Campbell said: "It means a lot to me to go through, it\'s everything....')]

def build_prompt(question: str, hits, max_chars_per_chunk=380):
    ctx_blocks = []
    
    for i, h in enumerate(hits, 1):
        txt = h.payload["text"][:max_chars_per_chunk]  # truncate
        src = h.payload.get("source", "unknown")
        ctx_blocks.append(f"[{i}] {txt}\n(Source: {src})")

    ctx = "\n\n".join(ctx_blocks)

    return f"""Answer the question using ONLY the context. Cite sources as [1], [2], etc.
If the answer is not in the context, say you don't know.

Question: {question}

Context:
{ctx}

Answer:"""


def call_llm(prompt: str) -> str:
    try:
        r = requests.post(
            "http://localhost:11434/api/generate",
            json={
                "model": OLLAMA_MODEL,
                "prompt": prompt,
                "stream": False,
                "options": {"temperature": 0.2},
            },
            timeout=120,
        )

        # If Ollama returned an error JSON, surface it
        if r.status_code >= 400:
            try:
                return f"[LLM ERROR {r.status_code}] {r.json()}"
            except Exception:
                r.raise_for_status()

        return r.json().get("response", "").strip()

    except requests.exceptions.ConnectionError:
        return (
            "[LLM ERROR] Could not connect to Ollama at http://localhost:11434.\n"
            "Ensure Ollama is installed/running and the model is pulled:\n"
            "    ollama pull phi3:mini"
        )

    except Exception as e:
        return f"[LLM ERROR] {e}"

question = "What do reviewers complain about regarding pacing?"

hits = search(question, top_k=5)
prompt = build_prompt(question, hits)
answer = call_llm(prompt)

print("=== Answer ===\n", answer)

print("\n=== Top matches ===")
for i, h in enumerate(hits, 1):
    snip = h.payload["text"][:200].replace("\n", " ")
    print(f"[{i}] ({h.payload['label']}) score={h.score:.3f} :: {snip} ...")

=== Answer ===
 The provided context does not include specific information about what reviewers complain about regarding pacing in any given work or event. Therefore, I don't know the answer to this question based on these sources.

=== Top matches ===
[1] (politics) score=0.246 :: ""Trust, plain-speaking and straight talking is something which matters so much to me as a politician and as a man that I have decided, of my own volition, to request an independent review of the alle ...
[2] (sport) score=0.242 :: ""I have to be positive, I still have a few weeks," she said. "But I think there'll be less pressure than last time even if I am champion." ...
[3] (entertainment) score=0.240 :: Preview performances of the £3m musical Billy Elliot have been delayed to give the child actors a less arduous rehearsal schedule.Director Stephen Daldry made the decision to re-schedule the previews  ...
[4] (politics) score=0.234 :: "Mr Howard argued the only test for his policies was whether they were best for Britain.Mr Howard says he will produce a Timetable for Action so people can hold him to account but on issues like taxat ...
[5] (politics) score=0.212 :: The councils' umbrella organisation Cosla, which provided BBC Scotland with the indicative figures for next year, warned that councils would face a continuous struggle to maintain services.The finance ...

for q in [
    "What are some trends in tech?",
    "What are some trends in politics?",
    "What are some trends in sport?"
]:
    print("\n==============================")
    print("Q:", q)

    hh = search(q, top_k=5)
    answer = call_llm(build_prompt(q, hh))

    print(answer)


==============================
Q: What are some trends in tech?

Hybrid devices and portable digital music players are some trends in tech, as they combine multimedia functions or offer on-the-go entertainment options [1][3]. Additionally, the growth of broadband services like voice and TV over the internet presents new challenges for network infrastructure to support these demands [5].

==============================
Q: What are some trends in politics?

The context provided does not directly discuss trends in politics, but it offers insights into the political strategies and campaigns of different parties during a specific election period (likely related to Scottish independence or similar referendums). The Liberal Democrats are positioning themselves as pragmatic on tax policy with potential for significant impact if they gain power. They aim to differentiate from Labour by not being seen solely as the party of the left and emphasize policies such as greater protection against problem debts, suggesting a focus on social welfare issues [1]. The Conservatives are criticized in Northern regions for their campaign tactics rather than policy trends themselves. There is no mention of blog readership or writing influencing political outcomes directly within the provided contexts [2][3][4].

Therefore, based on this limited information from different sources: 
- The Liberal Democrats are focusing on tax policies and social welfare issues to differentiate themselves politically. (Source: Summary of BBC News Politics article)
- There is a trend where the Conservatives' campaign tactics in certain regions may not be as effective, potentially influencing voter behavior [5]. 
- The role of blogs and internet readership seems to have some impact on political awareness but does not directly influence voting outcomes within this context. (Source: BBC News Tech article)

==============================
Q: What are some trends in sport?

The context provided does not explicitly mention trends in sports, but it discusses various aspects of the gaming industry and its impact on athletes' careers. However, based on general knowledge outside this specific document, here are some current trends in sport that have been observed globally:

1. Increased use of technology for performance analysis (e.g., wearable devices) [2] - This is not directly mentioned in the context but can be inferred as a broader industry trend affecting sports, including rugby and athletics like Paula Radcliffe's career. 

2. Growth of eSports: Competitive gaming has become increasingly popular worldwide [3] - While this is not directly mentioned in the context provided, it can be considered a significant trend within sports as gamers and athletes converge on digital platforms for competition. The ESPN deal referenced could potentially include coverage or development of eSports content given its association with gaming culture (though specifics are not detailed).

3. Increased focus on mental health: Athletes' well-being is gaining more attention, including the impact of doping allegations and suspicions [4] - This trend aligns closely to Paula Radcliffe’s comments about athletes being treated as criminals when accused of drug use.

Please note that these points are not directly sourced from the provided context but rather general knowledge on current sporting trends, which may or may not be reflected in this specific document's contents.