ujwal55 commited on
Commit
1b330b0
Β·
verified Β·
1 Parent(s): ddd25f0

Upload 5 files

Browse files
Methodology and Privacy Protection.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Synopsis Scorer: Methodology and Privacy Protection Strategy
2
+
3
+ ## Scoring Methodology
4
+
5
+ The synopsis quality evaluation system employs a multi-dimensional approach to assess how effectively a synopsis captures and communicates the content of its source article:
6
+
7
+ ### Components (Total: 100 points)
8
+
9
+ #### 1. **Content Coverage – 50 points**
10
+ - **What it checks:** How well the synopsis captures the key ideas from the article.
11
+ - **How it's done:**
12
+ - Uses the `all-MiniLM-L6-v2` SentenceTransformer model to convert both texts into vector form.
13
+ - Calculates the cosine similarity between the article and synopsis embeddings (range: 0 to 1).
14
+ - **Scoring:** `similarity Γ— 50` (e.g., 0.9 similarity = 45 points).
15
+ - **Why it matters:** A strong synopsis reflects the main points from the original content.
16
+
17
+ #### 2. **Clarity – 25 points**
18
+ - **What it checks:** How clearly and precisely the synopsis is written.
19
+ - **How it's done:**
20
+ - Calculates lexical diversity: `(unique words / total words) Γ— 25`.
21
+ - **Why it matters:** More diverse vocabulary indicates better language use and avoids repetition.
22
+
23
+ #### 3. **Coherence – 25 points**
24
+ - **What it checks:** How logically the synopsis is structured and whether it flows smoothly.
25
+ - **How it's done:**
26
+ - Gives 5 points per sentence, up to a maximum of 5 sentences.
27
+ - **Why it matters:** Clear, well-structured writing is easier to understand and follow.
28
+
29
+ ### Advanced Feedback
30
+
31
+ In addition to the quantitative scoring, the system leverages the Gemma 3 4B LLM to provide qualitative feedback on synopsis quality. The model is guided through careful prompt engineering to focus on relevance, coverage, clarity, and coherence without storing or reproducing the original text content.
32
+
33
+ ## Privacy Protection Strategy
34
+
35
+ The system implements a comprehensive data privacy approach to protect sensitive information:
36
+
37
+ ### Multi-Layer Anonymization
38
+
39
+ 1. **Named Entity Recognition**
40
+ - Uses spaCy's NER capabilities to identify and replace sensitive entities:
41
+ - PERSON: Individual names
42
+ - DATE: Temporal identifiers
43
+ - LOCATION/GPE: Geographic references
44
+ - ORG: Organization names
45
+
46
+ 2. **Regex Pattern Matching**
47
+ - Supplements NER with custom regular expressions to catch:
48
+ - Email addresses
49
+ - URLs and web links
50
+ - Phone numbers
51
+ - Identification numbers/codes
52
+
53
+ 3. **Privacy-Preserving LLM Integration**
54
+ - Applies anonymization before sending text to the LLM
55
+ - Uses a quantized model running locally to avoid data transmission to external APIs
56
+ - Implements character limits to prevent overloading and potential information leakage
57
+
58
+ ### System Design Considerations
59
+
60
+ - **Local Processing**: All text processing occurs on the user's machine
61
+ - **Access Control**: Token-based authentication restricts unauthorized access
62
+ - **Data Minimization**: Preview displays only limited text portions
63
+ - **Secure LLM Integration**: Carefully constructed prompts instruct the LLM to analyze without storing or reproducing sensitive content
64
+
65
+ This privacy-first approach ensures that the system can provide valuable evaluation feedback while maintaining the confidentiality of sensitive information in both source articles and synopses.
README.md CHANGED
@@ -1,19 +1,86 @@
1
- ---
2
- title: Synopsis Scorer
3
- emoji: πŸš€
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
- pinned: false
11
- short_description: '🧠 Privacy-Preserving Synopsis Scorer Upload an article and '
12
- ---
13
-
14
- # Welcome to Streamlit!
15
-
16
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
17
-
18
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
19
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Synopsis Scorer with Privacy Protection
2
+
3
+ This application evaluates the quality of text synopses against their source content while maintaining privacy through robust text anonymization techniques.
4
+
5
+ ## Features
6
+
7
+ - **Synopsis Quality Assessment**: Scores synopses based on content coverage, clarity, and coherence
8
+ - **Privacy Protection**: Anonymizes sensitive information in both source articles and synopses
9
+ - **LLM-Powered Feedback**: Provides qualitative feedback using Gemma 3 4B LLM
10
+ - **User-Friendly Interface**: Built with Streamlit for easy interaction
11
+
12
+ ## Setup Instructions
13
+
14
+ ### Prerequisites
15
+
16
+ - Python 3.8+
17
+ - At least 4GB RAM (recommended for LLM inference)
18
+ - 4GB disk space
19
+
20
+ ### Installation
21
+
22
+ 1. Clone this repository:
23
+ ```
24
+ git clone https://github.com/yourusername/synopsis-scorer.git
25
+ cd synopsis-scorer
26
+ ```
27
+
28
+ 2. Create a virtual environment:
29
+ ```
30
+ python -m venv venv
31
+ source venv/bin/activate # On Windows: venv\Scripts\activate
32
+ ```
33
+
34
+ 3. Install dependencies:
35
+ ```
36
+ pip install -r requirements.txt
37
+ ```
38
+
39
+ 4. Download the spaCy model:
40
+ ```
41
+ python -m spacy download en_core_web_sm
42
+ ```
43
+
44
+ 5. Download the Gemma model:
45
+ The application will automatically download the quantized Gemma model on first run, or you can manually download it:
46
+ ```
47
+ mkdir -p models
48
+
49
+ Download the model from this url: https://huggingface.co/google/gemma-3-4b-it-qat-q4_0-gguf/resolve/main/gemma-3-4b-it-q4_0.gguf and place it in "models" folder.
50
+ ```
51
+
52
+ ### Running the Application
53
+
54
+ 1. Create a `.streamlit/secrets.toml` file with your access token:
55
+ ```
56
+ echo 'access_token = "your_secret_token"' > .streamlit/secrets.toml
57
+ echo 'hf_token = "your_huggingface_token"' > .streamlit/secrets.toml
58
+ ```
59
+
60
+ 2. Start the application:
61
+ ```
62
+ streamlit run app.py
63
+ ```
64
+
65
+ 3. Open your browser and go to `http://localhost:8501`
66
+
67
+ ## Usage
68
+
69
+ 1. Enter the access token to unlock the application
70
+ 2. Upload an article file (PDF or TXT format)
71
+ 3. Upload a synopsis file (TXT format)
72
+ 4. Click "Evaluate" to process and score the synopsis
73
+ 5. Review the scoring metrics and LLM feedback
74
+
75
+ ## Project Structure
76
+
77
+ ```
78
+ synopsis-scorer/
79
+ β”œβ”€β”€ app.py # Main Streamlit application
80
+ β”œβ”€β”€ utils.py # Utilities for text processing and scoring
81
+ β”œβ”€β”€ requirements.txt # Python dependencies
82
+ β”œβ”€β”€ README.md # This documentation
83
+ β”œβ”€β”€ .streamlit/
84
+ β”‚ └── secrets.toml # Configuration secrets
85
+ └── models/ # Downloaded LLM models
86
+ ```
requirements.txt CHANGED
@@ -1,3 +1,8 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
 
 
 
1
+ streamlit==1.31.0
2
+ PyMuPDF==1.23.8
3
+ numpy==1.26.2
4
+ scikit-learn==1.3.2
5
+ sentence-transformers==2.2.2
6
+ spacy==3.7.2
7
+ llama-cpp-python==0.2.38
8
+ en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl
streamlit_app.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from utils import extract_text, anonymize_text, score_synopsis
3
+ from llama_cpp import Llama
4
+ import os
5
+
6
+ st.set_page_config(page_title="Synopsis Scorer", layout="wide")
7
+
8
+ # --- Access Control ---
9
+ TOKEN = st.secrets.get("access_token")
10
+ user_token = st.text_input("Enter Access Token to Continue", type="password")
11
+ if user_token != TOKEN:
12
+ st.warning("Please enter a valid access token.")
13
+ st.stop()
14
+
15
+ # --- Hugging Face Token Configuration ---
16
+ hf_token = st.secrets.get("hf_token") if "hf_token" in st.secrets else os.environ.get("HF_TOKEN")
17
+ if not hf_token and not os.path.exists("models/gemma-3-4b-it-q4_0.gguf"):
18
+ st.warning("Hugging Face token not found. Please add it to your secrets or environment variables.")
19
+ hf_token = st.text_input("Enter your Hugging Face token:", type="password")
20
+
21
+
22
+ # --- File Upload ---
23
+ st.title("πŸ“˜ Synopsis Scorer with Privacy Protection")
24
+ article_file = st.file_uploader("Upload the Article (.pdf/.txt)", type=["pdf", "txt"])
25
+ synopsis_file = st.file_uploader("Upload the Synopsis (.txt)", type=["txt"])
26
+
27
+ if article_file and synopsis_file:
28
+ with st.spinner("Reading files..."):
29
+ article = extract_text(article_file)
30
+ synopsis = extract_text(synopsis_file)
31
+
32
+ st.subheader("Preview")
33
+ st.text_area("Article", article[:1000] + "...", height=200)
34
+ st.text_area("Synopsis", synopsis, height=150)
35
+
36
+ if st.button("Evaluate"):
37
+ with st.spinner("Scoring..."):
38
+ scores = score_synopsis(article, synopsis)
39
+
40
+ # Anonymization
41
+ article_anon = anonymize_text(article)
42
+ synopsis_anon = anonymize_text(synopsis)
43
+
44
+ article_limit = 350000 # max_chars = 128000 * 3.5 (approx_chars_per_token) β‰ˆ 448,000 characters; 448,000 - 98000(space for synopsis) = 350000
45
+
46
+ # LLM feedback
47
+ try:
48
+ llm = Llama.from_pretrained(
49
+ repo_id="google/gemma-3-4b-it-qat-q4_0-gguf",
50
+ filename="gemma-3-4b-it-q4_0.gguf"
51
+ )
52
+ prompt = (
53
+ "You are an expert writing evaluator. The user has uploaded two text documents: "
54
+ "1) a short synopsis, and 2) a longer article (source content). "
55
+ "Without copying or storing the full content, analyze the synopsis and evaluate its quality in comparison to the article. "
56
+ "Assess it on the basis of relevance, coverage, clarity, and coherence.\n\n"
57
+ "Return:\n- A score out of 100\n- 2 to 3 lines of qualitative feedback\n\n"
58
+ f"Here is the source article:\n{article_anon[:article_limit]}\n\nHere is the synopsis:\n{synopsis_anon}"
59
+ )
60
+ result = llm.create_chat_completion(messages=[{"role": "user", "content": [{"type": "text", "text": prompt}]}])
61
+ feedback = result["choices"][0]["message"]["content"]
62
+ except Exception as e:
63
+ feedback = "LLM feedback not available: " + str(e)
64
+
65
+ st.success("Evaluation Complete βœ…")
66
+
67
+ st.metric("Total Score", f"{scores['total']} / 100")
68
+ st.progress(scores["total"] / 100)
69
+
70
+ st.subheader("Score Breakdown")
71
+ st.write(f"πŸ“˜ Content Coverage: {scores['content_coverage']} / 50")
72
+ st.write(f"🧠 Clarity: {scores['clarity']} / 25")
73
+ st.write(f"πŸ”— Coherence: {scores['coherence']} / 25")
74
+
75
+ st.subheader("LLM Feedback")
76
+ st.write(feedback)
utils.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import fitz
3
+ import numpy as np
4
+ from sklearn.metrics.pairwise import cosine_similarity
5
+ from sentence_transformers import SentenceTransformer
6
+ import spacy
7
+
8
+
9
+ # Load the English NLP model and the SentenceTransformer model
10
+ nlp = spacy.load("en_core_web_sm")
11
+ model = SentenceTransformer('all-MiniLM-L6-v2')
12
+
13
+ def extract_text(file):
14
+ if file.name.endswith(".pdf"):
15
+ doc = fitz.open(stream=file.read(), filetype="pdf")
16
+ return "\n".join([page.get_text() for page in doc])
17
+ else:
18
+ return file.read().decode("utf-8")
19
+
20
+ def anonymize_text(text):
21
+ doc = nlp(text)
22
+ #Collect spaCy-detected entities
23
+ replacements = []
24
+ for ent in doc.ents:
25
+ if ent.label_ == "PERSON":
26
+ replacements.append((ent.start_char, ent.end_char, "PERSON"))
27
+ elif ent.label_ == "DATE":
28
+ replacements.append((ent.start_char, ent.end_char, "DATE"))
29
+ elif ent.label_ in ["GPE", "LOC"]:
30
+ replacements.append((ent.start_char, ent.end_char, "LOCATION"))
31
+ elif ent.label_ == "ORG":
32
+ replacements.append((ent.start_char, ent.end_char, "ORG"))
33
+
34
+ #Add regex-based matches for things spaCy misses
35
+ regex_patterns = [
36
+ (r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "EMAIL"), # Email
37
+ (r"https?://\S+|www\.\S+", "URL"), # URLs
38
+ (r"\b\d{10}\b", "PHONE"), # 10-digit phone numbers
39
+ (r"\b[A-Z]{2,}\d{6,}\b", "ID"), # Generic IDs (e.g., AA123456)
40
+ ]
41
+ for pattern, label in regex_patterns:
42
+ for match in re.finditer(pattern, text):
43
+ replacements.append((match.start(), match.end(), label))
44
+
45
+ replacements.sort(reverse=True)
46
+ for start, end, replacement in replacements:
47
+ text = text[:start] + f"[{replacement}]" + text[end:] # Adding brackets for clarity
48
+
49
+ return text
50
+
51
+ def score_synopsis(article, synopsis):
52
+ embeddings = model.encode([article, synopsis])
53
+ similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
54
+
55
+ content_coverage = similarity * 50
56
+ clarity = (len(set(synopsis.split())) / max(len(synopsis.split()), 1)) * 25
57
+ coherence = min(25, 5 * (len(synopsis.split(".")) - 1))
58
+
59
+ total = content_coverage + clarity + coherence
60
+ return {
61
+ "total": round(total, 2),
62
+ "content_coverage": round(content_coverage, 2),
63
+ "clarity": round(clarity, 2),
64
+ "coherence": round(coherence, 2)
65
+ }