Spaces:
Sleeping
Sleeping
Upload 5 files
Browse files- Methodology and Privacy Protection.md +65 -0
- README.md +86 -19
- requirements.txt +8 -3
- streamlit_app.py +76 -0
- utils.py +65 -0
Methodology and Privacy Protection.md
ADDED
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Synopsis Scorer: Methodology and Privacy Protection Strategy
|
2 |
+
|
3 |
+
## Scoring Methodology
|
4 |
+
|
5 |
+
The synopsis quality evaluation system employs a multi-dimensional approach to assess how effectively a synopsis captures and communicates the content of its source article:
|
6 |
+
|
7 |
+
### Components (Total: 100 points)
|
8 |
+
|
9 |
+
#### 1. **Content Coverage β 50 points**
|
10 |
+
- **What it checks:** How well the synopsis captures the key ideas from the article.
|
11 |
+
- **How it's done:**
|
12 |
+
- Uses the `all-MiniLM-L6-v2` SentenceTransformer model to convert both texts into vector form.
|
13 |
+
- Calculates the cosine similarity between the article and synopsis embeddings (range: 0 to 1).
|
14 |
+
- **Scoring:** `similarity Γ 50` (e.g., 0.9 similarity = 45 points).
|
15 |
+
- **Why it matters:** A strong synopsis reflects the main points from the original content.
|
16 |
+
|
17 |
+
#### 2. **Clarity β 25 points**
|
18 |
+
- **What it checks:** How clearly and precisely the synopsis is written.
|
19 |
+
- **How it's done:**
|
20 |
+
- Calculates lexical diversity: `(unique words / total words) Γ 25`.
|
21 |
+
- **Why it matters:** More diverse vocabulary indicates better language use and avoids repetition.
|
22 |
+
|
23 |
+
#### 3. **Coherence β 25 points**
|
24 |
+
- **What it checks:** How logically the synopsis is structured and whether it flows smoothly.
|
25 |
+
- **How it's done:**
|
26 |
+
- Gives 5 points per sentence, up to a maximum of 5 sentences.
|
27 |
+
- **Why it matters:** Clear, well-structured writing is easier to understand and follow.
|
28 |
+
|
29 |
+
### Advanced Feedback
|
30 |
+
|
31 |
+
In addition to the quantitative scoring, the system leverages the Gemma 3 4B LLM to provide qualitative feedback on synopsis quality. The model is guided through careful prompt engineering to focus on relevance, coverage, clarity, and coherence without storing or reproducing the original text content.
|
32 |
+
|
33 |
+
## Privacy Protection Strategy
|
34 |
+
|
35 |
+
The system implements a comprehensive data privacy approach to protect sensitive information:
|
36 |
+
|
37 |
+
### Multi-Layer Anonymization
|
38 |
+
|
39 |
+
1. **Named Entity Recognition**
|
40 |
+
- Uses spaCy's NER capabilities to identify and replace sensitive entities:
|
41 |
+
- PERSON: Individual names
|
42 |
+
- DATE: Temporal identifiers
|
43 |
+
- LOCATION/GPE: Geographic references
|
44 |
+
- ORG: Organization names
|
45 |
+
|
46 |
+
2. **Regex Pattern Matching**
|
47 |
+
- Supplements NER with custom regular expressions to catch:
|
48 |
+
- Email addresses
|
49 |
+
- URLs and web links
|
50 |
+
- Phone numbers
|
51 |
+
- Identification numbers/codes
|
52 |
+
|
53 |
+
3. **Privacy-Preserving LLM Integration**
|
54 |
+
- Applies anonymization before sending text to the LLM
|
55 |
+
- Uses a quantized model running locally to avoid data transmission to external APIs
|
56 |
+
- Implements character limits to prevent overloading and potential information leakage
|
57 |
+
|
58 |
+
### System Design Considerations
|
59 |
+
|
60 |
+
- **Local Processing**: All text processing occurs on the user's machine
|
61 |
+
- **Access Control**: Token-based authentication restricts unauthorized access
|
62 |
+
- **Data Minimization**: Preview displays only limited text portions
|
63 |
+
- **Secure LLM Integration**: Carefully constructed prompts instruct the LLM to analyze without storing or reproducing sensitive content
|
64 |
+
|
65 |
+
This privacy-first approach ensures that the system can provide valuable evaluation feedback while maintaining the confidentiality of sensitive information in both source articles and synopses.
|
README.md
CHANGED
@@ -1,19 +1,86 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Synopsis Scorer with Privacy Protection
|
2 |
+
|
3 |
+
This application evaluates the quality of text synopses against their source content while maintaining privacy through robust text anonymization techniques.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+
- **Synopsis Quality Assessment**: Scores synopses based on content coverage, clarity, and coherence
|
8 |
+
- **Privacy Protection**: Anonymizes sensitive information in both source articles and synopses
|
9 |
+
- **LLM-Powered Feedback**: Provides qualitative feedback using Gemma 3 4B LLM
|
10 |
+
- **User-Friendly Interface**: Built with Streamlit for easy interaction
|
11 |
+
|
12 |
+
## Setup Instructions
|
13 |
+
|
14 |
+
### Prerequisites
|
15 |
+
|
16 |
+
- Python 3.8+
|
17 |
+
- At least 4GB RAM (recommended for LLM inference)
|
18 |
+
- 4GB disk space
|
19 |
+
|
20 |
+
### Installation
|
21 |
+
|
22 |
+
1. Clone this repository:
|
23 |
+
```
|
24 |
+
git clone https://github.com/yourusername/synopsis-scorer.git
|
25 |
+
cd synopsis-scorer
|
26 |
+
```
|
27 |
+
|
28 |
+
2. Create a virtual environment:
|
29 |
+
```
|
30 |
+
python -m venv venv
|
31 |
+
source venv/bin/activate # On Windows: venv\Scripts\activate
|
32 |
+
```
|
33 |
+
|
34 |
+
3. Install dependencies:
|
35 |
+
```
|
36 |
+
pip install -r requirements.txt
|
37 |
+
```
|
38 |
+
|
39 |
+
4. Download the spaCy model:
|
40 |
+
```
|
41 |
+
python -m spacy download en_core_web_sm
|
42 |
+
```
|
43 |
+
|
44 |
+
5. Download the Gemma model:
|
45 |
+
The application will automatically download the quantized Gemma model on first run, or you can manually download it:
|
46 |
+
```
|
47 |
+
mkdir -p models
|
48 |
+
|
49 |
+
Download the model from this url: https://huggingface.co/google/gemma-3-4b-it-qat-q4_0-gguf/resolve/main/gemma-3-4b-it-q4_0.gguf and place it in "models" folder.
|
50 |
+
```
|
51 |
+
|
52 |
+
### Running the Application
|
53 |
+
|
54 |
+
1. Create a `.streamlit/secrets.toml` file with your access token:
|
55 |
+
```
|
56 |
+
echo 'access_token = "your_secret_token"' > .streamlit/secrets.toml
|
57 |
+
echo 'hf_token = "your_huggingface_token"' > .streamlit/secrets.toml
|
58 |
+
```
|
59 |
+
|
60 |
+
2. Start the application:
|
61 |
+
```
|
62 |
+
streamlit run app.py
|
63 |
+
```
|
64 |
+
|
65 |
+
3. Open your browser and go to `http://localhost:8501`
|
66 |
+
|
67 |
+
## Usage
|
68 |
+
|
69 |
+
1. Enter the access token to unlock the application
|
70 |
+
2. Upload an article file (PDF or TXT format)
|
71 |
+
3. Upload a synopsis file (TXT format)
|
72 |
+
4. Click "Evaluate" to process and score the synopsis
|
73 |
+
5. Review the scoring metrics and LLM feedback
|
74 |
+
|
75 |
+
## Project Structure
|
76 |
+
|
77 |
+
```
|
78 |
+
synopsis-scorer/
|
79 |
+
βββ app.py # Main Streamlit application
|
80 |
+
βββ utils.py # Utilities for text processing and scoring
|
81 |
+
βββ requirements.txt # Python dependencies
|
82 |
+
βββ README.md # This documentation
|
83 |
+
βββ .streamlit/
|
84 |
+
β βββ secrets.toml # Configuration secrets
|
85 |
+
βββ models/ # Downloaded LLM models
|
86 |
+
```
|
requirements.txt
CHANGED
@@ -1,3 +1,8 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
streamlit==1.31.0
|
2 |
+
PyMuPDF==1.23.8
|
3 |
+
numpy==1.26.2
|
4 |
+
scikit-learn==1.3.2
|
5 |
+
sentence-transformers==2.2.2
|
6 |
+
spacy==3.7.2
|
7 |
+
llama-cpp-python==0.2.38
|
8 |
+
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl
|
streamlit_app.py
ADDED
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
from utils import extract_text, anonymize_text, score_synopsis
|
3 |
+
from llama_cpp import Llama
|
4 |
+
import os
|
5 |
+
|
6 |
+
st.set_page_config(page_title="Synopsis Scorer", layout="wide")
|
7 |
+
|
8 |
+
# --- Access Control ---
|
9 |
+
TOKEN = st.secrets.get("access_token")
|
10 |
+
user_token = st.text_input("Enter Access Token to Continue", type="password")
|
11 |
+
if user_token != TOKEN:
|
12 |
+
st.warning("Please enter a valid access token.")
|
13 |
+
st.stop()
|
14 |
+
|
15 |
+
# --- Hugging Face Token Configuration ---
|
16 |
+
hf_token = st.secrets.get("hf_token") if "hf_token" in st.secrets else os.environ.get("HF_TOKEN")
|
17 |
+
if not hf_token and not os.path.exists("models/gemma-3-4b-it-q4_0.gguf"):
|
18 |
+
st.warning("Hugging Face token not found. Please add it to your secrets or environment variables.")
|
19 |
+
hf_token = st.text_input("Enter your Hugging Face token:", type="password")
|
20 |
+
|
21 |
+
|
22 |
+
# --- File Upload ---
|
23 |
+
st.title("π Synopsis Scorer with Privacy Protection")
|
24 |
+
article_file = st.file_uploader("Upload the Article (.pdf/.txt)", type=["pdf", "txt"])
|
25 |
+
synopsis_file = st.file_uploader("Upload the Synopsis (.txt)", type=["txt"])
|
26 |
+
|
27 |
+
if article_file and synopsis_file:
|
28 |
+
with st.spinner("Reading files..."):
|
29 |
+
article = extract_text(article_file)
|
30 |
+
synopsis = extract_text(synopsis_file)
|
31 |
+
|
32 |
+
st.subheader("Preview")
|
33 |
+
st.text_area("Article", article[:1000] + "...", height=200)
|
34 |
+
st.text_area("Synopsis", synopsis, height=150)
|
35 |
+
|
36 |
+
if st.button("Evaluate"):
|
37 |
+
with st.spinner("Scoring..."):
|
38 |
+
scores = score_synopsis(article, synopsis)
|
39 |
+
|
40 |
+
# Anonymization
|
41 |
+
article_anon = anonymize_text(article)
|
42 |
+
synopsis_anon = anonymize_text(synopsis)
|
43 |
+
|
44 |
+
article_limit = 350000 # max_chars = 128000 * 3.5 (approx_chars_per_token) β 448,000 characters; 448,000 - 98000(space for synopsis) = 350000
|
45 |
+
|
46 |
+
# LLM feedback
|
47 |
+
try:
|
48 |
+
llm = Llama.from_pretrained(
|
49 |
+
repo_id="google/gemma-3-4b-it-qat-q4_0-gguf",
|
50 |
+
filename="gemma-3-4b-it-q4_0.gguf"
|
51 |
+
)
|
52 |
+
prompt = (
|
53 |
+
"You are an expert writing evaluator. The user has uploaded two text documents: "
|
54 |
+
"1) a short synopsis, and 2) a longer article (source content). "
|
55 |
+
"Without copying or storing the full content, analyze the synopsis and evaluate its quality in comparison to the article. "
|
56 |
+
"Assess it on the basis of relevance, coverage, clarity, and coherence.\n\n"
|
57 |
+
"Return:\n- A score out of 100\n- 2 to 3 lines of qualitative feedback\n\n"
|
58 |
+
f"Here is the source article:\n{article_anon[:article_limit]}\n\nHere is the synopsis:\n{synopsis_anon}"
|
59 |
+
)
|
60 |
+
result = llm.create_chat_completion(messages=[{"role": "user", "content": [{"type": "text", "text": prompt}]}])
|
61 |
+
feedback = result["choices"][0]["message"]["content"]
|
62 |
+
except Exception as e:
|
63 |
+
feedback = "LLM feedback not available: " + str(e)
|
64 |
+
|
65 |
+
st.success("Evaluation Complete β
")
|
66 |
+
|
67 |
+
st.metric("Total Score", f"{scores['total']} / 100")
|
68 |
+
st.progress(scores["total"] / 100)
|
69 |
+
|
70 |
+
st.subheader("Score Breakdown")
|
71 |
+
st.write(f"π Content Coverage: {scores['content_coverage']} / 50")
|
72 |
+
st.write(f"π§ Clarity: {scores['clarity']} / 25")
|
73 |
+
st.write(f"π Coherence: {scores['coherence']} / 25")
|
74 |
+
|
75 |
+
st.subheader("LLM Feedback")
|
76 |
+
st.write(feedback)
|
utils.py
ADDED
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import re
|
2 |
+
import fitz
|
3 |
+
import numpy as np
|
4 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
5 |
+
from sentence_transformers import SentenceTransformer
|
6 |
+
import spacy
|
7 |
+
|
8 |
+
|
9 |
+
# Load the English NLP model and the SentenceTransformer model
|
10 |
+
nlp = spacy.load("en_core_web_sm")
|
11 |
+
model = SentenceTransformer('all-MiniLM-L6-v2')
|
12 |
+
|
13 |
+
def extract_text(file):
|
14 |
+
if file.name.endswith(".pdf"):
|
15 |
+
doc = fitz.open(stream=file.read(), filetype="pdf")
|
16 |
+
return "\n".join([page.get_text() for page in doc])
|
17 |
+
else:
|
18 |
+
return file.read().decode("utf-8")
|
19 |
+
|
20 |
+
def anonymize_text(text):
|
21 |
+
doc = nlp(text)
|
22 |
+
#Collect spaCy-detected entities
|
23 |
+
replacements = []
|
24 |
+
for ent in doc.ents:
|
25 |
+
if ent.label_ == "PERSON":
|
26 |
+
replacements.append((ent.start_char, ent.end_char, "PERSON"))
|
27 |
+
elif ent.label_ == "DATE":
|
28 |
+
replacements.append((ent.start_char, ent.end_char, "DATE"))
|
29 |
+
elif ent.label_ in ["GPE", "LOC"]:
|
30 |
+
replacements.append((ent.start_char, ent.end_char, "LOCATION"))
|
31 |
+
elif ent.label_ == "ORG":
|
32 |
+
replacements.append((ent.start_char, ent.end_char, "ORG"))
|
33 |
+
|
34 |
+
#Add regex-based matches for things spaCy misses
|
35 |
+
regex_patterns = [
|
36 |
+
(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "EMAIL"), # Email
|
37 |
+
(r"https?://\S+|www\.\S+", "URL"), # URLs
|
38 |
+
(r"\b\d{10}\b", "PHONE"), # 10-digit phone numbers
|
39 |
+
(r"\b[A-Z]{2,}\d{6,}\b", "ID"), # Generic IDs (e.g., AA123456)
|
40 |
+
]
|
41 |
+
for pattern, label in regex_patterns:
|
42 |
+
for match in re.finditer(pattern, text):
|
43 |
+
replacements.append((match.start(), match.end(), label))
|
44 |
+
|
45 |
+
replacements.sort(reverse=True)
|
46 |
+
for start, end, replacement in replacements:
|
47 |
+
text = text[:start] + f"[{replacement}]" + text[end:] # Adding brackets for clarity
|
48 |
+
|
49 |
+
return text
|
50 |
+
|
51 |
+
def score_synopsis(article, synopsis):
|
52 |
+
embeddings = model.encode([article, synopsis])
|
53 |
+
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
|
54 |
+
|
55 |
+
content_coverage = similarity * 50
|
56 |
+
clarity = (len(set(synopsis.split())) / max(len(synopsis.split()), 1)) * 25
|
57 |
+
coherence = min(25, 5 * (len(synopsis.split(".")) - 1))
|
58 |
+
|
59 |
+
total = content_coverage + clarity + coherence
|
60 |
+
return {
|
61 |
+
"total": round(total, 2),
|
62 |
+
"content_coverage": round(content_coverage, 2),
|
63 |
+
"clarity": round(clarity, 2),
|
64 |
+
"coherence": round(coherence, 2)
|
65 |
+
}
|