Spaces:

ml6team
/

post-processing-summarization

Sleeping

App Files Files Community

MatthiasC commited on Apr 14, 2022

Commit

f51bffc

1 Parent(s): 8545c27

Add more articles/summaries and custom renderer (still needs to be cleaned up and tested further

Browse files

Files changed (11) hide show

.idea/HFSummSpace.iml +2 -2
README.md +4 -0
__pycache__/custom_renderer.cpython-37.pyc +0 -0
app.py +233 -42
arial.ttf +0 -0
custom_renderer.py +206 -0
requirements.txt +3 -0
sample-articles/article13.txt +28 -0
sample-articles/article16.txt +44 -0
sample-summaries/article13.txt +1 -0
sample-summaries/article16.txt +1 -0

.idea/HFSummSpace.iml CHANGED Viewed

@@ -8,7 +8,7 @@
     <orderEntry type="sourceFolder" forTests="false" />
   </component>
   <component name="PyDocumentationSettings">
-    <option name="format" value="PLAIN" />
-    <option name="myDocStringFormat" value="Plain" />
   </component>
 </module>

     <orderEntry type="sourceFolder" forTests="false" />
   </component>
   <component name="PyDocumentationSettings">
+    <option name="format" value="EPYTEXT" />
+    <option name="myDocStringFormat" value="Epytext" />
   </component>
 </module>

README.md CHANGED Viewed

@@ -10,3 +10,7 @@ pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
+sudo lsof -i:5000
+kill -9 67007(=PID)

__pycache__/custom_renderer.cpython-37.pyc ADDED Viewed

Binary file (6.5 kB). View file

app.py CHANGED Viewed

@@ -3,9 +3,18 @@ from typing import AnyStr
 import streamlit as st
 from bs4 import BeautifulSoup
 import spacy
 from spacy import displacy
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 from transformers import pipeline
@@ -50,6 +59,7 @@ st.set_page_config(
     }
 )
 # Model setup
 @st.cache(allow_output_mutation=True,
           suppress_st_warning=True,
@@ -72,7 +82,7 @@ def format_explainer_html(html_string):
     inside_token_prefix = '##'
     soup = BeautifulSoup(html_string, 'html.parser')
     p = soup.new_tag('p',
-        attrs={'style': 'color: black; background-color: white;'})
     # Select token elements and remove model specific tokens
     current_word = None
     for token in soup.find_all('td')[-1].find_all('mark')[1:-1]:
@@ -101,6 +111,7 @@ def format_explainer_html(html_string):
     return p
 def list_all_article_names() -> list:
     filenames = []
     for file in os.listdir('./sample-articles/'):
@@ -108,16 +119,19 @@ def list_all_article_names() -> list:
             filenames.append(file.replace('.txt', ''))
     return filenames
 def fetch_article_contents(filename: str) -> AnyStr:
     with open(f'./sample-articles/{filename.lower()}.txt', 'r') as f:
         data = f.read()
     return data
 def fetch_summary_contents(filename: str) -> AnyStr:
     with open(f'./sample-summaries/{filename.lower()}.txt', 'r') as f:
         data = f.read()
     return data
 def classify_comment(comment, selected_model):
     """Classify the given comment and augment with additional information."""
     toxicity_pipeline, cls_explainer = load_pipeline(selected_model)
@@ -180,53 +194,230 @@ if 'results' not in st.session_state:
 #     submitted = rightmost_col.form_submit_button("Classify",
 #                                                  help="Classify comment")
-with st.form("article-input"):
-    # TODO: should probably set a minimum length of article or something
-    selected_article = st.selectbox('Select an article or provide your own:', list_all_article_names(),
-    )#index=0, format_func=special_internal_function, key=None, help=None, on_change=None, args=None, kwargs=None, *, disabled=False)
-    st.session_state.article_text = fetch_article_contents(selected_article)
-    article_text = st.text_area(
-        label='Enter the comment you want to classify below (in Dutch):',
-        value = st.session_state.article_text)
-    _, rightmost_col = st.columns([6,1])
-    get_summary = rightmost_col.form_submit_button("Generate summary",
-                                                 help="Generate summary for the given article text")
 def display_summary(article_name: str):
-    st.subheader("GENERATED SUMMARY")
-    st.markdown("######")
     summary_content = fetch_summary_contents(article_name)
     nlp = spacy.load('en_core_web_sm')
     doc = nlp(summary_content)
-    html = displacy.render(doc, style="ent")
-    html = html.replace("\n", " ")
     HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
-    st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
-    st.markdown(summary_content)
-# Listener
-if get_summary:
-    if article_text:
-        with st.spinner('Generating summary...'):
-            #classify_comment(article_text, selected_model)
-            display_summary(selected_article)
-    else:
-        st.error('**Error**: No comment to classify. Please provide a comment.')
 # Results
-if 'results' in st.session_state and st.session_state.results:
-    first = True
-    for result in st.session_state.results[::-1]:
-        if not first:
-            st.markdown("---")
-        st.markdown(f"Text:\n> {result['text']}")
-        col_1, col_2, col_3 = st.columns([1,2,2])
-        col_1.metric(label='', value=f"{result['emoji']}")
-        col_2.metric(label='Label', value=f"{result['label']}")
-        col_3.metric(label='Score', value=f"{result['score']:.3f}")
-        st.markdown(f"Token Attribution:\n{result['tokens_with_background']}",
-         unsafe_allow_html=True)
-        st.caption(f"Model: {result['model_name']}")
-        first = False

 import streamlit as st
 from bs4 import BeautifulSoup
+import numpy as np
+import base64
+from spacy_streamlit.util import get_svg
+from custom_renderer import render_sentence_custom
+from flair.data import Sentence
+from flair.models import SequenceTagger
 import spacy
 from spacy import displacy
+from spacy_streamlit import visualize_parser
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 from transformers import pipeline
     }
 )
 # Model setup
 @st.cache(allow_output_mutation=True,
           suppress_st_warning=True,
     inside_token_prefix = '##'
     soup = BeautifulSoup(html_string, 'html.parser')
     p = soup.new_tag('p',
+                     attrs={'style': 'color: black; background-color: white;'})
     # Select token elements and remove model specific tokens
     current_word = None
     for token in soup.find_all('td')[-1].find_all('mark')[1:-1]:
     return p
 def list_all_article_names() -> list:
     filenames = []
     for file in os.listdir('./sample-articles/'):
             filenames.append(file.replace('.txt', ''))
     return filenames
 def fetch_article_contents(filename: str) -> AnyStr:
     with open(f'./sample-articles/{filename.lower()}.txt', 'r') as f:
         data = f.read()
     return data
 def fetch_summary_contents(filename: str) -> AnyStr:
     with open(f'./sample-summaries/{filename.lower()}.txt', 'r') as f:
         data = f.read()
     return data
 def classify_comment(comment, selected_model):
     """Classify the given comment and augment with additional information."""
     toxicity_pipeline, cls_explainer = load_pipeline(selected_model)
 #     submitted = rightmost_col.form_submit_button("Classify",
 #                                                  help="Classify comment")
+# TODO: should probably set a minimum length of article or something
+selected_article = st.selectbox('Select an article or provide your own:',
+                                list_all_article_names())  # index=0, format_func=special_internal_function, key=None, help=None, on_change=None, args=None, kwargs=None, *, disabled=False)
+st.session_state.article_text = fetch_article_contents(selected_article)
+article_text = st.text_area(
+    label='Full article text',
+    value=st.session_state.article_text,
+    height=250
+)
+# _, rightmost_col = st.columns([5, 1])
+# get_summary = rightmost_col.button("Generate summary",
+#                                                 help="Generate summary for the given article text")
 def display_summary(article_name: str):
+    st.subheader("Generated summary")
+    # st.markdown("######")
     summary_content = fetch_summary_contents(article_name)
+    soup = BeautifulSoup(summary_content, features="html.parser")
+    HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
+    st.session_state.summary_output = HTML_WRAPPER.format(soup)
+    st.write(st.session_state.summary_output, unsafe_allow_html=True)
+# TODO: this functionality can be cached (e.g. by storing html file output) if wanted (or just store list of entities idk)
+def get_and_compare_entities_spacy(article_name: str):
+    nlp = spacy.load('en_core_web_lg')
+    article_content = fetch_article_contents(article_name)
+    doc = nlp(article_content)
+    # entities_article = doc.ents
+    entities_article = []
+    for entity in doc.ents:
+        entities_article.append(str(entity))
+    summary_content = fetch_summary_contents(article_name)
+    doc = nlp(summary_content)
+    # entities_summary = doc.ents
+    entities_summary = []
+    for entity in doc.ents:
+        entities_summary.append(str(entity))
+    matched_entities = []
+    unmatched_entities = []
+    for entity in entities_summary:
+        # TODO: currently substring matching but probably should do embedding method or idk?
+        if any(entity.lower() in substring_entity.lower() for substring_entity in entities_article):
+            matched_entities.append(entity)
+        else:
+            unmatched_entities.append(entity)
+    # print(entities_article)
+    # print(entities_summary)
+    return matched_entities, unmatched_entities
+def get_and_compare_entities_flair(article_name: str):
     nlp = spacy.load('en_core_web_sm')
+    tagger = SequenceTagger.load("flair/ner-english-ontonotes-fast")
+    article_content = fetch_article_contents(article_name)
+    doc = nlp(article_content)
+    entities_article = []
+    sentences = list(doc.sents)
+    for sentence in sentences:
+        sentence_entities = Sentence(str(sentence))
+        tagger.predict(sentence_entities)
+        for entity in sentence_entities.get_spans('ner'):
+            entities_article.append(entity.text)
+    summary_content = fetch_summary_contents(article_name)
     doc = nlp(summary_content)
+    entities_summary = []
+    sentences = list(doc.sents)
+    for sentence in sentences:
+        sentence_entities = Sentence(str(sentence))
+        tagger.predict(sentence_entities)
+        for entity in sentence_entities.get_spans('ner'):
+            entities_summary.append(entity.text)
+    matched_entities = []
+    unmatched_entities = []
+    for entity in entities_summary:
+        # TODO: currently substring matching but probably should do embedding method or idk?
+        if any(entity.lower() in substring_entity.lower() for substring_entity in entities_article):
+            matched_entities.append(entity)
+        else:
+            unmatched_entities.append(entity)
+    # print(entities_article)
+    # print(entities_summary)
+    return matched_entities, unmatched_entities
+def highlight_entities(article_name: str):
+    st.subheader("Match entities with article")
+    # st.markdown("####")
+    summary_content = fetch_summary_contents(article_name)
+    markdown_start_red = "<mark class=\"entity\" style=\"background: rgb(238, 135, 135);\">"
+    markdown_start_green = "<mark class=\"entity\" style=\"background: rgb(121, 236, 121);\">"
+    markdown_end = "</mark>"
+    matched_entities, unmatched_entities = get_and_compare_entities_spacy(article_name)
+    for entity in matched_entities:
+        summary_content = summary_content.replace(entity, markdown_start_green + entity + markdown_end)
+    for entity in unmatched_entities:
+        summary_content = summary_content.replace(entity, markdown_start_red + entity + markdown_end)
+    soup = BeautifulSoup(summary_content, features="html.parser")
     HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
+    st.write(HTML_WRAPPER.format(soup), unsafe_allow_html=True)
+def render_dependency_parsing(text: str):
+    nlp = spacy.load('en_core_web_sm')
+    #doc = nlp(text)
+    # st.write(displacy.render(doc, style='dep'))
+    #sentence_spans = list(doc.sents)
+    # dep_svg = displacy.serve(sentence_spans, style="dep")
+    # dep_svg = displacy.render(doc, style="dep", jupyter = False,
+    #                           options = {"compact" : False,})
+    # st.image(dep_svg, width = 50,use_column_width=True)
+    #visualize_parser(doc)
+    #docs = [doc]
+    #split_sents = True
+    #docs = [span.as_doc() for span in doc.sents] if split_sents else [doc]
+    #for sent in docs:
+    html = render_sentence_custom(text)
+    # Double newlines seem to mess with the rendering
+    html = html.replace("\n\n", "\n")
+    st.write(get_svg(html), unsafe_allow_html=True)
+    #st.image(html, width=50, use_column_width=True)
+def check_dependency(text):
+    tagger = SequenceTagger.load("flair/ner-english-ontonotes-fast")
+    nlp = spacy.load('en_core_web_lg')
+    doc = nlp(text)
+    tok_l = doc.to_json()['tokens']
+    # all_deps = []
+    all_deps = ""
+    sentences = list(doc.sents)
+    for sentence in sentences:
+        all_entities = []
+        # # ENTITIES WITH SPACY:
+        for entity in sentence.ents:
+            all_entities.append(str(entity))
+        # # ENTITIES WITH FLAIR:
+        sentence_entities = Sentence(str(sentence))
+        tagger.predict(sentence_entities)
+        for entity in sentence_entities.get_spans('ner'):
+            all_entities.append(entity.text)
+        # ENTITIES WITH XLM ROBERTA
+        # entities_xlm = [entity["word"] for entity in ner_model(str(sentence))]
+        # for entity in entities_xlm:
+        #     all_entities.append(str(entity))
+        start_id = sentence.start
+        end_id = sentence.end
+        for t in tok_l:
+            if t["id"] < start_id or t["id"] > end_id:
+                continue
+            head = tok_l[t['head']]
+            if t['dep'] == 'amod':
+                object_here = text[t['start']:t['end']]
+                object_target = text[head['start']:head['end']]
+                # ONE NEEDS TO BE ENTITY
+                if (object_here in all_entities):
+                    # all_deps.append(f"'{text[t['start']:t['end']]}' is {t['dep']} of '{text[head['start']:head['end']]}'")
+                    all_deps = all_deps.join(str(sentence))
+                elif (object_target in all_entities):
+                    # all_deps.append(f"'{text[t['start']:t['end']]}' is {t['dep']} of '{text[head['start']:head['end']]}'")
+                    all_deps = all_deps.join(str(sentence))
+                else:
+                    continue
+    return all_deps
+with st.form("article-input"):
+    left_column, _ = st.columns([1, 1])
+    get_summary = left_column.form_submit_button("Generate summary",
+                                                 help="Generate summary for the given article text")
+    # Listener
+    if get_summary:
+        if article_text:
+            with st.spinner('Generating summary...'):
+                # classify_comment(article_text, selected_model)
+                display_summary(selected_article)
+        else:
+            st.error('**Error**: No comment to classify. Please provide a comment.')
+# Entity part
+with st.form("Entity-part"):
+    left_column, _ = st.columns([1, 1])
+    draw_entities = left_column.form_submit_button("Draw Entities",
+                                                   help="Draw Entities")
+    if draw_entities:
+        with st.spinner("Drawing entities..."):
+            highlight_entities(selected_article)
+with st.form("Dependency-usage"):
+    left_column, _ = st.columns([1, 1])
+    parsing = left_column.form_submit_button("Dependency parsing",
+                                             help="Dependency parsing")
+    if parsing:
+        with st.spinner("Doing dependency parsing..."):
+            render_dependency_parsing(check_dependency(fetch_summary_contents(selected_article)))
 # Results
+# if 'results' in st.session_state and st.session_state.results:
+#     first = True
+#     for result in st.session_state.results[::-1]:
+#         if not first:
+#             st.markdown("---")
+#         st.markdown(f"Text:\n> {result['text']}")
+#         col_1, col_2, col_3 = st.columns([1,2,2])
+#         col_1.metric(label='', value=f"{result['emoji']}")
+#         col_2.metric(label='Label', value=f"{result['label']}")
+#         col_3.metric(label='Score', value=f"{result['score']:.3f}")
+#         st.markdown(f"Token Attribution:\n{result['tokens_with_background']}",
+#          unsafe_allow_html=True)
+#         st.caption(f"Model: {result['model_name']}")
+#         first = False

arial.ttf ADDED Viewed

Binary file (312 kB). View file

custom_renderer.py ADDED Viewed

	@@ -0,0 +1,206 @@

+from typing import Dict, Any
+import spacy
+from PIL import ImageFont
+from spacy.tokens import Doc
+def get_pil_text_size(text, font_size, font_name):
+    font = ImageFont.truetype(font_name, font_size)
+    size = font.getsize(text)
+    return size
+def render_arrow(
+        label: str, start: int, end: int, direction: str, i: int
+) -> str:
+    """Render individual arrow.
+    label (str): Dependency label.
+    start (int): Index of start word.
+    end (int): Index of end word.
+    direction (str): Arrow direction, 'left' or 'right'.
+    i (int): Unique ID, typically arrow index.
+    RETURNS (str): Rendered SVG markup.
+    """
+    TPL_DEP_ARCS = """
+    <g class="displacy-arrow">
+        <path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="red"/>
+        <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
+            <textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" side="{label_side}" fill="red" text-anchor="middle">{label}</textPath>
+        </text>
+        <path class="displacy-arrowhead" d="{head}" fill="red"/>
+    </g>
+    """
+    arc = get_arc(start + 20, 50, 5, end + 20)
+    arrowhead = get_arrowhead(direction, start + 20, 50, end + 20)
+    label_side = "right" if direction == "rtl" else "left"
+    return TPL_DEP_ARCS.format(
+        id=0,
+        i=0,
+        stroke=2,
+        head=arrowhead,
+        label=label,
+        label_side=label_side,
+        arc=arc,
+    )
+def get_arc(x_start: int, y: int, y_curve: int, x_end: int) -> str:
+    """Render individual arc.
+    x_start (int): X-coordinate of arrow start point.
+    y (int): Y-coordinate of arrow start and end point.
+    y_curve (int): Y-corrdinate of Cubic Bézier y_curve point.
+    x_end (int): X-coordinate of arrow end point.
+    RETURNS (str): Definition of the arc path ('d' attribute).
+    """
+    template = "M{x},{y} C{x},{c} {e},{c} {e},{y}"
+    return template.format(x=x_start, y=y, c=y_curve, e=x_end)
+def get_arrowhead(direction: str, x: int, y: int, end: int) -> str:
+    """Render individual arrow head.
+    direction (str): Arrow direction, 'left' or 'right'.
+    x (int): X-coordinate of arrow start point.
+    y (int): Y-coordinate of arrow start and end point.
+    end (int): X-coordinate of arrow end point.
+    RETURNS (str): Definition of the arrow head path ('d' attribute).
+    """
+    arrow_width = 6
+    if direction == "left":
+        p1, p2, p3 = (x, x - arrow_width + 2, x + arrow_width - 2)
+    else:
+        p1, p2, p3 = (end, end + arrow_width - 2, end - arrow_width + 2)
+    return f"M{p1},{y + 2} L{p2},{y - arrow_width} {p3},{y - arrow_width}"
+# parsed = [{'words': [{'text': 'The', 'tag': 'DET', 'lemma': None}, {'text': 'OnePlus', 'tag': 'PROPN', 'lemma': None}, {'text': '10', 'tag': 'NUM', 'lemma': None}, {'text': 'Pro', 'tag': 'PROPN', 'lemma': None}, {'text': 'is', 'tag': 'AUX', 'lemma': None}, {'text': 'the', 'tag': 'DET', 'lemma': None}, {'text': 'company', 'tag': 'NOUN', 'lemma': None}, {'text': "'s", 'tag': 'PART', 'lemma': None}, {'text': 'first', 'tag': 'ADJ', 'lemma': None}, {'text': 'flagship', 'tag': 'NOUN', 'lemma': None}, {'text': 'phone.', 'tag': 'NOUN', 'lemma': None}], 'arcs': [{'start': 0, 'end': 3, 'label': 'det', 'dir': 'left'}, {'start': 1, 'end': 3, 'label': 'nmod', 'dir': 'left'}, {'start': 1, 'end': 2, 'label': 'nummod', 'dir': 'right'}, {'start': 3, 'end': 4, 'label': 'nsubj', 'dir': 'left'}, {'start': 5, 'end': 6, 'label': 'det', 'dir': 'left'}, {'start': 6, 'end': 10, 'label': 'poss', 'dir': 'left'}, {'start': 6, 'end': 7, 'label': 'case', 'dir': 'right'}, {'start': 8, 'end': 10, 'label': 'amod', 'dir': 'left'}, {'start': 9, 'end': 10, 'label': 'compound', 'dir': 'left'}, {'start': 4, 'end': 10, 'label': 'attr', 'dir': 'right'}], 'settings': {'lang': 'en', 'direction': 'ltr'}}]
+def render_sentence_custom(parsed: str):
+    TPL_DEP_WORDS = """
+  <text class="displacy-token" fill="currentColor" text-anchor="start" y="{y}">
+      <tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
+      <tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
+  </text>
+  """
+    TPL_DEP_SVG = """
+  <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="{lang}" id="{id}" class="displacy" width="{width}" height="{height}" direction="{dir}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}; direction: {dir}">{content}</svg>
+  """
+    arcs_svg = []
+    couples = []
+    nlp = spacy.load('en_core_web_sm')
+    doc = nlp(parsed)
+    arcs = {}
+    words = {}
+    parsed = [parse_deps(doc)]
+    for i, p in enumerate(parsed):
+        arcs = p["arcs"]
+        words = p["words"]
+    for i, a in enumerate(arcs):
+        if a["label"] == "amod":
+            couples = (a["start"], a["end"])
+    print(couples)
+    x_value_counter = 10
+    index_counter = 0
+    svg_words = []
+    coords_test = []
+    for i, word in enumerate(words):
+        word = word["text"]
+        word = word + " "
+        pixel_x_length = get_pil_text_size(word, 16, 'arial.ttf')[0]
+        svg_words.append(TPL_DEP_WORDS.format(text=word, tag="", x=x_value_counter, y=70))
+        print(index_counter)
+        if index_counter >= couples[0] and index_counter <= couples[1]:
+            coords_test.append(x_value_counter)
+            x_value_counter += 50
+        index_counter += 1
+        x_value_counter += pixel_x_length + 4
+    print(coords_test)
+    for i, a in enumerate(arcs):
+        if a["label"] == "amod":
+            arcs_svg.append(render_arrow(a["label"], coords_test[0], coords_test[-1], a["dir"], i))
+    content = "".join(svg_words) + "".join(arcs_svg)
+    full_svg = TPL_DEP_SVG.format(
+        id=0,
+        width=1975,
+        height=574.5,
+        color="#00000",
+        bg="#ffffff",
+        font="Arial",
+        content=content,
+        dir="ltr",
+        lang="en",
+    )
+    return full_svg
+def parse_deps(orig_doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
+    """Generate dependency parse in {'words': [], 'arcs': []} format.
+    doc (Doc): Document do parse.
+    RETURNS (dict): Generated dependency parse keyed by words and arcs.
+    """
+    doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes(exclude=["user_data"]))
+    if not doc.has_annotation("DEP"):
+        print("WARNING")
+    if options.get("collapse_phrases", False):
+        with doc.retokenize() as retokenizer:
+            for np in list(doc.noun_chunks):
+                attrs = {
+                    "tag": np.root.tag_,
+                    "lemma": np.root.lemma_,
+                    "ent_type": np.root.ent_type_,
+                }
+                retokenizer.merge(np, attrs=attrs)
+    if options.get("collapse_punct", True):
+        spans = []
+        for word in doc[:-1]:
+            if word.is_punct or not word.nbor(1).is_punct:
+                continue
+            start = word.i
+            end = word.i + 1
+            while end < len(doc) and doc[end].is_punct:
+                end += 1
+            span = doc[start:end]
+            spans.append((span, word.tag_, word.lemma_, word.ent_type_))
+        with doc.retokenize() as retokenizer:
+            for span, tag, lemma, ent_type in spans:
+                attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
+                retokenizer.merge(span, attrs=attrs)
+    fine_grained = options.get("fine_grained")
+    add_lemma = options.get("add_lemma")
+    words = [
+        {
+            "text": w.text,
+            "tag": w.tag_ if fine_grained else w.pos_,
+            "lemma": w.lemma_ if add_lemma else None,
+        }
+        for w in doc
+    ]
+    arcs = []
+    for word in doc:
+        if word.i < word.head.i:
+            arcs.append(
+                {"start": word.i, "end": word.head.i, "label": word.dep_, "dir": "left"}
+            )
+        elif word.i > word.head.i:
+            arcs.append(
+                {
+                    "start": word.head.i,
+                    "end": word.i,
+                    "label": word.dep_,
+                    "dir": "right",
+                }
+            )
+    return {"words": words, "arcs": arcs, "settings": get_doc_settings(orig_doc)}
+def get_doc_settings(doc: Doc) -> Dict[str, Any]:
+    return {
+        "lang": doc.lang_,
+        "direction": doc.vocab.writing_system.get("direction", "ltr"),
+    }

requirements.txt CHANGED Viewed

@@ -3,4 +3,7 @@ streamlit==1.2.0
 transformers==4.15.0
 transformers-interpret==0.5.2
 spacy==3.0.0
 en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz

 transformers==4.15.0
 transformers-interpret==0.5.2
 spacy==3.0.0
+spacy_streamlit==1.0.3
+flair
 en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
+en_core_web_lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.0.0/en_core_web_lg-3.0.0.tar.gz

sample-articles/article13.txt ADDED Viewed

	@@ -0,0 +1,28 @@

+We're already seeing the effects of the Oppo merger on OnePlus' next flagship.
+Ron Amadeo - Jan 5, 2022 1:00 am UTC
+Enlarge / The OnePlus 10 Pro.
+OnePlus
+Official product news about the upcoming OnePlus 10 Pro has begun to trickle out. For now, we have an incomplete overview with some pictures and specs, while things like a price, release date, and the finer details will have to wait for later.
+First up: specs. OnePlus 10 Pro officially has the brand-new Qualcomm Snapdragon 8 Gen 1 SoC. This is Qualcomm's new flagship SoC for 2022, and it features a single ARM Cortex X2 core, three medium Cortex A710 CPUs, and four small Cortex A510 CPUs, all built on a 4 nm process. OnePlus isn't saying how much RAM and storage the 10 Pro has, but the 9 Pro came with 8GB or 12GB of RAM and 128GB or 256GB of storage. The company confirmed the display is 120 Hz but didn't give a size, though rumors say it's 6.7-inch, the same as the OnePlus 9 Pro. That fits the now-official dimensions, which are 163 × 73.9 × 8.55 mm.
+The battery is officially 5000 mAh, an upgrade over the 9 Pro's 4500 mAh battery. Considering the similar dimensions between the two phones, this is a welcome upgrade in battery density. OnePlus is also up to a whopping 80 W "SuperVOOC" quick charging now—an improvement over last year's 65 W "Warp Charge." OnePlus doesn't give any indication of what kind of charge time we can expect, but 65 W could charge the 9 Pro's 4500 mAh battery from 0-100 in a half-hour. Charging speed is still outpacing battery growth, so the 10 Pro should charge in under a half-hour. Just like last year, wireless charging is 50 W.
+Enlarge / Another look at that wacky camera block.
+OnePlus
+OnePlus has pitched itself as a scrappy startup in the past, but it's actually owned by the Chinese company BBK Electronics, one of the world's largest smartphone manufacturers. Just like General Motors, BBK has multiple brands (OnePlus, Oppo, Vivo, Realme, and iQOO) targeting different markets, and they share plenty of parts and engineering. While OnePlus and Oppo have always shared some engineering resources, last year it was announced OnePlus would actually be folded into Oppo.
+The Oppoization of OnePlus is going to be a major narrative for the OnePlus 10 Pro. We can already see a bit of it with the change from "Warp Charging" (OnePlus branding) to "SuperVOOC" (Oppo branding). But what really matters is the software, which will see OnePlus adopt Oppo's Color OS Android skin with a few custom tweaks rather than the separate codebases the two companies were running. We got a glimpse of this design direction via the OnePlus 9's Android 12 update, and the reviews were not kind. But we'll see what the first new phone software brings.
+As for the design, the camera block is really the only area where Android OEMs allow themselves to differentiate from the norm. This year, OnePlus is going with this square-ish design that wraps around the side of the phone. It looks a lot like the Galaxy S21 Ultra camera block, except that it's wrapped around the entire corner. Inside the camera block are three cameras and an LED flash. Right now, OnePlus is only disclosing megapixel counts, and those are 48MP, 50MP, and 8MP.
+Enlarge / This is not an official picture, but OnLeaks' clearly accurate leak from November is still our only look at the front of the phone.
+We don't actually have a picture of the front yet, so above is OnLeak's unofficial render from a few months ago. This has the camera hole on the left side instead of the middle. Other than that, it looks like every other Android phone on the market.
+It might be because of Oppo's influence, but OnePlus' launch is all sorts of weird this year. The phone is launching in China first on January 11. We don't have a price yet, but OnePlus' flagship prices have gone up every year so far, and the 9 Pro was $969. There's also no word on a US release date yet.

sample-articles/article16.txt ADDED Viewed

	@@ -0,0 +1,44 @@

+SINGAPORE, Jan 5 (Reuters) - Chinese gaming and social media company Tencent Holdings Ltd (0700.HK) has raised $3 billion by selling 14.5 million shares at $208 each in Sea, which owns e-commerce firm Shopee, according to a term sheet seen by Reuters on Wednesday.
+Tencent said late on Tuesday it had entered into a deal to reduce its stake in the Singapore-based gaming and e-commerce group to 18.7% from 21.3%. The company plans to retain the substantial majority of its stake in Sea for the long term.
+The sale comes after Tencent said last month it would divest $16.4 billion of its stake in JD.com (9618.HK), weakening its ties to China's second-biggest e-commerce firm, amid pressure from Beijing's broad regulatory crackdown on technology firms. read more
+Register now for FREE unlimited access to Reuters.com
+Sea's shares fell 11.4% on Tuesday in New York to $197.8 following the divestment news. Ahead of the announcement, Sea said Tencent had also agreed to cut its voting stake in the company to less than 10%.
+"We believe with a lower voting right control, it could reduce any potential conflict if Tencent's gaming teams plan to publish more games directly in global markets and help reduce any potential geopolitical friction if/when Sea plans to expand more strategically into new markets in more countries," Citi's analysts said in a report on Wednesday.
+Sea said Tencent and its affiliates had given an irrevocable notice to convert all their Class B ordinary shares.
+Upon conversion, all outstanding class B shares of Sea will be beneficially owned by Forrest Li, the founder, chairman and CEO of Sea, Southeast Asia's most valued company, which has a market capitalisation of $110 billion.
+Tencent and Sea declined to comment on the pricing of the share sale.
+Guotai Junan International analyst Vincent Liu said he did not see Tencent's move to trim its Sea stake as surprising, given its recent JD.com divestment. Tencent owns a huge, diversified investment portfolio so buying or selling shares in its investees could be considered a "regular action", he said.
+"On the other hand, we think that this reflects some of Tencent's adjustments in business strategy, especially under the circumstance of tightening regulations on anti-trust," he added.
+Sea's shares have shed 47% from a record high of $372 struck in October but have still risen five-fold in the past three years.
+The company started out as a gaming firm in 2009 and then diversified into e-commerce and food delivery, benefiting from roaring demand for its services from consumers, especially during pandemic-related restrictions.
+Sea is now expanding its e-commerce operations globally. read more
+"The divestment provides Tencent with resources to fund other investments and social initiatives," Tencent said in a statement.
+It sold the stock at the lower end of the $208-$212 per share range when the transaction was launched on Tuesday. The price set was a 6.8% discount to Sea's closing price of $223.3 on Monday.
+Tencent's shares fell 3.5% on Wednesday in a broader market, weighed down by tech stocks.
+Tencent will be subject to a lockup period that restricts further sale of Sea shares by Tencent during the next six months.
+Separately, Sea is proposing to increase the voting power of each Class B ordinary share to 15 votes from three.
+"The board believes that, as Sea has scaled significantly to become a leading global consumer internet company, it is in the best interests of the company in pursuing its long-term growth strategies to further clarify its capital structure through the contemplated changes," it said.
+Sea said the changes are subject to approval by its shareholders.
+It said that once the changes are made, the outstanding Class B ordinary shares beneficially owned by Li are expected to represent about 57% of the voting power, up from about 52%.
+Separately, Li holds about 54% of the total voting power related to the size and composition of Sea's board of directors.

sample-summaries/article13.txt ADDED Viewed

	@@ -0,0 +1 @@

+ The OnePlus 10 Pro is the company's first flagship phone. It's the result of a merger between OnePlus and Oppo, which will be called "SuperVOOC" The phone is launching in China first on January 11. There's also no word on a US release date yet. The 10 Pro will have a 6.7-inch display and three cameras on the back. We don't have a price yet, but OnePlus' flagship prices have gone up every year so far, and the 9 Pro was $969.The phone will go on sale January 11 in China and January 18 in the U.S.

sample-summaries/article16.txt ADDED Viewed

	@@ -0,0 +1 @@

+ Tencent Holdings Ltd has raised $3 billion by selling 14.5 million shares in Sea. Sea owns e-commerce firm Shopee, according to a term sheet seen by Reuters on Wednesday. Tencent said late on Tuesday it had entered into a deal to reduce its stake in the Singapore-based group to 18.7% from 21.3%. The sale comes after Tencent said last month it would divest $16.4 billion of its stakes in JD.com and Six9, weakening its ties to China's second-biggest e- commerce firm. SEA's shares fell 11.4% on Tuesday in New York to $197.8 following the divestment news.