Spaces:

JournalistsonHF
/

ai-scraper

Running

App Files Files Community

fdaudens HF staff commited on May 22

Commit

d7c693f

•

1 Parent(s): b7d27f0

first commit

Browse files

Files changed (4) hide show

README.md +26 -1
app.py +70 -0
packages.txt +8 -0
requirements.txt +5 -0

README.md CHANGED Viewed

@@ -8,5 +8,30 @@ sdk_version: 4.31.4
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 app_file: app.py
 pinned: false
 ---
+# Scrape and Summarize Web Content with AI
+## Overview
+This project provides an easy way to scrape and summarize web content using advanced AI models hosted on Hugging Face. It leverages the capabilities of ScrapeGraphAI and integrates a user-friendly interface with Gradio. The project allows users to input a prompt and a source URL to obtain summarized web content without writing any code.
+## Features
+- **No-code interface**: Easily scrape and summarize web content.
+- **Advanced AI models**: Utilizes Hugging Face models for language processing and embeddings.
+- **Gradio integration**: User-friendly interface to interact with the models.
+- **Customizable**: Change models and configurations as needed. (coming soon)
+## Configuration
+- **Models**:
+    - The project uses `Mistral-7B-Instruct-v0.2` for the language model and `sentence-transformers/all-MiniLM-l6-v2` for embeddings.
+## Contributing
+Contributions are welcome! Please submit pull requests or open issues to suggest improvements.
+## Acknowledgements
+- [ScrapeGraphAI](https://github.com/VinciGit00/Scrapegraph-ai)
+- [Hugging Face](https://huggingface.co/)
+- [Gradio](https://gradio.app/)

app.py ADDED Viewed

	@@ -0,0 +1,70 @@

+import os
+from dotenv import load_dotenv
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+from langchain_community.llms import HuggingFaceEndpoint
+from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings
+import gradio as gr
+import subprocess
+# Ensure Playwright installs required browsers and dependencies
+subprocess.run(["playwright", "install"])
+#subprocess.run(["playwright", "install-deps"])
+# Load environment variables
+load_dotenv()
+HUGGINGFACEHUB_API_TOKEN = os.getenv('HUGGINGFACEHUB_API_TOKEN')
+# Initialize the model instances
+repo_id = "mistralai/Mistral-7B-Instruct-v0.2"
+llm_model_instance = HuggingFaceEndpoint(
+    repo_id=repo_id, max_length=128, temperature=0.5, token=HUGGINGFACEHUB_API_TOKEN
+)
+embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
+    api_key=HUGGINGFACEHUB_API_TOKEN, model_name="sentence-transformers/all-MiniLM-l6-v2"
+)
+graph_config = {
+    "llm": {"model_instance": llm_model_instance},
+    "embeddings": {"model_instance": embedder_model_instance}
+}
+def scrape_and_summarize(prompt, source):
+    smart_scraper_graph = SmartScraperGraph(
+        prompt=prompt,
+        source=source,
+        config=graph_config
+    )
+    result = smart_scraper_graph.run()
+    exec_info = smart_scraper_graph.get_execution_info()
+    return result, prettify_exec_info(exec_info)
+# Gradio interface
+with gr.Blocks() as demo:
+    gr.Markdown("# Scrape websites, no-code version")
+    gr.Markdown("""Easily scrape and summarize web content using advanced AI models on the Hugging Face Hub without writing any code. Input your desired prompt and source URL to get started.
+                This is a no-code version of the excellent lib [ScrapeGraphAI](https://github.com/VinciGit00/Scrapegraph-ai).
+                It's a basic demo and a work in progress. Please contribute to it to make it more useful!""")
+    with gr.Row():
+        with gr.Column():
+            model_dropdown = gr.Textbox(label="Model", value="Mistral-7B-Instruct-v0.2")
+            prompt_input = gr.Textbox(label="Prompt", value="List me all the press releases with their headlines and urls.")
+            source_input = gr.Textbox(label="Source URL", value="https://www.whitehouse.gov/")
+            scrape_button = gr.Button("Scrape and Summarize")
+        with gr.Column():
+            result_output = gr.Textbox(label="Result")
+            exec_info_output = gr.Textbox(label="Execution Info")
+    scrape_button.click(
+        scrape_and_summarize,
+        inputs=[prompt_input, source_input],
+        outputs=[result_output, exec_info_output]
+    )
+# Launch the Gradio app
+if __name__ == "__main__":
+    demo.launch()

packages.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+libnss3
+libnspr4
+libatk1.0-0
+libatk-bridge2.0-0
+libcups2
+libatspi2.0-0
+libxcomposite1
+libxdamage1

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+gradio==4.31.3
+langchain_community==0.0.38
+python-dotenv==1.0.1
+scrapegraphai==1.2.3
+playwright==1.43.0