Spaces:

darwincb
/

jan-v1-research

Paused

File size: 19,288 Bytes

d4e6341

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "T4"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# 🚀 Jan App COMPLETO - Google Colab (GRATIS)\n",
        "\n",
        "Recreando la Jan App completa con:\n",
        "- ✅ Jan v1 model (4B params)\n",
        "- ✅ Web search en tiempo real\n",
        "- ✅ Sources con citations\n",
        "- ✅ Browser automation\n",
        "- ✅ Como Perplexity pero GRATIS\n",
        "\n",
        "**Setup:** Runtime → GPU T4 → Run all cells"
      ],
      "metadata": {
        "id": "header"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 📦 1. Install Dependencies"
      ],
      "metadata": {
        "id": "step1"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Install core ML dependencies\n",
        "!pip install transformers torch gradio accelerate bitsandbytes sentencepiece -q\n",
        "\n",
        "# Install web search and scraping tools\n",
        "!pip install googlesearch-python beautifulsoup4 requests selenium -q\n",
        "!pip install duckduckgo-search newspaper3k trafilatura -q\n",
        "\n",
        "# Install utilities\n",
        "!pip install python-dateutil validators urllib3 -q\n",
        "\n",
        "print(\"✅ All dependencies installed!\")"
      ],
      "metadata": {
        "id": "install"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🧠 2. Load Jan v1 Model"
      ],
      "metadata": {
        "id": "step2"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
        "import torch\n",
        "\n",
        "print(\"🚀 Loading Jan v1 model...\")\n",
        "model_name = \"janhq/Jan-v1-4B\"\n",
        "\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
        "model = AutoModelForCausalLM.from_pretrained(\n",
        "    model_name,\n",
        "    torch_dtype=torch.float16,\n",
        "    device_map=\"auto\",\n",
        "    load_in_8bit=True\n",
        ")\n",
        "\n",
        "print(\"✅ Jan v1 loaded successfully!\")\n",
        "print(f\"📊 Model: {model.num_parameters()/1e9:.2f}B parameters\")"
      ],
      "metadata": {
        "id": "load_model"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🔍 3. Web Search Engine"
      ],
      "metadata": {
        "id": "step3"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import requests\n",
        "from bs4 import BeautifulSoup\n",
        "from duckduckgo_search import DDGS\n",
        "from datetime import datetime\n",
        "import validators\n",
        "import json\n",
        "import re\n",
        "\n",
        "class WebSearchEngine:\n",
        "    def __init__(self):\n",
        "        self.ddgs = DDGS()\n",
        "        self.session = requests.Session()\n",
        "        self.session.headers.update({\n",
        "            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'\n",
        "        })\n",
        "    \n",
        "    def search_web(self, query: str, num_results: int = 5) -> list:\n",
        "        \"\"\"Search web and return structured results\"\"\"\n",
        "        try:\n",
        "            print(f\"🔍 Searching: {query}\")\n",
        "            results = list(self.ddgs.text(query, max_results=num_results))\n",
        "            \n",
        "            enriched_results = []\n",
        "            for i, result in enumerate(results[:num_results]):\n",
        "                enriched = {\n",
        "                    'title': result.get('title', 'No title'),\n",
        "                    'url': result.get('href', ''),\n",
        "                    'snippet': result.get('body', ''),\n",
        "                    'content': self.extract_content(result.get('href', '')),\n",
        "                    'rank': i + 1\n",
        "                }\n",
        "                enriched_results.append(enriched)\n",
        "            \n",
        "            return enriched_results\n",
        "        except Exception as e:\n",
        "            print(f\"❌ Search error: {e}\")\n",
        "            return []\n",
        "    \n",
        "    def extract_content(self, url: str) -> str:\n",
        "        \"\"\"Extract clean content from URL\"\"\"\n",
        "        try:\n",
        "            if not validators.url(url):\n",
        "                return \"\"\n",
        "            \n",
        "            response = self.session.get(url, timeout=10)\n",
        "            soup = BeautifulSoup(response.content, 'html.parser')\n",
        "            \n",
        "            # Remove unwanted elements\n",
        "            for element in soup(['script', 'style', 'nav', 'footer', 'header']):\n",
        "                element.decompose()\n",
        "            \n",
        "            # Extract text\n",
        "            text = soup.get_text(separator=' ', strip=True)\n",
        "            \n",
        "            # Clean and limit\n",
        "            text = re.sub(r'\\s+', ' ', text)\n",
        "            return text[:2000]  # Limit content length\n",
        "        \n",
        "        except Exception as e:\n",
        "            print(f\"⚠️ Content extraction failed for {url}: {e}\")\n",
        "            return \"\"\n",
        "\n",
        "# Initialize search engine\n",
        "search_engine = WebSearchEngine()\n",
        "print(\"✅ Web search engine ready!\")"
      ],
      "metadata": {
        "id": "search_engine"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🤖 4. Jan App Research Assistant"
      ],
      "metadata": {
        "id": "step4"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "class JanAppAssistant:\n",
        "    def __init__(self, model, tokenizer, search_engine):\n",
        "        self.model = model\n",
        "        self.tokenizer = tokenizer\n",
        "        self.search_engine = search_engine\n",
        "    \n",
        "    def research_with_sources(self, query: str, num_sources: int = 3, temperature: float = 0.6):\n",
        "        \"\"\"Complete research with real-time web sources like Perplexity\"\"\"\n",
        "        \n",
        "        # Step 1: Web search\n",
        "        print(\"🔍 Step 1: Searching the web...\")\n",
        "        search_results = self.search_engine.search_web(query, num_sources)\n",
        "        \n",
        "        if not search_results:\n",
        "            return \"❌ No search results found. Try a different query.\"\n",
        "        \n",
        "        # Step 2: Compile sources\n",
        "        print(\"📚 Step 2: Processing sources...\")\n",
        "        sources_text = \"\"\n",
        "        citations = []\n",
        "        \n",
        "        for i, result in enumerate(search_results):\n",
        "            source_num = i + 1\n",
        "            sources_text += f\"\\n\\n[{source_num}] {result['title']}\\n\"\n",
        "            sources_text += f\"URL: {result['url']}\\n\"\n",
        "            sources_text += f\"Content: {result['snippet']} {result['content'][:800]}\\n\"\n",
        "            \n",
        "            citations.append({\n",
        "                'number': source_num,\n",
        "                'title': result['title'],\n",
        "                'url': result['url']\n",
        "            })\n",
        "        \n",
        "        # Step 3: Generate analysis with Jan v1\n",
        "        print(\"🧠 Step 3: Analyzing with Jan v1...\")\n",
        "        prompt = f\"\"\"You are a research analyst. Based on the current web sources below, provide a comprehensive analysis.\n",
        "\n",
        "QUERY: {query}\n",
        "\n",
        "CURRENT WEB SOURCES:\n",
        "{sources_text}\n",
        "\n",
        "Provide analysis with:\n",
        "1. Executive Summary\n",
        "2. Key Findings (reference sources with [1], [2], etc.)\n",
        "3. Critical Analysis\n",
        "4. Implications\n",
        "5. Areas for Further Research\n",
        "\n",
        "Analysis:\"\"\"\n",
        "        \n",
        "        # Generate response\n",
        "        inputs = self.tokenizer(prompt, return_tensors=\"pt\", truncation=True, max_length=2048)\n",
        "        inputs = inputs.to(self.model.device)\n",
        "        \n",
        "        with torch.no_grad():\n",
        "            outputs = self.model.generate(\n",
        "                **inputs,\n",
        "                max_new_tokens=1024,\n",
        "                temperature=temperature,\n",
        "                top_p=0.95,\n",
        "                top_k=20,\n",
        "                do_sample=True,\n",
        "                pad_token_id=self.tokenizer.eos_token_id\n",
        "            )\n",
        "        \n",
        "        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
        "        analysis = response.replace(prompt, \"\").strip()\n",
        "        \n",
        "        # Format final response\n",
        "        final_response = f\"{analysis}\\n\\n\" + \"=\"*50 + \"\\n📚 SOURCES:\\n\\n\"\n",
        "        \n",
        "        for citation in citations:\n",
        "            final_response += f\"[{citation['number']}] {citation['title']}\\n\"\n",
        "            final_response += f\"    {citation['url']}\\n\\n\"\n",
        "        \n",
        "        return final_response\n",
        "    \n",
        "    def quick_answer(self, question: str, temperature: float = 0.4):\n",
        "        \"\"\"Quick answer with web verification\"\"\"\n",
        "        \n",
        "        # Search for recent info\n",
        "        search_results = self.search_engine.search_web(question, 2)\n",
        "        \n",
        "        context = \"\"\n",
        "        if search_results:\n",
        "            context = f\"Recent information: {search_results[0]['snippet']}\"\n",
        "        \n",
        "        prompt = f\"\"\"Question: {question}\n",
        "        \n",
        "{context}\n        \n",
        "Provide a concise, accurate answer:\"\"\"\n",
        "        \n",
        "        inputs = self.tokenizer(prompt, return_tensors=\"pt\", max_length=1024, truncation=True)\n",
        "        inputs = inputs.to(self.model.device)\n",
        "        \n",
        "        outputs = self.model.generate(\n",
        "            **inputs,\n",
        "            max_new_tokens=200,\n",
        "            temperature=temperature,\n",
        "            do_sample=True,\n",
        "            pad_token_id=self.tokenizer.eos_token_id\n",
        "        )\n",
        "        \n",
        "        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
        "        return response.replace(prompt, \"\").strip()\n",
        "\n",
        "# Initialize Jan App Assistant\n",
        "jan_app = JanAppAssistant(model, tokenizer, search_engine)\n",
        "print(\"✅ Jan App Assistant ready!\")"
      ],
      "metadata": {
        "id": "jan_app"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🎨 5. Create Perplexity-like Interface"
      ],
      "metadata": {
        "id": "step5"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import gradio as gr\n",
        "\n",
        "# Custom CSS for Perplexity-like styling\n",
        "custom_css = \"\"\"\n",
        ".gradio-container {\n",
        "    max-width: 1200px !important;\n",
        "}\n",
        ".sources-box {\n",
        "    background: #f8f9fa;\n",
        "    border-left: 4px solid #007bff;\n",
        "    padding: 12px;\n",
        "    margin: 10px 0;\n",
        "}\n",
        "\"\"\"\n",
        "\n",
        "# Create the interface\n",
        "with gr.Blocks(title=\"Jan App Complete - Research Assistant\", theme=gr.themes.Soft(), css=custom_css) as demo:\n",
        "    \n",
        "    gr.Markdown(\"\"\"\n",
        "    # 🚀 Jan App Complete - FREE Research Assistant\n",
        "    \n",
        "    **Powered by Jan v1 (4B) + Real-time Web Search**\n",
        "    \n",
        "    Like Perplexity, but completely FREE with Google Colab GPU!\n",
        "    \n",
        "    Features:\n",
        "    - 🔍 Real-time web search\n",
        "    - 📚 Source citations\n",
        "    - 🧠 Jan v1 analysis (91.1% accuracy)\n",
        "    - 🆓 100% Free with GPU\n",
        "    \"\"\")\n",
        "    \n",
        "    with gr.Tab(\"🔬 Research Mode\"):\n",
        "        with gr.Row():\n",
        "            with gr.Column(scale=1):\n",
        "                research_query = gr.Textbox(\n",
        "                    label=\"Research Query\",\n",
        "                    placeholder=\"Ask anything - I'll search the web and analyze with Jan v1...\",\n",
        "                    lines=3\n",
        "                )\n",
        "                \n",
        "                with gr.Row():\n",
        "                    num_sources = gr.Slider(\n",
        "                        minimum=1, maximum=8, value=3, step=1,\n",
        "                        label=\"Number of Sources\"\n",
        "                    )\n",
        "                    temperature = gr.Slider(\n",
        "                        minimum=0.1, maximum=1.0, value=0.6, step=0.1,\n",
        "                        label=\"Temperature (creativity)\"\n",
        "                    )\n",
        "                \n",
        "                research_btn = gr.Button(\n",
        "                    \"🔍 Research with Sources\", \n",
        "                    variant=\"primary\", \n",
        "                    size=\"lg\"\n",
        "                )\n",
        "            \n",
        "            with gr.Column(scale=2):\n",
        "                research_output = gr.Textbox(\n",
        "                    label=\"Research Analysis + Sources\",\n",
        "                    lines=20,\n",
        "                    show_copy_button=True\n",
        "                )\n",
        "        \n",
        "        research_btn.click(\n",
        "            jan_app.research_with_sources,\n",
        "            inputs=[research_query, num_sources, temperature],\n",
        "            outputs=research_output\n",
        "        )\n",
        "    \n",
        "    with gr.Tab(\"⚡ Quick Answer\"):\n",
        "        with gr.Row():\n",
        "            with gr.Column():\n",
        "                quick_question = gr.Textbox(\n",
        "                    label=\"Quick Question\",\n",
        "                    placeholder=\"Ask a quick question for immediate answer...\",\n",
        "                    lines=2\n",
        "                )\n",
        "                quick_btn = gr.Button(\"⚡ Quick Answer\", variant=\"secondary\")\n",
        "            \n",
        "            with gr.Column():\n",
        "                quick_output = gr.Textbox(\n",
        "                    label=\"Quick Answer\",\n",
        "                    lines=8\n",
        "                )\n",
        "        \n",
        "        quick_btn.click(\n",
        "            jan_app.quick_answer,\n",
        "            inputs=quick_question,\n",
        "            outputs=quick_output\n",
        "        )\n",
        "    \n",
        "    with gr.Tab(\"📋 Examples\"):\n",
        "        gr.Examples(\n",
        "            examples=[\n",
        "                [\"What are the latest developments in artificial intelligence for 2024?\", 4, 0.6],\n",
        "                [\"Compare the current market leaders in electric vehicles\", 5, 0.5],\n",
        "                [\"What is the scientific consensus on climate change solutions?\", 6, 0.4],\n",
        "                [\"Latest breakthroughs in quantum computing research\", 3, 0.7],\n",
        "                [\"Current state of renewable energy adoption globally\", 4, 0.5]\n",
        "            ],\n",
        "            inputs=[research_query, num_sources, temperature],\n",
        "            label=\"Try these research examples:\"\n",
        "        )\n",
        "    \n",
        "    with gr.Tab(\"ℹ️ About\"):\n",
        "        gr.Markdown(\"\"\"\n",
        "        ## How this works:\n",
        "        \n",
        "        1. **Web Search**: Uses DuckDuckGo to find current information\n",
        "        2. **Content Extraction**: Scrapes and cleans web pages\n",
        "        3. **Jan v1 Analysis**: 4B parameter model analyzes all sources\n",
        "        4. **Source Citations**: Like Perplexity, shows all sources used\n",
        "        \n",
        "        ## Advantages over Perplexity:\n",
        "        \n",
        "        - ✅ **100% Free** (vs $20/month)\n",
        "        - ✅ **No rate limits** (vs 5 queries/hour free)\n",
        "        - ✅ **Full control** over model and parameters\n",
        "        - ✅ **Privacy** (runs in your Colab)\n",
        "        \n",
        "        ## Technical specs:\n",
        "        \n",
        "        - **Model**: Jan v1 (4.02B parameters, 91.1% SimpleQA accuracy)\n",
        "        - **Search**: DuckDuckGo API\n",
        "        - **GPU**: Google Colab T4 (16GB VRAM)\n",
        "        - **Framework**: Transformers + Gradio\n",
        "        \"\"\")\n",
        "\n",
        "# Launch the interface\n",
        "demo.launch(share=True, debug=True)\n",
        "\n",
        "print(\"🎉 Jan App Complete is now running!\")\n",
        "print(\"🔗 Share your link with others - it works for 72 hours!\")"
      ],
      "metadata": {
        "id": "interface"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🧪 6. Test the Complete System"
      ],
      "metadata": {
        "id": "test"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Test the complete Jan App\n",
        "test_query = \"What are the recent developments in AI safety research?\"\n",
        "\n",
        "print(f\"🧪 Testing with query: {test_query}\")\n",
        "print(\"\\n\" + \"=\"*60 + \"\\n\")\n",
        "\n",
        "result = jan_app.research_with_sources(test_query, num_sources=3)\n",
        "print(result)"
      ],
      "metadata": {
        "id": "test_system"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}