{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "gpuType": "T4" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "source": [ "# ๐Ÿš€ Jan App COMPLETO - Google Colab (GRATIS)\n", "\n", "Recreando la Jan App completa con:\n", "- โœ… Jan v1 model (4B params)\n", "- โœ… Web search en tiempo real\n", "- โœ… Sources con citations\n", "- โœ… Browser automation\n", "- โœ… Como Perplexity pero GRATIS\n", "\n", "**Setup:** Runtime โ†’ GPU T4 โ†’ Run all cells" ], "metadata": { "id": "header" } }, { "cell_type": "markdown", "source": [ "## ๐Ÿ“ฆ 1. Install Dependencies" ], "metadata": { "id": "step1" } }, { "cell_type": "code", "source": [ "# Install core ML dependencies\n", "!pip install transformers torch gradio accelerate bitsandbytes sentencepiece -q\n", "\n", "# Install web search and scraping tools\n", "!pip install googlesearch-python beautifulsoup4 requests selenium -q\n", "!pip install duckduckgo-search newspaper3k trafilatura -q\n", "\n", "# Install utilities\n", "!pip install python-dateutil validators urllib3 -q\n", "\n", "print(\"โœ… All dependencies installed!\")" ], "metadata": { "id": "install" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## ๐Ÿง  2. Load Jan v1 Model" ], "metadata": { "id": "step2" } }, { "cell_type": "code", "source": [ "from transformers import AutoModelForCausalLM, AutoTokenizer\n", "import torch\n", "\n", "print(\"๐Ÿš€ Loading Jan v1 model...\")\n", "model_name = \"janhq/Jan-v1-4B\"\n", "\n", "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", "model = AutoModelForCausalLM.from_pretrained(\n", " model_name,\n", " torch_dtype=torch.float16,\n", " device_map=\"auto\",\n", " load_in_8bit=True\n", ")\n", "\n", "print(\"โœ… Jan v1 loaded successfully!\")\n", "print(f\"๐Ÿ“Š Model: {model.num_parameters()/1e9:.2f}B parameters\")" ], "metadata": { "id": "load_model" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## ๐Ÿ” 3. Web Search Engine" ], "metadata": { "id": "step3" } }, { "cell_type": "code", "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "from duckduckgo_search import DDGS\n", "from datetime import datetime\n", "import validators\n", "import json\n", "import re\n", "\n", "class WebSearchEngine:\n", " def __init__(self):\n", " self.ddgs = DDGS()\n", " self.session = requests.Session()\n", " self.session.headers.update({\n", " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'\n", " })\n", " \n", " def search_web(self, query: str, num_results: int = 5) -> list:\n", " \"\"\"Search web and return structured results\"\"\"\n", " try:\n", " print(f\"๐Ÿ” Searching: {query}\")\n", " results = list(self.ddgs.text(query, max_results=num_results))\n", " \n", " enriched_results = []\n", " for i, result in enumerate(results[:num_results]):\n", " enriched = {\n", " 'title': result.get('title', 'No title'),\n", " 'url': result.get('href', ''),\n", " 'snippet': result.get('body', ''),\n", " 'content': self.extract_content(result.get('href', '')),\n", " 'rank': i + 1\n", " }\n", " enriched_results.append(enriched)\n", " \n", " return enriched_results\n", " except Exception as e:\n", " print(f\"โŒ Search error: {e}\")\n", " return []\n", " \n", " def extract_content(self, url: str) -> str:\n", " \"\"\"Extract clean content from URL\"\"\"\n", " try:\n", " if not validators.url(url):\n", " return \"\"\n", " \n", " response = self.session.get(url, timeout=10)\n", " soup = BeautifulSoup(response.content, 'html.parser')\n", " \n", " # Remove unwanted elements\n", " for element in soup(['script', 'style', 'nav', 'footer', 'header']):\n", " element.decompose()\n", " \n", " # Extract text\n", " text = soup.get_text(separator=' ', strip=True)\n", " \n", " # Clean and limit\n", " text = re.sub(r'\\s+', ' ', text)\n", " return text[:2000] # Limit content length\n", " \n", " except Exception as e:\n", " print(f\"โš ๏ธ Content extraction failed for {url}: {e}\")\n", " return \"\"\n", "\n", "# Initialize search engine\n", "search_engine = WebSearchEngine()\n", "print(\"โœ… Web search engine ready!\")" ], "metadata": { "id": "search_engine" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## ๐Ÿค– 4. Jan App Research Assistant" ], "metadata": { "id": "step4" } }, { "cell_type": "code", "source": [ "class JanAppAssistant:\n", " def __init__(self, model, tokenizer, search_engine):\n", " self.model = model\n", " self.tokenizer = tokenizer\n", " self.search_engine = search_engine\n", " \n", " def research_with_sources(self, query: str, num_sources: int = 3, temperature: float = 0.6):\n", " \"\"\"Complete research with real-time web sources like Perplexity\"\"\"\n", " \n", " # Step 1: Web search\n", " print(\"๐Ÿ” Step 1: Searching the web...\")\n", " search_results = self.search_engine.search_web(query, num_sources)\n", " \n", " if not search_results:\n", " return \"โŒ No search results found. Try a different query.\"\n", " \n", " # Step 2: Compile sources\n", " print(\"๐Ÿ“š Step 2: Processing sources...\")\n", " sources_text = \"\"\n", " citations = []\n", " \n", " for i, result in enumerate(search_results):\n", " source_num = i + 1\n", " sources_text += f\"\\n\\n[{source_num}] {result['title']}\\n\"\n", " sources_text += f\"URL: {result['url']}\\n\"\n", " sources_text += f\"Content: {result['snippet']} {result['content'][:800]}\\n\"\n", " \n", " citations.append({\n", " 'number': source_num,\n", " 'title': result['title'],\n", " 'url': result['url']\n", " })\n", " \n", " # Step 3: Generate analysis with Jan v1\n", " print(\"๐Ÿง  Step 3: Analyzing with Jan v1...\")\n", " prompt = f\"\"\"You are a research analyst. Based on the current web sources below, provide a comprehensive analysis.\n", "\n", "QUERY: {query}\n", "\n", "CURRENT WEB SOURCES:\n", "{sources_text}\n", "\n", "Provide analysis with:\n", "1. Executive Summary\n", "2. Key Findings (reference sources with [1], [2], etc.)\n", "3. Critical Analysis\n", "4. Implications\n", "5. Areas for Further Research\n", "\n", "Analysis:\"\"\"\n", " \n", " # Generate response\n", " inputs = self.tokenizer(prompt, return_tensors=\"pt\", truncation=True, max_length=2048)\n", " inputs = inputs.to(self.model.device)\n", " \n", " with torch.no_grad():\n", " outputs = self.model.generate(\n", " **inputs,\n", " max_new_tokens=1024,\n", " temperature=temperature,\n", " top_p=0.95,\n", " top_k=20,\n", " do_sample=True,\n", " pad_token_id=self.tokenizer.eos_token_id\n", " )\n", " \n", " response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)\n", " analysis = response.replace(prompt, \"\").strip()\n", " \n", " # Format final response\n", " final_response = f\"{analysis}\\n\\n\" + \"=\"*50 + \"\\n๐Ÿ“š SOURCES:\\n\\n\"\n", " \n", " for citation in citations:\n", " final_response += f\"[{citation['number']}] {citation['title']}\\n\"\n", " final_response += f\" {citation['url']}\\n\\n\"\n", " \n", " return final_response\n", " \n", " def quick_answer(self, question: str, temperature: float = 0.4):\n", " \"\"\"Quick answer with web verification\"\"\"\n", " \n", " # Search for recent info\n", " search_results = self.search_engine.search_web(question, 2)\n", " \n", " context = \"\"\n", " if search_results:\n", " context = f\"Recent information: {search_results[0]['snippet']}\"\n", " \n", " prompt = f\"\"\"Question: {question}\n", " \n", "{context}\n \n", "Provide a concise, accurate answer:\"\"\"\n", " \n", " inputs = self.tokenizer(prompt, return_tensors=\"pt\", max_length=1024, truncation=True)\n", " inputs = inputs.to(self.model.device)\n", " \n", " outputs = self.model.generate(\n", " **inputs,\n", " max_new_tokens=200,\n", " temperature=temperature,\n", " do_sample=True,\n", " pad_token_id=self.tokenizer.eos_token_id\n", " )\n", " \n", " response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)\n", " return response.replace(prompt, \"\").strip()\n", "\n", "# Initialize Jan App Assistant\n", "jan_app = JanAppAssistant(model, tokenizer, search_engine)\n", "print(\"โœ… Jan App Assistant ready!\")" ], "metadata": { "id": "jan_app" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## ๐ŸŽจ 5. Create Perplexity-like Interface" ], "metadata": { "id": "step5" } }, { "cell_type": "code", "source": [ "import gradio as gr\n", "\n", "# Custom CSS for Perplexity-like styling\n", "custom_css = \"\"\"\n", ".gradio-container {\n", " max-width: 1200px !important;\n", "}\n", ".sources-box {\n", " background: #f8f9fa;\n", " border-left: 4px solid #007bff;\n", " padding: 12px;\n", " margin: 10px 0;\n", "}\n", "\"\"\"\n", "\n", "# Create the interface\n", "with gr.Blocks(title=\"Jan App Complete - Research Assistant\", theme=gr.themes.Soft(), css=custom_css) as demo:\n", " \n", " gr.Markdown(\"\"\"\n", " # ๐Ÿš€ Jan App Complete - FREE Research Assistant\n", " \n", " **Powered by Jan v1 (4B) + Real-time Web Search**\n", " \n", " Like Perplexity, but completely FREE with Google Colab GPU!\n", " \n", " Features:\n", " - ๐Ÿ” Real-time web search\n", " - ๐Ÿ“š Source citations\n", " - ๐Ÿง  Jan v1 analysis (91.1% accuracy)\n", " - ๐Ÿ†“ 100% Free with GPU\n", " \"\"\")\n", " \n", " with gr.Tab(\"๐Ÿ”ฌ Research Mode\"):\n", " with gr.Row():\n", " with gr.Column(scale=1):\n", " research_query = gr.Textbox(\n", " label=\"Research Query\",\n", " placeholder=\"Ask anything - I'll search the web and analyze with Jan v1...\",\n", " lines=3\n", " )\n", " \n", " with gr.Row():\n", " num_sources = gr.Slider(\n", " minimum=1, maximum=8, value=3, step=1,\n", " label=\"Number of Sources\"\n", " )\n", " temperature = gr.Slider(\n", " minimum=0.1, maximum=1.0, value=0.6, step=0.1,\n", " label=\"Temperature (creativity)\"\n", " )\n", " \n", " research_btn = gr.Button(\n", " \"๐Ÿ” Research with Sources\", \n", " variant=\"primary\", \n", " size=\"lg\"\n", " )\n", " \n", " with gr.Column(scale=2):\n", " research_output = gr.Textbox(\n", " label=\"Research Analysis + Sources\",\n", " lines=20,\n", " show_copy_button=True\n", " )\n", " \n", " research_btn.click(\n", " jan_app.research_with_sources,\n", " inputs=[research_query, num_sources, temperature],\n", " outputs=research_output\n", " )\n", " \n", " with gr.Tab(\"โšก Quick Answer\"):\n", " with gr.Row():\n", " with gr.Column():\n", " quick_question = gr.Textbox(\n", " label=\"Quick Question\",\n", " placeholder=\"Ask a quick question for immediate answer...\",\n", " lines=2\n", " )\n", " quick_btn = gr.Button(\"โšก Quick Answer\", variant=\"secondary\")\n", " \n", " with gr.Column():\n", " quick_output = gr.Textbox(\n", " label=\"Quick Answer\",\n", " lines=8\n", " )\n", " \n", " quick_btn.click(\n", " jan_app.quick_answer,\n", " inputs=quick_question,\n", " outputs=quick_output\n", " )\n", " \n", " with gr.Tab(\"๐Ÿ“‹ Examples\"):\n", " gr.Examples(\n", " examples=[\n", " [\"What are the latest developments in artificial intelligence for 2024?\", 4, 0.6],\n", " [\"Compare the current market leaders in electric vehicles\", 5, 0.5],\n", " [\"What is the scientific consensus on climate change solutions?\", 6, 0.4],\n", " [\"Latest breakthroughs in quantum computing research\", 3, 0.7],\n", " [\"Current state of renewable energy adoption globally\", 4, 0.5]\n", " ],\n", " inputs=[research_query, num_sources, temperature],\n", " label=\"Try these research examples:\"\n", " )\n", " \n", " with gr.Tab(\"โ„น๏ธ About\"):\n", " gr.Markdown(\"\"\"\n", " ## How this works:\n", " \n", " 1. **Web Search**: Uses DuckDuckGo to find current information\n", " 2. **Content Extraction**: Scrapes and cleans web pages\n", " 3. **Jan v1 Analysis**: 4B parameter model analyzes all sources\n", " 4. **Source Citations**: Like Perplexity, shows all sources used\n", " \n", " ## Advantages over Perplexity:\n", " \n", " - โœ… **100% Free** (vs $20/month)\n", " - โœ… **No rate limits** (vs 5 queries/hour free)\n", " - โœ… **Full control** over model and parameters\n", " - โœ… **Privacy** (runs in your Colab)\n", " \n", " ## Technical specs:\n", " \n", " - **Model**: Jan v1 (4.02B parameters, 91.1% SimpleQA accuracy)\n", " - **Search**: DuckDuckGo API\n", " - **GPU**: Google Colab T4 (16GB VRAM)\n", " - **Framework**: Transformers + Gradio\n", " \"\"\")\n", "\n", "# Launch the interface\n", "demo.launch(share=True, debug=True)\n", "\n", "print(\"๐ŸŽ‰ Jan App Complete is now running!\")\n", "print(\"๐Ÿ”— Share your link with others - it works for 72 hours!\")" ], "metadata": { "id": "interface" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## ๐Ÿงช 6. Test the Complete System" ], "metadata": { "id": "test" } }, { "cell_type": "code", "source": [ "# Test the complete Jan App\n", "test_query = \"What are the recent developments in AI safety research?\"\n", "\n", "print(f\"๐Ÿงช Testing with query: {test_query}\")\n", "print(\"\\n\" + \"=\"*60 + \"\\n\")\n", "\n", "result = jan_app.research_with_sources(test_query, num_sources=3)\n", "print(result)" ], "metadata": { "id": "test_system" }, "execution_count": null, "outputs": [] } ] }