--- title: Vietnamese Legal Chatbot emoji: βš–οΈ colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.34.0 app_file: app.py pinned: false --- # **Vietnamese Legal Chatbot** πŸ›οΈβš–οΈ A Retrieval-Augmented Generation (RAG) system designed to answer legal questions in Vietnamese, providing accurate and contextually relevant responses based on Vietnamese legal documents. [![Demo](https://img.shields.io/badge/πŸš€-Live%20Demo-blue)](https://huggingface.co/spaces/fisherman611/vietnamese-legal-chatbot) [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE) [![Python](https://img.shields.io/badge/Python-3.11+-blue.svg)](https://python.org) ## **Features** - **Advanced RAG Architecture** - Combines vector search, BM25, and cross-encoder reranking for optimal document retrieval - **Hybrid Search** - Uses both semantic similarity (vector search) and keyword matching (BM25) to find relevant documents - **Question Refinement** - Improves query understanding through automatic question refinement - **Cross-Encoder Reranking** - Employs cross-encoder reranking to improve the accuracy of retrieved documents. - **Fallback Mechanisms** - Integrates with Google Search to provide answers when legal documents are insufficient. - **Vietnamese-Optimized** - Specifically designed for Vietnamese language processing and legal terminology ## **Dataset** The dataset used is from the [[Zalo-AI-2021] Legal Text Retrieval](https://www.kaggle.com/datasets/hariwh0/zaloai2021-legal-text-retrieval/data) dataset. Please download and restructure it to match the following format: ```bash β”œβ”€β”€ data/ β”‚ β”œβ”€β”€ train/ β”‚ β”‚ β”œβ”€β”€ train_question_answer.json β”‚ β”‚ └── train_qna.csv β”‚ β”œβ”€β”€ test/ β”‚ β”‚ β”œβ”€β”€ public_test_question.json β”‚ β”‚ └── public_test_sample_submission.json β”‚ β”œβ”€β”€ corpus/ β”‚ β”‚ β”œβ”€β”€ legal_corpus_legend.csv β”‚ β”‚ β”œβ”€β”€ legal_corpus_splitted.csv β”‚ β”‚ β”œβ”€β”€ legal_corpus_original.csv β”‚ β”‚ β”œβ”€β”€ legal_corpus_merged_u369.csv β”‚ β”‚ β”œβ”€β”€ legal_corpus_merged_u256.csv β”‚ β”‚ β”œβ”€β”€ legal_corpus_hashmap.csv β”‚ β”‚ └── legal_corpus.json β”‚ └── utils/ β”‚ └── stopwords.txt ``` ## **Architecture** The system follows a modern RAG architecture with three primary layers: ```mermaid flowchart LR %% Input Layer Query["πŸ” User Query"] ==> QR["πŸ“ Question Refiner"] QR ==> TP["βš™οΈ Text Processor"] %% Data Sources DOCS[("πŸ“š Legal Documents
Knowledge Base")] ==> TP %% Retrieval Layer subgraph retrieval["πŸ”Ž Retrieval Layer"] direction LR VS["🎯 Vector Store
(Qdrant)
Semantic Search"] BM25["πŸ“Š BM25 Retriever
Keyword Search"] Hybrid["⚑ Hybrid Search
Score Combination"] VS ==> Hybrid BM25 ==> Hybrid end %% Reranking Layer subgraph reranking["πŸ† Reranking Layer"] direction LR RR["🧠 Cross-Encoder
Reranker
Deep Relevance"] SF["πŸ”’ Score Fusion
Final Ranking"] RR ==> SF end %% Generation Layer subgraph generation["✨ Generation Layer"] direction LR CT["πŸ“‹ Context Builder
Prompt Assembly"] LLM["πŸ€– LLM
(Gemini)
Response Generation"] CT ==> LLM end %% Main flow connections TP ==> VS TP ==> BM25 Hybrid ==> RR SF ==> CT LLM ==> Response["πŸ“€ Final Response"] %% Fallback System Hybrid -.->|"⚠️ Insufficient Information"| FB["πŸ”„ Fallback Handler"] FB ==> GS["🌐 Google Search API
External Knowledge"] GS ==> CT %% External Data Stores VDB[("πŸ’Ύ Vector Database
Embeddings Storage")] <==> VS BM25DB[("πŸ“‡ BM25 Index
Inverted Index")] <==> BM25 %% Enhanced Styling classDef inputNode fill:#2d3748,stroke:#4299e1,stroke-width:3px,color:#ffffff,font-weight:bold classDef processNode fill:#1a365d,stroke:#63b3ed,stroke-width:2px,color:#ffffff classDef retrievalNode fill:#065f46,stroke:#10b981,stroke-width:2px,color:#ffffff classDef rerankNode fill:#7c2d12,stroke:#f97316,stroke-width:2px,color:#ffffff classDef generationNode fill:#581c87,stroke:#a855f7,stroke-width:2px,color:#ffffff classDef fallbackNode fill:#be123c,stroke:#f43f5e,stroke-width:2px,color:#ffffff classDef dataNode fill:#365314,stroke:#84cc16,stroke-width:2px,color:#ffffff classDef outputNode fill:#1e293b,stroke:#06b6d4,stroke-width:3px,color:#ffffff,font-weight:bold %% Apply styles class Query,QR inputNode class TP processNode class VS,BM25,Hybrid retrievalNode class RR,SF rerankNode class CT,LLM generationNode class FB,GS fallbackNode class DOCS,VDB,BM25DB dataNode class Response outputNode %% Subgraph styling classDef subgraphStyle fill:#f8fafc,stroke:#334155,stroke-width:2px class retrieval,reranking,generation subgraphStyle ``` ### Retrieval Layer - **Vector Store (Qdrant)** - Semantic search using dense vector embeddings - **BM25 Retriever** - Statistical keyword-based search - **Hybrid Search** - Combines and deduplicates results from both retrieval methods ### Reranking Layer - **Cross-Encoder Reranker** - Precisely scores document-query pairs for relevance - **Score Fusion** - Intelligently combines original retrieval scores with reranker scores ### Generation Layer - **Context Builder** - Formats retrieved documents into a prompt context - **LLM (Gemini)** - Generates natural language responses based on the retrieved context ## **Results** | Method | MRR | Coverage | R@1 | R@10 | R@20 | MAP@20 | |--------|-----|----------|-----|------|------|--------| | **Hybrid + Reranking** | **0.6082** | **88.99%** | **48.2%** | **82.7%** | **88.4%** | **62.4%** | | Hybrid (BM25 + Vector) | 0.5801 | 88.20% | 43.1% | 83.3% | 87.5% | 59.2% | | BM25 Only | 0.5545 | 78.94% | 43.0% | 76.8% | 78.3% | 56.5% | | Vector Only | 0.4691 | 68.09% | 36.4% | 66.6% | 67.3% | 47.1% | *Evaluation conducted on `train_qna.csv`* ## **Installation** 1. Clone the repository: ```bash git clone https://github.com/fisherman611/vietnamese-legal-chatbot.git cd vietnamese-legal-chatbot ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Configure environment variables: ```bash # Create .env file with your API keys GOOGLE_API_KEY=your_google_api_key QDRANT_URL=your_qdrant_url # Optional for cloud deployment QDRANT_API_KEY=your_qdrant_api_key # Optional for cloud deployment ``` 4. Run the setup script: ```bash python setup_system.py ``` 5. Launch the application: ```bash python app.py ``` ## **References** [1] T. N. Ba, V. D. The, T. P. Quang, and T. T. Van. Vietnamese legal information retrieval in question-answering system, 2024. URL https://arxivorg/abs/2409.13699. [2] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H.KΓΌttler, M. Lewis, W. tau Yih, T. RocktΓ€schel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https://arxiv.org/abs/2005.11401. [3] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang. Retrieval-augmented generation for large language models: A survey, 2024. URL https://arxiv.org/abs/2312.10997. [4] J. Rayo, R. de la Rosa, and M. Garrido. A hybrid approach to information retrieval and answer generation for regulatory texts, 2025. URL https://arxiv.org/abs/2502.16767. [5] [BM25 retriever](https://python.langchain.com/docs/integrations/retrievers/bm25/) [6] [QDrant Vector Database](https://qdrant.tech/documentation/) ## **License** This project is licensed under the [MIT License](LICENSE).