File size: 7,796 Bytes
543589f 12c42c1 e5c7c4c 543589f 4f00e29 543589f 35c8d24 12c42c1 35c8d24 12c42c1 80a91cf 12c42c1 35c8d24 12c42c1 35c8d24 12c42c1 4f00e29 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
---
title: Vietnamese Legal Chatbot
emoji: ⚖️
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: 5.34.0
app_file: app.py
pinned: false
---
# **Vietnamese Legal Chatbot** 🏛️⚖️
A Retrieval-Augmented Generation (RAG) system designed to answer legal questions in Vietnamese, providing accurate and contextually relevant responses based on Vietnamese legal documents.
[](https://huggingface.co/spaces/fisherman611/vietnamese-legal-chatbot)
[](LICENSE)
[](https://python.org)
## **Features**
- **Advanced RAG Architecture** - Combines vector search, BM25, and cross-encoder reranking for optimal document retrieval
- **Hybrid Search** - Uses both semantic similarity (vector search) and keyword matching (BM25) to find relevant documents
- **Question Refinement** - Improves query understanding through automatic question refinement
- **Cross-Encoder Reranking** - Employs cross-encoder reranking to improve the accuracy of retrieved documents.
- **Fallback Mechanisms** - Integrates with Google Search to provide answers when legal documents are insufficient.
- **Vietnamese-Optimized** - Specifically designed for Vietnamese language processing and legal terminology
## **Dataset**
The dataset used is from the [[Zalo-AI-2021] Legal Text Retrieval](https://www.kaggle.com/datasets/hariwh0/zaloai2021-legal-text-retrieval/data) dataset. Please download and restructure it to match the following format:
```bash
├── data/
│ ├── train/
│ │ ├── train_question_answer.json
│ │ └── train_qna.csv
│ ├── test/
│ │ ├── public_test_question.json
│ │ └── public_test_sample_submission.json
│ ├── corpus/
│ │ ├── legal_corpus_legend.csv
│ │ ├── legal_corpus_splitted.csv
│ │ ├── legal_corpus_original.csv
│ │ ├── legal_corpus_merged_u369.csv
│ │ ├── legal_corpus_merged_u256.csv
│ │ ├── legal_corpus_hashmap.csv
│ │ └── legal_corpus.json
│ └── utils/
│ └── stopwords.txt
```
## **Architecture**
The system follows a modern RAG architecture with three primary layers:
```mermaid
flowchart LR
%% Input Layer
Query["🔍 User Query"] ==> QR["📝 Question Refiner"]
QR ==> TP["⚙️ Text Processor"]
%% Data Sources
DOCS[("📚 Legal Documents<br/>Knowledge Base")] ==> TP
%% Retrieval Layer
subgraph retrieval["🔎 Retrieval Layer"]
direction LR
VS["🎯 Vector Store<br/>(Qdrant)<br/>Semantic Search"]
BM25["📊 BM25 Retriever<br/>Keyword Search"]
Hybrid["⚡ Hybrid Search<br/>Score Combination"]
VS ==> Hybrid
BM25 ==> Hybrid
end
%% Reranking Layer
subgraph reranking["🏆 Reranking Layer"]
direction LR
RR["🧠 Cross-Encoder<br/>Reranker<br/>Deep Relevance"]
SF["🔢 Score Fusion<br/>Final Ranking"]
RR ==> SF
end
%% Generation Layer
subgraph generation["✨ Generation Layer"]
direction LR
CT["📋 Context Builder<br/>Prompt Assembly"]
LLM["🤖 LLM<br/>(Gemini)<br/>Response Generation"]
CT ==> LLM
end
%% Main flow connections
TP ==> VS
TP ==> BM25
Hybrid ==> RR
SF ==> CT
LLM ==> Response["📤 Final Response"]
%% Fallback System
Hybrid -.->|"⚠️ Insufficient Information"| FB["🔄 Fallback Handler"]
FB ==> GS["🌐 Google Search API<br/>External Knowledge"]
GS ==> CT
%% External Data Stores
VDB[("💾 Vector Database<br/>Embeddings Storage")] <==> VS
BM25DB[("📇 BM25 Index<br/>Inverted Index")] <==> BM25
%% Enhanced Styling
classDef inputNode fill:#2d3748,stroke:#4299e1,stroke-width:3px,color:#ffffff,font-weight:bold
classDef processNode fill:#1a365d,stroke:#63b3ed,stroke-width:2px,color:#ffffff
classDef retrievalNode fill:#065f46,stroke:#10b981,stroke-width:2px,color:#ffffff
classDef rerankNode fill:#7c2d12,stroke:#f97316,stroke-width:2px,color:#ffffff
classDef generationNode fill:#581c87,stroke:#a855f7,stroke-width:2px,color:#ffffff
classDef fallbackNode fill:#be123c,stroke:#f43f5e,stroke-width:2px,color:#ffffff
classDef dataNode fill:#365314,stroke:#84cc16,stroke-width:2px,color:#ffffff
classDef outputNode fill:#1e293b,stroke:#06b6d4,stroke-width:3px,color:#ffffff,font-weight:bold
%% Apply styles
class Query,QR inputNode
class TP processNode
class VS,BM25,Hybrid retrievalNode
class RR,SF rerankNode
class CT,LLM generationNode
class FB,GS fallbackNode
class DOCS,VDB,BM25DB dataNode
class Response outputNode
%% Subgraph styling
classDef subgraphStyle fill:#f8fafc,stroke:#334155,stroke-width:2px
class retrieval,reranking,generation subgraphStyle
```
### Retrieval Layer
- **Vector Store (Qdrant)** - Semantic search using dense vector embeddings
- **BM25 Retriever** - Statistical keyword-based search
- **Hybrid Search** - Combines and deduplicates results from both retrieval methods
### Reranking Layer
- **Cross-Encoder Reranker** - Precisely scores document-query pairs for relevance
- **Score Fusion** - Intelligently combines original retrieval scores with reranker scores
### Generation Layer
- **Context Builder** - Formats retrieved documents into a prompt context
- **LLM (Gemini)** - Generates natural language responses based on the retrieved context
## **Results**
| Method | MRR | Coverage | R@1 | R@10 | R@20 | MAP@20 |
|--------|-----|----------|-----|------|------|--------|
| **Hybrid + Reranking** | **0.6082** | **88.99%** | **48.2%** | **82.7%** | **88.4%** | **62.4%** |
| Hybrid (BM25 + Vector) | 0.5801 | 88.20% | 43.1% | 83.3% | 87.5% | 59.2% |
| BM25 Only | 0.5545 | 78.94% | 43.0% | 76.8% | 78.3% | 56.5% |
| Vector Only | 0.4691 | 68.09% | 36.4% | 66.6% | 67.3% | 47.1% |
*Evaluation conducted on `train_qna.csv`*
## **Installation**
1. Clone the repository:
```bash
git clone https://github.com/fisherman611/vietnamese-legal-chatbot.git
cd vietnamese-legal-chatbot
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Configure environment variables:
```bash
# Create .env file with your API keys
GOOGLE_API_KEY=your_google_api_key
QDRANT_URL=your_qdrant_url # Optional for cloud deployment
QDRANT_API_KEY=your_qdrant_api_key # Optional for cloud deployment
```
4. Run the setup script:
```bash
python setup_system.py
```
5. Launch the application:
```bash
python app.py
```
## **References**
[1] T. N. Ba, V. D. The, T. P. Quang, and T. T. Van. Vietnamese legal information retrieval in question-answering system, 2024. URL https://arxivorg/abs/2409.13699.
[2] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H.Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https://arxiv.org/abs/2005.11401.
[3] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang. Retrieval-augmented generation for large language models: A survey, 2024. URL https://arxiv.org/abs/2312.10997.
[4] J. Rayo, R. de la Rosa, and M. Garrido. A hybrid approach to information retrieval and answer generation for regulatory texts, 2025. URL https://arxiv.org/abs/2502.16767.
[5] [BM25 retriever](https://python.langchain.com/docs/integrations/retrievers/bm25/)
[6] [QDrant Vector Database](https://qdrant.tech/documentation/)
## **License**
This project is licensed under the [MIT License](LICENSE). |