File size: 7,796 Bytes
543589f
 
12c42c1
e5c7c4c
 
543589f
4f00e29
543589f
 
 
 
35c8d24
12c42c1
 
 
35c8d24
 
 
 
12c42c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80a91cf
12c42c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35c8d24
 
 
 
 
 
 
 
 
 
 
12c42c1
 
 
 
35c8d24
12c42c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f00e29
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
title: Vietnamese Legal Chatbot
emoji: ⚖️
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: 5.34.0
app_file: app.py
pinned: false
---

# **Vietnamese Legal Chatbot** 🏛️⚖️

A Retrieval-Augmented Generation (RAG) system designed to answer legal questions in Vietnamese, providing accurate and contextually relevant responses based on Vietnamese legal documents.

[![Demo](https://img.shields.io/badge/🚀-Live%20Demo-blue)](https://huggingface.co/spaces/fisherman611/vietnamese-legal-chatbot)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Python](https://img.shields.io/badge/Python-3.11+-blue.svg)](https://python.org)

## **Features**

- **Advanced RAG Architecture** - Combines vector search, BM25, and cross-encoder reranking for optimal document retrieval
- **Hybrid Search** - Uses both semantic similarity (vector search) and keyword matching (BM25) to find relevant documents
- **Question Refinement** - Improves query understanding through automatic question refinement
- **Cross-Encoder Reranking** - Employs cross-encoder reranking to improve the accuracy of retrieved documents.
- **Fallback Mechanisms** - Integrates with Google Search to provide answers when legal documents are insufficient.
- **Vietnamese-Optimized** - Specifically designed for Vietnamese language processing and legal terminology

## **Dataset**
The dataset used is from the [[Zalo-AI-2021] Legal Text Retrieval](https://www.kaggle.com/datasets/hariwh0/zaloai2021-legal-text-retrieval/data) dataset. Please download and restructure it to match the following format:
```bash
├── data/
│   ├── train/
│   │   ├── train_question_answer.json
│   │   └── train_qna.csv
│   ├── test/
│   │   ├── public_test_question.json
│   │   └── public_test_sample_submission.json
│   ├── corpus/
│   │   ├── legal_corpus_legend.csv
│   │   ├── legal_corpus_splitted.csv
│   │   ├── legal_corpus_original.csv
│   │   ├── legal_corpus_merged_u369.csv
│   │   ├── legal_corpus_merged_u256.csv
│   │   ├── legal_corpus_hashmap.csv
│   │   └── legal_corpus.json
│   └── utils/
│       └── stopwords.txt
```
## **Architecture**

The system follows a modern RAG architecture with three primary layers:

```mermaid
flowchart LR
    %% Input Layer
    Query["🔍 User Query"] ==> QR["📝 Question Refiner"]
    QR ==> TP["⚙️ Text Processor"]
    
    %% Data Sources
    DOCS[("📚 Legal Documents<br/>Knowledge Base")] ==> TP
    
    %% Retrieval Layer
    subgraph retrieval["🔎 Retrieval Layer"]
        direction LR
        VS["🎯 Vector Store<br/>(Qdrant)<br/>Semantic Search"] 
        BM25["📊 BM25 Retriever<br/>Keyword Search"]
        Hybrid["⚡ Hybrid Search<br/>Score Combination"]
        VS ==> Hybrid
        BM25 ==> Hybrid
    end
    
    %% Reranking Layer
    subgraph reranking["🏆 Reranking Layer"]
        direction LR
        RR["🧠 Cross-Encoder<br/>Reranker<br/>Deep Relevance"]
        SF["🔢 Score Fusion<br/>Final Ranking"]
        RR ==> SF
    end
    
    %% Generation Layer
    subgraph generation["✨ Generation Layer"]
        direction LR
        CT["📋 Context Builder<br/>Prompt Assembly"]
        LLM["🤖 LLM<br/>(Gemini)<br/>Response Generation"]
        CT ==> LLM
    end
    
    %% Main flow connections
    TP ==> VS
    TP ==> BM25
    Hybrid ==> RR
    SF ==> CT
    LLM ==> Response["📤 Final Response"]
    
    %% Fallback System
    Hybrid -.->|"⚠️ Insufficient Information"| FB["🔄 Fallback Handler"]
    FB ==> GS["🌐 Google Search API<br/>External Knowledge"]
    GS ==> CT
    
    %% External Data Stores
    VDB[("💾 Vector Database<br/>Embeddings Storage")] <==> VS
    BM25DB[("📇 BM25 Index<br/>Inverted Index")] <==> BM25
    
    %% Enhanced Styling
    classDef inputNode fill:#2d3748,stroke:#4299e1,stroke-width:3px,color:#ffffff,font-weight:bold
    classDef processNode fill:#1a365d,stroke:#63b3ed,stroke-width:2px,color:#ffffff
    classDef retrievalNode fill:#065f46,stroke:#10b981,stroke-width:2px,color:#ffffff
    classDef rerankNode fill:#7c2d12,stroke:#f97316,stroke-width:2px,color:#ffffff
    classDef generationNode fill:#581c87,stroke:#a855f7,stroke-width:2px,color:#ffffff
    classDef fallbackNode fill:#be123c,stroke:#f43f5e,stroke-width:2px,color:#ffffff
    classDef dataNode fill:#365314,stroke:#84cc16,stroke-width:2px,color:#ffffff
    classDef outputNode fill:#1e293b,stroke:#06b6d4,stroke-width:3px,color:#ffffff,font-weight:bold
    
    %% Apply styles
    class Query,QR inputNode
    class TP processNode
    class VS,BM25,Hybrid retrievalNode
    class RR,SF rerankNode
    class CT,LLM generationNode
    class FB,GS fallbackNode
    class DOCS,VDB,BM25DB dataNode
    class Response outputNode
    
    %% Subgraph styling
    classDef subgraphStyle fill:#f8fafc,stroke:#334155,stroke-width:2px
    class retrieval,reranking,generation subgraphStyle
```

### Retrieval Layer
- **Vector Store (Qdrant)** - Semantic search using dense vector embeddings
- **BM25 Retriever** - Statistical keyword-based search
- **Hybrid Search** - Combines and deduplicates results from both retrieval methods

### Reranking Layer
- **Cross-Encoder Reranker** - Precisely scores document-query pairs for relevance
- **Score Fusion** - Intelligently combines original retrieval scores with reranker scores

### Generation Layer
- **Context Builder** - Formats retrieved documents into a prompt context
- **LLM (Gemini)** - Generates natural language responses based on the retrieved context

## **Results**

| Method | MRR | Coverage | R@1 | R@10 | R@20 | MAP@20 |
|--------|-----|----------|-----|------|------|--------|
| **Hybrid + Reranking** | **0.6082** | **88.99%** | **48.2%** | **82.7%** | **88.4%** | **62.4%** |
| Hybrid (BM25 + Vector) | 0.5801 | 88.20% | 43.1% | 83.3% | 87.5% | 59.2% |
| BM25 Only | 0.5545 | 78.94% | 43.0% | 76.8% | 78.3% | 56.5% |
| Vector Only | 0.4691 | 68.09% | 36.4% | 66.6% | 67.3% | 47.1% |

*Evaluation conducted on `train_qna.csv`*

## **Installation**

1. Clone the repository:
```bash
git clone https://github.com/fisherman611/vietnamese-legal-chatbot.git
cd vietnamese-legal-chatbot
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Configure environment variables:
```bash
# Create .env file with your API keys
GOOGLE_API_KEY=your_google_api_key
QDRANT_URL=your_qdrant_url  # Optional for cloud deployment
QDRANT_API_KEY=your_qdrant_api_key  # Optional for cloud deployment
```

4. Run the setup script:
```bash
python setup_system.py
```

5. Launch the application:
```bash
python app.py
```

## **References**
[1] T. N. Ba, V. D. The, T. P. Quang, and T. T. Van. Vietnamese legal information retrieval in question-answering system, 2024. URL https://arxivorg/abs/2409.13699.

[2] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H.Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https://arxiv.org/abs/2005.11401.

[3] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang. Retrieval-augmented generation for large language models: A survey, 2024. URL https://arxiv.org/abs/2312.10997.

[4] J. Rayo, R. de la Rosa, and M. Garrido. A hybrid approach to information retrieval and answer generation for regulatory texts, 2025. URL https://arxiv.org/abs/2502.16767.

[5] [BM25 retriever](https://python.langchain.com/docs/integrations/retrievers/bm25/)

[6] [QDrant Vector Database](https://qdrant.tech/documentation/)
## **License** 
This project is licensed under the [MIT License](LICENSE).