---
title: Vietnamese Legal Chatbot
emoji: βοΈ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.34.0
app_file: app.py
pinned: false
---
# **Vietnamese Legal Chatbot** ποΈβοΈ
A Retrieval-Augmented Generation (RAG) system designed to answer legal questions in Vietnamese, providing accurate and contextually relevant responses based on Vietnamese legal documents.
[](https://huggingface.co/spaces/fisherman611/vietnamese-legal-chatbot)
[](LICENSE)
[](https://python.org)
## **Features**
- **Advanced RAG Architecture** - Combines vector search, BM25, and cross-encoder reranking for optimal document retrieval
- **Hybrid Search** - Uses both semantic similarity (vector search) and keyword matching (BM25) to find relevant documents
- **Question Refinement** - Improves query understanding through automatic question refinement
- **Cross-Encoder Reranking** - Employs cross-encoder reranking to improve the accuracy of retrieved documents.
- **Fallback Mechanisms** - Integrates with Google Search to provide answers when legal documents are insufficient.
- **Vietnamese-Optimized** - Specifically designed for Vietnamese language processing and legal terminology
## **Dataset**
The dataset used is from the [[Zalo-AI-2021] Legal Text Retrieval](https://www.kaggle.com/datasets/hariwh0/zaloai2021-legal-text-retrieval/data) dataset. Please download and restructure it to match the following format:
```bash
βββ data/
β βββ train/
β β βββ train_question_answer.json
β β βββ train_qna.csv
β βββ test/
β β βββ public_test_question.json
β β βββ public_test_sample_submission.json
β βββ corpus/
β β βββ legal_corpus_legend.csv
β β βββ legal_corpus_splitted.csv
β β βββ legal_corpus_original.csv
β β βββ legal_corpus_merged_u369.csv
β β βββ legal_corpus_merged_u256.csv
β β βββ legal_corpus_hashmap.csv
β β βββ legal_corpus.json
β βββ utils/
β βββ stopwords.txt
```
## **Architecture**
The system follows a modern RAG architecture with three primary layers:
```mermaid
flowchart LR
%% Input Layer
Query["π User Query"] ==> QR["π Question Refiner"]
QR ==> TP["βοΈ Text Processor"]
%% Data Sources
DOCS[("π Legal Documents
Knowledge Base")] ==> TP
%% Retrieval Layer
subgraph retrieval["π Retrieval Layer"]
direction LR
VS["π― Vector Store
(Qdrant)
Semantic Search"]
BM25["π BM25 Retriever
Keyword Search"]
Hybrid["β‘ Hybrid Search
Score Combination"]
VS ==> Hybrid
BM25 ==> Hybrid
end
%% Reranking Layer
subgraph reranking["π Reranking Layer"]
direction LR
RR["π§ Cross-Encoder
Reranker
Deep Relevance"]
SF["π’ Score Fusion
Final Ranking"]
RR ==> SF
end
%% Generation Layer
subgraph generation["β¨ Generation Layer"]
direction LR
CT["π Context Builder
Prompt Assembly"]
LLM["π€ LLM
(Gemini)
Response Generation"]
CT ==> LLM
end
%% Main flow connections
TP ==> VS
TP ==> BM25
Hybrid ==> RR
SF ==> CT
LLM ==> Response["π€ Final Response"]
%% Fallback System
Hybrid -.->|"β οΈ Insufficient Information"| FB["π Fallback Handler"]
FB ==> GS["π Google Search API
External Knowledge"]
GS ==> CT
%% External Data Stores
VDB[("πΎ Vector Database
Embeddings Storage")] <==> VS
BM25DB[("π BM25 Index
Inverted Index")] <==> BM25
%% Enhanced Styling
classDef inputNode fill:#2d3748,stroke:#4299e1,stroke-width:3px,color:#ffffff,font-weight:bold
classDef processNode fill:#1a365d,stroke:#63b3ed,stroke-width:2px,color:#ffffff
classDef retrievalNode fill:#065f46,stroke:#10b981,stroke-width:2px,color:#ffffff
classDef rerankNode fill:#7c2d12,stroke:#f97316,stroke-width:2px,color:#ffffff
classDef generationNode fill:#581c87,stroke:#a855f7,stroke-width:2px,color:#ffffff
classDef fallbackNode fill:#be123c,stroke:#f43f5e,stroke-width:2px,color:#ffffff
classDef dataNode fill:#365314,stroke:#84cc16,stroke-width:2px,color:#ffffff
classDef outputNode fill:#1e293b,stroke:#06b6d4,stroke-width:3px,color:#ffffff,font-weight:bold
%% Apply styles
class Query,QR inputNode
class TP processNode
class VS,BM25,Hybrid retrievalNode
class RR,SF rerankNode
class CT,LLM generationNode
class FB,GS fallbackNode
class DOCS,VDB,BM25DB dataNode
class Response outputNode
%% Subgraph styling
classDef subgraphStyle fill:#f8fafc,stroke:#334155,stroke-width:2px
class retrieval,reranking,generation subgraphStyle
```
### Retrieval Layer
- **Vector Store (Qdrant)** - Semantic search using dense vector embeddings
- **BM25 Retriever** - Statistical keyword-based search
- **Hybrid Search** - Combines and deduplicates results from both retrieval methods
### Reranking Layer
- **Cross-Encoder Reranker** - Precisely scores document-query pairs for relevance
- **Score Fusion** - Intelligently combines original retrieval scores with reranker scores
### Generation Layer
- **Context Builder** - Formats retrieved documents into a prompt context
- **LLM (Gemini)** - Generates natural language responses based on the retrieved context
## **Results**
| Method | MRR | Coverage | R@1 | R@10 | R@20 | MAP@20 |
|--------|-----|----------|-----|------|------|--------|
| **Hybrid + Reranking** | **0.6082** | **88.99%** | **48.2%** | **82.7%** | **88.4%** | **62.4%** |
| Hybrid (BM25 + Vector) | 0.5801 | 88.20% | 43.1% | 83.3% | 87.5% | 59.2% |
| BM25 Only | 0.5545 | 78.94% | 43.0% | 76.8% | 78.3% | 56.5% |
| Vector Only | 0.4691 | 68.09% | 36.4% | 66.6% | 67.3% | 47.1% |
*Evaluation conducted on `train_qna.csv`*
## **Installation**
1. Clone the repository:
```bash
git clone https://github.com/fisherman611/vietnamese-legal-chatbot.git
cd vietnamese-legal-chatbot
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Configure environment variables:
```bash
# Create .env file with your API keys
GOOGLE_API_KEY=your_google_api_key
QDRANT_URL=your_qdrant_url # Optional for cloud deployment
QDRANT_API_KEY=your_qdrant_api_key # Optional for cloud deployment
```
4. Run the setup script:
```bash
python setup_system.py
```
5. Launch the application:
```bash
python app.py
```
## **References**
[1] T. N. Ba, V. D. The, T. P. Quang, and T. T. Van. Vietnamese legal information retrieval in question-answering system, 2024. URL https://arxivorg/abs/2409.13699.
[2] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H.KΓΌttler, M. Lewis, W. tau Yih, T. RocktΓ€schel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https://arxiv.org/abs/2005.11401.
[3] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang. Retrieval-augmented generation for large language models: A survey, 2024. URL https://arxiv.org/abs/2312.10997.
[4] J. Rayo, R. de la Rosa, and M. Garrido. A hybrid approach to information retrieval and answer generation for regulatory texts, 2025. URL https://arxiv.org/abs/2502.16767.
[5] [BM25 retriever](https://python.langchain.com/docs/integrations/retrievers/bm25/)
[6] [QDrant Vector Database](https://qdrant.tech/documentation/)
## **License**
This project is licensed under the [MIT License](LICENSE).