nuatmochoi commited on
Commit
4efa655
ยท
verified ยท
1 Parent(s): 9e753c7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +174 -3
README.md CHANGED
@@ -1,3 +1,174 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ license: apache-2.0
5
+ tags:
6
+ - text2sql
7
+ - spider
8
+ - korean
9
+ - llama
10
+ - text-generation
11
+ - table-question-answering
12
+ datasets:
13
+ - spider
14
+ - huggingface-KREW/spider-ko
15
+ base_model: unsloth/Meta-Llama-3.1-8B-Instruct
16
+ model-index:
17
+ - name: Llama-3.1-8B-Spider-SQL-Ko
18
+ results:
19
+ - task:
20
+ type: text2sql
21
+ name: Text to SQL
22
+ dataset:
23
+ name: Spider (Korean)
24
+ type: text2sql
25
+ metrics:
26
+ - type: exact_match
27
+ value: 42.65
28
+ - type: execution_accuracy
29
+ value: 65.47
30
+ ---
31
+
32
+ # Llama-3.1-8B-Spider-SQL-Ko
33
+
34
+ ํ•œ๊ตญ์–ด ์งˆ๋ฌธ์„ SQL ์ฟผ๋ฆฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” Text-to-SQL ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. spider ๋ฐ์ดํ„ฐ์…‹์˜ train ๐Ÿค–
35
+ [Spider](https://yale-lily.github.io/spider) ๋ฐ์ดํ„ฐ์…‹์„ ํ•œ๊ตญ์–ด๋กœ ๋ฒˆ์—ญํ•œ [spider-ko](https://huggingface.co/datasets/huggingface-KREW/spider-ko) ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜์—ฌ ๋ฏธ์„ธ์กฐ์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
36
+
37
+ ## ๐Ÿ“Š ์ฃผ์š” ์„ฑ๋Šฅ
38
+
39
+ Spider ํ•œ๊ตญ์–ด ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹(1,034๊ฐœ) ํ‰๊ฐ€ ๊ฒฐ๊ณผ:
40
+ - **์ •ํ™• ์ผ์น˜์œจ**: 42.65% (441/1034)
41
+ - **์‹คํ–‰ ์ •ํ™•๋„**: 65.47% (677/1034)
42
+
43
+ > ๐Ÿ’ก ์‹คํ–‰ ์ •ํ™•๋„๊ฐ€ ์ •ํ™• ์ผ์น˜์œจ๋ณด๋‹ค ๋†’์€ ์ด์œ ๋Š”, SQL ๋ฌธ๋ฒ•์ด ๋‹ค๋ฅด๋”๋ผ๋„ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
44
+
45
+ ## ๐Ÿš€ ๋ฐ”๋กœ ์‹œ์ž‘ํ•˜๊ธฐ
46
+
47
+ ```python
48
+ from unsloth import FastLanguageModel
49
+
50
+ # ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
51
+ model, tokenizer = FastLanguageModel.from_pretrained(
52
+ model_name="huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko",
53
+ max_seq_length=2048,
54
+ dtype=None,
55
+ load_in_4bit=True,
56
+ )
57
+
58
+ # ํ•œ๊ตญ์–ด ์งˆ๋ฌธ โ†’ SQL ๋ณ€ํ™˜
59
+ question = "๊ฐ€์ˆ˜๋Š” ๋ช‡ ๋ช…์ด ์žˆ๋‚˜์š”?"
60
+ schema = """ํ…Œ์ด๋ธ”: singer
61
+ ์ปฌ๋Ÿผ: singer_id, name, country, age"""
62
+
63
+ prompt = f"""๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์Šคํ‚ค๋งˆ:
64
+ {schema}
65
+
66
+ ์งˆ๋ฌธ: {question}
67
+ SQL:"""
68
+
69
+ # ๊ฒฐ๊ณผ: SELECT count(*) FROM singer
70
+ ```
71
+
72
+ ## ๐Ÿ“ ๋ชจ๋ธ ์†Œ๊ฐœ
73
+
74
+ - **๊ธฐ๋ฐ˜ ๋ชจ๋ธ**: Llama 3.1 8B Instruct (4bit ์–‘์žํ™”)
75
+ - **ํ•™์Šต ๋ฐ์ดํ„ฐ**: [spider-ko](https://huggingface.co/datasets/huggingface-KREW/spider-ko) (1-epoch)
76
+ - **์ง€์› DB**: 166๊ฐœ์˜ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ([spider dataset]([Spider](https://yale-lily.github.io/spider)))
77
+ - **ํ•™์Šต ๋ฐฉ๋ฒ•**: LoRA (r=16, alpha=32)
78
+
79
+ ## ๐Ÿ’ฌ ํ™œ์šฉ ์˜ˆ์‹œ
80
+
81
+ ### ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•
82
+
83
+ ```python
84
+ def generate_sql(question, schema_info):
85
+ """ํ•œ๊ตญ์–ด ์งˆ๋ฌธ์„ SQL๋กœ ๋ณ€ํ™˜"""
86
+ prompt = f"""๋‹ค์Œ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์Šคํ‚ค๋งˆ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์งˆ๋ฌธ์— ๋Œ€ํ•œ SQL ์ฟผ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜์„ธ์š”.
87
+
88
+ ### ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์Šคํ‚ค๋งˆ:
89
+ {schema_info}
90
+
91
+ ### ์งˆ๋ฌธ: {question}
92
+
93
+ ### SQL ์ฟผ๋ฆฌ:"""
94
+
95
+ messages = [{"role": "user", "content": prompt}]
96
+ inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
97
+
98
+ outputs = model.generate(inputs, max_new_tokens=150, temperature=0.1)
99
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
100
+
101
+ return response.split("### SQL ์ฟผ๋ฆฌ:")[-1].strip()
102
+ ```
103
+
104
+ ### ์‹ค์ œ ์‚ฌ์šฉ ์˜ˆ์‹œ
105
+
106
+ ```python
107
+ # ์˜ˆ์‹œ 1: ์ง‘๊ณ„ ํ•จ์ˆ˜
108
+ question = "๋ถ€์„œ์žฅ๋“ค ์ค‘ 56์„ธ๋ณด๋‹ค ๋‚˜์ด๊ฐ€ ๋งŽ์€ ์‚ฌ๋žŒ์ด ๋ช‡ ๋ช…์ž…๋‹ˆ๊นŒ?"
109
+ # ๊ฒฐ๊ณผ: SELECT count(*) FROM head WHERE age > 56
110
+
111
+ # ์˜ˆ์‹œ 2: ์กฐ์ธ
112
+ question = "๊ฐ€์žฅ ๋งŽ์€ ๋Œ€ํšŒ๋ฅผ ๊ฐœ์ตœํ•œ ๋„์‹œ์˜ ์ƒํƒœ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?"
113
+ # ๊ฒฐ๊ณผ: SELECT T1.Status FROM city AS T1 JOIN farm_competition AS T2 ON T1.City_ID = T2.Host_city_ID GROUP BY T2.Host_city_ID ORDER BY COUNT(*) DESC LIMIT 1
114
+
115
+ # ์˜ˆ์‹œ 3: ์„œ๋ธŒ์ฟผ๋ฆฌ
116
+ question = "๊ธฐ์—…๊ฐ€๊ฐ€ ์•„๋‹Œ ์‚ฌ๋žŒ๋“ค์˜ ์ด๋ฆ„์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?"
117
+ # ๊ฒฐ๊ณผ: SELECT Name FROM people WHERE People_ID NOT IN (SELECT People_ID FROM entrepreneur)
118
+ ```
119
+
120
+ ## โš ๏ธ ์‚ฌ์šฉ ์‹œ ์ฃผ์˜์‚ฌํ•ญ
121
+
122
+ ### ์ œํ•œ์‚ฌํ•ญ
123
+ - โœ… ์˜์–ด ํ…Œ์ด๋ธ”/์ปฌ๋Ÿผ๋ช… ์‚ฌ์šฉ (ํ•œ๊ตญ์–ด ์งˆ๋ฌธ โ†’ ์˜์–ด SQL)
124
+ - โœ… Spider ๋ฐ์ดํ„ฐ์…‹ ๋„๋ฉ”์ธ์— ์ตœ์ ํ™”
125
+ - โŒ NoSQL, ๊ทธ๋ž˜ํ”„ DB ๋ฏธ์ง€์›
126
+ - โŒ ๋งค์šฐ ๋ณต์žกํ•œ ์ค‘์ฒฉ ์ฟผ๋ฆฌ๋Š” ์ •ํ™•๋„ ํ•˜๋ฝ
127
+
128
+ ## ๐Ÿ”ง ๊ธฐ์ˆ  ์‚ฌ์–‘
129
+
130
+ ### ํ•™์Šต ํ™˜๊ฒฝ
131
+ - **GPU**: NVIDIA Tesla T4 (16GB)
132
+ - **ํ•™์Šต ์‹œ๊ฐ„**: ์•ฝ 4์‹œ๊ฐ„
133
+ - **๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ**: ์ตœ๋Œ€ 7.6GB VRAM
134
+
135
+ ### ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
136
+ ```python
137
+ training_args = {
138
+ "per_device_train_batch_size": 2,
139
+ "gradient_accumulation_steps": 4,
140
+ "learning_rate": 5e-4,
141
+ "num_train_epochs": 1,
142
+ "optimizer": "adamw_8bit",
143
+ "lr_scheduler_type": "cosine",
144
+ "warmup_ratio": 0.05
145
+ }
146
+
147
+ lora_config = {
148
+ "r": 16,
149
+ "lora_alpha": 32,
150
+ "lora_dropout": 0,
151
+ "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
152
+ "gate_proj", "up_proj", "down_proj"]
153
+ }
154
+ ```
155
+
156
+ ## ๐Ÿ“š ์ฐธ๊ณ  ์ž๋ฃŒ
157
+
158
+ ### ์ธ์šฉ
159
+ ```bibtex
160
+ @misc{llama31_spider_sql_ko_2025,
161
+ title={Llama-3.1-8B-Spider-SQL-Ko: Korean Text-to-SQL Model},
162
+ author={[Sohyun Sim, Youngjun Cho, Seongwoo Choi]},
163
+ year={2025},
164
+ publisher={Hugging Face KREW},
165
+ url={https://huggingface.co/huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko}
166
+ }
167
+ ```
168
+
169
+ ### ๊ด€๋ จ ๋…ผ๋ฌธ
170
+ - [Spider: A Large-Scale Human-Labeled Dataset](https://arxiv.org/abs/1809.08887) (Yu et al., 2018)
171
+
172
+ ## ๐Ÿค ๊ธฐ์—ฌ์ž
173
+
174
+ [@sim-so](https://huggingface.co/sim-so), [@choincnp](https://huggingface.co/choincnp), [@nuatmochoi](https://huggingface.co/nuatmochoi)