Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,7 @@
|
|
1 |
---
|
2 |
language:
|
3 |
- en
|
|
|
4 |
tags:
|
5 |
- language
|
6 |
- kannada
|
@@ -8,6 +9,7 @@ license: mit
|
|
8 |
base_model:
|
9 |
- text2font/ByteLevelBPETokenizer_default
|
10 |
---
|
|
|
11 |
# Kannada ByteLevel BPE Tokenizer
|
12 |
|
13 |
This repository contains a **Byte-Level BPE (Byte Pair Encoding) tokenizer** for the **Kannada** language, designed using the Hugging Face `tokenizers` library. The tokenizer is optimized for handling Kannada text, including **pure Kannada, mixed-language text, numbers, punctuation, and special cases like URLs and emojis**.
|
@@ -34,223 +36,54 @@ The tokenizer has been tested on multiple text categories:
|
|
34 |
|
35 |
## Tokenizer Test Results
|
36 |
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
Encoded tokens: ['<s>', 'ನನ', 'à³į', 'ನ', 'Ġ', 'e', 'm', 'a', 'i', 'l', 'Ġ', 'I', 'D', 'Ġà²ĩದ', 'à³ģ', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '@', 'e', 'm', 'a', 'i', 'l', '.', 'com', 'Ġà²Ĩà²Ĺ', 'ಿ', 'ದ', 'à³Ĩ', '</s>']
|
73 |
-
Token IDs: [0, 306, 264, 266, 225, 73, 81, 69, 77, 80, 225, 45, 40, 493, 265, 225, 73, 92, 69, 81, 84, 80, 73, 36, 73, 81, 69, 77, 80, 18, 469, 408, 267, 269, 268, 1]
|
74 |
-
Decoded text: ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
|
75 |
-
Analysis:
|
76 |
-
- Number of tokens: 36
|
77 |
-
- Average token length: 1.14 characters
|
78 |
-
- Reconstruction: Perfect
|
79 |
-
|
80 |
-
----------------------------------------
|
81 |
-
|
82 |
-
Test Case 2: Technical terms
|
83 |
-
Original text: ಈ mobile phone ನಲ್ಲಿ 4G network ಇದೆ
|
84 |
-
Encoded tokens: ['<s>', 'à²Ī', 'Ġ', 'm', 'o', 'b', 'i', 'l', 'e', 'Ġ', 'p', 'h', 'o', 'n', 'e', 'Ġನಲ', 'à³į', 'ಲ', 'ಿ', 'Ġ', '4', 'G', 'Ġ', 'n', 'e', 't', 'w', 'o', 'r', 'k', 'Ġà²ĩದ', 'à³Ĩ', '</s>']
|
85 |
-
Token IDs: [0, 725, 225, 81, 83, 70, 77, 80, 73, 225, 84, 76, 83, 82, 73, 684, 264, 274, 267, 225, 24, 43, 225, 82, 73, 88, 91, 83, 86, 79, 493, 268, 1]
|
86 |
-
Decoded text: ಈ mobile phone ನಲ್ಲಿ 4G network ಇದೆ
|
87 |
-
Analysis:
|
88 |
-
- Number of tokens: 33
|
89 |
-
- Average token length: 1.06 characters
|
90 |
-
- Reconstruction: Perfect
|
91 |
-
|
92 |
-
----------------------------------------
|
93 |
-
|
94 |
-
Test Case 3: Social media
|
95 |
-
Original text: ಅವರ Instagram profile ತುಂಬಾ popular ಆಗಿದೆ
|
96 |
-
Encoded tokens: ['<s>', 'à²ħವರ', 'Ġ', 'I', 'n', 's', 't', 'a', 'g', 'r', 'a', 'm', 'Ġ', 'p', 'r', 'o', 'f', 'i', 'l', 'e', 'Ġತ', 'à³ģà²Ĥ', 'ಬ', 'ಾ', 'Ġ', 'p', 'o', 'p', 'u', 'l', 'a', 'r', 'Ġà²Ĩà²Ĺ', 'ಿ', 'ದ', 'à³Ĩ', '</s>']
|
97 |
-
Token IDs: [0, 744, 225, 45, 82, 87, 88, 69, 75, 86, 69, 81, 225, 84, 86, 83, 74, 77, 80, 73, 299, 336, 293, 270, 225, 84, 83, 84, 89, 80, 69, 86, 408, 267, 269, 268, 1]
|
98 |
-
Decoded text: ಅವರ Instagram profile ತುಂಬಾ popular ಆಗಿದೆ
|
99 |
-
Analysis:
|
100 |
-
- Number of tokens: 37
|
101 |
-
- Average token length: 1.11 characters
|
102 |
-
- Reconstruction: Perfect
|
103 |
-
|
104 |
-
----------------------------------------
|
105 |
-
|
106 |
-
|
107 |
-
Category: Numbers and Dates
|
108 |
-
|
109 |
-
|
110 |
-
Test Case 1: Pure numbers
|
111 |
-
Original text: ೧೨೩೪೫ 12345
|
112 |
-
Encoded tokens: ['<s>', 'à³§', 'à³', '¨', 'à³', '©', 'à³', 'ª', 'à³', '«', 'Ġ', '1', '2', '3', '4', '5', '</s>']
|
113 |
-
Token IDs: [0, 2891, 262, 106, 262, 107, 262, 108, 262, 109, 225, 21, 22, 23, 24, 25, 1]
|
114 |
-
Decoded text: ೧೨೩೪೫ 12345
|
115 |
-
Analysis:
|
116 |
-
- Number of tokens: 17
|
117 |
-
- Average token length: 0.65 characters
|
118 |
-
- Reconstruction: Perfect
|
119 |
-
|
120 |
-
----------------------------------------
|
121 |
-
|
122 |
-
Test Case 2: Mixed numbers
|
123 |
-
Original text: ನನ್ನ ಫೋನ್ ನಂಬರ್ +91 9876543210
|
124 |
-
Encoded tokens: ['<s>', 'ನನ', 'à³į', 'ನ', 'Ġಫ', 'à³', 'ĭ', 'ನ', 'à³į', 'Ġನ', 'à²Ĥ', 'ಬರ', 'à³į', 'Ġ', '+', '9', '1', 'Ġ', '9', '8', '7', '6', '5', '4', '3', '2', '1', '0', '</s>']
|
125 |
-
Token IDs: [0, 306, 264, 266, 1055, 262, 238, 266, 264, 298, 275, 505, 264, 225, 15, 29, 21, 225, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 1]
|
126 |
-
Decoded text: ನನ್ನ ಫೋನ್ ನಂಬರ್ +91 9876543210
|
127 |
-
Analysis:
|
128 |
-
- Number of tokens: 29
|
129 |
-
- Average token length: 1.03 characters
|
130 |
-
- Reconstruction: Perfect
|
131 |
-
|
132 |
-
----------------------------------------
|
133 |
-
|
134 |
-
Test Case 3: Date formats
|
135 |
-
Original text: ೨೦೨೪ ಫೆಬ್ರವರಿ ೧೪ / 14-02-2024
|
136 |
-
Encoded tokens: ['<s>', 'à³', '¨', 'à³', '¦', 'à³', '¨', 'à³', 'ª', 'Ġಫ', 'à³Ĩ', 'ಬ', 'à³į', 'ರ', 'ವರ', 'ಿ', 'Ġ', 'à³§', 'à³', 'ª', 'Ġ', '/', 'Ġ', '1', '4', '-', '0', '2', '-', '2', '0', '2', '4', '</s>']
|
137 |
-
Token IDs: [0, 262, 106, 262, 104, 262, 106, 262, 108, 1055, 268, 293, 264, 272, 328, 267, 225, 2891, 262, 108, 225, 19, 225, 21, 24, 17, 20, 22, 17, 22, 20, 22, 24, 1]
|
138 |
-
Decoded text: ೨೦೨೪ ಫೆಬ್ರವರಿ ೧೪ / 14-02-2024
|
139 |
-
Analysis:
|
140 |
-
- Number of tokens: 34
|
141 |
-
- Average token length: 0.85 characters
|
142 |
-
- Reconstruction: Perfect
|
143 |
-
|
144 |
-
----------------------------------------
|
145 |
-
|
146 |
-
Test Case 4: Currency
|
147 |
-
Original text: ₹೧೦೦೦.೫೦ ರೂಪಾಯಿ / Rs. 1000.50
|
148 |
-
Encoded tokens: ['<s>', 'â', 'Ĥ', '¹', 'à³§', 'à³', '¦', 'à³', '¦', 'à³', '¦', '.', 'à³', '«', 'à³', '¦', 'Ġರ', 'à³Ĥ', 'ಪ', 'ಾ', 'ಯ', 'ಿ', 'Ġ', '/', 'Ġ', 'R', 's', '.', 'Ġ', '1', '0', '0', '0', '.', '5', '0', '</s>']
|
149 |
-
Token IDs: [0, 163, 229, 122, 2891, 262, 104, 262, 104, 262, 104, 18, 262, 109, 262, 104, 371, 289, 291, 270, 277, 267, 225, 19, 225, 54, 87, 18, 225, 21, 20, 20, 20, 18, 25, 20, 1]
|
150 |
-
Decoded text: ₹೧೦೦೦.೫೦ ರೂಪಾಯಿ / Rs. 1000.50
|
151 |
-
Analysis:
|
152 |
-
- Number of tokens: 37
|
153 |
-
- Average token length: 0.78 characters
|
154 |
-
- Reconstruction: Perfect
|
155 |
-
|
156 |
-
----------------------------------------
|
157 |
-
|
158 |
-
|
159 |
-
Category: Punctuation
|
160 |
-
|
161 |
-
|
162 |
-
Test Case 1: Basic punctuation
|
163 |
-
Original text: ನೀವು ಹೇಗಿದ್ದೀರಿ? ನಾನು ಚೆನ್ನಾಗಿದ್ದೇನೆ!
|
164 |
-
Encoded tokens: ['<s>', 'ನ', 'à³Ģ', 'ವ', 'à³ģ', 'Ġಹ', 'à³', 'ĩ', 'à²Ĺ', 'ಿ', 'ದ', 'à³į', 'ದ', 'à³Ģ', 'ರ', 'ಿ?', 'Ġನ', 'ಾ', 'ನ', 'à³ģ', 'Ġà²ļ', 'à³Ĩ', 'ನ', 'à³į', 'ನ', 'ಾ', 'à²Ĺ', 'ಿ', 'ದ', 'à³į', 'ದ', 'à³', 'ĩ', 'ನ', 'à³Ĩ!', '</s>']
|
165 |
-
Token IDs: [0, 266, 276, 279, 265, 284, 262, 234, 273, 267, 269, 264, 269, 276, 272, 813, 298, 270, 266, 265, 339, 268, 266, 264, 266, 270, 273, 267, 269, 264, 269, 262, 234, 266, 532, 1]
|
166 |
-
Decoded text: ನೀವು ಹೇಗಿದ್ದೀರಿ? ನಾನು ಚೆನ್ನಾಗಿದ್ದೇನೆ!
|
167 |
-
Analysis:
|
168 |
-
- Number of tokens: 36
|
169 |
-
- Average token length: 1.03 characters
|
170 |
-
- Reconstruction: Perfect
|
171 |
-
|
172 |
-
----------------------------------------
|
173 |
-
|
174 |
-
Test Case 2: Special characters
|
175 |
-
Original text: ಇಂದು @bangalore #Karnataka (ಕರ್ನಾಟಕ) ರಾಜ್ಯ
|
176 |
-
Encoded tokens: ['<s>', 'à²ĩ', 'à²Ĥ', 'ದ', 'à³ģ', 'Ġ', '@', 'b', 'an', 'g', 'a', 'l', 'o', 'r', 'e', 'Ġ', '#', 'K', 'a', 'r', 'n', 'a', 't', 'a', 'k', 'a', 'Ġ(', 'à²ķರ', 'à³į', 'ನ', 'ಾ', 'à²Łà²ķ', ')', 'Ġರ', 'ಾ', 'à²ľ', 'à³į', 'ಯ', '</s>']
|
177 |
-
Token IDs: [0, 447, 275, 269, 265, 225, 36, 70, 380, 75, 69, 80, 83, 86, 73, 225, 7, 47, 69, 86, 82, 69, 88, 69, 79, 69, 443, 368, 264, 266, 270, 482, 13, 371, 270, 330, 264, 277, 1]
|
178 |
-
Decoded text: ಇಂದು @bangalore #Karnataka (ಕರ್ನಾಟಕ) ರಾಜ್ಯ
|
179 |
-
Analysis:
|
180 |
-
- Number of tokens: 39
|
181 |
-
- Average token length: 1.08 characters
|
182 |
-
- Reconstruction: Perfect
|
183 |
-
|
184 |
-
----------------------------------------
|
185 |
-
|
186 |
-
Test Case 3: Mixed punctuation
|
187 |
-
Original text: "ಕನ್ನಡ" - 'ನಾಡಿನ' & ಸಂಸ್ಕೃತಿ...
|
188 |
-
Encoded tokens: ['<s>', '"', 'à²ķನ', 'à³į', 'ನಡ', '"', 'Ġ-', 'Ġ', "'", 'ನ', 'ಾ', 'ಡ', 'ಿ', 'ನ', "'", 'Ġ', '&', 'Ġಸ', 'à²Ĥ', 'ಸ', 'à³į', 'à²ķ', 'à³ĥ', 'ತ', 'ಿ.', '.', '.', '</s>']
|
189 |
-
Token IDs: [0, 6, 754, 264, 407, 6, 438, 225, 11, 266, 270, 280, 267, 266, 11, 225, 10, 300, 275, 281, 264, 278, 412, 271, 517, 18, 18, 1]
|
190 |
-
Decoded text: "ಕನ್ನಡ" - 'ನಾಡಿನ' & ಸಂಸ್ಕೃತಿ...
|
191 |
-
Analysis:
|
192 |
-
- Number of tokens: 28
|
193 |
-
- Average token length: 1.11 characters
|
194 |
-
- Reconstruction: Perfect
|
195 |
-
|
196 |
-
----------------------------------------
|
197 |
-
|
198 |
-
|
199 |
-
Category: Special Cases
|
200 |
-
|
201 |
-
|
202 |
-
Test Case 1: URLs
|
203 |
-
Original text: ನಮ್ಮ ವೆಬ್ಸೈಟ್: https://kannada.example.com
|
204 |
-
Encoded tokens: ['<s>', 'ನಮ', 'à³į', 'ಮ', 'Ġವ', 'à³Ĩ', 'ಬ', 'à³į', 'âĢ', 'Į', 'ಸ', 'à³Ī', 'à²Ł', 'à³į:', 'Ġhttps', '://', 'kannad', 'a', '.', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '.', 'com', '</s>']
|
205 |
-
Token IDs: [0, 765, 264, 285, 332, 268, 293, 264, 297, 239, 281, 335, 286, 3786, 473, 452, 471, 69, 18, 73, 92, 69, 81, 84, 80, 73, 18, 469, 1]
|
206 |
-
Decoded text: ನಮ್ಮ ವೆಬ್ಸೈಟ್: https://kannada.example.com
|
207 |
-
Analysis:
|
208 |
-
- Number of tokens: 29
|
209 |
-
- Average token length: 1.48 characters
|
210 |
-
- Reconstruction: Perfect
|
211 |
-
|
212 |
-
----------------------------------------
|
213 |
-
|
214 |
-
Test Case 2: Hashtags
|
215 |
-
Original text: #ಕನ್ನಡ_ರಾಜ್ಯೋತ್ಸವ #Kannada
|
216 |
-
Encoded tokens: ['<s>', '#', 'à²ķನ', 'à³į', 'ನಡ', '_', 'ರ', 'ಾ', 'à²ľ', 'à³į', 'ಯ', 'à³', 'ĭ', 'ತ', 'à³į', 'ಸವ', 'Ġ', '#', 'K', 'an', 'nad', 'a', '</s>']
|
217 |
-
Token IDs: [0, 7, 754, 264, 407, 67, 272, 270, 330, 264, 277, 262, 238, 271, 264, 595, 225, 7, 47, 380, 461, 69, 1]
|
218 |
-
Decoded text: #ಕನ್ನಡ_ರಾಜ್ಯೋತ್ಸವ #Kannada
|
219 |
-
Analysis:
|
220 |
-
- Number of tokens: 23
|
221 |
-
- Average token length: 1.13 characters
|
222 |
-
- Reconstruction: Perfect
|
223 |
-
|
224 |
-
----------------------------------------
|
225 |
-
|
226 |
-
Test Case 3: Emojis
|
227 |
-
Original text: ಕನ್ನಡ ನಾಡು 🚩 ಕನ್ನಡ ಭಾಷೆ ❤️
|
228 |
-
Encoded tokens: ['<s>', 'à²ķನ', 'à³į', 'ನಡ', 'Ġನ', 'ಾ', 'ಡ', 'à³ģ', 'Ġ', 'ð', 'Ł', 'ļ', '©', 'Ġà²ķನ', 'à³į', 'ನಡ', 'Ġà²Ń', 'ಾ', 'ಷ', 'à³Ĩ', 'Ġ', 'â', 'Ŀ', '¤', 'ï', '¸', 'ı', '</s>']
|
229 |
-
Token IDs: [0, 754, 264, 407, 298, 270, 280, 265, 225, 177, 258, 253, 107, 738, 264, 407, 386, 270, 323, 268, 225, 163, 256, 102, 176, 121, 242, 1]
|
230 |
-
Decoded text: ಕನ್ನಡ ನಾಡು 🚩 ಕನ್ನಡ ಭಾಷೆ ❤️
|
231 |
-
Analysis:
|
232 |
-
- Number of tokens: 28
|
233 |
-
- Average token length: 0.93 characters
|
234 |
-
- Reconstruction: Perfect
|
235 |
-
|
236 |
-
----------------------------------------
|
237 |
-
|
238 |
-
Test Case 4: File paths
|
239 |
-
Original text: C:\Users\ಬಳಕೆದಾರ\Documents\ಕನ್ನಡ.txt
|
240 |
-
Encoded tokens: ['<s>', 'C', ':', '\\', 'U', 's', 'e', 'r', 's', '\\', 'ಬಳà²ķ', 'à³Ĩ', 'ದ', 'ಾ', 'ರ', '\\', 'D', 'o', 'c', 'u', 'm', 'e', 'n', 't', 's', '\\', 'à²ķನ', 'à³į', 'ನಡ', '.', 't', 'x', 't', '</s>']
|
241 |
-
Token IDs: [0, 39, 30, 64, 57, 87, 73, 86, 87, 64, 4023, 268, 269, 270, 272, 64, 40, 83, 71, 89, 81, 73, 82, 88, 87, 64, 754, 264, 407, 18, 88, 92, 88, 1]
|
242 |
-
Decoded text: C:\Users\ಬಳಕೆದಾರ\Documents\ಕನ್ನಡ.txt
|
243 |
-
Analysis:
|
244 |
-
- Number of tokens: 34
|
245 |
-
- Average token length: 1.06 characters
|
246 |
-
- Reconstruction: Perfect
|
247 |
-
|
248 |
-
----------------------------------------
|
249 |
-
|
250 |
-
|
251 |
|
252 |
## Repository Structure
|
253 |
-
The repository consists of tokenizer files, configuration files, and documentation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
254 |
|
255 |
## License
|
256 |
This tokenizer is released under the **MIT License**.
|
@@ -258,9 +91,4 @@ This tokenizer is released under the **MIT License**.
|
|
258 |
## Citation
|
259 |
If you use this tokenizer, please cite this repository.
|
260 |
|
261 |
-
|
262 |
-
license: mit
|
263 |
-
language:
|
264 |
-
- en
|
265 |
-
- kn
|
266 |
-
---
|
|
|
1 |
---
|
2 |
language:
|
3 |
- en
|
4 |
+
- kn
|
5 |
tags:
|
6 |
- language
|
7 |
- kannada
|
|
|
9 |
base_model:
|
10 |
- text2font/ByteLevelBPETokenizer_default
|
11 |
---
|
12 |
+
|
13 |
# Kannada ByteLevel BPE Tokenizer
|
14 |
|
15 |
This repository contains a **Byte-Level BPE (Byte Pair Encoding) tokenizer** for the **Kannada** language, designed using the Hugging Face `tokenizers` library. The tokenizer is optimized for handling Kannada text, including **pure Kannada, mixed-language text, numbers, punctuation, and special cases like URLs and emojis**.
|
|
|
36 |
|
37 |
## Tokenizer Test Results
|
38 |
|
39 |
+
### Category: Pure Kannada
|
40 |
+
|
41 |
+
#### Test Case 1: Basic sentence
|
42 |
+
**Original text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
|
43 |
+
**Encoded tokens:** ['<s>', 'ನಮಸ', 'à³į', 'à²ķ', 'ಾ', 'ರ', 'Ġà²ķನ', 'à³į', 'ನಡ', 'Ġà²Ń', 'ಾ', 'ಷ', 'à³Ĩ', '</s>']
|
44 |
+
**Token IDs:** [0, 1461, 264, 278, 270, 272, 738, 264, 407, 386, 270, 323, 268, 1]
|
45 |
+
**Decoded text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
|
46 |
+
**Analysis:**
|
47 |
+
- Number of tokens: 14
|
48 |
+
- Average token length: 1.29 characters
|
49 |
+
- Reconstruction: Perfect
|
50 |
+
|
51 |
+
#### Test Case 2: Complex sentence
|
52 |
+
**Original text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
|
53 |
+
**Encoded tokens:** ['<s>', 'à²ķನ', 'à³į', 'ನಡ', 'Ġನ', 'ಾ', 'ಡ', 'ಿ', 'ನ', 'Ġಸ', 'à²Ĥ', 'ಸ', 'à³į', 'à²ķ', 'à³ĥ', 'ತ', 'ಿ', 'Ġಮತ', 'à³į', 'ತ', 'à³ģ', 'Ġಪರ', 'à²Ĥ', 'ಪರ', 'à³Ĩ', '</s>']
|
54 |
+
**Token IDs:** [0, 754, 264, 407, 298, 270, 280, 267, 266, 300, 275, 281, 264, 278, 412, 271, 267, 382, 264, 271, 265, 360, 275, 524, 268, 1]
|
55 |
+
**Decoded text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
|
56 |
+
**Analysis:**
|
57 |
+
- Number of tokens: 26
|
58 |
+
- Average token length: 1.27 characters
|
59 |
+
- Reconstruction: Perfect
|
60 |
+
|
61 |
+
### Category: Mixed Language
|
62 |
+
|
63 |
+
#### Test Case 1: Kannada with English
|
64 |
+
**Original text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
|
65 |
+
**Encoded tokens:** ['<s>', 'ನನ', 'à³į', 'ನ', 'Ġ', 'e', 'm', 'a', 'i', 'l', 'Ġ', 'I', 'D', 'Ġà²ĩದ', 'à³ģ', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '@', 'e', 'm', 'a', 'i', 'l', '.', 'com', 'Ġà²Ĩà²Ĺ', 'ಿ', 'ದ', 'à³Ĩ', '</s>']
|
66 |
+
**Token IDs:** [0, 306, 264, 266, 225, 73, 81, 69, 77, 80, 225, 45, 40, 493, 265, 225, 73, 92, 69, 81, 84, 80, 73, 36, 73, 81, 69, 77, 80, 18, 469, 408, 267, 269, 268, 1]
|
67 |
+
**Decoded text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
|
68 |
+
**Analysis:**
|
69 |
+
- Number of tokens: 36
|
70 |
+
- Average token length: 1.14 characters
|
71 |
+
- Reconstruction: Perfect
|
72 |
+
|
73 |
+
(Additional test cases can be added following the same format)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
|
75 |
## Repository Structure
|
76 |
+
The repository consists of tokenizer files, configuration files, and documentation:
|
77 |
+
```
|
78 |
+
├── kannada_tokenizer/
|
79 |
+
│ ├── vocab.json
|
80 |
+
│ ├── merges.txt
|
81 |
+
│ ├── tokenizer.json
|
82 |
+
│ ├── tokenizer_config.json
|
83 |
+
│ ├── special_tokens_map.json
|
84 |
+
│ ├── config.json
|
85 |
+
├── README.md
|
86 |
+
```
|
87 |
|
88 |
## License
|
89 |
This tokenizer is released under the **MIT License**.
|
|
|
91 |
## Citation
|
92 |
If you use this tokenizer, please cite this repository.
|
93 |
|
94 |
+
For improvements and contributions, feel free to submit a **Pull Request** or open an **Issue**. 🚀
|
|
|
|
|
|
|
|
|
|