ruthuvikas1998 commited on
Commit
ec3dcf7
·
verified ·
1 Parent(s): 57909d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -221
README.md CHANGED
@@ -1,6 +1,7 @@
1
  ---
2
  language:
3
  - en
 
4
  tags:
5
  - language
6
  - kannada
@@ -8,6 +9,7 @@ license: mit
8
  base_model:
9
  - text2font/ByteLevelBPETokenizer_default
10
  ---
 
11
  # Kannada ByteLevel BPE Tokenizer
12
 
13
  This repository contains a **Byte-Level BPE (Byte Pair Encoding) tokenizer** for the **Kannada** language, designed using the Hugging Face `tokenizers` library. The tokenizer is optimized for handling Kannada text, including **pure Kannada, mixed-language text, numbers, punctuation, and special cases like URLs and emojis**.
@@ -34,223 +36,54 @@ The tokenizer has been tested on multiple text categories:
34
 
35
  ## Tokenizer Test Results
36
 
37
-
38
-
39
- Category: Pure Kannada
40
-
41
-
42
- Test Case 1: Basic sentence
43
- Original text: ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
44
- Encoded tokens: ['<s>', 'ನಮಸ', 'à³į', 'à²ķ', 'ಾ', 'ರ', 'Ġà²ķನ', 'à³į', 'ನಡ', 'Ġà²Ń', 'ಾ', 'ಷ', 'à³Ĩ', '</s>']
45
- Token IDs: [0, 1461, 264, 278, 270, 272, 738, 264, 407, 386, 270, 323, 268, 1]
46
- Decoded text: ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
47
- Analysis:
48
- - Number of tokens: 14
49
- - Average token length: 1.29 characters
50
- - Reconstruction: Perfect
51
-
52
- ----------------------------------------
53
-
54
- Test Case 2: Complex sentence
55
- Original text: ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
56
- Encoded tokens: ['<s>', 'à²ķನ', 'à³į', 'ನಡ', 'Ġನ', 'ಾ', 'ಡ', 'ಿ', 'ನ', 'Ġಸ', 'à²Ĥ', 'ಸ', 'à³į', 'à²ķ', 'à³ĥ', 'ತ', 'ಿ', 'Ġಮತ', 'à³į', 'ತ', 'à³ģ', 'Ġಪರ', 'à²Ĥ', 'ಪರ', 'à³Ĩ', '</s>']
57
- Token IDs: [0, 754, 264, 407, 298, 270, 280, 267, 266, 300, 275, 281, 264, 278, 412, 271, 267, 382, 264, 271, 265, 360, 275, 524, 268, 1]
58
- Decoded text: ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
59
- Analysis:
60
- - Number of tokens: 26
61
- - Average token length: 1.27 characters
62
- - Reconstruction: Perfect
63
-
64
- ----------------------------------------
65
-
66
-
67
- Category: Mixed Language
68
-
69
-
70
- Test Case 1: Kannada with English
71
- Original text: ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
72
- Encoded tokens: ['<s>', 'ನನ', 'à³į', 'ನ', 'Ġ', 'e', 'm', 'a', 'i', 'l', 'Ġ', 'I', 'D', 'Ġà²ĩದ', 'à³ģ', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '@', 'e', 'm', 'a', 'i', 'l', '.', 'com', 'Ġà²Ĩà²Ĺ', 'ಿ', 'ದ', 'à³Ĩ', '</s>']
73
- Token IDs: [0, 306, 264, 266, 225, 73, 81, 69, 77, 80, 225, 45, 40, 493, 265, 225, 73, 92, 69, 81, 84, 80, 73, 36, 73, 81, 69, 77, 80, 18, 469, 408, 267, 269, 268, 1]
74
- Decoded text: ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
75
- Analysis:
76
- - Number of tokens: 36
77
- - Average token length: 1.14 characters
78
- - Reconstruction: Perfect
79
-
80
- ----------------------------------------
81
-
82
- Test Case 2: Technical terms
83
- Original text: ಈ mobile phone ನಲ್ಲಿ 4G network ಇದೆ
84
- Encoded tokens: ['<s>', 'à²Ī', 'Ġ', 'm', 'o', 'b', 'i', 'l', 'e', 'Ġ', 'p', 'h', 'o', 'n', 'e', 'Ġನಲ', 'à³į', 'ಲ', 'ಿ', 'Ġ', '4', 'G', 'Ġ', 'n', 'e', 't', 'w', 'o', 'r', 'k', 'Ġà²ĩದ', 'à³Ĩ', '</s>']
85
- Token IDs: [0, 725, 225, 81, 83, 70, 77, 80, 73, 225, 84, 76, 83, 82, 73, 684, 264, 274, 267, 225, 24, 43, 225, 82, 73, 88, 91, 83, 86, 79, 493, 268, 1]
86
- Decoded text: ಈ mobile phone ನಲ್ಲಿ 4G network ಇದೆ
87
- Analysis:
88
- - Number of tokens: 33
89
- - Average token length: 1.06 characters
90
- - Reconstruction: Perfect
91
-
92
- ----------------------------------------
93
-
94
- Test Case 3: Social media
95
- Original text: ಅವರ Instagram profile ತುಂಬಾ popular ಆಗಿದೆ
96
- Encoded tokens: ['<s>', 'à²ħವರ', 'Ġ', 'I', 'n', 's', 't', 'a', 'g', 'r', 'a', 'm', 'Ġ', 'p', 'r', 'o', 'f', 'i', 'l', 'e', 'Ġತ', 'à³ģà²Ĥ', 'ಬ', 'ಾ', 'Ġ', 'p', 'o', 'p', 'u', 'l', 'a', 'r', 'Ġà²Ĩà²Ĺ', 'ಿ', 'ದ', 'à³Ĩ', '</s>']
97
- Token IDs: [0, 744, 225, 45, 82, 87, 88, 69, 75, 86, 69, 81, 225, 84, 86, 83, 74, 77, 80, 73, 299, 336, 293, 270, 225, 84, 83, 84, 89, 80, 69, 86, 408, 267, 269, 268, 1]
98
- Decoded text: ಅವರ Instagram profile ತುಂಬಾ popular ಆಗಿದೆ
99
- Analysis:
100
- - Number of tokens: 37
101
- - Average token length: 1.11 characters
102
- - Reconstruction: Perfect
103
-
104
- ----------------------------------------
105
-
106
-
107
- Category: Numbers and Dates
108
-
109
-
110
- Test Case 1: Pure numbers
111
- Original text: ೧೨೩೪೫ 12345
112
- Encoded tokens: ['<s>', 'à³§', 'à³', '¨', 'à³', '©', 'à³', 'ª', 'à³', '«', 'Ġ', '1', '2', '3', '4', '5', '</s>']
113
- Token IDs: [0, 2891, 262, 106, 262, 107, 262, 108, 262, 109, 225, 21, 22, 23, 24, 25, 1]
114
- Decoded text: ೧೨೩೪೫ 12345
115
- Analysis:
116
- - Number of tokens: 17
117
- - Average token length: 0.65 characters
118
- - Reconstruction: Perfect
119
-
120
- ----------------------------------------
121
-
122
- Test Case 2: Mixed numbers
123
- Original text: ನನ್ನ ಫೋನ್ ನಂಬರ್ +91 9876543210
124
- Encoded tokens: ['<s>', 'ನನ', 'à³į', 'ನ', 'Ġಫ', 'à³', 'ĭ', 'ನ', 'à³į', 'Ġನ', 'à²Ĥ', 'ಬರ', 'à³į', 'Ġ', '+', '9', '1', 'Ġ', '9', '8', '7', '6', '5', '4', '3', '2', '1', '0', '</s>']
125
- Token IDs: [0, 306, 264, 266, 1055, 262, 238, 266, 264, 298, 275, 505, 264, 225, 15, 29, 21, 225, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 1]
126
- Decoded text: ನನ್ನ ಫೋನ್ ನಂಬರ್ +91 9876543210
127
- Analysis:
128
- - Number of tokens: 29
129
- - Average token length: 1.03 characters
130
- - Reconstruction: Perfect
131
-
132
- ----------------------------------------
133
-
134
- Test Case 3: Date formats
135
- Original text: ೨೦೨೪ ಫೆಬ್ರವರಿ ೧೪ / 14-02-2024
136
- Encoded tokens: ['<s>', 'à³', '¨', 'à³', '¦', 'à³', '¨', 'à³', 'ª', 'Ġಫ', 'à³Ĩ', 'ಬ', 'à³į', 'ರ', 'ವರ', 'ಿ', 'Ġ', 'à³§', 'à³', 'ª', 'Ġ', '/', 'Ġ', '1', '4', '-', '0', '2', '-', '2', '0', '2', '4', '</s>']
137
- Token IDs: [0, 262, 106, 262, 104, 262, 106, 262, 108, 1055, 268, 293, 264, 272, 328, 267, 225, 2891, 262, 108, 225, 19, 225, 21, 24, 17, 20, 22, 17, 22, 20, 22, 24, 1]
138
- Decoded text: ೨೦೨೪ ಫೆಬ್ರವರಿ ೧೪ / 14-02-2024
139
- Analysis:
140
- - Number of tokens: 34
141
- - Average token length: 0.85 characters
142
- - Reconstruction: Perfect
143
-
144
- ----------------------------------------
145
-
146
- Test Case 4: Currency
147
- Original text: ₹೧೦೦೦.೫೦ ರೂಪಾಯಿ / Rs. 1000.50
148
- Encoded tokens: ['<s>', 'â', 'Ĥ', '¹', 'à³§', 'à³', '¦', 'à³', '¦', 'à³', '¦', '.', 'à³', '«', 'à³', '¦', 'Ġರ', 'à³Ĥ', 'ಪ', 'ಾ', 'ಯ', 'ಿ', 'Ġ', '/', 'Ġ', 'R', 's', '.', 'Ġ', '1', '0', '0', '0', '.', '5', '0', '</s>']
149
- Token IDs: [0, 163, 229, 122, 2891, 262, 104, 262, 104, 262, 104, 18, 262, 109, 262, 104, 371, 289, 291, 270, 277, 267, 225, 19, 225, 54, 87, 18, 225, 21, 20, 20, 20, 18, 25, 20, 1]
150
- Decoded text: ₹೧೦೦೦.೫೦ ರೂಪಾಯಿ / Rs. 1000.50
151
- Analysis:
152
- - Number of tokens: 37
153
- - Average token length: 0.78 characters
154
- - Reconstruction: Perfect
155
-
156
- ----------------------------------------
157
-
158
-
159
- Category: Punctuation
160
-
161
-
162
- Test Case 1: Basic punctuation
163
- Original text: ನೀವು ಹೇಗಿದ್ದೀರಿ? ನಾನು ಚೆನ್ನಾಗಿದ್ದೇನೆ!
164
- Encoded tokens: ['<s>', 'ನ', 'à³Ģ', 'ವ', 'à³ģ', 'Ġಹ', 'à³', 'ĩ', 'à²Ĺ', 'ಿ', 'ದ', 'à³į', 'ದ', 'à³Ģ', 'ರ', 'ಿ?', 'Ġನ', 'ಾ', 'ನ', 'à³ģ', 'Ġà²ļ', 'à³Ĩ', 'ನ', 'à³į', 'ನ', 'ಾ', 'à²Ĺ', 'ಿ', 'ದ', 'à³į', 'ದ', 'à³', 'ĩ', 'ನ', 'à³Ĩ!', '</s>']
165
- Token IDs: [0, 266, 276, 279, 265, 284, 262, 234, 273, 267, 269, 264, 269, 276, 272, 813, 298, 270, 266, 265, 339, 268, 266, 264, 266, 270, 273, 267, 269, 264, 269, 262, 234, 266, 532, 1]
166
- Decoded text: ನೀವು ಹೇಗಿದ್ದೀರಿ? ನಾನು ಚೆನ್ನಾಗಿದ್ದೇನೆ!
167
- Analysis:
168
- - Number of tokens: 36
169
- - Average token length: 1.03 characters
170
- - Reconstruction: Perfect
171
-
172
- ----------------------------------------
173
-
174
- Test Case 2: Special characters
175
- Original text: ಇಂದು @bangalore #Karnataka (ಕರ್ನಾಟಕ) ರಾಜ್ಯ
176
- Encoded tokens: ['<s>', 'à²ĩ', 'à²Ĥ', 'ದ', 'à³ģ', 'Ġ', '@', 'b', 'an', 'g', 'a', 'l', 'o', 'r', 'e', 'Ġ', '#', 'K', 'a', 'r', 'n', 'a', 't', 'a', 'k', 'a', 'Ġ(', 'à²ķರ', 'à³į', 'ನ', 'ಾ', 'à²Łà²ķ', ')', 'Ġರ', 'ಾ', 'à²ľ', 'à³į', 'ಯ', '</s>']
177
- Token IDs: [0, 447, 275, 269, 265, 225, 36, 70, 380, 75, 69, 80, 83, 86, 73, 225, 7, 47, 69, 86, 82, 69, 88, 69, 79, 69, 443, 368, 264, 266, 270, 482, 13, 371, 270, 330, 264, 277, 1]
178
- Decoded text: ಇಂದು @bangalore #Karnataka (ಕರ್ನಾಟಕ) ರಾಜ್ಯ
179
- Analysis:
180
- - Number of tokens: 39
181
- - Average token length: 1.08 characters
182
- - Reconstruction: Perfect
183
-
184
- ----------------------------------------
185
-
186
- Test Case 3: Mixed punctuation
187
- Original text: "ಕನ್ನಡ" - 'ನಾಡಿನ' & ಸಂಸ್ಕೃತಿ...
188
- Encoded tokens: ['<s>', '"', 'à²ķನ', 'à³į', 'ನಡ', '"', 'Ġ-', 'Ġ', "'", 'ನ', 'ಾ', 'ಡ', 'ಿ', 'ನ', "'", 'Ġ', '&', 'Ġಸ', 'à²Ĥ', 'ಸ', 'à³į', 'à²ķ', 'à³ĥ', 'ತ', 'ಿ.', '.', '.', '</s>']
189
- Token IDs: [0, 6, 754, 264, 407, 6, 438, 225, 11, 266, 270, 280, 267, 266, 11, 225, 10, 300, 275, 281, 264, 278, 412, 271, 517, 18, 18, 1]
190
- Decoded text: "ಕನ್ನಡ" - 'ನಾಡಿನ' & ಸಂಸ್ಕೃತಿ...
191
- Analysis:
192
- - Number of tokens: 28
193
- - Average token length: 1.11 characters
194
- - Reconstruction: Perfect
195
-
196
- ----------------------------------------
197
-
198
-
199
- Category: Special Cases
200
-
201
-
202
- Test Case 1: URLs
203
- Original text: ನಮ್ಮ ವೆಬ್‌ಸೈಟ್: https://kannada.example.com
204
- Encoded tokens: ['<s>', 'ನಮ', 'à³į', 'ಮ', 'Ġವ', 'à³Ĩ', 'ಬ', 'à³į', 'âĢ', 'Į', 'ಸ', 'à³Ī', 'à²Ł', 'à³į:', 'Ġhttps', '://', 'kannad', 'a', '.', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '.', 'com', '</s>']
205
- Token IDs: [0, 765, 264, 285, 332, 268, 293, 264, 297, 239, 281, 335, 286, 3786, 473, 452, 471, 69, 18, 73, 92, 69, 81, 84, 80, 73, 18, 469, 1]
206
- Decoded text: ನಮ್ಮ ವೆಬ್‌ಸೈಟ್: https://kannada.example.com
207
- Analysis:
208
- - Number of tokens: 29
209
- - Average token length: 1.48 characters
210
- - Reconstruction: Perfect
211
-
212
- ----------------------------------------
213
-
214
- Test Case 2: Hashtags
215
- Original text: #ಕನ್ನಡ_ರಾಜ್ಯೋತ್ಸವ #Kannada
216
- Encoded tokens: ['<s>', '#', 'à²ķನ', 'à³į', 'ನಡ', '_', 'ರ', 'ಾ', 'à²ľ', 'à³į', 'ಯ', 'à³', 'ĭ', 'ತ', 'à³į', 'ಸವ', 'Ġ', '#', 'K', 'an', 'nad', 'a', '</s>']
217
- Token IDs: [0, 7, 754, 264, 407, 67, 272, 270, 330, 264, 277, 262, 238, 271, 264, 595, 225, 7, 47, 380, 461, 69, 1]
218
- Decoded text: #ಕನ್ನಡ_ರಾಜ್ಯೋತ್ಸವ #Kannada
219
- Analysis:
220
- - Number of tokens: 23
221
- - Average token length: 1.13 characters
222
- - Reconstruction: Perfect
223
-
224
- ----------------------------------------
225
-
226
- Test Case 3: Emojis
227
- Original text: ಕನ್ನಡ ನಾಡು 🚩 ಕನ್ನಡ ಭಾಷೆ ❤️
228
- Encoded tokens: ['<s>', 'à²ķನ', 'à³į', 'ನಡ', 'Ġನ', 'ಾ', 'ಡ', 'à³ģ', 'Ġ', 'ð', 'Ł', 'ļ', '©', 'Ġà²ķನ', 'à³į', 'ನಡ', 'Ġà²Ń', 'ಾ', 'ಷ', 'à³Ĩ', 'Ġ', 'â', 'Ŀ', '¤', 'ï', '¸', 'ı', '</s>']
229
- Token IDs: [0, 754, 264, 407, 298, 270, 280, 265, 225, 177, 258, 253, 107, 738, 264, 407, 386, 270, 323, 268, 225, 163, 256, 102, 176, 121, 242, 1]
230
- Decoded text: ಕನ್ನಡ ನಾಡು 🚩 ಕನ್ನಡ ಭಾಷೆ ❤️
231
- Analysis:
232
- - Number of tokens: 28
233
- - Average token length: 0.93 characters
234
- - Reconstruction: Perfect
235
-
236
- ----------------------------------------
237
-
238
- Test Case 4: File paths
239
- Original text: C:\Users\ಬಳಕೆದಾರ\Documents\ಕನ್ನಡ.txt
240
- Encoded tokens: ['<s>', 'C', ':', '\\', 'U', 's', 'e', 'r', 's', '\\', 'ಬಳà²ķ', 'à³Ĩ', 'ದ', 'ಾ', 'ರ', '\\', 'D', 'o', 'c', 'u', 'm', 'e', 'n', 't', 's', '\\', 'à²ķನ', 'à³į', 'ನಡ', '.', 't', 'x', 't', '</s>']
241
- Token IDs: [0, 39, 30, 64, 57, 87, 73, 86, 87, 64, 4023, 268, 269, 270, 272, 64, 40, 83, 71, 89, 81, 73, 82, 88, 87, 64, 754, 264, 407, 18, 88, 92, 88, 1]
242
- Decoded text: C:\Users\ಬಳಕೆದಾರ\Documents\ಕನ್ನಡ.txt
243
- Analysis:
244
- - Number of tokens: 34
245
- - Average token length: 1.06 characters
246
- - Reconstruction: Perfect
247
-
248
- ----------------------------------------
249
-
250
-
251
 
252
  ## Repository Structure
253
- The repository consists of tokenizer files, configuration files, and documentation.
 
 
 
 
 
 
 
 
 
 
254
 
255
  ## License
256
  This tokenizer is released under the **MIT License**.
@@ -258,9 +91,4 @@ This tokenizer is released under the **MIT License**.
258
  ## Citation
259
  If you use this tokenizer, please cite this repository.
260
 
261
- ---
262
- license: mit
263
- language:
264
- - en
265
- - kn
266
- ---
 
1
  ---
2
  language:
3
  - en
4
+ - kn
5
  tags:
6
  - language
7
  - kannada
 
9
  base_model:
10
  - text2font/ByteLevelBPETokenizer_default
11
  ---
12
+
13
  # Kannada ByteLevel BPE Tokenizer
14
 
15
  This repository contains a **Byte-Level BPE (Byte Pair Encoding) tokenizer** for the **Kannada** language, designed using the Hugging Face `tokenizers` library. The tokenizer is optimized for handling Kannada text, including **pure Kannada, mixed-language text, numbers, punctuation, and special cases like URLs and emojis**.
 
36
 
37
  ## Tokenizer Test Results
38
 
39
+ ### Category: Pure Kannada
40
+
41
+ #### Test Case 1: Basic sentence
42
+ **Original text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
43
+ **Encoded tokens:** ['<s>', 'ನಮಸ', 'à³į', 'à²ķ', 'ಾ', 'ರ', 'Ġà²ķನ', 'à³į', 'ನಡ', 'Ġà²Ń', 'ಾ', 'ಷ', 'à³Ĩ', '</s>']
44
+ **Token IDs:** [0, 1461, 264, 278, 270, 272, 738, 264, 407, 386, 270, 323, 268, 1]
45
+ **Decoded text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
46
+ **Analysis:**
47
+ - Number of tokens: 14
48
+ - Average token length: 1.29 characters
49
+ - Reconstruction: Perfect
50
+
51
+ #### Test Case 2: Complex sentence
52
+ **Original text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
53
+ **Encoded tokens:** ['<s>', 'à²ķನ', 'à³į', 'ನಡ', 'Ġನ', 'ಾ', 'ಡ', 'ಿ', 'ನ', 'Ġಸ', 'à²Ĥ', 'ಸ', 'à³į', 'à²ķ', 'à³ĥ', 'ತ', 'ಿ', 'Ġಮತ', 'à³į', 'ತ', 'à³ģ', 'Ġಪರ', 'à²Ĥ', 'ಪರ', 'à³Ĩ', '</s>']
54
+ **Token IDs:** [0, 754, 264, 407, 298, 270, 280, 267, 266, 300, 275, 281, 264, 278, 412, 271, 267, 382, 264, 271, 265, 360, 275, 524, 268, 1]
55
+ **Decoded text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
56
+ **Analysis:**
57
+ - Number of tokens: 26
58
+ - Average token length: 1.27 characters
59
+ - Reconstruction: Perfect
60
+
61
+ ### Category: Mixed Language
62
+
63
+ #### Test Case 1: Kannada with English
64
+ **Original text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
65
+ **Encoded tokens:** ['<s>', 'ನನ', 'à³į', 'ನ', 'Ġ', 'e', 'm', 'a', 'i', 'l', 'Ġ', 'I', 'D', 'Ġà²ĩದ', 'à³ģ', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '@', 'e', 'm', 'a', 'i', 'l', '.', 'com', 'Ġà²Ĩà²Ĺ', 'ಿ', 'ದ', 'à³Ĩ', '</s>']
66
+ **Token IDs:** [0, 306, 264, 266, 225, 73, 81, 69, 77, 80, 225, 45, 40, 493, 265, 225, 73, 92, 69, 81, 84, 80, 73, 36, 73, 81, 69, 77, 80, 18, 469, 408, 267, 269, 268, 1]
67
+ **Decoded text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
68
+ **Analysis:**
69
+ - Number of tokens: 36
70
+ - Average token length: 1.14 characters
71
+ - Reconstruction: Perfect
72
+
73
+ (Additional test cases can be added following the same format)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ## Repository Structure
76
+ The repository consists of tokenizer files, configuration files, and documentation:
77
+ ```
78
+ ├── kannada_tokenizer/
79
+ │ ├── vocab.json
80
+ │ ├── merges.txt
81
+ │ ├── tokenizer.json
82
+ │ ├── tokenizer_config.json
83
+ │ ├── special_tokens_map.json
84
+ │ ├── config.json
85
+ ├── README.md
86
+ ```
87
 
88
  ## License
89
  This tokenizer is released under the **MIT License**.
 
91
  ## Citation
92
  If you use this tokenizer, please cite this repository.
93
 
94
+ For improvements and contributions, feel free to submit a **Pull Request** or open an **Issue**. 🚀