|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
English & Greek Tokenizer trained from scratch |
|
|
|
### Direct Use |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("gsar78/tokenizer_BPE_en_el") |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("gsar78/tokenizer_BPE_en_el") |
|
|
|
# Tokenize input text |
|
input_text = "This is a game" |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
|
|
# Print the tokenized input (IDs and tokens) |
|
print("Token IDs:", inputs["input_ids"].tolist()) |
|
|
|
# Convert token IDs to tokens |
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
print("Tokens:", tokens) |
|
|
|
# Manually join tokens to form the tokenized string |
|
tokenized_string = ' '.join(tokens) |
|
print("Tokenized String:", tokenized_string) |
|
``` |
|
|
|
```context |
|
# Output: |
|
Token IDs: [[2967, 317, 220, 1325]] |
|
Tokens: ['This', 'Ġis', 'Ġa', 'Ġgame'] |
|
Tokenized String: This Ġis Ġa Ġgame |
|
``` |
|
|
|
|
|
### Recommendations |
|
|
|
When tokenizing Greek, Greek tokens may appear as gibberish, but actually this does not impact the downstream model pretraining. |
|
|
|
(An improved version of this tokenizer, without the gibberish Greek tokens can be found here: gsar78/Greek_Tokenizer) |
|
|
|
Can be used a good start for pretraining a GPT-based model or any other model using BPE. |
|
|
|
|