Bert2DModel
Bert2DModel is a new take on the classic BERT architecture, built specifically for languages that have really complex word structures, like Turkish.
Think of it this way: regular BERT sees a sentence as a flat line of words. But for some languages, words themselves have a lot of internal structure (prefixes, suffixes, etc.). Bert2D is cool because it uses a "2D embedding" system. It not only looks at a word's position in the sentence (the first dimension) but also at the position of the sub-pieces inside that word (the second dimension). This gives it a much deeper understanding of the grammar and meaning, especially when words can change form in many different ways. This first version is trained for Turkish!
You can find all the original [Bert2DModel] checkpoints under the yigitbekir collection.
Click on the [Bert2DModel] models in the right sidebar for more examples of how to apply [Bert2DModel] to different text and token classification tasks.
The example below demonstrates how to use the fill-mask
pipeline with Bert2DModel
or load it directly with the [AutoModel
] class.
from transformers import pipeline
# 1. Define your model repository ID
repo_id = "yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2"
# 2. Create the pipeline for the "fill-mask" task
# The model_kwargs dictionary passes arguments to the underlying model loading function.
fill_masker = pipeline(
"fill-mask",
model=repo_id,
use_fast=True,
trust_remote_code=True
)
# 3. Prepare the input and get predictions
masked_sentence = "Adamın mesleği [MASK] midir acaba?"
predictions = fill_masker(masked_sentence)
# 4. Print the results in a user-friendly format
print(f"Predictions for: '{masked_sentence}'")
for prediction in predictions:
print(f" Sequence: {prediction['sequence']}")
print(f" Token: {prediction['token_str']}")
print(f" Score: {prediction['score']:.4f}")
print("-" * 20)
# Expected output:
# Sequence: Adamın mesleği mühendis midir acaba?
# Score: 0.2393
# --------------------
# Sequence: Adamın mesleği doktor midir acaba?
# Score: 0.1698
# --------------------
from transformers import AutoTokenizer, AutoModel
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2", trust_remote_code=True)
model = AutoModel.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2", trust_remote_code=True)
# Example text
text = "Türkiye'nin başkenti Ankara'dır."
inputs = tokenizer(text, return_tensors="pt")
# Get model outputs
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
Notes
Configuration is Key:
Bert2D
introduces new configuration parameters that are not present in a standard BERT model. You must use theBert2DConfig
and be mindful of these settings when training or fine-tuning. Failing to do so will lead to unexpected behavior. The two key new parameters aremax_word_position_embeddings
andmax_intermediate_subword_position_embeddings
.from transformers import AutoConfig # Load the custom config from a pretrained model config = AutoConfig.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2", trust_remote_code=True) # Access new parameters print(f"Max Word Positions: {config.max_word_position_embeddings}") # Expected output: Max Word Positions: 512 print(f"Intermediate Subword Position: {config.max_intermediate_subword_position_embeddings}") # Expected output: Intermediate Subword Position: 2
- Downloads last month
- 611