--- library_name: transformers base_model: google-bert/bert-base-chinese tags: - generated_from_trainer datasets: - peoples_daily_ner metrics: - f1 model-index: - name: models_for_ner results: - task: type: token-classification name: Token Classification dataset: name: peoples_daily_ner type: peoples_daily_ner config: peoples_daily_ner split: validation args: peoples_daily_ner metrics: - type: f1 value: 0.9508438253415484 name: F1 --- # models_for_ner This model is a fine-tuned version of [google-bert/bert-base-chinese](https://huggingface.co/google-bert/bert-base-chinese) on the peoples_daily_ner dataset. It achieves the following results on the evaluation set: - Loss: 0.0219 - F1: 0.9508 ## Model description ### 使用方法(pipline的方法) ```python from transformers import pipeline ner_pipe = pipeline('token-classification', model='roberthsu2003/models_for_ner',aggregation_strategy='simple') inputs = '徐國堂在台北上班' res = ner_pipe(inputs) print(res) res_result = {} for r in res: entity_name = r['entity_group'] start = r['start'] end = r['end'] if entity_name not in res_result: res_result[entity_name] = [] res_result[entity_name].append(inputs[start:end]) res_result #==output== {'PER': ['徐國堂'], 'LOC': ['台北']} ``` ### 使用方法(model,tokenizer) ```python from transformers import AutoModelForTokenClassification, AutoTokenizer import numpy as np # Load the pre-trained model and tokenizer model = AutoModelForTokenClassification.from_pretrained('roberthsu2003/models_for_ner') tokenizer = AutoTokenizer.from_pretrained('roberthsu2003/models_for_ner') # The label mapping (you might need to adjust this based on your training) #['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'] label_list = list(model.config.id2label.values()) def predict_ner(text): """Predicts NER tags for a given text using the loaded model.""" # Encode the text inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True) # Get model predictions outputs = model(**inputs) predictions = np.argmax(outputs.logits.detach().numpy(), axis=-1) # Get the word IDs from the encoded inputs # This is the key change - word_ids() is a method on the encoding result, not the tokenizer itself word_ids = inputs.word_ids(batch_index=0) pred_tags = [] for word_id, pred in zip(word_ids, predictions[0]): if word_id is None: continue # Skip special tokens pred_tags.append(label_list[pred]) return pred_tags #To get the entities, you'll need to group consecutive non-O tags: def get_entities(tags): """Groups consecutive NER tags to extract entities.""" entities = [] start_index = -1 current_entity_type = None for i, tag in enumerate(tags): if tag != 'O': if start_index == -1: start_index = i current_entity_type = tag[2:] # Extract entity type (e.g., PER, LOC, ORG) else: #tag == 'O' if start_index != -1: entities.append((start_index, i, current_entity_type)) start_index = -1 current_entity_type = None if start_index != -1: entities.append((start_index, len(tags), current_entity_type)) return entities # Example usage: text = "徐國堂在台北上班" ner_tags = predict_ner(text) print(f"Text: {text}") #==output== #Text: 徐國堂在台北上班 print(f"NER Tags: {ner_tags}") #===output== #NER Tags: ['B-PER', 'I-PER', 'I-PER', 'O', 'B-LOC', 'I-LOC', 'O', 'O'] entities = get_entities(ner_tags) word_tokens = tokenizer.tokenize(text) # Tokenize to get individual words print(f"Entities:") for start, end, entity_type in entities: entity_text = "".join(word_tokens[start:end]) print(f"- {entity_text}: {entity_type}") #==output== #Entities: #- 徐國堂: PER #- 台北: LOC ``` ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 64 - eval_batch_size: 128 - seed: 42 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - num_epochs: 3 ### Training results | Training Loss | Epoch | Step | Validation Loss | F1 | |:-------------:|:-----:|:----:|:---------------:|:------:| | 0.0274 | 1.0 | 327 | 0.0204 | 0.9510 | | 0.0127 | 2.0 | 654 | 0.0174 | 0.9592 | | 0.0063 | 3.0 | 981 | 0.0186 | 0.9602 | ### Framework versions - Transformers 4.48.3 - Pytorch 2.5.1+cu124 - Datasets 3.3.2 - Tokenizers 0.21.0