AutoTrain documentation

Token Classification

You are viewing v0.8.19 version. A newer version v0.8.24 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Token Classification

Token classification is the task of classifying each token in a sequence. This can be used for Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and more. Get your data ready in proper format and then with just a few clicks, your state-of-the-art model will be ready to be used in production.

Data Format

The data should be in the following CSV format:

tokens,tags
"['I', 'love', 'Paris']", "['O', 'O', 'B-LOC']"
"['I', 'live', 'in', 'New', 'York']", "['O', 'O', 'O', 'B-LOC', 'I-LOC']"
.
.
.

or you can also use JSONL format:

{"tokens": ["I", "love", "Paris"], "tags": ["O", "O", "B-LOC"]}
{"tokens": ["I", "live", "in", "New", "York"], "tags": ["O", "O", "O", "B-LOC", "I-LOC"]}
.
.
.

As you can see, we have two columns in the CSV file. One column is the tokens and the other is the tags. Both the columns are stringified lists! The tokens column contains the tokens of the sentence and the tags column contains the tags for each token.

If your CSV is huge, you can divide it into multiple CSV files and upload them separately. Please make sure that the column names are the same in all CSV files.

One way to divide the CSV file using pandas is as follows:

import pandas as pd

# Set the chunk size
chunk_size = 1000
i = 1

# Open the CSV file and read it in chunks
for chunk in pd.read_csv('example.csv', chunksize=chunk_size):
    # Save each chunk to a new file
    chunk.to_csv(f'chunk_{i}.csv', index=False)
    i += 1

Columns

Your CSV/JSONL dataset must have two columns: tokens and tags.

< > Update on GitHub