
Tokenizers convert raw text into numerical tokens that models can process — a critical step in every NLP pipeline.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer("Hello, how are you?")
print(tokens['input_ids']) # [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
print(tokenizer.decode(tokens['input_ids'])) # [CLS] hello, how are you? [SEP]Reference:
TaskLoco™ — The Sticky Note GOAT