Bert classification

Bert classification DEFAULT

Sentence Classification With Huggingface BERT and W&B

Set Up A Hyperparameter Sweep

There’s only one step left before we train our model.

We’ll create a configuration file that’ll list all the values a hyper-parameter can take.

Then we’ll initialize our wandb sweep agent to log, compare and visualize the performance of each combination.

The metric we’re looking to maximize is the which we’ll log in the training loop.

In the BERT paper, the authors described the best set of hyper-parameters to perform transfer learning and we’re using that same sets of values for our hyper-parameters.

The Training Function

Visualizations

Conclusion

Now you have a state of the art BERT model, trained on the best set of hyper-parameter values for performing sentence classification along with various statistical visualizations. We can see the best hyperparameter values from running the sweeps. The highest validation accuracy that was achieved in this batch of sweeps is around 84%.

I encourage you to try running a sweep with more hyperparameter combinations to see if you can improve the performance of the model.

Try BERT fine tuning in a colab ->

Sours: https://wandb.ai/cayush/bert-finetuning/reports/Sentence-Classification-With-Huggingface-BERT-and-W-B--Vmlldzo4MDMwNA

Text Classification With BERT

A code-first reader-friendly kickstart to finetuning BERT for text classification, tf.data and tf.Hub. Made by Akshay Uppal using Weights & Biases

Akshay Uppal

Sections:

  1. What is BERT?

  2. Setting up BERT for Text Classification

  3. The Quora Dataset We'll Use for Our Task

  4. Exploring the Dataset

  5. Getting BERT

  6. Using tf.data

  7. Creating, Training, and Tracking Our BERT Model

  8. Saving & Versioning the Model

  9. BERT Text Classification & Code

What is BERT?

Bidirectional Encoder Representations from Transformers, better known as BERT, is a revolutionary paper by Google that increased the State-of-the-art performance for various NLP tasks and was the stepping stone for many other revolutionary architectures.

It's not an exaggeration to say that BERT set a new direction for the entire domain. It shows clear benefits of using pre-trained models (trained on huge datasets) and transfer learning independent of the downstream tasks.

In this report, we're going to look at using BERT for text classification and provide a ton of code and examples to get you up and running. If you'd like to check out the primary source yourself, here's a link to the annotated paper.

BERT Classification Model

Setting Up BERT for Text Classification

First, we'll install TensorFlow and TensorFlow Model Garden:

We'll also clone the Github Repo for TensorFlow models. A few things of note:

  • –depth 1, during cloning, Git will only get the latest copy of the relevant files. It can save you a lot of space and time.

  • -b lets us clone a specific branch only.

Please match it with your TensorFlow 2.x version.

It's raining imports in here, friends.

A quick sanity-check of different versions and dependencies installed:

Let's Get the Dataset

The dataset we'll use today is provided via the Quora Insincere Questions Classification competition on Kaggle.

Please feel free to download the training set from Kaggle or use the link below to download the train.csv from that competition:

https://archive.org/download/quora_dataset_train.csv/quora_dataset_train.csv.zip.

Decompress and Read the Data into a pandas DataFrame:

Next, run the following:

Alright. Now, let's quickly visualize that data in a W&B Table:

Let's Explore

The Label Distribution

It's a good idea to understand the data you're working with before you really dig into modeling. Here, we're going to walk through our label distribution. how long our data points are, make certain that our test and train sets are well distributed, and a few other preliminary tasks. First though, let's look at label distribution by running:

Label Distribution

Word Length and Character Length

Now, let's run a few lines of code to understand the text data we're working with here.

Preparing Training and Testing Data for Our BERT Text Classification Tasks

A few notes on our approach here:

  • We'll use small portions of the data as the overall dataset would take ages to train. You can of course feel free to include more data by changing train_size

  • Since the dataset is very imbalanced we will keep the same distribution in both train and test set by stratifying it based on the labels. In this section, we'll be analyzing our data to make sure we did a good job at this.

(130612, 3) (11755, 3)

Getting the Word and Character Length for the Sampled Sets

In other words, it looks like the train and validation set are similar in terms of class imbalance and the various lengths in the question texts.

Analyzing the Distribution of Question Text Length in Words

Analyzing the Distribution of Question Text Length in Characters

As we dig into our train and validation sets, one other thing we want to check is if the the question text length is mostly similar between the two. Having roughly similar distributions here is generally a smart idea to prevent biasing or overfitting our model.

And it is. Even the distribution of question length in words and characters is very similar. It looks like a good train/test split so far.

Taming the Data

Next, we want the dataset to be created and preprocessed on the CPU:

130612 11755

Okay. Let's BERT.

Let's BERT: Get the Pre-trained BERT Model from TensorFlow Hub

Source

We'll be using the uncased BERT present in the tfhub.

In order to prepare the text to be given to the BERT layer, we need to first tokenize our words. The tokenizer here is present as a model asset and will do uncasing for us as well.

Setting all parameters in the form of a dictionary so any changes, if needed, can be made here:

Get the BERT layer and tokenizer:

Checking out some of the training samples and their tokenized ids

['hello', 'world', '##,', 'it', 'is', 'a', 'wonderful', 'day', 'for', 'learning'] [7592, 2088, 29623, 2009, 2003, 1037, 6919, 2154, 2005, 4083]

Let's Get That Data Ready: Tokenize and Preprocess Text for BERT

Each line of the dataset is composed of the review text and its label. Data preprocessing consists of transforming text to BERT input features:

  • Input Word Ids: Output of our tokenizer, converting each sentence into a set of token ids.
  • Input Masks: Since we are padding all the sequences to 128(max sequence length), it is important that we create some sort of mask to make sure those paddings do not interfere with the actual text tokens. Therefore we need a generate input mask blocking the paddings. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.

  • Segment Ids: For out task of text classification, since there is only one sequence, the segment_ids/input_type_ids is essentially just a vector of 0s.

Bert was trained on two tasks:

  1. fill in randomly masked words from a sentence.

  2. given two sentences, which sentence came first?

  • You want to use Dataset.map to apply this function to each element of the dataset. Dataset.map runs in graph mode and Graph tensors do not have a value.
  • In graph mode, you can only use TensorFlow Ops and functions.

So you can't .map this function directly: You need to wrap it in a tf.py_function. The tf.py_function will pass regular tensors (with a value and a .numpy() method to access it), to the wrapped python function.

Wrapping the Python Function into a TensorFlow op for Eager Execution

The final data point passed to the model is of the format a dictionary as x and labels (the dictionary has keys which should obviously match).

Let the Data Flow: Creating the Final Input Pipeline Using tf.data

Apply the Transformation to our Train and Test Datasets

The resulting tf.data.Datasets return (features, labels) pairs, as expected by keras.Model.fit

Creating, Training & Tracking Our BERT Classification Model.

Let's model our way to glory!!!

Create The Model

There are two outputs from the BERT Layer:

  • A pooled_output of shape [batch_size, 768] with representations for the entire input sequences.

  • A sequence_output of shape [batch_size, max_seq_length, 768] with representations for each input token (in context).

For the classification task, we are only concerned with the pooled_output:

Training Your Model

Model Summary

Model Architecture Summary

One drawback of the tf hub is that we import the entire module as a layer in keras as a result of which we don't see the parameters and layers in the model summary.

The official tfhub page states that "All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice." Therefore we will go ahead and train the entire model without freezing anything

Experiment Tracking

Since you are here, I am sure you have a good idea about Weights and Biases but if not, then read along :)

In order to start the experiment tracking, we will be creating 'runs' on W&B,

wandb.init(): It initializes the run with basic project information parameters:

  • project: The project name, will create a new project tab where all the experiments for this project will be tracked

  • config: A dictionary of all parameters and hyper-parameters we wish to track

  • group: optional, but would help us to group by different parameters later on

  • job_type: to describe the job type, it would help in grouping different experiments later. eg "train", "evaluate" etc

Now, In order to Log all the different metrics, we will use a simple callback provided by W&B.

WandCallback() : https://docs.wandb.ai/guides/integrations/keras

Yes, it is as simple as adding a callback :D

Some Training Metrics and Graphs

Lets Evaluate

Let us do an evaluation on the validation set and log the scores using W&B.

wandb.log(): Log a dictionary of scalars (metrics like accuracy and loss) and any other type of wandb object. Here we will pass the evaluation dictionary as it is and log it.

Saving the Models and Model Versioning

Finally, we're going to look at saving reproducible models with W&B. Namely, with Artifacts.

W&B Artifacts

For saving the models and making it easier to track different experiments, we will be using wandb.artifacts. W&B Artifacts are a way to save your datasets and models.

Within a run, there are three steps for creating and saving a model Artifact.

  • Create an empty Artifact with wandb.Artifact().

  • Add your model file to the Artifact with wandb.add_file().

  • Call wandb.log_artifact() to save the Artifact

Quick Sneak Peek into the W&B Dashboard

Things to note:

  • Grouping of experiments and runs.

  • Visualizations of all training logs and metrics.

  • Visualizations for system metrics could be useful when training on cloud instances or physical GPU machines

  • Hyperparameter tracking in the tabular form.

  • Artifacts: Model versioning and storage.

BERT Test Classification Summary & Code

I hope this hands-on tutorial was useful for you, and if you have read so far I hope you have some good takeaway points from here.

The full code of this post can be found here

Sours: https://wandb.ai/akshayuppal12/Finetune-BERT-Text-Classification/reports/Text-Classification-With-BERT--Vmlldzo4OTk4MzY
  1. Dead sea collection night cream
  2. Np week 2015
  3. Hover 1 dynamo manual
  4. Realistic cloud stencil

Jay Alammar

Translations: Chinese, Russian

Progress has been rapidly accelerating in machine learning models that process language over the last couple of years. This progress has left the research lab and started powering some of the leading digital products. A great example of this is the recent announcement of how the BERT model is now a major force behind Google Search. Google believes this step (or progress in natural language understanding as applied in search) represents “the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search”.

This post is a simple tutorial for how to use a variant of BERT to classify sentences. This is an example that is basic enough as a first intro, yet advanced enough to showcase some of the key concepts involved.

Alongside this post, I’ve prepared a notebook. You can see it here the notebook or run it on colab.

Dataset: SST2

The dataset we will use in this example is SST2, which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):

sentence label
a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films 1
apparently reassembled from the cutting room floor of any given daytime soap 0
they presume their audience won't sit still for a sociology lesson 0
this is a visually stunning rumination on love , memory , history and the war between art and commerce 1
jonathan parker 's bartleby should have been the be all end all of the modern office anomie films 1

Models: Sentence Sentiment Classification

Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries a positive sentiment) or a 0 (indicating the sentence carries a negative sentiment). We can think of it as looking like this:

Under the hood, the model is actually made up of two model.

  • DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
  • The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.

If you’ve read my previous post, Illustrated BERT, this vector is the result of the first position (which receives the [CLS] token as input).

Model Training

While we’ll be using two models, we will only train the logistic regression model. For DistillBERT, we’ll use a model that’s already pre-trained and has a grasp on the English language. This model, however is neither trained not fine-tuned to do sentence classification. We get some sentence classification capability, however, from the general objectives BERT is trained on. This is especially the case with BERT’s output for the first position (associated with the [CLS] token). I believe that’s due to BERT’s second training object – Next sentence classification. That objective seemingly trains the model to encapsulate a sentence-wide sense to the output at the first position. The transformers library provides us with an implementation of DistilBERT as well as pretrained versions of the model.

Tutorial Overview

So here’s the game plan with this tutorial. We will first use the trained distilBERT to generate sentence embeddings for 2,000 sentences.

We will not touch distilBERT after this step. It’s all Scikit Learn from here. We do the usual train/test split on this dataset:


Train/test split for the output of distilBert (model #1) creates the dataset we'll train and evaluate logistic regression on (model #2). Note that in reality, sklearn's train/test split shuffles the examples before making the split, it doesn't just take the first 75% of examples as they appear in the dataset.

Then we train the logistic regression model on the training set:

How a single prediction is calculated

Before we dig into the code and explain how to train the model, let’s look at how a trained model calculates its prediction.

Let’s try to classify the sentence “a visually stunning rumination on love”. The first step is to use the BERT tokenizer to first split the word into tokens. Then, we add the special tokens needed for sentence classifications (these are [CLS] at the first position, and [SEP] at the end of the sentence).

The third step the tokenizer does is to replace each token with its id from the embedding table which is a component we get with the trained model. Read The Illustrated Word2vec for a background on word embeddings.

Note that the tokenizer does all these steps in a single line of code:

Our input sentence is now the proper shape to be passed to DistilBERT.

If you’ve read Illustrated BERT, this step can also be visualized in this manner:

Flowing Through DistilBERT

Passing the input vector through DistilBERT works just like BERT. The output would be a vector for each input token. each vector is made up of 768 numbers (floats).

Because this is a sentence classification task, we ignore all except the first vector (the one associated with the [CLS] token). The one vector we pass as the input to the logistic regression model.

From here, it’s the logistic regression model’s job to classify this vector based on what it learned from its training phase. We can think of a prediction calculation as looking like this:

The training is what we’ll discuss in the next section, along with the code of the entire process.

The Code

In this section we’ll highlight the code to train this sentence classification model. A notebook containing all this code is available on colab and github.

Let’s start by importing the tools of the trade

The dataset is available as a file on github, so we just import it directly into a pandas dataframe

We can use df.head() to look at the first five rows of the dataframe to see how the data looks.

Which outputs:

Importing pre-trained DistilBERT model and tokenizer

We can now tokenize the dataset. Note that we’re going to do things a little differently here from the example above. The example above tokenized and processed only one sentence. Here, we’ll tokenize and process all sentences together as a batch (the notebook processes a smaller group of examples just for resource considerations, let’s say 2000 examples).

Tokenization

This turns every sentence into the list of ids.

The dataset is currently a list (or pandas Series/DataFrame) of lists. Before DistilBERT can process this as input, we’ll need to make all the vectors the same size by padding shorter sentences with the token id 0. You can refer to the notebook for the padding step, it’s basic python string and array manipulation.

After the padding, we have a matrix/tensor that is ready to be passed to BERT:

Processing with DistilBERT

We now create an input tensor out of the padded token matrix, and send that to DistilBERT

After running this step, holds the outputs of DistilBERT. It is a tuple with the shape (number of examples, max number of tokens in the sequence, number of hidden units in the DistilBERT model). In our case, this will be 2000 (since we only limited ourselves to 2000 examples), 66 (which is the number of tokens in the longest sequence from the 2000 examples), 768 (the number of hidden units in the DistilBERT model).

Unpacking the BERT output tensor

Let’s unpack this 3-d output tensor. We can first start by examining its dimensions:

Recapping a sentence’s journey

Each row is associated with a sentence from our dataset. To recap the processing path of the first sentence, we can think of it as looking like this:

Slicing the important part

For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything else.

This is how we slice that 3d tensor to get the 2d tensor we’re interested in:

And now is a 2d numpy array containing the sentence embeddings of all the sentences in our dataset.


The tensor we sliced from BERT's output

Dataset for Logistic Regression

Now that we have the output of BERT, we have assembled the dataset we need to train our logistic regression model. The 768 columns are the features, and the labels we just get from our initial dataset.


The labeled dataset we use to train the Logistic Regression. The features are the output vectors of BERT for the [CLS] token (position #0) that we sliced in the previous figure. Each row corresponds to a sentence in our dataset, each column corresponds to the output of a hidden unit from the feed-forward neural network at the top transformer block of the Bert/DistilBERT model.

After doing the traditional train/test split of machine learning, we can declare our Logistic Regression model and train it against the dataset.

Which splits the dataset into training/testing sets:

Next, we train the Logistic Regression model on the training set.

Now that the model is trained, we can score it against the test set:

Which shows the model achieves around 81% accuracy.

Score Benchmarks

For reference, the highest accuracy score for this dataset is currently 96.8. DistilBERT can be trained to improve its score on this task – a process called fine-tuning which updates BERT’s weights to make it achieve a better performance in the sentence classification (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of 90.7. The full size BERT model achieves 94.9.

The Notebook

Dive right into the notebook or run it on colab.

And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at fine-tuning. You can also go back and switch from distilBERT to BERT and see how that works.

Thanks to Clément Delangue, Victor Sanh, and the Huggingface team for providing feedback to earlier versions of this tutorial.

Written on November 26, 2019

Sours: https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

Sentiment Classification Using BERT

BERT stands for Bidirectional Representation for Transformers, was proposed by researchers at Google AI language in 2018. Although the main aim of that was to improve the understanding of the meaning of queries related to Google Search, BERT becomes one of the most important and complete architecture for various natural language tasks having generated state-of-the-art results on Sentence pair classification task, question-answer task, etc. For more details on the architecture please look at this article

Architecture:

One of the most important features of BERT is that itsadaptability to perform different NLP tasks with state-of-the-art accuracy (similar to the transfer learning we used in Computer vision). For that, the paper also proposed the architecture of different tasks. In this post, we will be using BERT architecture for single sentence classification tasks specifically the architecture used for CoLA (Corpus of Linguistic Acceptability) binary classification task. In the previous post about BERT, we discussed BERT architecture in detail, but let’s recap some of the important details of it:

BERT single sentence classification task

BERT has proposed in the two versions:

  • BERT (BASE): 12 layers of encoder stack with 12 bidirectional self-attention heads and 768 hidden units.
  • BERT (LARGE): 24 layers of encoder stack with 24 bidirectional self-attention heads and 1024 hidden units.

For TensorFlow implementation, Google has provided two versions of both the BERT BASE and BERT LARGE: Uncased and Cased. In an uncased version, letters are lowercased before WordPiece tokenization.



Implementation:

  • First, we need to clone the GitHub repo to BERT to make the setup easier.

Code: 

python3

Cloning into 'bert'... remote: Enumerating objects: 340, done. remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340 Receiving objects: 100% (340/340), 317.20 KiB | 584.00 KiB/s, done. Resolving deltas: 100% (185/185), done.
  • Now, we need to download the BERTBASE model using the following link and unzip it into the working directory ( or the desired location).

Code: 

Archive: uncased_L-12_H-768_A-12.zip creating: uncased_L-12_H-768_A-12/ inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001 inflating: uncased_L-12_H-768_A-12/vocab.txt inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index inflating: uncased_L-12_H-768_A-12/bert_config.json
  • We will be using the TensorFlow 1x version. In Google colab there is a magic function called tensorflow_version that can switch different versions.

Code: 

TensorFlow 1.x selected.
  • Now, we will import modules necessary for running this project, we will be using NumPy, scikit-learn and Keras from TensorFlow inbuilt modules. These are already preinstalled in colab, make sure to install these in your environment.

Code: 

python3

  • Now we will load IMDB sentiments datasets and do some preprocessing before training. For loading the IMDB dataset from TensorFlow Hub, we will follow this tutorial. 

Code: 

python3

 

 

 

 

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 84131840/84125825 [==============================] - 8s 0us/step ((25000, 3), (25000, 3))
  • This dataset contains 50k reviews 25k for each training and test, we will sample 5k reviews from each test and train. Also, both test and train dataset contains 3 columns whose list is given below

Code: 

python3

(Index(['sentence', 'sentiment', 'polarity'], dtype='object'), Index(['sentence', 'sentiment', 'polarity'], dtype='object'))
  • Now, we need to convert the specific format that is required by the BERT model to train and predict, for that we will use pandas dataframe. Below are the columns required in BERT training and test format:
    • GUID: An id for the row. Required for both train and test data
    • Class label.: A value of 0 or 1 depending on positive and negative sentiment.
    • alpha: This is a dummy column for text classification but is expected by BERT during training.
    • text:  The review text of the data point which needed to be classified. Obviously required for both training and test

Code: 



python3

 

guid label alpha text 14930 0 1 a William Hurt may not be an American matinee id... 1445 1 1 a Rock solid giallo from a master filmmaker of t... 16943 2 1 a This movie surprised me. Some things were "cli... 6391 3 1 a This film may seem dated today, but remember t... 4526 4 0 a The Twilight Zone has achieved a certain mytho... ----- guid text 20010 0 One of Alfred Hitchcock's three greatest films... 16132 1 Hitchcock once gave an interview where he said... 24947 2 I had nothing to do before going out one night... 5471 3 tell you what that was excellent. Dylan Moran ... 21075 4 I watched this show until my puberty but still...
  • Now, we split the data into three parts: train, dev, and test and save it into tsv file save it into a folder (here “IMDB Dataset”). This is because  run classifier file requires dataset in tsv format.

Code: 

python3

  • In this step, we train the model using the following command, for executing bash commands on colab, we use ! sign in front of the command. The run_classifier file trains the model with the help of given command. Due to time and resource constraints, we will run it only on 3  epochs.

Code: 

python3

# Last few lines INFO:tensorflow:***** Eval results ***** I0713 06:06:28.966619 139722620139392 run_classifier.py:923] ***** Eval results ***** INFO:tensorflow: eval_accuracy = 0.796 I0713 06:06:28.966814 139722620139392 run_classifier.py:925] eval_accuracy = 0.796 INFO:tensorflow: eval_loss = 0.95403963 I0713 06:06:28.967138 139722620139392 run_classifier.py:925] eval_loss = 0.95403963 INFO:tensorflow: global_step = 1687 I0713 06:06:28.967317 139722620139392 run_classifier.py:925] global_step = 1687 INFO:tensorflow: loss = 0.95741796 I0713 06:06:28.967507 139722620139392 run_classifier.py:925] loss = 0.95741796
  • Now we will use test data to evaluate our model with the following bash script. This script saves the predictions into a tsv file.

Code: 

python3

INFO:tensorflow:Restoring parameters from /content/bert_output/model.ckpt-1687 I0713 06:08:22.372014 140390020667264 saver.py:1284] Restoring parameters from /content/bert_output/model.ckpt-1687 INFO:tensorflow:Running local_init_op. I0713 06:08:23.801442 140390020667264 session_manager.py:500] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0713 06:08:23.859703 140390020667264 session_manager.py:502] Done running local_init_op. 2020-07-13 06:08:24.453814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 INFO:tensorflow:prediction_loop marked as finished I0713 06:10:02.280455 140390020667264 error_handling.py:101] prediction_loop marked as finished INFO:tensorflow:prediction_loop marked as finished I0713 06:10:02.280870 140390020667264 error_handling.py:101] prediction_loop marked as finished
  • The code below takes maximum prediction for each row of test data and store it into a list.

Code: 

python3

 

  • The code below calculates accuracy and F1-score.

Code: 

python3

Accuracy 0.8548 F1-Score 0.8496894409937888
  • We have achieved 85% accuracy and F1-score on the IMDB reviews dataset while training BERT (BASE)  just for 3 epochs which is quite a good result.  Training on more epochs will certainly improve the accuracy.

References:

Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.




Sours: https://www.geeksforgeeks.org/sentiment-classification-using-bert/

Classification bert

BERT¶

Overview¶

The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It’s a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.

The abstract from the paper is the following:

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Tips:

  • BERT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

  • BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.

This model was contributed by thomwolf. The original code can be found here.

BertConfig¶

class (vocab_size=30522, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=0, position_embedding_type='absolute', use_cache=True, classifier_dropout=None, **kwargs)[source]¶

This is the configuration class to store the configuration of a or a . It is used to instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BERT bert-base-uncased architecture.

Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

Parameters
  • vocab_size (, optional, defaults to 30522) – Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the passed when calling or .

  • hidden_size (, optional, defaults to 768) – Dimensionality of the encoder layers and the pooler layer.

  • num_hidden_layers (, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

  • num_attention_heads (, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

  • intermediate_size (, optional, defaults to 3072) – Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.

  • hidden_act ( or , optional, defaults to ) – The non-linear activation function (function or string) in the encoder and pooler. If string, , , and are supported.

  • hidden_dropout_prob (, optional, defaults to 0.1) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • attention_probs_dropout_prob (, optional, defaults to 0.1) – The dropout ratio for the attention probabilities.

  • max_position_embeddings (, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • type_vocab_size (, optional, defaults to 2) – The vocabulary size of the passed when calling or .

  • initializer_range (, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • layer_norm_eps (, optional, defaults to 1e-12) – The epsilon used by the layer normalization layers.

  • position_embedding_type (, optional, defaults to ) – Type of position embedding. Choose one of , , . For positional embeddings use . For more information on , please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on , please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

  • use_cache (, optional, defaults to ) – Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if .

  • classifier_dropout (, optional) – The dropout ratio for the classification head.

Examples:

>>> fromtransformersimportBertModel,BertConfig>>> # Initializing a BERT bert-base-uncased style configuration>>> configuration=BertConfig()>>> # Initializing a model from the bert-base-uncased style configuration>>> model=BertModel(configuration)>>> # Accessing the model configuration>>> configuration=model.config

BertTokenizer¶

class (vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[source]¶

Construct a BERT tokenizer. Based on WordPiece.

This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Parameters
  • vocab_file () – File containing the vocabulary.

  • do_lower_case (, optional, defaults to ) – Whether or not to lowercase the input when tokenizing.

  • do_basic_tokenize (, optional, defaults to ) – Whether or not to do basic tokenization before WordPiece.

  • never_split (, optional) – Collection of tokens which will never be split during tokenization. Only has an effect when

  • unk_token (, optional, defaults to ) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • sep_token (, optional, defaults to ) – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

  • pad_token (, optional, defaults to ) – The token used for padding, for example when batching sequences of different lengths.

  • cls_token (, optional, defaults to ) – The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

  • mask_token (, optional, defaults to ) – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.

  • tokenize_chinese_chars (, optional, defaults to ) –

    Whether or not to tokenize Chinese characters.

    This should likely be deactivated for Japanese (see this issue).

  • strip_accents – (, optional): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for (as in the original BERT).

(token_ids_0:List[int], token_ids_1:Optional[List[int]]=None) → List[int][source]¶

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:

  • single sequence:

  • pair of sequences:

Parameters
  • token_ids_0 () – List of IDs to which the special tokens will be added.

  • token_ids_1 (, optional) – Optional second list of IDs for sequence pairs.

Returns

List of input IDs with the appropriate special tokens.

Return type
(token_ids_0:List[int], token_ids_1:Optional[List[int]]=None) → List[int][source]¶

Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format:

00000000000111111111|firstsequence|secondsequence|

If is , this method only returns the first portion of the mask (0s).

Parameters
  • token_ids_0 () – List of IDs.

  • token_ids_1 (, optional) – Optional second list of IDs for sequence pairs.

Returns

List of token type IDs according to the given sequence(s).

Return type
(token_ids_0:List[int], token_ids_1:Optional[List[int]]=None, already_has_special_tokens:bool=False) → List[int][source]¶

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer method.

Parameters
  • token_ids_0 () – List of IDs.

  • token_ids_1 (, optional) – Optional second list of IDs for sequence pairs.

  • already_has_special_tokens (, optional, defaults to ) – Whether or not the token list is already formatted with special tokens for the model.

Returns

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Return type
(save_directory:str, filename_prefix:Optional[str]=None) → Tuple[str][source]¶

Save only the vocabulary of the tokenizer (vocabulary + added tokens).

This method won’t save the configuration and special token mappings of the tokenizer. Use to save the whole state of the tokenizer.

Parameters
  • save_directory () – The directory in which to save the vocabulary.

  • filename_prefix (, optional) – An optional prefix to add to the named of the saved files.

Returns

Paths to the files saved.

Return type

BertTokenizerFast¶

class (vocab_file=None, tokenizer_file=None, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[source]¶

Construct a “fast” BERT tokenizer (backed by HuggingFace’s tokenizers library). Based on WordPiece.

This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Parameters
  • vocab_file () – File containing the vocabulary.

  • do_lower_case (, optional, defaults to ) – Whether or not to lowercase the input when tokenizing.

  • unk_token (, optional, defaults to ) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • sep_token (, optional, defaults to ) – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

  • pad_token (, optional, defaults to ) – The token used for padding, for example when batching sequences of different lengths.

  • cls_token (, optional, defaults to ) – The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

  • mask_token (, optional, defaults to ) – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.

  • clean_text (, optional, defaults to ) – Whether or not to clean the text before tokenization by removing any control characters and replacing all whitespaces by the classic one.

  • tokenize_chinese_chars (, optional, defaults to ) – Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this issue).

  • strip_accents – (, optional): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for (as in the original BERT).

  • wordpieces_prefix – (, optional, defaults to ): The prefix for subwords.

(token_ids_0, token_ids_1=None)[source]¶

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:

  • single sequence:

  • pair of sequences:

Parameters
  • token_ids_0 () – List of IDs to which the special tokens will be added.

  • token_ids_1 (, optional) – Optional second list of IDs for sequence pairs.

Returns

List of input IDs with the appropriate special tokens.

Return type
(token_ids_0:List[int], token_ids_1:Optional[List[int]]=None) → List[int][source]¶

Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format:

00000000000111111111|firstsequence|secondsequence|

If is , this method only returns the first portion of the mask (0s).

Parameters
  • token_ids_0 () – List of IDs.

  • token_ids_1 (, optional) – Optional second list of IDs for sequence pairs.

Returns

List of token type IDs according to the given sequence(s).

Return type
(save_directory:str, filename_prefix:Optional[str]=None) → Tuple[str][source]¶

Save only the vocabulary of the tokenizer (vocabulary + added tokens).

This method won’t save the configuration and special token mappings of the tokenizer. Use to save the whole state of the tokenizer.

Parameters
  • save_directory () – The directory in which to save the vocabulary.

  • filename_prefix (, optional) – An optional prefix to add to the named of the saved files.

Returns

Paths to the files saved.

Return type

alias of

Bert specific outputs¶

class (loss:Optional[torch.FloatTensor]=None, prediction_logits:torch.FloatTensor=None, seq_relationship_logits:torch.FloatTensor=None, hidden_states:Optional[Tuple[torch.FloatTensor]]=None, attentions:Optional[Tuple[torch.FloatTensor]]=None)[source]¶

Output type of .

Parameters
  • loss (optional, returned when is provided, of shape ) – Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.

  • prediction_logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • seq_relationship_logits ( of shape ) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

  • hidden_states (, optional, returned when is passed or when ) –

    Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) –

    Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class (loss:Optional[tensorflow.python.framework.ops.Tensor]=None, prediction_logits:tensorflow.python.framework.ops.Tensor=None, seq_relationship_logits:tensorflow.python.framework.ops.Tensor=None, hidden_states:Optional[Union[Tuple[tensorflow.python.framework.ops.Tensor], tensorflow.python.framework.ops.Tensor]]=None, attentions:Optional[Union[Tuple[tensorflow.python.framework.ops.Tensor], tensorflow.python.framework.ops.Tensor]]=None)[source]¶

Output type of .

Parameters
  • prediction_logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • seq_relationship_logits ( of shape ) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

  • hidden_states (, optional, returned when is passed or when ) –

    Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) –

    Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

class (prediction_logits:jax._src.numpy.lax_numpy.ndarray=None, seq_relationship_logits:jax._src.numpy.lax_numpy.ndarray=None, hidden_states:Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]]=None, attentions:Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]]=None)[source]¶

Output type of .

Parameters
  • prediction_logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • seq_relationship_logits ( of shape ) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

  • hidden_states (, optional, returned when is passed or when ) –

    Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) –

    Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

(**updates

“Returns a new object replacing the specified fields with new values.

BertModel¶

class (config, add_pooling_layer=True)[source]¶

The bare Bert Model transformer outputting raw hidden-states without any specific head on top.

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of cross-attention is added between the self-attention layers, following the architecture described in Attention is all you need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

To behave as an decoder the model needs to be initialized with the argument of the configuration set to . To be used in a Seq2Seq model, the model needs to initialized with both argument and set to ; an is then expected as an input to the forward pass.

(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • attention_mask ( of shape , optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    What are attention masks?

  • token_type_ids ( of shape , optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    What are token type IDs?

  • position_ids ( of shape , optional) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional) –

    Mask to nullify selected heads of the self-attention modules. Mask values selected in :

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

  • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

  • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

  • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

  • encoder_hidden_states ( of shape , optional) – Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

  • encoder_attention_mask ( of shape , optional) –

    Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • past_key_values ( of length with each tuple having 4 tensors of shape ) –

    Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

    If are used, the user can optionally input only the last (those that don’t have their past key value states given to this model) of shape instead of all of shape .

  • use_cache (, optional) – If set to , key value states are returned and can be used to speed up decoding (see ).

Returns

A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

  • last_hidden_state ( of shape ) – Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output ( of shape ) – Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

  • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (, optional, returned when and is passed or when ) – Tuple of (one for each layer) of shape .

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • past_key_values (, optional, returned when is passed or when ) – Tuple of of length , with each tuple having 2 tensors of shape ) and optionally if 2 additional tensors of shape .

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if in the cross-attention blocks) that can be used (see input) to speed up sequential decoding.

Return type

or

Example:

>>> fromtransformersimportBertTokenizer,BertModel>>> importtorch>>> tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')>>> model=BertModel.from_pretrained('bert-base-uncased')>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="pt")>>> outputs=model(**inputs)>>> last_hidden_states=outputs.last_hidden_state

BertForPreTraining¶

class (config)[source]¶

Bert Model with two heads on top as done during the pretraining: a masked language modeling head and a next sentence prediction (classification) head.

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, next_sentence_label=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • attention_mask ( of shape , optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    What are attention masks?

  • token_type_ids ( of shape , optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    What are token type IDs?

  • position_ids ( of shape , optional) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional) –

    Mask to nullify selected heads of the self-attention modules. Mask values selected in :

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

  • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

  • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

  • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

  • labels ( of shape , optional) – Labels for computing the masked language modeling loss. Indices should be in (see docstring) Tokens with indices set to are ignored (masked), the loss is only computed for the tokens with labels in

  • next_sentence_label ( of shape , optional) –

    Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see docstring) Indices should be in :

    • 0 indicates sequence B is a continuation of sequence A,

    • 1 indicates sequence B is a random sequence.

  • kwargs (, optional, defaults to {}) – Used to hide legacy arguments that have been deprecated.

Returns

A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

  • loss (optional, returned when is provided, of shape ) – Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.

  • prediction_logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • seq_relationship_logits ( of shape ) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

  • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Example:

>>> fromtransformersimportBertTokenizer,BertForPreTraining>>> importtorch>>> tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')>>> model=BertForPreTraining.from_pretrained('bert-base-uncased')>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="pt")>>> outputs=model(**inputs)>>> prediction_logits=outputs.prediction_logits>>> seq_relationship_logits=outputs.seq_relationship_logits
Return type

or

BertLMHeadModel¶

class (config)[source]¶

Bert Model with a language modeling head on top for CLM fine-tuning.

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, labels=None, past_key_values=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • attention_mask ( of shape , optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    What are attention masks?

  • token_type_ids ( of shape , optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    What are token type IDs?

  • position_ids ( of shape , optional) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional) –

    Mask to nullify selected heads of the self-attention modules. Mask values selected in :

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

  • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

  • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

  • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

  • encoder_hidden_states ( of shape , optional) – Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

  • encoder_attention_mask ( of shape , optional) –

    Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • labels ( of shape , optional) – Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in (see docstring) Tokens with indices set to are ignored (masked), the loss is only computed for the tokens with labels n

  • past_key_values ( of length with each tuple having 4 tensors of shape ) –

    Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

    If are used, the user can optionally input only the last (those that don’t have their past key value states given to this model) of shape instead of all of shape .

  • use_cache (, optional) – If set to , key value states are returned and can be used to speed up decoding (see ).

Returns

A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

  • loss ( of shape , optional, returned when is provided) – Language modeling loss (for next-token prediction).

  • logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

    Cross attentions weights after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • past_key_values (, optional, returned when is passed or when ) – Tuple of tuples of length , with each tuple containing the cached key, value states of the self-attention and the cross-attention layers if model is used in encoder-decoder setting. Only relevant if .

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see input) to speed up sequential decoding.

Example:

>>> fromtransformersimportBertTokenizer,BertLMHeadModel,BertConfig>>> importtorch>>> tokenizer=BertTokenizer.from_pretrained('bert-base-cased')>>> config=BertConfig.from_pretrained("bert-base-cased")>>> config.is_decoder=True>>> model=BertLMHeadModel.from_pretrained('bert-base-cased',config=config)>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="pt")>>> outputs=model(**inputs)>>> prediction_logits=outputs.logits
Return type

or

BertForMaskedLM¶

class (config)[source]¶

Bert Model with a language modeling head on top.

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • attention_mask ( of shape , optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    What are attention masks?

  • token_type_ids ( of shape , optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    What are token type IDs?

  • position_ids ( of shape , optional) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional) –

    Mask to nullify selected heads of the self-attention modules. Mask values selected in :

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

  • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

  • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

  • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

  • labels ( of shape , optional) – Labels for computing the masked language modeling loss. Indices should be in (see docstring) Tokens with indices set to are ignored (masked), the loss is only computed for the tokens with labels in

Returns

A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

  • loss ( of shape , optional, returned when is provided) – Masked language modeling (MLM) loss.

  • logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Return type

or

Example:

>>> fromtransformersimportBertTokenizer,BertForMaskedLM>>> importtorch>>> tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')>>> model=BertForMaskedLM.from_pretrained('bert-base-uncased')>>> inputs=tokenizer("The capital of France is [MASK].",return_tensors="pt")>>> labels=tokenizer("The capital of France is Paris.",return_tensors="pt")["input_ids"]>>> outputs=model(**inputs,labels=labels)>>> loss=outputs.loss>>> logits=outputs.logits

BertForNextSentencePrediction¶

class (config)[source]¶

Bert Model with a next sentence prediction (classification) head on top.

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None, **kwargs)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • attention_mask ( of shape , optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    What are attention masks?

  • token_type_ids ( of shape , optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    What are token type IDs?

  • position_ids ( of shape , optional) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional) –

    Mask to nullify selected heads of the self-attention modules. Mask values selected in :

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

  • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

  • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

  • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

  • labels ( of shape , optional) –

    Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see docstring). Indices should be in :

    • 0 indicates sequence B is a continuation of sequence A,

    • 1 indicates sequence B is a random sequence.

Returns

A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

  • loss ( of shape , optional, returned when is provided) – Next sequence prediction (classification) loss.

  • logits ( of shape ) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

  • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Example:

>>> fromtransformersimportBertTokenizer,BertForNextSentencePrediction>>> importtorch>>> tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')>>> model=BertForNextSentencePrediction.from_pretrained('bert-base-uncased')>>> prompt="In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.">>> next_sentence="The sky is blue due to the shorter wavelength of blue light.">>> encoding=tokenizer(prompt,next_sentence,return_tensors='pt')>>> outputs=model(**encoding,labels=torch.LongTensor([1]))>>> logits=outputs.logits>>> assertlogits[0,0]<logits[0,1]# next sentence was random
Return type

or

BertForSequenceClassification¶

class (config)[source]¶

Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

The forward method, overrides the special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
  • input_ids ( of shape ) –

    Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using . See and for details.

    What are input IDs?

  • attention_mask ( of shape , optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in :

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    What are attention masks?

  • token_type_ids ( of shape , optional) –

    Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

    • 0 corresponds to a sentence A token,

    • 1 corresponds to a sentence B token.

    What are token type IDs?

  • position_ids ( of shape , optional) –

    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

    What are position IDs?

  • head_mask ( of shape or , optional) –

    Mask to nullify selected heads of the self-attention modules. Mask values selected in :

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

  • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

  • output_hidden_states (

Sours: https://huggingface.co/transformers/model_doc/bert.html

.

You will also like:

.



100 101 102 103 104