{"id":511,"date":"2022-11-19T22:53:16","date_gmt":"2022-11-19T19:53:16","guid":{"rendered":"https:\/\/acua.qcri.org\/blog\/?p=511"},"modified":"2022-11-20T19:45:54","modified_gmt":"2022-11-20T16:45:54","slug":"demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems","status":"publish","type":"post","link":"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/","title":{"rendered":"Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems"},"content":{"rendered":"<p>Newcomers to the field of Artificial Intelligence (AI) often see the term \u2018language model\u2019 tossed around when discussing Natural Language Processing (NLP) tasks without any proper clarification of its importance and usage in solving real-world problems.<\/p>\n<p>So, this tutorial blogpost aims at demystifying language models by defining what a language model is, describing the common types of language models and providing a concrete example of using a popular language model to solve a common Machine Learning (ML) classification problem.<\/p>\n<p><strong>What is a language model?<\/strong><br \/>\nWhen talking about textual data, a language model is a mathematical model that assigns a probability to a sequence of words. The most common type of language model is the Bag of Words model (BoW). Bag of words is a representation of text where each word is represented by a number. This can be done by assigning a unique number to each word, or by using a technique called \u201chashing.\u201d. For instance, the number of times a certain word appears in a document is called term frequency, which is part of the BoW language model [1].<\/p>\n<p><strong>Types of language models<\/strong><br \/>\nLanguage models can be either rule-based, statistical, or based on neural networks:<\/p>\n<ol>\n<li>Rule-based language models are based on a set of rules; these models are less common than the other types of models. They are based on a set of rules that define how words can be combined. These rules can be hand-crafted or learned from data. Rule-based language models are less accurate but can be used with less data.<\/li>\n<li>Statistical language models are based on statistical methods and data; such models are more common than rule-based language models. They are trained on large amounts of data and can be used to calculate the probability of any word sequence, such as the BoW model that we discussed earlier. Statistical language models are more accurate compared to rule-based models but require more data to train.<\/li>\n<li>Neural network language models use continuous representations or embeddings of words to make predictions making use of neural networks. These are new players in the NLP town and have surpassed the statistical language models in their effectiveness. They use different kinds of Neural Networks to model language and just like statistical models, they require more data to train them [2].<\/li>\n<\/ol>\n<p><strong>What is the purpose of language models?<\/strong><br \/>\nLanguage models are used in NLP and ML tasks such as speech recognition, question answering, document or text classification, and machine translation.<\/p>\n<p>To illustrate a real-world example of using language models, let\u2019s look at Gmail\u2019s spam filtering text classification problem.<br \/>\nThe spam filtering problem is the problem of designing a system that can automatically detect spam emails and flag them for the user, as in the figure below. This is a difficult problem because spam emails are often very similar to regular emails, and it can be hard to design a system that can accurately distinguish between them. There are many different approaches to this problem, but one common approach is to use a supervised machine learning algorithm with a language model to learn from a dataset of labeled emails (i.e., emails that have been manually labeled as spam or not spam). The algorithm can then be used to classify new emails as spam or not spam [3].<\/p>\n<figure id=\"attachment_513\" aria-describedby=\"caption-attachment-513\" style=\"width: 600px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-513\" src=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/document_classification.webp\" alt=\"\" width=\"600\" height=\"300\" srcset=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/document_classification.webp 1277w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/document_classification-300x150.webp 300w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/document_classification-1024x512.webp 1024w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/document_classification-768x384.webp 768w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><figcaption id=\"caption-attachment-513\" class=\"wp-caption-text\">Figure 1 Spam filtering text classification problem (source: <a href=\"https:\/\/developers.google.com\/machine-learning\/guides\/text-classification\">developers.google.com<\/a>)<\/figcaption><\/figure>\n<p>Machine learning researchers are constantly looking for ways to improve the performance of language models. One area of research is to develop new algorithms that can better learn from data, as we explained in the spam filtering problem. Another area of research is to develop new ways to represent data that can be more easily learned by language models.<\/p>\n<p><strong>BERT as a neural network language model<\/strong><br \/>\nA popular type of neural network-based language models is BERT, which was<br \/>\ncreated in 2017 by Google and fine-tuned on the Wikipedia corpus that contains 2.5 billion words. BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context, just like how humans try to use surrounding terms in sentences to identify the meaning of unknown words.<br \/>\nAs a language model, BERT was pre-trained using text from Wikipedia and can be fine-tuned with datasets to solve any type of classification problems like question answering.<br \/>\nHowever, as a neural network, Bidirectional Encoder Representations from Transformers (BERT) is a machine learning framework for NLP that uses deep learning algorithms to better understand the meaning of ambiguous language [4]. BERT is a type of transformer neural network that is pre-trained on text from Wikipedia and can be fine-tuned with any type of dataset to suit user\u2019s needs, such as performing sentiment analysis, or question answering, as seen in the figure below.<\/p>\n<figure id=\"attachment_520\" aria-describedby=\"caption-attachment-520\" style=\"width: 599px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-520\" src=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/BERT-model-example.png\" alt=\"\" width=\"599\" height=\"365\" srcset=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/BERT-model-example.png 2166w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/BERT-model-example-300x183.png 300w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/BERT-model-example-1024x624.png 1024w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/BERT-model-example-768x468.png 768w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/BERT-model-example-1536x936.png 1536w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/BERT-model-example-2048x1248.png 2048w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/BERT-model-example-1568x956.png 1568w\" sizes=\"(max-width: 599px) 100vw, 599px\" \/><figcaption id=\"caption-attachment-520\" class=\"wp-caption-text\">Figure 2 Fine-tuning a BERT model to solve a specific problem (source: <a href=\"https:\/\/www.deepset.ai\/blog\/what-is-a-language-model\">deepset.ai<\/a>)<\/figcaption><\/figure>\n<p>Now that we briefly described BERT, we will illustrate its usage on the problem of detecting toxicity from comments on Reddit. The dataset that we will use was part of a study that investigates the toxic behavior of users on Reddit [5]. The dataset (can be found <a href=\"https:\/\/www.dropbox.com\/s\/rb3e5czxtcr9xhy\/trainingDataset-toxicComments.csv\">here<\/a>) consists of comments from the subreddit r\/AskReddit, where each comment can be either non-toxic, slightly-toxic, or highly-toxic. So, the dataset is a great candidate for testing our understanding of language models by using and fine-tuning a BERT neural network language model.<\/p>\n<p>First, we will install a light-weight library called <a href=\"https:\/\/github.com\/amaiya\/ktrain\">ktrain<\/a> to help with simplifying the training of neural network models like BERT, then we will import the required libraries as follows:<\/p>\n<pre style=\"color: #000000; background: #ffffff;\">!pip install <span style=\"color: #44aadd;\">-<\/span>q ktrain <span style=\"color: #696969;\"># -q is a quiet option for less noise in the console output<\/span>\r\n<span style=\"color: #800000; font-weight: bold;\">import<\/span> numpy <span style=\"color: #800000; font-weight: bold;\">as<\/span> np <span style=\"color: #696969;\"># linear algebra<\/span>\r\n<span style=\"color: #800000; font-weight: bold;\">import<\/span> pandas <span style=\"color: #800000; font-weight: bold;\">as<\/span> pd <span style=\"color: #696969;\"># data processing, CSV file I\/O (e.g. pd.read_csv)<\/span>\r\nseed_value<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">13<\/span>\r\n<span style=\"color: #800000; font-weight: bold;\">import<\/span> re\r\n<span style=\"color: #800000; font-weight: bold;\">import<\/span> os\r\n<span style=\"color: #800000; font-weight: bold;\">import<\/span> math\r\n<span style=\"color: #800000; font-weight: bold;\">from<\/span> sklearn<span style=\"color: #808030;\">.<\/span>utils<span style=\"color: #808030;\">.<\/span>class_weight <span style=\"color: #800000; font-weight: bold;\">import<\/span> compute_class_weight\r\n<span style=\"color: #800000; font-weight: bold;\">from<\/span> sklearn<span style=\"color: #808030;\">.<\/span>preprocessing <span style=\"color: #800000; font-weight: bold;\">import<\/span> MultiLabelBinarizer\r\n<span style=\"color: #800000; font-weight: bold;\">import<\/span> tensorflow <span style=\"color: #800000; font-weight: bold;\">as<\/span> tf\r\n<span style=\"color: #800000; font-weight: bold;\">from<\/span> tensorflow<span style=\"color: #808030;\">.<\/span>keras <span style=\"color: #800000; font-weight: bold;\">import<\/span> activations\r\n<span style=\"color: #800000; font-weight: bold;\">from<\/span> sklearn<span style=\"color: #808030;\">.<\/span>metrics <span style=\"color: #800000; font-weight: bold;\">import<\/span> accuracy_score<span style=\"color: #808030;\">,<\/span> recall_score<span style=\"color: #808030;\">,<\/span> precision_score<span style=\"color: #808030;\">,<\/span> f1_score<span style=\"color: #808030;\">,<\/span> roc_auc_score\r\n<span style=\"color: #800000; font-weight: bold;\">from<\/span> sklearn<span style=\"color: #808030;\">.<\/span>metrics <span style=\"color: #800000; font-weight: bold;\">import<\/span> classification_report\r\n<span style=\"color: #800000; font-weight: bold;\">from<\/span> sklearn<span style=\"color: #808030;\">.<\/span>metrics <span style=\"color: #800000; font-weight: bold;\">import<\/span> confusion_matrix\r\n<span style=\"color: #800000; font-weight: bold;\">import<\/span> ktrain\r\n<span style=\"color: #800000; font-weight: bold;\">from<\/span> ktrain <span style=\"color: #800000; font-weight: bold;\">import<\/span> text\r\n<\/pre>\n<p>Then, clean the dataset using any appropriate method and partition the dataset into training, testing, and validation splits. Here, we cleaned the dataset based on special characteristics typically found in data retrieved from Reddit. Furthermore, we computed the class weight to supply it to the neural network model at the training phase. The toxicity detection dataset is severely imbalanced, with 81.57% non-toxic, 11.81% slightly-toxic, and 6.62% highly-toxic comments. Therefore, using the class weights for training is a good idea to ensure that the model pays more attention to minority classes.<\/p>\n<pre style=\"color: #000000; background: #ffffff;\"><span style=\"color: #696969;\">#Define class names<\/span>\r\nclasses <span style=\"color: #808030;\">=<\/span> <span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">\"highly_toxic\"<\/span><span style=\"color: #808030;\">,<\/span><span style=\"color: #0000e6;\">\"slightly_toxic\"<\/span><span style=\"color: #808030;\">,<\/span><span style=\"color: #0000e6;\">\"non_toxic\"<\/span><span style=\"color: #808030;\">]<\/span>\r\n\r\n<span style=\"color: #696969;\">#Function to partition data equally across classes <\/span>\r\n<span style=\"color: #800000; font-weight: bold;\">def<\/span> get_dataset_partitions_pd<span style=\"color: #808030;\">(<\/span>df<span style=\"color: #808030;\">,<\/span> train_split<span style=\"color: #808030;\">=<\/span><span style=\"color: #008000;\">0.8<\/span><span style=\"color: #808030;\">,<\/span> val_split<span style=\"color: #808030;\">=<\/span><span style=\"color: #008000;\">0.1<\/span><span style=\"color: #808030;\">,<\/span> test_split<span style=\"color: #808030;\">=<\/span><span style=\"color: #008000;\">0.1<\/span><span style=\"color: #808030;\">,<\/span> target_variable<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">None<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">:<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">assert<\/span> <span style=\"color: #808030;\">(<\/span>train_split <span style=\"color: #44aadd;\">+<\/span> test_split <span style=\"color: #44aadd;\">+<\/span> val_split<span style=\"color: #808030;\">)<\/span> <span style=\"color: #44aadd;\">==<\/span> <span style=\"color: #008c00;\">1<\/span>\r\n    \r\n    <span style=\"color: #696969;\"># Only allows for equal validation and test splits<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">assert<\/span> val_split <span style=\"color: #44aadd;\">==<\/span> test_split \r\n\r\n    <span style=\"color: #696969;\"># Shuffle<\/span>\r\n    df_sample <span style=\"color: #808030;\">=<\/span> df<span style=\"color: #808030;\">.<\/span>sample<span style=\"color: #808030;\">(<\/span>frac<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">1<\/span><span style=\"color: #808030;\">,<\/span> random_state<span style=\"color: #808030;\">=<\/span>seed_value<span style=\"color: #808030;\">)<\/span>\r\n    <span style=\"color: #696969;\"># Specify seed to always have the same split distribution between runs<\/span>\r\n    <span style=\"color: #696969;\"># If target variable is provided, generate stratified sets<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> target_variable <span style=\"color: #800000; font-weight: bold;\">is<\/span> <span style=\"color: #800000; font-weight: bold;\">not<\/span> <span style=\"color: #074726;\">None<\/span><span style=\"color: #808030;\">:<\/span>\r\n      grouped_df <span style=\"color: #808030;\">=<\/span> df_sample<span style=\"color: #808030;\">.<\/span>groupby<span style=\"color: #808030;\">(<\/span>target_variable<span style=\"color: #808030;\">)<\/span>\r\n      arr_list <span style=\"color: #808030;\">=<\/span> <span style=\"color: #808030;\">[<\/span>np<span style=\"color: #808030;\">.<\/span>split<span style=\"color: #808030;\">(<\/span>g<span style=\"color: #808030;\">,<\/span> <span style=\"color: #808030;\">[<\/span><span style=\"color: #400000;\">int<\/span><span style=\"color: #808030;\">(<\/span>train_split <span style=\"color: #44aadd;\">*<\/span> <span style=\"color: #400000;\">len<\/span><span style=\"color: #808030;\">(<\/span>g<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #400000;\">int<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #008c00;\">1<\/span> <span style=\"color: #44aadd;\">-<\/span> val_split<span style=\"color: #808030;\">)<\/span> <span style=\"color: #44aadd;\">*<\/span> <span style=\"color: #400000;\">len<\/span><span style=\"color: #808030;\">(<\/span>g<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">)<\/span> <span style=\"color: #800000; font-weight: bold;\">for<\/span> i<span style=\"color: #808030;\">,<\/span> g <span style=\"color: #800000; font-weight: bold;\">in<\/span> grouped_df<span style=\"color: #808030;\">]<\/span>\r\n\r\n      train_ds <span style=\"color: #808030;\">=<\/span> pd<span style=\"color: #808030;\">.<\/span>concat<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">[<\/span>t<span style=\"color: #808030;\">[<\/span><span style=\"color: #008c00;\">0<\/span><span style=\"color: #808030;\">]<\/span> <span style=\"color: #800000; font-weight: bold;\">for<\/span> t <span style=\"color: #800000; font-weight: bold;\">in<\/span> arr_list<span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">)<\/span>\r\n      val_ds <span style=\"color: #808030;\">=<\/span> pd<span style=\"color: #808030;\">.<\/span>concat<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">[<\/span>t<span style=\"color: #808030;\">[<\/span><span style=\"color: #008c00;\">1<\/span><span style=\"color: #808030;\">]<\/span> <span style=\"color: #800000; font-weight: bold;\">for<\/span> t <span style=\"color: #800000; font-weight: bold;\">in<\/span> arr_list<span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">)<\/span>\r\n      test_ds <span style=\"color: #808030;\">=<\/span> pd<span style=\"color: #808030;\">.<\/span>concat<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">[<\/span>v<span style=\"color: #808030;\">[<\/span><span style=\"color: #008c00;\">2<\/span><span style=\"color: #808030;\">]<\/span> <span style=\"color: #800000; font-weight: bold;\">for<\/span> v <span style=\"color: #800000; font-weight: bold;\">in<\/span> arr_list<span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">)<\/span>\r\n\r\n    <span style=\"color: #800000; font-weight: bold;\">else<\/span><span style=\"color: #808030;\">:<\/span>\r\n      indices_or_sections <span style=\"color: #808030;\">=<\/span> <span style=\"color: #808030;\">[<\/span><span style=\"color: #400000;\">int<\/span><span style=\"color: #808030;\">(<\/span>train_split <span style=\"color: #44aadd;\">*<\/span> <span style=\"color: #400000;\">len<\/span><span style=\"color: #808030;\">(<\/span>df<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #400000;\">int<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #008c00;\">1<\/span> <span style=\"color: #44aadd;\">-<\/span> val_split<span style=\"color: #808030;\">)<\/span> <span style=\"color: #44aadd;\">*<\/span> <span style=\"color: #400000;\">len<\/span><span style=\"color: #808030;\">(<\/span>df<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">]<\/span>\r\n      train_ds<span style=\"color: #808030;\">,<\/span> val_ds<span style=\"color: #808030;\">,<\/span> test_ds <span style=\"color: #808030;\">=<\/span> np<span style=\"color: #808030;\">.<\/span>split<span style=\"color: #808030;\">(<\/span>df_sample<span style=\"color: #808030;\">,<\/span> indices_or_sections<span style=\"color: #808030;\">)<\/span>\r\n    \r\n    <span style=\"color: #800000; font-weight: bold;\">return<\/span> train_ds<span style=\"color: #808030;\">,<\/span> val_ds<span style=\"color: #808030;\">,<\/span> test_ds\r\n\r\n<span style=\"color: #696969;\">#Function to get class weights based on distribution of data in rows<\/span>\r\n<span style=\"color: #800000; font-weight: bold;\">def<\/span> generate_class_weights<span style=\"color: #808030;\">(<\/span>class_series<span style=\"color: #808030;\">,<\/span> multi_class<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span> one_hot_encoded<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">False<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">:<\/span>\r\n  <span style=\"color: #800000; font-weight: bold;\">if<\/span> multi_class<span style=\"color: #808030;\">:<\/span>\r\n    <span style=\"color: #696969;\"># If class is one hot encoded, transform to categorical labels to use compute_class_weight   <\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> one_hot_encoded<span style=\"color: #808030;\">:<\/span>\r\n      class_series <span style=\"color: #808030;\">=<\/span> np<span style=\"color: #808030;\">.<\/span>argmax<span style=\"color: #808030;\">(<\/span>class_series<span style=\"color: #808030;\">,<\/span> axis<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">1<\/span><span style=\"color: #808030;\">)<\/span>\r\n  \r\n    <span style=\"color: #696969;\"># Compute class weights with sklearn method<\/span>\r\n    class_labels <span style=\"color: #808030;\">=<\/span> np<span style=\"color: #808030;\">.<\/span>unique<span style=\"color: #808030;\">(<\/span>class_series<span style=\"color: #808030;\">)<\/span>\r\n    class_weights <span style=\"color: #808030;\">=<\/span> compute_class_weight<span style=\"color: #808030;\">(<\/span>class_weight<span style=\"color: #808030;\">=<\/span><span style=\"color: #0000e6;\">'balanced'<\/span><span style=\"color: #808030;\">,<\/span> classes<span style=\"color: #808030;\">=<\/span>class_labels<span style=\"color: #808030;\">,<\/span> y<span style=\"color: #808030;\">=<\/span>class_series<span style=\"color: #808030;\">)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">return<\/span> <span style=\"color: #400000;\">dict<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #400000;\">zip<\/span><span style=\"color: #808030;\">(<\/span>class_labels<span style=\"color: #808030;\">,<\/span> class_weights<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span>\r\n  <span style=\"color: #800000; font-weight: bold;\">else<\/span><span style=\"color: #808030;\">:<\/span>\r\n    <span style=\"color: #696969;\"># It is neccessary that the multi-label values are one-hot encoded<\/span>\r\n    mlb <span style=\"color: #808030;\">=<\/span> <span style=\"color: #074726;\">None<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> <span style=\"color: #800000; font-weight: bold;\">not<\/span> one_hot_encoded<span style=\"color: #808030;\">:<\/span>\r\n      mlb <span style=\"color: #808030;\">=<\/span> MultiLabelBinarizer<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span>\r\n      class_series <span style=\"color: #808030;\">=<\/span> mlb<span style=\"color: #808030;\">.<\/span>fit_transform<span style=\"color: #808030;\">(<\/span>class_series<span style=\"color: #808030;\">)<\/span>\r\n\r\n    n_samples <span style=\"color: #808030;\">=<\/span> <span style=\"color: #400000;\">len<\/span><span style=\"color: #808030;\">(<\/span>class_series<span style=\"color: #808030;\">)<\/span>\r\n    n_classes <span style=\"color: #808030;\">=<\/span> <span style=\"color: #400000;\">len<\/span><span style=\"color: #808030;\">(<\/span>class_series<span style=\"color: #808030;\">[<\/span><span style=\"color: #008c00;\">0<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">)<\/span>\r\n\r\n    <span style=\"color: #696969;\"># Count each class frequency<\/span>\r\n    class_count <span style=\"color: #808030;\">=<\/span> <span style=\"color: #808030;\">[<\/span><span style=\"color: #008c00;\">0<\/span><span style=\"color: #808030;\">]<\/span> <span style=\"color: #44aadd;\">*<\/span> n_classes\r\n    <span style=\"color: #800000; font-weight: bold;\">for<\/span> classes <span style=\"color: #800000; font-weight: bold;\">in<\/span> class_series<span style=\"color: #808030;\">:<\/span>\r\n        <span style=\"color: #800000; font-weight: bold;\">for<\/span> index <span style=\"color: #800000; font-weight: bold;\">in<\/span> <span style=\"color: #400000;\">range<\/span><span style=\"color: #808030;\">(<\/span>n_classes<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">:<\/span>\r\n            <span style=\"color: #800000; font-weight: bold;\">if<\/span> classes<span style=\"color: #808030;\">[<\/span>index<span style=\"color: #808030;\">]<\/span> <span style=\"color: #44aadd;\">!=<\/span> <span style=\"color: #008c00;\">0<\/span><span style=\"color: #808030;\">:<\/span>\r\n                class_count<span style=\"color: #808030;\">[<\/span>index<span style=\"color: #808030;\">]<\/span> <span style=\"color: #44aadd;\">+<\/span><span style=\"color: #808030;\">=<\/span> <span style=\"color: #008c00;\">1<\/span>\r\n    \r\n    <span style=\"color: #696969;\"># Compute class weights using balanced method<\/span>\r\n    class_weights <span style=\"color: #808030;\">=<\/span> <span style=\"color: #808030;\">[<\/span>n_samples <span style=\"color: #44aadd;\">\/<\/span> <span style=\"color: #808030;\">(<\/span>n_classes <span style=\"color: #44aadd;\">*<\/span> freq<span style=\"color: #808030;\">)<\/span> <span style=\"color: #800000; font-weight: bold;\">if<\/span> freq <span style=\"color: #44aadd;\">&gt;<\/span> <span style=\"color: #008c00;\">0<\/span> <span style=\"color: #800000; font-weight: bold;\">else<\/span> <span style=\"color: #008c00;\">1<\/span> <span style=\"color: #800000; font-weight: bold;\">for<\/span> freq <span style=\"color: #800000; font-weight: bold;\">in<\/span> class_count<span style=\"color: #808030;\">]<\/span>\r\n    class_labels <span style=\"color: #808030;\">=<\/span> <span style=\"color: #400000;\">range<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #400000;\">len<\/span><span style=\"color: #808030;\">(<\/span>class_weights<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span> <span style=\"color: #800000; font-weight: bold;\">if<\/span> mlb <span style=\"color: #800000; font-weight: bold;\">is<\/span> <span style=\"color: #074726;\">None<\/span> <span style=\"color: #800000; font-weight: bold;\">else<\/span> mlb<span style=\"color: #808030;\">.<\/span>classes_\r\n    <span style=\"color: #800000; font-weight: bold;\">return<\/span> <span style=\"color: #400000;\">dict<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #400000;\">zip<\/span><span style=\"color: #808030;\">(<\/span>class_labels<span style=\"color: #808030;\">,<\/span> class_weights<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span>\r\n\r\n<span style=\"color: #696969;\">#Function to clean Reddit data<\/span>\r\n<span style=\"color: #800000; font-weight: bold;\">def<\/span> clean<span style=\"color: #808030;\">(<\/span>text<span style=\"color: #808030;\">,<\/span> newline<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span> quote<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span> bullet_point<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span> \r\n          link<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span> strikethrough<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span> spoiler<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span>\r\n          code<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span> superscript<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span> table<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span> heading<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">:<\/span>\r\n    <span style=\"color: #696969;\">\"\"\"<\/span>\r\n<span style=\"color: #696969;\">\u00a0\u00a0\u00a0\u00a0Cleans text (string).<\/span>\r\n<span style=\"color: #696969;\">\u00a0\u00a0\u00a0\u00a0Removes common Reddit special characters\/symbols:<\/span>\r\n<span style=\"color: #696969;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0* \\n (newlines)<\/span>\r\n<span style=\"color: #696969;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0* &amp;gt; (&gt; quotes)<\/span>\r\n<span style=\"color: #696969;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0* * or &amp;amp;#x200B; (bullet points)<\/span>\r\n<span style=\"color: #696969;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0* []() (links)<\/span>\r\n<span style=\"color: #696969;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0* etc (see below)<\/span>\r\n<span style=\"color: #696969;\">\u00a0\u00a0\u00a0\u00a0Specific removals can be turned off, but everything is on by default.<\/span>\r\n<span style=\"color: #696969;\">\u00a0\u00a0\u00a0\u00a0Standard punctuation etc is deliberately not removed, can be done in a<\/span>\r\n<span style=\"color: #696969;\">\u00a0\u00a0\u00a0\u00a0second round manually, or may be preserved in any case.<\/span>\r\n<span style=\"color: #696969;\">\u00a0\u00a0\u00a0\u00a0\"\"\"<\/span>\r\n    <span style=\"color: #696969;\"># Newlines (replaced with space to preserve cases like word1\\nword2)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> newline<span style=\"color: #808030;\">:<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">r'\\n+'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">' '<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n\r\n        <span style=\"color: #696969;\"># Remove resulting ' '<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> text<span style=\"color: #808030;\">.<\/span>strip<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">r'\\s\\s+'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">' '<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n\r\n    <span style=\"color: #696969;\"># &gt; Quotes<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> quote<span style=\"color: #808030;\">:<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">r'\\\"?\\\\?&amp;?gt;?'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">''<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n\r\n    <span style=\"color: #696969;\"># Bullet points\/asterisk (bold\/italic)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> bullet_point<span style=\"color: #808030;\">:<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">r'\\*'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">''<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">'&amp;amp;#x200B;'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">''<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n\r\n    <span style=\"color: #696969;\"># []() Link (Also removes the hyperlink)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> link<span style=\"color: #808030;\">:<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">r'\\[.*?\\]\\(.*?\\)'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">''<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n\r\n    <span style=\"color: #696969;\"># Strikethrough<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> strikethrough<span style=\"color: #808030;\">:<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">'~'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">''<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n\r\n    <span style=\"color: #696969;\"># Spoiler, which is used with &lt; less-than (Preserves the text)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> spoiler<span style=\"color: #808030;\">:<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">'&amp;lt;'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">''<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">r'!(.*?)!'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">r'\\1'<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n\r\n    <span style=\"color: #696969;\"># Code, inline and block<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> code<span style=\"color: #808030;\">:<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">'`'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">''<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n\r\n    <span style=\"color: #696969;\"># Superscript (Preserves the text)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> superscript<span style=\"color: #808030;\">:<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">r'\\^\\((.*?)\\)'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">r'\\1'<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n\r\n    <span style=\"color: #696969;\"># Table<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> table<span style=\"color: #808030;\">:<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">r'\\|'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">' '<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">':-'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">''<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n\r\n    <span style=\"color: #696969;\"># Heading<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">if<\/span> heading<span style=\"color: #808030;\">:<\/span>\r\n        text <span style=\"color: #808030;\">=<\/span> re<span style=\"color: #808030;\">.<\/span>sub<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">'#'<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">''<\/span><span style=\"color: #808030;\">,<\/span> text<span style=\"color: #808030;\">)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">return<\/span> text\r\n              \r\n    \r\n<span style=\"color: #800000; font-weight: bold;\">def<\/span> get_data<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">:<\/span>\r\n    dataset <span style=\"color: #808030;\">=<\/span> pd<span style=\"color: #808030;\">.<\/span>read_csv<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">'trainingDataset-toxicComments.csv'<\/span><span style=\"color: #808030;\">)<\/span>\r\n    dataset <span style=\"color: #808030;\">=<\/span> dataset<span style=\"color: #808030;\">.<\/span>sample<span style=\"color: #808030;\">(<\/span>frac<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">1<\/span><span style=\"color: #808030;\">,<\/span>random_state<span style=\"color: #808030;\">=<\/span>seed_value<span style=\"color: #808030;\">)<\/span>\r\n    dataset<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'comment_text'<\/span><span style=\"color: #808030;\">]<\/span> <span style=\"color: #808030;\">=<\/span> dataset<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'comment_text'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span>apply<span style=\"color: #808030;\">(<\/span><span style=\"color: #800000; font-weight: bold;\">lambda<\/span> x<span style=\"color: #808030;\">:<\/span> clean<span style=\"color: #808030;\">(<\/span>x<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span>\r\n\r\n    myLabels <span style=\"color: #808030;\">=<\/span> dataset<span style=\"color: #808030;\">[<\/span>classes<span style=\"color: #808030;\">]<\/span>\r\n    \r\n    dataset<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'category'<\/span><span style=\"color: #808030;\">]<\/span> <span style=\"color: #808030;\">=<\/span> myLabels<span style=\"color: #808030;\">.<\/span>idxmax<span style=\"color: #808030;\">(<\/span>axis<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">1<\/span><span style=\"color: #808030;\">)<\/span>\r\n    dataset<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'label'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">=<\/span> dataset<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'category'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span><span style=\"color: #400000;\">map<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #800080;\">{<\/span><span style=\"color: #0000e6;\">'highly_toxic'<\/span><span style=\"color: #808030;\">:<\/span><span style=\"color: #008c00;\">0<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #0000e6;\">'slightly_toxic'<\/span><span style=\"color: #808030;\">:<\/span><span style=\"color: #008c00;\">1<\/span><span style=\"color: #808030;\">,<\/span><span style=\"color: #0000e6;\">'non_toxic'<\/span><span style=\"color: #808030;\">:<\/span><span style=\"color: #008c00;\">2<\/span><span style=\"color: #800080;\">}<\/span><span style=\"color: #808030;\">)<\/span>\r\n    y <span style=\"color: #808030;\">=<\/span> dataset<span style=\"color: #808030;\">[<\/span><span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'label'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">]<\/span>\r\n    htoxic<span style=\"color: #808030;\">,<\/span> stoxic<span style=\"color: #808030;\">,<\/span> neither <span style=\"color: #808030;\">=<\/span> np<span style=\"color: #808030;\">.<\/span>bincount<span style=\"color: #808030;\">(<\/span>dataset<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'label'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">)<\/span>\r\n    total <span style=\"color: #808030;\">=<\/span> neither <span style=\"color: #44aadd;\">+<\/span> stoxic  <span style=\"color: #44aadd;\">+<\/span> htoxic\r\n    <span style=\"color: #800000; font-weight: bold;\">print<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">'Examples:<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">    Total: {}<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">    highly toxic: {} ({:.2f}% of total)<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">'<\/span><span style=\"color: #808030;\">.<\/span>format<span style=\"color: #808030;\">(<\/span>\r\n        total<span style=\"color: #808030;\">,<\/span> htoxic<span style=\"color: #808030;\">,<\/span> <span style=\"color: #008c00;\">100<\/span> <span style=\"color: #44aadd;\">*<\/span> htoxic <span style=\"color: #44aadd;\">\/<\/span> total<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">print<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">'Examples:<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">    Total: {}<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">    slightly toxic: {} ({:.2f}% of total)<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">'<\/span><span style=\"color: #808030;\">.<\/span>format<span style=\"color: #808030;\">(<\/span>\r\n        total<span style=\"color: #808030;\">,<\/span> stoxic<span style=\"color: #808030;\">,<\/span> <span style=\"color: #008c00;\">100<\/span> <span style=\"color: #44aadd;\">*<\/span> stoxic <span style=\"color: #44aadd;\">\/<\/span> total<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">print<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">'Examples:<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">    Total: {}<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">    Neither: {} ({:.2f}% of total)<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">'<\/span><span style=\"color: #808030;\">.<\/span>format<span style=\"color: #808030;\">(<\/span>\r\n        total<span style=\"color: #808030;\">,<\/span> neither<span style=\"color: #808030;\">,<\/span> <span style=\"color: #008c00;\">100<\/span> <span style=\"color: #44aadd;\">*<\/span> neither <span style=\"color: #44aadd;\">\/<\/span> total<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">print<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span>   \r\n    weights <span style=\"color: #808030;\">=<\/span> generate_class_weights<span style=\"color: #808030;\">(<\/span>myLabels<span style=\"color: #808030;\">.<\/span>values<span style=\"color: #808030;\">,<\/span> multi_class<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span> one_hot_encoded<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">print<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">\"Class weights:\"<\/span><span style=\"color: #808030;\">)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">print<\/span><span style=\"color: #808030;\">(<\/span>weights<span style=\"color: #808030;\">)<\/span>\r\n    \r\n    train_ds<span style=\"color: #808030;\">,<\/span> val_ds<span style=\"color: #808030;\">,<\/span> test_ds <span style=\"color: #808030;\">=<\/span> get_dataset_partitions_pd<span style=\"color: #808030;\">(<\/span>dataset<span style=\"color: #808030;\">,<\/span> target_variable<span style=\"color: #808030;\">=<\/span><span style=\"color: #0000e6;\">\"label\"<\/span><span style=\"color: #808030;\">)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">print<\/span><span style=\"color: #808030;\">(<\/span>f<span style=\"color: #0000e6;\">'Distribution in training set: <\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">{train_ds[\"label\"].value_counts().sort_index() \/ len(train_ds)}<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">'<\/span><span style=\"color: #44aadd;\">+<\/span>\r\n      f<span style=\"color: #0000e6;\">'Distribution in validation set: <\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">{val_ds[\"label\"].value_counts().sort_index() \/ len(val_ds)}<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">'<\/span><span style=\"color: #44aadd;\">+<\/span>\r\n      f<span style=\"color: #0000e6;\">'Distribution in testing set: <\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">{test_ds[\"label\"].value_counts().sort_index() \/ len(test_ds)}'<\/span><span style=\"color: #808030;\">)<\/span>\r\n\r\n    train_sentences <span style=\"color: #808030;\">=<\/span> train_ds<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'comment_text'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span>fillna<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">\"fillna\"<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">.<\/span><span style=\"color: #400000;\">str<\/span><span style=\"color: #808030;\">.<\/span>lower<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span>\r\n    y_train <span style=\"color: #808030;\">=<\/span> train_ds<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'label'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span>values<span style=\"color: #808030;\">.<\/span>astype<span style=\"color: #808030;\">(<\/span>np<span style=\"color: #808030;\">.<\/span>int32<span style=\"color: #808030;\">)<\/span>\r\n    val_sentences <span style=\"color: #808030;\">=<\/span> val_ds<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'comment_text'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span>fillna<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">\"fillna\"<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">.<\/span><span style=\"color: #400000;\">str<\/span><span style=\"color: #808030;\">.<\/span>lower<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span>\r\n    y_val <span style=\"color: #808030;\">=<\/span> val_ds<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'label'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span>values<span style=\"color: #808030;\">.<\/span>astype<span style=\"color: #808030;\">(<\/span>np<span style=\"color: #808030;\">.<\/span>int32<span style=\"color: #808030;\">)<\/span>\r\n    test_sentences <span style=\"color: #808030;\">=<\/span> test_ds<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'comment_text'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span>fillna<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">\"fillna\"<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">.<\/span><span style=\"color: #400000;\">str<\/span><span style=\"color: #808030;\">.<\/span>lower<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span>\r\n    y_tests <span style=\"color: #808030;\">=<\/span> test_ds<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'label'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span>values<span style=\"color: #808030;\">.<\/span>astype<span style=\"color: #808030;\">(<\/span>np<span style=\"color: #808030;\">.<\/span>int32<span style=\"color: #808030;\">)<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">return<\/span> train_sentences<span style=\"color: #808030;\">,<\/span> y_train<span style=\"color: #808030;\">,<\/span> val_sentences<span style=\"color: #808030;\">,<\/span> y_val<span style=\"color: #808030;\">,<\/span> test_sentences<span style=\"color: #808030;\">,<\/span> y_tests<span style=\"color: #808030;\">,<\/span>weights\r\n<\/pre>\n<p>Calling the above function to process the dataset, clean it and partition it can be done as follows:<\/p>\n<pre style=\"color: #000000; background: #ffffff;\"><span style=\"color: #800000; font-weight: bold;\">print<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">'Result of processing the whole train set:'<\/span><span style=\"color: #808030;\">)<\/span>\r\ntrain_sentences<span style=\"color: #808030;\">,<\/span> y_train<span style=\"color: #808030;\">,<\/span> val_sentences<span style=\"color: #808030;\">,<\/span> y_val<span style=\"color: #808030;\">,<\/span> test_sentences<span style=\"color: #808030;\">,<\/span> y_tests<span style=\"color: #808030;\">,<\/span>weights <span style=\"color: #808030;\">=<\/span> get_data<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span>\r\n<\/pre>\n<p>The following step involves defining the exact BERT pre-trained language model name that will be fine-tuned as a transformer model to detect toxic comments. In this tutorial, we used the basic uncased BERT model, but there are other models that can be used like BERT tiny, large, medium in cased and uncased variations. For more information, please refer to Google research\u2019s explanation of the various flavors of BERT, which can be found <a href=\"https:\/\/github.com\/google-research\/bert\">here<\/a>.<\/p>\n<pre style=\"color: #000000; background: #ffffff;\">MODEL_NAME <span style=\"color: #808030;\">=<\/span> <span style=\"color: #0000e6;\">'bert-base-uncased'<\/span>\r\ncategories <span style=\"color: #808030;\">=<\/span> <span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'Advertising'<\/span><span style=\"color: #808030;\">,<\/span><span style=\"color: #0000e6;\">'Customer experience'<\/span><span style=\"color: #808030;\">,<\/span><span style=\"color: #0000e6;\">'News'<\/span><span style=\"color: #808030;\">,<\/span><span style=\"color: #0000e6;\">'Responding to customers'<\/span><span style=\"color: #808030;\">,<\/span><span style=\"color: #0000e6;\">'Socially relevant posts'<\/span><span style=\"color: #808030;\">]<\/span>\r\n\r\nt <span style=\"color: #808030;\">=<\/span> text<span style=\"color: #808030;\">.<\/span>Transformer<span style=\"color: #808030;\">(<\/span>MODEL_NAME<span style=\"color: #808030;\">,<\/span> maxlen<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">256<\/span><span style=\"color: #808030;\">,<\/span> class_names<span style=\"color: #808030;\">=<\/span>categories<span style=\"color: #808030;\">)<\/span>\r\ntrn <span style=\"color: #808030;\">=<\/span> t<span style=\"color: #808030;\">.<\/span>preprocess_train<span style=\"color: #808030;\">(<\/span>train<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'text'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span>to_numpy<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> train<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'label'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span>to_numpy<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span>\r\nval <span style=\"color: #808030;\">=<\/span> t<span style=\"color: #808030;\">.<\/span>preprocess_test<span style=\"color: #808030;\">(<\/span>test<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'text'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span>to_numpy<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> test<span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'label'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">.<\/span>to_numpy<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span>\r\nmodel <span style=\"color: #808030;\">=<\/span> t<span style=\"color: #808030;\">.<\/span>get_classifier<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span>\r\nmodel<span style=\"color: #808030;\">.<\/span><span style=\"color: #400000;\">compile<\/span><span style=\"color: #808030;\">(<\/span>loss<span style=\"color: #808030;\">=<\/span> focal_loss<span style=\"color: #808030;\">(<\/span>alpha<span style=\"color: #808030;\">=<\/span><span style=\"color: #008000;\">0.25<\/span><span style=\"color: #808030;\">,<\/span>from_logits<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span>\r\n              optimizer<span style=\"color: #808030;\">=<\/span><span style=\"color: #0000e6;\">'adam'<\/span><span style=\"color: #808030;\">,<\/span>\r\n              metrics<span style=\"color: #808030;\">=<\/span><span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'accuracy'<\/span><span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">)<\/span>\r\n\r\n<span style=\"color: #696969;\">#Run these two lines to find the best learning rate<\/span>\r\nlearner <span style=\"color: #808030;\">=<\/span> ktrain<span style=\"color: #808030;\">.<\/span>get_learner<span style=\"color: #808030;\">(<\/span>model<span style=\"color: #808030;\">,<\/span> train_data<span style=\"color: #808030;\">=<\/span>trn<span style=\"color: #808030;\">,<\/span> val_data<span style=\"color: #808030;\">=<\/span>val<span style=\"color: #808030;\">,<\/span> batch_size<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">16<\/span><span style=\"color: #808030;\">)<\/span>\r\nlearner<span style=\"color: #808030;\">.<\/span>lr_find<span style=\"color: #808030;\">(<\/span>show_plot<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">,<\/span> max_epochs<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">2<\/span><span style=\"color: #808030;\">)<\/span>\r\n<\/pre>\n<p>After defining the model, we used ktrain to preprocess and tokenize the dataset as required by BERT:<\/p>\n<pre style=\"color: #000000; background: #ffffff;\"><span style=\"color: #808030;\">(<\/span>x_train<span style=\"color: #808030;\">,<\/span>  y_train<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> <span style=\"color: #808030;\">(<\/span>x_test<span style=\"color: #808030;\">,<\/span> y_test<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> preproc <span style=\"color: #808030;\">=<\/span> text<span style=\"color: #808030;\">.<\/span>texts_from_array<span style=\"color: #808030;\">(<\/span>x_train<span style=\"color: #808030;\">=<\/span>train_sentences<span style=\"color: #808030;\">.<\/span>to_numpy<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> y_train<span style=\"color: #808030;\">=<\/span>y_train<span style=\"color: #808030;\">,<\/span>\r\n                                          x_test<span style=\"color: #808030;\">=<\/span>val_sentences<span style=\"color: #808030;\">.<\/span>to_numpy<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> y_test<span style=\"color: #808030;\">=<\/span>y_val<span style=\"color: #808030;\">,<\/span>\r\n                                          class_names<span style=\"color: #808030;\">=<\/span>classes<span style=\"color: #808030;\">,<\/span>\r\n                                          preprocess_mode<span style=\"color: #808030;\">=<\/span><span style=\"color: #0000e6;\">'bert'<\/span><span style=\"color: #808030;\">,<\/span>\r\n                                          maxlen<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">256<\/span><span style=\"color: #808030;\">)<\/span>\r\n<\/pre>\n<p>Then, we defined the evaluation metric that will be displayed by the model. For simplicity, we chose accuracy. Followed by compiling the model with the desired metric, ensuring that the correct loss function is used. Since our dataset represents a multi-class classification problem, we used categorical cross entropy as a loss function. If the problem was binary (e.g., toxic and non-toxic comments), we can use binary cross entropy as a loss function. For the optimizer, we used the default Adam optimizer, which is typically used in many neural network solutions. Next, we defined a ktrain learner to fine-tune and train the BERT language model.<\/p>\n<pre style=\"color: #000000; background: #ffffff;\">METRICS <span style=\"color: #808030;\">=<\/span> <span style=\"color: #808030;\">[<\/span><span style=\"color: #0000e6;\">'accuracy'<\/span><span style=\"color: #808030;\">]<\/span> <span style=\"color: #696969;\">#Here we will just use accuracy<\/span>\r\nmodel <span style=\"color: #808030;\">=<\/span> text<span style=\"color: #808030;\">.<\/span>text_classifier<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">'bert'<\/span><span style=\"color: #808030;\">,<\/span> train_data<span style=\"color: #808030;\">=<\/span><span style=\"color: #808030;\">(<\/span>x_train<span style=\"color: #808030;\">,<\/span> y_train<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> preproc<span style=\"color: #808030;\">=<\/span>preproc<span style=\"color: #808030;\">)<\/span>\r\nmodel<span style=\"color: #808030;\">.<\/span><span style=\"color: #400000;\">compile<\/span><span style=\"color: #808030;\">(<\/span>loss<span style=\"color: #808030;\">=<\/span><span style=\"color: #0000e6;\">'categorical_crossentropy'<\/span><span style=\"color: #808030;\">,<\/span>\r\n              optimizer<span style=\"color: #808030;\">=<\/span><span style=\"color: #0000e6;\">'adam'<\/span><span style=\"color: #808030;\">,<\/span>\r\n              metrics<span style=\"color: #808030;\">=<\/span>METRICS<span style=\"color: #808030;\">)<\/span>\r\nlearner <span style=\"color: #808030;\">=<\/span> ktrain<span style=\"color: #808030;\">.<\/span>get_learner<span style=\"color: #808030;\">(<\/span>model<span style=\"color: #808030;\">,<\/span> train_data<span style=\"color: #808030;\">=<\/span><span style=\"color: #808030;\">(<\/span>x_train<span style=\"color: #808030;\">,<\/span> y_train<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> val_data <span style=\"color: #808030;\">=<\/span> <span style=\"color: #808030;\">(<\/span>x_test<span style=\"color: #808030;\">,<\/span> y_test<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span>batch_size<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">16<\/span><span style=\"color: #808030;\">)<\/span>\r\n<\/pre>\n<p>Before proceeding with the training of the model, we defined a custom callback function to compute the Receiver Operator Characteristics-Area Under the Curve (ROC-AUC) score as a metric by the end of each epoch. ROC-AUC is a typical evaluation metric used with toxicity classification problems, thus, we used it to save the best trained model based on the highest ROC-AUC score.<\/p>\n<pre style=\"color: #000000; background: #ffffff;\"><span style=\"color: #800000; font-weight: bold;\">from<\/span> tensorflow<span style=\"color: #808030;\">.<\/span>keras<span style=\"color: #808030;\">.<\/span>callbacks <span style=\"color: #800000; font-weight: bold;\">import<\/span> Callback\r\n<span style=\"color: #800000; font-weight: bold;\">class<\/span> RocAucEvaluation<span style=\"color: #808030;\">(<\/span>Callback<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">:<\/span>\r\n    <span style=\"color: #800000; font-weight: bold;\">def<\/span> <span style=\"color: #074726;\">__init__<\/span><span style=\"color: #808030;\">(<\/span>self<span style=\"color: #808030;\">,<\/span> validation_data<span style=\"color: #808030;\">=<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> interval<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">1<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">:<\/span>\r\n        <span style=\"color: #400000;\">super<\/span><span style=\"color: #808030;\">(<\/span>Callback<span style=\"color: #808030;\">,<\/span> self<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">.<\/span><span style=\"color: #074726;\">__init__<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span>\r\n\r\n        self<span style=\"color: #808030;\">.<\/span>interval <span style=\"color: #808030;\">=<\/span> interval\r\n        self<span style=\"color: #808030;\">.<\/span>X_val<span style=\"color: #808030;\">,<\/span> self<span style=\"color: #808030;\">.<\/span>y_val <span style=\"color: #808030;\">=<\/span> validation_data\r\n        self<span style=\"color: #808030;\">.<\/span>max_score <span style=\"color: #808030;\">=<\/span> <span style=\"color: #008c00;\">0<\/span>\r\n        self<span style=\"color: #808030;\">.<\/span>not_better_count <span style=\"color: #808030;\">=<\/span> <span style=\"color: #008c00;\">0<\/span>\r\n\r\n    <span style=\"color: #800000; font-weight: bold;\">def<\/span> on_epoch_end<span style=\"color: #808030;\">(<\/span>self<span style=\"color: #808030;\">,<\/span> epoch<span style=\"color: #808030;\">,<\/span> logs<span style=\"color: #808030;\">=<\/span><span style=\"color: #800080;\">{<\/span><span style=\"color: #800080;\">}<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">:<\/span>\r\n        <span style=\"color: #800000; font-weight: bold;\">if<\/span> epoch <span style=\"color: #44aadd;\">%<\/span> self<span style=\"color: #808030;\">.<\/span>interval <span style=\"color: #44aadd;\">==<\/span> <span style=\"color: #008c00;\">0<\/span><span style=\"color: #808030;\">:<\/span>\r\n            y_pred <span style=\"color: #808030;\">=<\/span> self<span style=\"color: #808030;\">.<\/span>model<span style=\"color: #808030;\">.<\/span>predict<span style=\"color: #808030;\">(<\/span>self<span style=\"color: #808030;\">.<\/span>X_val<span style=\"color: #808030;\">,<\/span> verbose<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">1<\/span><span style=\"color: #808030;\">)<\/span>\r\n            score <span style=\"color: #808030;\">=<\/span> roc_auc_score<span style=\"color: #808030;\">(<\/span>self<span style=\"color: #808030;\">.<\/span>y_val<span style=\"color: #808030;\">,<\/span> y_pred<span style=\"color: #808030;\">)<\/span>\r\n            <span style=\"color: #800000; font-weight: bold;\">print<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">\"<\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\"> ROC-AUC - epoch: %d - score: %.6f <\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">\"<\/span> <span style=\"color: #44aadd;\">%<\/span> <span style=\"color: #808030;\">(<\/span>epoch<span style=\"color: #44aadd;\">+<\/span><span style=\"color: #008c00;\">1<\/span><span style=\"color: #808030;\">,<\/span> score<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span>\r\n            <span style=\"color: #800000; font-weight: bold;\">if<\/span> <span style=\"color: #808030;\">(<\/span>score <span style=\"color: #44aadd;\">&gt;<\/span> self<span style=\"color: #808030;\">.<\/span>max_score<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">:<\/span>\r\n                <span style=\"color: #800000; font-weight: bold;\">print<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">\"*** New High Score (previous: %.6f) <\/span><span style=\"color: #0f69ff;\">\\n<\/span><span style=\"color: #0000e6;\">\"<\/span> <span style=\"color: #44aadd;\">%<\/span> self<span style=\"color: #808030;\">.<\/span>max_score<span style=\"color: #808030;\">)<\/span>\r\n                model<span style=\"color: #808030;\">.<\/span>save_weights<span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">\"best_weights.h5\"<\/span><span style=\"color: #808030;\">)<\/span>\r\n                self<span style=\"color: #808030;\">.<\/span>max_score<span style=\"color: #808030;\">=<\/span>score\r\n                self<span style=\"color: #808030;\">.<\/span>not_better_count <span style=\"color: #808030;\">=<\/span> <span style=\"color: #008c00;\">0<\/span>\r\n            <span style=\"color: #800000; font-weight: bold;\">else<\/span><span style=\"color: #808030;\">:<\/span>\r\n                self<span style=\"color: #808030;\">.<\/span>not_better_count <span style=\"color: #44aadd;\">+<\/span><span style=\"color: #808030;\">=<\/span> <span style=\"color: #008c00;\">1<\/span>\r\n                <span style=\"color: #800000; font-weight: bold;\">if<\/span> self<span style=\"color: #808030;\">.<\/span>not_better_count <span style=\"color: #44aadd;\">&gt;<\/span> <span style=\"color: #008c00;\">3<\/span><span style=\"color: #808030;\">:<\/span>\r\n                    <span style=\"color: #800000; font-weight: bold;\">print<\/span><span style=\"color: #808030;\">(<\/span><span style=\"color: #0000e6;\">\"Epoch %05d: early stopping, high score = %.6f\"<\/span> <span style=\"color: #44aadd;\">%<\/span> <span style=\"color: #808030;\">(<\/span>epoch<span style=\"color: #808030;\">,<\/span>self<span style=\"color: #808030;\">.<\/span>max_score<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span>\r\n                    self<span style=\"color: #808030;\">.<\/span>model<span style=\"color: #808030;\">.<\/span>stop_training <span style=\"color: #808030;\">=<\/span> <span style=\"color: #074726;\">True<\/span>\r\nRocAuc <span style=\"color: #808030;\">=<\/span> RocAucEvaluation<span style=\"color: #808030;\">(<\/span>validation_data<span style=\"color: #808030;\">=<\/span><span style=\"color: #808030;\">(<\/span>x_test<span style=\"color: #808030;\">,<\/span>y_test<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span> interval<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">1<\/span><span style=\"color: #808030;\">)<\/span>\r\n<\/pre>\n<p>An important component of fine-tuning any neural network model is the learning rate. With ktrain, we can write one line of code to find the ideal learning rate based on our training set, as follows:<\/p>\n<pre style=\"color: #000000; background: #ffffff;\">learner<span style=\"color: #808030;\">.<\/span>lr_find<span style=\"color: #808030;\">(<\/span>show_plot<span style=\"color: #808030;\">=<\/span><span style=\"color: #074726;\">True<\/span><span style=\"color: #808030;\">)<\/span>\r\n<\/pre>\n<p>The step above is optional but allows us to identify the best learning rate that will yield the highest ROC-AUC score with minimal loss. After running this step, ktrain showed the following learning rate plot. Notice that the loss starts to drop at e-4, so choosing it as a learning rate fine-tuning parameter is a good option.<\/p>\n<figure id=\"attachment_528\" aria-describedby=\"caption-attachment-528\" style=\"width: 386px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-528\" src=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/learning-rate.png\" alt=\"\" width=\"386\" height=\"266\" srcset=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/learning-rate.png 386w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/learning-rate-300x207.png 300w\" sizes=\"(max-width: 386px) 100vw, 386px\" \/><figcaption id=\"caption-attachment-528\" class=\"wp-caption-text\">Figure 3 Learning rate visualization plot generated by ktrain<\/figcaption><\/figure>\n<p>To train the model, we can simply write the following line of code:<\/p>\n<pre style=\"color: #000000; background: #ffffff;\">learner<span style=\"color: #808030;\">.<\/span>fit_onecycle<span style=\"color: #808030;\">(<\/span>lr<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">1<\/span><span style=\"color: #ffffff; background: #dd0000; font-weight: bold; font-style: italic;\">e<\/span><span style=\"color: #44aadd;\">-<\/span><span style=\"color: #008c00;\">4<\/span><span style=\"color: #808030;\">,<\/span>epochs<span style=\"color: #808030;\">=<\/span><span style=\"color: #008c00;\">4<\/span><span style=\"color: #808030;\">,<\/span>callbacks<span style=\"color: #808030;\">=<\/span><span style=\"color: #808030;\">[<\/span>RocAuc<span style=\"color: #808030;\">]<\/span><span style=\"color: #808030;\">,<\/span>class_weight<span style=\"color: #808030;\">=<\/span>weights<span style=\"color: #808030;\">)<\/span>\r\n<\/pre>\n<p>This line of code means the following: take the pre-trained BERT learner and fine-tune it using the one cycle learning rate adjustment policy. The properties of fine-tuning this model include using the learning rate 10-4, using the ROC-AUC callback that was defined earlier, training the model over 4 epochs, and using the class weights while tuning the trained model.<\/p>\n<p>Once the model gets trained, we can use a couple of lines of code by ktrain to evaluate the performance on the testing portion of the dataset as follows:<\/p>\n<pre style=\"color: #000000; background: #ffffff;\">testing <span style=\"color: #808030;\">=<\/span> preproc<span style=\"color: #808030;\">.<\/span>preprocess_test<span style=\"color: #808030;\">(<\/span>test_sentences<span style=\"color: #808030;\">.<\/span>to_numpy<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span>y<span style=\"color: #808030;\">=<\/span>y_tests<span style=\"color: #808030;\">)<\/span>\r\nlearner<span style=\"color: #808030;\">.<\/span>validate<span style=\"color: #808030;\">(<\/span>val_data<span style=\"color: #808030;\">=<\/span><span style=\"color: #808030;\">(<\/span>testing<span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">,<\/span>class_names<span style=\"color: #808030;\">=<\/span>t<span style=\"color: #808030;\">.<\/span>get_classes<span style=\"color: #808030;\">(<\/span><span style=\"color: #808030;\">)<\/span><span style=\"color: #808030;\">)<\/span>\r\n<\/pre>\n<p>The learner.validate function call prints a classification report and a confusion matrix to summarize the performance, as shown here:<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-531\" src=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/Screen-Shot-2022-11-19-at-9.29.09-PM-1024x484.png\" alt=\"\" width=\"580\" height=\"274\" srcset=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/Screen-Shot-2022-11-19-at-9.29.09-PM-1024x484.png 1024w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/Screen-Shot-2022-11-19-at-9.29.09-PM-300x142.png 300w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/Screen-Shot-2022-11-19-at-9.29.09-PM-768x363.png 768w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/Screen-Shot-2022-11-19-at-9.29.09-PM.png 1040w\" sizes=\"(max-width: 580px) 100vw, 580px\" \/><br \/>\nThe results above show that our fine-tuned BERT model achieved an accuracy of 89%, which is not bad at all for detecting toxicity from comments in Reddit.<\/p>\n<p>To showcase how the trained model can be used, we used ktrain to prepare a predictor object that can take any text and predicts its\u2019 toxicity based on the fine-tuned BERT model.<\/p>\n<pre style=\"color: #000000; background: #ffffff;\">predictor <span style=\"color: #808030;\">=<\/span> ktrain<span style=\"color: #808030;\">.<\/span>get_predictor<span style=\"color: #808030;\">(<\/span>learner<span style=\"color: #808030;\">.<\/span>model<span style=\"color: #808030;\">,<\/span>preproc<span style=\"color: #808030;\">)<\/span>\r\nexamples <span style=\"color: #808030;\">=<\/span> <span style=\"color: #808030;\">[<\/span>\r\n    <span style=\"color: #0000e6;\">\"If you don't stop immediately, I will kill you.\"<\/span><span style=\"color: #808030;\">,<\/span>\r\n    <span style=\"color: #0000e6;\">\"Okay-- Take care sweetie.\"<\/span><span style=\"color: #808030;\">,<\/span>\r\n    <span style=\"color: #0000e6;\">\"You fucking asshole! WTF is wrong with you?!\"<\/span>\r\n<span style=\"color: #808030;\">]<\/span>\r\npredictor<span style=\"color: #808030;\">.<\/span>predict<span style=\"color: #808030;\">(<\/span>examples<span style=\"color: #808030;\">)<\/span>\r\n<\/pre>\n<p>In the code snippet above, we prepared three sentences that fall into the three categories of toxicity that our model can predict. If we apply common sense, we can guess which class each sentence belongs to. To verify that the model can predict the toxicity of each sentence, we called the predictor.predict function, which produced the following output:<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"wp-image-534 aligncenter\" src=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/Screen-Shot-2022-11-19-at-9.35.29-PM-1024x81.png\" alt=\"\" width=\"611\" height=\"48\" srcset=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/Screen-Shot-2022-11-19-at-9.35.29-PM-1024x81.png 1024w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/Screen-Shot-2022-11-19-at-9.35.29-PM-300x24.png 300w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/Screen-Shot-2022-11-19-at-9.35.29-PM-768x61.png 768w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/Screen-Shot-2022-11-19-at-9.35.29-PM.png 1040w\" sizes=\"(max-width: 611px) 100vw, 611px\" \/><\/p>\n<p>Notice that our model was able to predict the class of each sentence correctly. Now, we have a fine-tuned BERT-based language model that can be used to detect toxicity from Reddit comments.<\/p>\n<p>In conclusion, BERT provides a great starting point for fine-tuning language models. However, there are many ways to improve upon the model. For example, we can experiment with different hyperparameters, or we can train the model on more data to improve the accuracy or ROC-AUC scores.<\/p>\n<p>Disclaimer: A portion of this blogpost was written with the assistance of open AI\u2019s GPT3 Davinci\u2019s engine.<\/p>\n<p>The entire script is available on <a href=\"https:\/\/colab.research.google.com\/drive\/1k5f-WodETqGBTA4uiwBTtpwSAt1vvnvM?usp=sharing\">Google Colab<\/a><\/p>\n<p>Happy coding!<\/p>\n<p><strong>References<\/strong><br \/>\n[1]\t\u201cWhat Is a Language Model?\u201d https:\/\/www.deepset.ai\/blog\/what-is-a-language-model (accessed Nov. 19, 2022).<\/p>\n<p>[2]\tD. Ash, \u201cLanguage Models in AI,\u201d unpack, Feb. 01, 2021. https:\/\/medium.com\/unpackai\/language-models-in-ai-70a318f43041 (accessed Nov. 19, 2022).<\/p>\n<p>[3]\tK. Ganesan, \u201cAI Document Classification: 5 Real-World Examples,\u201d Opinosis Analytics, Aug. 21, 2020. https:\/\/www.opinosis-analytics.com\/blog\/document-classification\/ (accessed Nov. 19, 2022).<\/p>\n<p>[4]\t\u201cWhat is BERT (Language Model) and How Does It Work?,\u201d SearchEnterpriseAI. https:\/\/www.techtarget.com\/searchenterpriseai\/definition\/BERT-language-model (accessed Nov. 19, 2022).<\/p>\n<p>[5]\tH. Almerekhi, H. Kwak, and B. J. Jansen, \u201cInvestigating toxicity changes of cross-community redditors from 2 billion posts and comments,\u201d PeerJ Comput. Sci., vol. 8, p. e1059, Aug. 2022, doi: 10.7717\/peerj-cs.1059.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Newcomers to the field of Artificial Intelligence (AI) often see the term \u2018language model\u2019 tossed around when discussing Natural Language Processing (NLP) tasks without any proper clarification of its importance and usage in solving real-world problems. So, this tutorial blogpost aims at demystifying language models by defining what a language model is, describing the common&hellip; <a class=\"more-link\" href=\"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/\">Continue reading <span class=\"screen-reader-text\">Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems<\/span><\/a><\/p>\n","protected":false},"author":10,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[26,21],"tags":[69,70],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v19.13 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems - Team Acua<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems - Team Acua\" \/>\n<meta property=\"og:description\" content=\"Newcomers to the field of Artificial Intelligence (AI) often see the term \u2018language model\u2019 tossed around when discussing Natural Language Processing (NLP) tasks without any proper clarification of its importance and usage in solving real-world problems. So, this tutorial blogpost aims at demystifying language models by defining what a language model is, describing the common&hellip; Continue reading Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems\" \/>\n<meta property=\"og:url\" content=\"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/\" \/>\n<meta property=\"og:site_name\" content=\"Team Acua\" \/>\n<meta property=\"article:published_time\" content=\"2022-11-19T19:53:16+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-11-20T16:45:54+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/document_classification.webp\" \/>\n<meta name=\"author\" content=\"Hind Almerekhi\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Hind Almerekhi\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"15 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/\"},\"author\":{\"name\":\"Hind Almerekhi\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/180d6b2a32636ac94c2fecb1839931cc\"},\"headline\":\"Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems\",\"datePublished\":\"2022-11-19T19:53:16+00:00\",\"dateModified\":\"2022-11-20T16:45:54+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/\"},\"wordCount\":1854,\"publisher\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/#organization\"},\"keywords\":[\"BERT\",\"Language models\"],\"articleSection\":[\"Algorithms\",\"Customer segmentation\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/\",\"url\":\"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/\",\"name\":\"Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems - Team Acua\",\"isPartOf\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/#website\"},\"datePublished\":\"2022-11-19T19:53:16+00:00\",\"dateModified\":\"2022-11-20T16:45:54+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/acua.qcri.org\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#website\",\"url\":\"https:\/\/acua.qcri.org\/blog\/\",\"name\":\"Team Acua\",\"description\":\"Audience, Customer, and User Analytics\",\"publisher\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/acua.qcri.org\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#organization\",\"name\":\"Team Acua\",\"url\":\"https:\/\/acua.qcri.org\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/cropped-cropped-logo.png\",\"contentUrl\":\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/cropped-cropped-logo.png\",\"width\":1466,\"height\":770,\"caption\":\"Team Acua\"},\"image\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/180d6b2a32636ac94c2fecb1839931cc\",\"name\":\"Hind Almerekhi\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/a23aaa08517337c496fdf67313626d2e?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/a23aaa08517337c496fdf67313626d2e?s=96&d=mm&r=g\",\"caption\":\"Hind Almerekhi\"},\"url\":\"https:\/\/acua.qcri.org\/blog\/author\/hialmerekhihbku-edu-qa\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems - Team Acua","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/","og_locale":"en_US","og_type":"article","og_title":"Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems - Team Acua","og_description":"Newcomers to the field of Artificial Intelligence (AI) often see the term \u2018language model\u2019 tossed around when discussing Natural Language Processing (NLP) tasks without any proper clarification of its importance and usage in solving real-world problems. So, this tutorial blogpost aims at demystifying language models by defining what a language model is, describing the common&hellip; Continue reading Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems","og_url":"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/","og_site_name":"Team Acua","article_published_time":"2022-11-19T19:53:16+00:00","article_modified_time":"2022-11-20T16:45:54+00:00","og_image":[{"url":"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/11\/document_classification.webp"}],"author":"Hind Almerekhi","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Hind Almerekhi","Est. reading time":"15 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/#article","isPartOf":{"@id":"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/"},"author":{"name":"Hind Almerekhi","@id":"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/180d6b2a32636ac94c2fecb1839931cc"},"headline":"Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems","datePublished":"2022-11-19T19:53:16+00:00","dateModified":"2022-11-20T16:45:54+00:00","mainEntityOfPage":{"@id":"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/"},"wordCount":1854,"publisher":{"@id":"https:\/\/acua.qcri.org\/blog\/#organization"},"keywords":["BERT","Language models"],"articleSection":["Algorithms","Customer segmentation"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/","url":"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/","name":"Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems - Team Acua","isPartOf":{"@id":"https:\/\/acua.qcri.org\/blog\/#website"},"datePublished":"2022-11-19T19:53:16+00:00","dateModified":"2022-11-20T16:45:54+00:00","breadcrumb":{"@id":"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/acua.qcri.org\/blog\/demystifying-language-models-the-case-of-berts-usage-in-solving-classification-problems\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/acua.qcri.org\/blog\/"},{"@type":"ListItem","position":2,"name":"Demystifying Language Models: The Case of BERT\u2019s Usage in Solving Classification Problems"}]},{"@type":"WebSite","@id":"https:\/\/acua.qcri.org\/blog\/#website","url":"https:\/\/acua.qcri.org\/blog\/","name":"Team Acua","description":"Audience, Customer, and User Analytics","publisher":{"@id":"https:\/\/acua.qcri.org\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/acua.qcri.org\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/acua.qcri.org\/blog\/#organization","name":"Team Acua","url":"https:\/\/acua.qcri.org\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/acua.qcri.org\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/cropped-cropped-logo.png","contentUrl":"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/cropped-cropped-logo.png","width":1466,"height":770,"caption":"Team Acua"},"image":{"@id":"https:\/\/acua.qcri.org\/blog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/180d6b2a32636ac94c2fecb1839931cc","name":"Hind Almerekhi","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/a23aaa08517337c496fdf67313626d2e?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a23aaa08517337c496fdf67313626d2e?s=96&d=mm&r=g","caption":"Hind Almerekhi"},"url":"https:\/\/acua.qcri.org\/blog\/author\/hialmerekhihbku-edu-qa\/"}]}},"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/posts\/511"}],"collection":[{"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/users\/10"}],"replies":[{"embeddable":true,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/comments?post=511"}],"version-history":[{"count":26,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/posts\/511\/revisions"}],"predecessor-version":[{"id":543,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/posts\/511\/revisions\/543"}],"wp:attachment":[{"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/media?parent=511"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/categories?post=511"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/tags?post=511"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}