File size: 5,110 Bytes
332ee8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "gpuClass": "standard"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Implementation of text classification with BERT\n",
        "\n",
        "\n",
        "Still Working on it.\n",
        "\n",
        "This notebook is based in this TensorFlow tutorial: [Classify text with BERT](https://www.tensorflow.org/tutorials/text/classify_text_wibert)\n",
        "\n",
        "BERT [(article link)](https://arxiv.org/abs/1810.04805) and other Transformer encoder architectures have been wildly successful on a variety of tasks in NLP (natural language processing). They compute vector-space representations of natural language that are suitable for use in deep learning models.\n",
        "\n",
        "![](http://www.d2l.ai/_images/nlp-map-pretrain.svg):\n",
        "\n",
        "Source: http://www.d2l.ai/chapter_natural-language-processing-pretraining/index.html\n",
        "\n",
        "BERT models are usually pre-trained on a large corpus of text, then fine-tuned for specific tasks.\n",
        "\n",
        "In this notebook, I am going to use a pretreined BERT to compute vector-space representations of a hate speech dataset to feed two different downsteam Archtectures (CNN and MLP).\n",
        "\n",
        "Sentiment Analysis\n",
        "\n",
        "This notebook trains a sentiment analysis model to classify the [Hate Speech and Offensive Language Dataset]( https://www.kaggle.com/mrmorj/hate-speech-and-offensive-language-dataset) tweets in three classes:\n",
        " \n",
        "* 0 - hate speech \n",
        "* 1 - offensive language \n",
        "* 2 - neither as positive or negative"
      ],
      "metadata": {
        "id": "Jh_WkIs1iJDs"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ltc_HOzjX87s"
      },
      "outputs": [],
      "source": [
        "!pip install -q tensorflow-text > /dev/null\n",
        "!pip install -q tf-models-official > /dev/null"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "from official.nlp.optimization import AdamWeightDecay, WarmUp\n",
        "import tensorflow as tf\n",
        "import tensorflow_hub as hub\n",
        "import tensorflow_text as text\n",
        "import numpy as np\n",
        "np.set_printoptions(suppress=True)\n",
        "\n",
        "\n",
        "# Carga el modelo\n",
        "with tf.keras.utils.custom_object_scope({'AdamWeightDecay': AdamWeightDecay, 'WarmUp': WarmUp}):\n",
        "    classifier_model = tf.keras.models.load_model('classifier_model.h5', \n",
        "                                                  custom_objects={'KerasLayer': hub.KerasLayer})"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "EQlIdYjKYAnn",
        "outputId": "efb1f87f-8b45-4201-ac14-2abdc74b8cfd"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11.\n",
            "WARNING:tensorflow:From /usr/local/lib/python3.9/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.\n",
            "Instructions for updating:\n",
            "Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089\n",
            "WARNING:tensorflow:Error in loading the saved optimizer state. As a result, your model is starting with a freshly initialized optimizer.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "classifier_model.predict([\"LEETSSS GOOO Get those ... outta here!!!!!!\"])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6Ma3P-7iYEbA",
        "outputId": "2e6fbc37-6ab8-4035-c706-f8d6d3d8c7ba"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "1/1 [==============================] - 1s 1s/step\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[0.99998355, 0.00001638, 0.00000017]], dtype=float32)"
            ]
          },
          "metadata": {},
          "execution_count": 4
        }
      ]
    }
  ]
}