Learn Computing from the Experts | The Rheinwerk Computing Blog

What Is Text Vectorization in AI?

Written by Rheinwerk Computing | Aug 7, 2024 1:00:00 PM

In artificial intelligence, words must first be transformed into numerical values before an embedding layer transforms them into vector representations.

 

Prior to that, the words are also converted to lowercase letters, and the punctuation is removed. TensorFlow or Keras provides an easy-to-use option for this, namely, text vectorization.

 

The K8_textVectorization program classifies teaching assessments.

 

import pandas as pd

import tensorflow as tf

from keras.models import Model

import numpy as np

# Evaluation texts

train_data =[

             'You are the best teacher!',

             'I will name my first-born after you.',

             'Your lessons are boring',

             'terrible',

             'excellent!',

             'I was just sleeping.',

             'The best lessons',

             'I have learned nothing.',

             'lame',

             'You should stop teaching.'

]

# 0 for bad, 1 for good

train_col = np.array([1,1,0,0,1,0,1,0,0,0])

 

The train_data variable contains the texts, and train_col contains the corresponding ratings. The individual words are now converted into integers.

 

transform = tf.keras.layers.TextVectorization(max_tokens=50,

   output_sequence_length=10)

 

transform.adapt(train_data)

 

tain_data_transformed = transform(train_data)

 

print(tain_data_transformed)

 

We specify that there is a maximum total of 50 words to be coded. If there are more words, only the most common 50 words are used. We also stipulate that the texts have a maximum of 10 words each. Using tain_data_transformed = transform(train_data), we create the table for the assignment, while tain_data_transformed = transform(train_data) finally transforms the texts into vectors. Pay attention to the differences: These vectors with the integers represent the texts, whereby each word is represented by an integer. The embedding layer, on the other hand, converts each individual word or integer into an n-dimensional vector with floats. The code that follows is stored in the tain_data_transformed variable.

 

tf.Tensor(

[[ 2 7 4 6 13 0 0 0 0 0]

 [ 3 9 18 19 24 27 2 0 0 0]

 [ 8 5 7 26 0 0 0 0 0 0]

 [11 0 0 0 0 0 0 0 0 0]

 [25 0 0 0 0 0 0 0 0 0]

 [ 3 10 22 15 0 0 0 0 0 0]

 [ 4 6 5 0 0 0 0 0 0 0]

 [ 3 23 20 17 0 0 0 0 0 0]

 [21 0 0 0 0 0 0 0 0 0]

 [2 16 14 12 0 0 0 0 0 0]], shape=(10, 10), dtype=int64)

 

As the values are generated randomly, the output may look different on your side. Compare the individual vectors with the texts. The first element in the first vector is the number 2. The print(transform.get_vocabulary()[2]) call provides the output you, which is the first word in the first sentence. You should also call the method with other numbers in the vectors. You’ll see that the words have been replaced by numbers in the appropriate places. Now let’s build the model.

 

model = tf.keras.Sequential([

   tf.keras.layers.Embedding(50, 16),

   tf.keras.layers.GlobalAveragePooling1D(),

   tf.keras.layers.Dense(2, activation=tf.nn.softmax)

])

 

The embedding layer is called with two parameters: there is a maximum of 50 words in total, and each individual word is to be transformed into a 16-dimensional vector.

 

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',

   metrics=['accuracy'])

 

model.fit(tain_data_transformed, train_col, epochs=200)

 

After 200 epochs, we get a correct classification rate of 90%. With this small dataset and this large number of epochs, overfitting certainly plays a major role, but that is beside the point here, as we only want to understand the principle.

 

reviews = [

   "terrible",

   "the best lessons",

   "please stop",

   "still asleep"

]

txt = transform(reviews)

pred = model.predict([txt])

print(pred)

 

Now the trained AI is supposed to classify new ratings. I’ve deliberately used similar wording in the text fragments. For everything else, the training data is too small and the number of epochs too high. The output is show in the following listing.

 

[[0.6918644 0.30813566]

 [0.48320082 0.5167992 ]

 [0.678091 0.32190904]

 [0.64080495 0.35919502]]

 

The second rating is assigned to one class (good = 1); all other ratings are assigned to the other class (bad = 0). This very simple example actually works.

 

Editor’s note: This post has been adapted from a section of the book Developing AI Applications: An Introduction by Metin Karatas.