Goal


I am going to develop and assess a fake news classifier using Tensorflow, which is commonly used for researchers in creating machine learning models. We are going to create a machine learning model to predict whether a news article is fake based on the article content. Let's get started.

Import Training Data

The data we are going to import for this blog post is from the article:

  • Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).

The data we are using has been separated into training data and test data. In order to build a machine learning model, we are going to import the training data first. We may access the training data from

https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_train.csv?raw=true

after small modifications on the original data. We will import the test data after we have built the models and ready to evaluate our best model in the later sections.

I have downloaded the csv file from the above link. Let's read the csv file by pandas to create the training fake_news dataframe.

# import pandas package using to analyzed dataframe
import pandas as pd
# import fake_news_train.csv into a DataFrame called fake_news
fake_news = pd.read_csv("fake_news_train.csv")

Let's see the top five rows of the fake_news dataframe.

fake_news.head()
Unnamed: 0 title text fake
0 17366 Merkel: Strong result for Austria's FPO 'big c... German Chancellor Angela Merkel said on Monday... 0
1 5634 Trump says Pence will lead voter fraud panel WEST PALM BEACH, Fla.President Donald Trump sa... 0
2 17487 JUST IN: SUSPECTED LEAKER and “Close Confidant... On December 5, 2017, Circa s Sara Carter warne... 1
3 12217 Thyssenkrupp has offered help to Argentina ove... Germany s Thyssenkrupp, has offered assistance... 0
4 5535 Trump say appeals court decision on travel ban... President Donald Trump on Thursday called the ... 0

Based on the above dataframe, we can see that each row represents the information of one article, and there are four columns, which are Unnamed:0, title, text, and fake:

  • title: the title of the article
  • text: the full article text
  • fake: 0 represents the article contains all true content, and 1 represents the article contains fake news

where Unnamed:0 will not be used for the blog post.

Make a Dataset

Notice that in the title and text columns in the fake_news dataframe, there are words, such as "a", "the", and "and". We usually called these words as stopwords. Since these stopwords have no significant meaning, we want to remove these words to clean our data and make a new dataset.

Hence, the purpose of this part is to write a function called make_dataset:

  1. remove the uninformative stopwords in the "title" and "text" columns in fake_news dataframe
  2. return a tf.data.Dataset with a tuple of (title, text) as inputs and the fake column as the ouput

By calling the make_dataset function on our training fake_news dataframe, we will be able to directly get a dataset that satisfied the above requirements.

Before starting to write the make_dataset function, we need to find a way to know all the stopwords in advance so that we will be able to recognize the stopwords in the title and text columns. Thanks to the sklearn module, we will be able to know the stopwords by using the text from sklearn.feature_extraction. Let's import the sklearn package and produce the stopwords.

# import sklearn package to be familiar with stopwords
from sklearn.feature_extraction import text
stopwords = text.ENGLISH_STOP_WORDS

Next, in order to make a TensorFlow dataset in the make_dataset function, we need to import the tensorflow package. Besides that, we need to import the nltk package, which is used for dividing a string into substrings and is helpful for the process of removing stopwords later.

import tensorflow as tf
import nltk

Since the data from the title and text columns in the fake_news dataframe contain words in both lower and upper cases, we need to lowercase all the data so that we may check whether those words in lower cases are in stopwords, which consisting all stopwords in lower cases, or not. Now, let's define the make_dataset function.

def make_dataset(df):
    """
    Remove the stopwords inside the title and text columns of the input dataframe 
    and make a TensorFlow dataset based on the updated title and text columns and 
    the fake column.
    
    Parameter
    ----------
    df: an input dataframe consisting the title, text, and fake columns
    
    Return 
    ----------
    ds: a TensorFlow dataset with input (title, text) and output the fake column
    """
    
    # step 1: remove stopwords in the title and text columns
    # use apply method to first lower the letters of each word
    # and then remove the words that are stopwords
    # this will go through each row of the dataframe column
    tokenizer = nltk.RegexpTokenizer(r"\w+")
    df["title"] = df["title"].apply(lambda x: x.lower())
    df["title"] = df["title"].apply(lambda x: ' '.join([word for word in tokenizer.tokenize(x) if word not in stopwords]))
    df["text"] = df["text"].apply(lambda x: x.lower())
    df["text"] = df["text"].apply(lambda x: ' '.join([word for word in tokenizer.tokenize(x) if word not in stopwords]))
    
    # step 2: make a TensorFlow dataset
    ds = tf.data.Dataset.from_tensor_slices(
    (
        {
            "title" : df[["title"]], 
            "text"  : df["text"]
        }, 
        {
            "fake"  : df[["fake"]]
        }
    )
    )
    
    # batch the dataset to make our model to train on chunks of data
    ds = ds.batch(100)
    
    return ds

Data Preparation

Let's call the make_dataset function by using the fake_news dataframe as an input to return a TensorFlow dataset as our primary dataset.

data = make_dataset(fake_news)

The next step we are going to do is to split data into training data and validation data. We set our training data to be 80% of data and the validation data to be 20% of data.

data = data.shuffle(buffer_size = len(data))
# training data size 80%
train_size = int(0.8*len(data))
# validation data size 20%
val_size   = int(0.2*len(data))
# get the training data by training data size
train = data.take(train_size)
# get the validation data by validation data size
# use skip(train_size) to avoid same data
val = data.skip(train_size).take(val_size)

Create Models

In this section, we are going to create three TensorFlow models with different inputs:

  1. only the article title
  2. only the article text
  3. both the article title and the article text The goal of creating these models is to answer the following question:

When detecting fake news, is it most effective to focus on only the title of the article, the full text of the article, or both?

We are going to use Keras Functional API to construct all of the three models. Comparing to Keras Sequential API, Keras Functional API is more flexible in building models with shared layers and multiple outputs that we are going to use for this blog post. Before creating each model separately, we need to import the relevant packages.

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow import keras

Before building each model, we may define some variables and prepare the input data that will be used in all three models first for simplification.

Vectorization

In order to make the computer understands the meaning of the content in the dataset, we need to change the string to a vector consisting of numbers. This process of representing text as a vector is called Vectorization. Besides that, we also need to lowercase the string and remove the punctuations. Based on the above rules, we are going to define a standardization function and pass it as an argument to the vectorization layer.

# import relevant packages in advance for vectorization later
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers.experimental.preprocessing import StringLookup
import string
import re
size_vocabulary = 2000

def standardization(input_data):
    """
    Lowercase the words and remove punctuations in the input Tensorflow dataset.
    """

    # lowercase the strings again to make sure
    lowercase = tf.strings.lower(input_data)
    # remove punctuations
    no_punctuation = tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation),'')
    return no_punctuation 

# though make_dataset function can lowercase the dataset and remove punctuations
# we may still use standardization to update the dataset again
# we set size_vocabulary to be 2000
vectorize_layer = TextVectorization(
    standardize = standardization,
    max_tokens = size_vocabulary,
    output_mode ='int',
    output_sequence_length = 500) 

vectorize_layer.adapt(train.map(lambda x, y: x["title"]))

Similarly, we are going to do vectorization to the text by using the standardization function defined in the previous part to let the computer understand the data.

# though make_dataset function can lowercase the dataset and remove punctuations
# we may still use standardization to update the dataset again
# we set size_vocabulary to be 2000
vectorize_layer = TextVectorization(
    standardize = standardization,
    max_tokens = size_vocabulary,
    output_mode ='int',
    output_sequence_length = 500) 

vectorize_layer.adapt(train.map(lambda x, y: x["text"]))

Define Embedding

We are going to define shared_embedding as the same layer in all three models that we want to build. The Embedding layer will be important for the later steps.

shared_embedding = layers.Embedding(size_vocabulary, 10, name = "embedding")

Define Inputs

When constructing all three models, we need to specify the keras.Input. For all three models, they used two inputs repeatedly. We may define title_input and text_input in advance for convenience.

# define title_input
title_input = keras.Input(
    shape = (1,), 
    name = "title",
    dtype = "string"
)
# define text_input
text_input = keras.Input(
    shape = (1,), 
    name = "text",
    dtype = "string"
)

First Model: use only title

Write the Pipeline

We are going to construct the first model with the following few layers first. The Embedding layer will be the shared_embedding defining above. This layer will be important for the later steps.

# create the following layers one by one
title_features = vectorize_layer(title_input)
title_features = shared_embedding(title_features)
title_features = layers.Dropout(0.2)(title_features)
title_features = layers.GlobalAveragePooling1D()(title_features)
title_features = layers.Dropout(0.2)(title_features)
title_features = layers.Dense(32, activation = 'relu')(title_features)
# output layer
output_title = layers.Dense(2, name = "fake")(title_features)

Create the First Model

Now, let's create the first model by specifying the input, title_input, and output in the following cell.

model_title = keras.Model(
    inputs = [title_input],
    outputs = output_title
)

We may also see the layers of our first model by running the cell below.

model_title.summary()
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
title (InputLayer)           [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 500)               0         
_________________________________________________________________
embedding (Embedding)        (None, 500, 10)           20000     
_________________________________________________________________
dropout (Dropout)            (None, 500, 10)           0         
_________________________________________________________________
global_average_pooling1d (Gl (None, 10)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense (Dense)                (None, 32)                352       
_________________________________________________________________
fake (Dense)                 (None, 2)                 66        
=================================================================
Total params: 20,418
Trainable params: 20,418
Non-trainable params: 0
_________________________________________________________________

Train the First Model

We are ready to train our first model. Training a model requires both compiling the model and fitting the training data inside the model. So, firstly, let's compile it.

model_title.compile(optimizer = "adam",
              loss = losses.SparseCategoricalCrossentropy(from_logits = True),
              metrics = ["accuracy"]
)

Then, we need to fit the training data, train, defined above to our model.

history_title = model_title.fit(train, 
                                epochs = 50, 
                                validation_data = val, 
                                verbose = False)

Visualize Accuracy

Finally, we are going to use a graph to plot the training accuracy and validation accuracy at each epoch. Before plotting, we need to import matplotlib.

# import matplotlib for plotting
from matplotlib import pyplot as plt
# get the final validation accuracy
val_acc = round(history_title.history["val_accuracy"][-1], 5)
# plot both the training accuracy and validation accuracy
plt.plot(history_title.history["accuracy"], label = "training")
plt.plot(history_title.history["val_accuracy"], label = "validation")
# add title, xlabel, and ylabel
plt.gca().set(xlabel = "epoch", ylabel = "accuracy", title = "First Model by Title with Validation Accuracy: " + str(val_acc))
plt.legend()

![png]({{ site.baseurl }}/images/Blog3_revise_files/Blog3_revise_54_1.png)

From the above graph, we can see that the validation accuracy by our first model is about 0.93822, which is not too high. Since we use Dropout when writing the pipeline, our model avoids overfitting. Let's continue to create the second and the third model and see whether we will get a higher accuracy for those two models.

Second Model: use only text

Write the Pipeline

Similarly, we are going to construct the second model by using the same layers as the first model. We will also use the shared_embedding defined above.

# create the following layers one by one
text_features = vectorize_layer(text_input)
text_features = shared_embedding(text_features)
text_features = layers.Dropout(0.2)(text_features)
text_features = layers.GlobalAveragePooling1D()(text_features)
text_features = layers.Dropout(0.2)(text_features)
text_features = layers.Dense(32, activation = 'relu')(text_features)
# output layer
output_text = layers.Dense(2, name = "fake")(text_features)

Create the Second Model

Let's create the model by specifying the input, text_input, and output in the following cell.

model_text = keras.Model(
    inputs = [text_input],
    outputs = output_text
)

Let's also see the summary of the layers in our second model.

model_text.summary()
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
text (InputLayer)            [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 500)               0         
_________________________________________________________________
embedding (Embedding)        (None, 500, 10)           20000     
_________________________________________________________________
dropout_2 (Dropout)          (None, 500, 10)           0         
_________________________________________________________________
global_average_pooling1d_1 ( (None, 10)                0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                352       
_________________________________________________________________
fake (Dense)                 (None, 2)                 66        
=================================================================
Total params: 20,418
Trainable params: 20,418
Non-trainable params: 0
_________________________________________________________________

We might find out that the layers in the second model are similar to the layers in the first model. This is because we used a similar pipeline to construct these two models.

Train the Second Model

Before fitting the training data into the model, we need to compile.

# compile first
model_text.compile(optimizer = "adam",
              loss = losses.SparseCategoricalCrossentropy(from_logits = True),
              metrics = ["accuracy"]
)
history_text = model_text.fit(train, 
                              epochs = 50, 
                              validation_data = val,
                              verbose = False)

Now, our second model has been created, and we are ready to visualize the accuracy of our model in the next step.

Visualize Accuracy

# get the final validation accuracy
val_acc = round(history_text.history["val_accuracy"][-1], 5)
# plot both the training accuracy and validation accuracy
plt.plot(history_text.history["accuracy"], label = "training")
plt.plot(history_text.history["val_accuracy"], label = "validation")
# add title, xlabel, and ylabel
plt.gca().set(xlabel = "epoch", ylabel = "accuracy", title = "Second Model by Text with Validation Accuracy: " + str(val_acc))
plt.legend()

![png]({{ site.baseurl }}/images/Blog3_revise_files/Blog3_revise_72_1.png)

Based on the above graph, it shows that our second model has a higher validation accuracy, which is 0.99348, than the first model. Since we used Dropout to avoid overfitting, our second model by using the same-structure layers as the first model is great.

Third Model: use both title and text

In this part, we are going to construct our third model, which uses both title and text, by using the concatenate tool. Since we already have the inputs, let's continue to write the pipeline for our third model.

Concatenation

We are going to concatenate the title_features pipeline with the text_features pipeline.

main = layers.concatenate([title_features, text_features], axis = 1)

Create the Third Model

Let's do some additional modifications by passing the consolidated set of computed features through a few more Dense layers, shown below.

main = layers.Dense(32, activation = "relu")(main)
output = layers.Dense(2, name = "fake")(main)

Let's specify the input, title_input and text_input, and output to our third model in the following cell.

model = keras.Model(
    inputs = [title_input, text_input],
    outputs = output
)
model.summary()
Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
title (InputLayer)              [(None, 1)]          0                                            
__________________________________________________________________________________________________
text (InputLayer)               [(None, 1)]          0                                            
__________________________________________________________________________________________________
text_vectorization_1 (TextVecto (None, 500)          0           title[0][0]                      
                                                                 text[0][0]                       
__________________________________________________________________________________________________
embedding (Embedding)           (None, 500, 10)      20000       text_vectorization_1[0][0]       
                                                                 text_vectorization_1[1][0]       
__________________________________________________________________________________________________
dropout (Dropout)               (None, 500, 10)      0           embedding[0][0]                  
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 500, 10)      0           embedding[1][0]                  
__________________________________________________________________________________________________
global_average_pooling1d (Globa (None, 10)           0           dropout[0][0]                    
__________________________________________________________________________________________________
global_average_pooling1d_1 (Glo (None, 10)           0           dropout_2[0][0]                  
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 10)           0           global_average_pooling1d[0][0]   
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 10)           0           global_average_pooling1d_1[0][0] 
__________________________________________________________________________________________________
dense (Dense)                   (None, 32)           352         dropout_1[0][0]                  
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 32)           352         dropout_3[0][0]                  
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 64)           0           dense[0][0]                      
                                                                 dense_1[0][0]                    
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 32)           2080        concatenate[0][0]                
__________________________________________________________________________________________________
fake (Dense)                    (None, 2)            66          dense_2[0][0]                    
==================================================================================================
Total params: 22,850
Trainable params: 22,850
Non-trainable params: 0
__________________________________________________________________________________________________

From the above summary of our third model, we might find out that there exist similar layers compared to our first model and second model. This helps us to understand again how our third model is created -- concatenate the title_features pipeline with the text_features pipeline.

Train the Third Model

Again, we need to compile before fitting the training data into the model.

model.compile(optimizer = "adam",
              loss = losses.SparseCategoricalCrossentropy(from_logits = True),
              metrics = ["accuracy"]
)
history = model.fit(train, 
                    epochs = 50, 
                    validation_data = val, 
                    verbose = False)

Visualize Accuracy

Now, we are ready to visualize both the validation accuracy and the training accuracy for each epoch to see whether our model is good.

# get the final validation accuracy
val_acc = round(history.history["val_accuracy"][-1], 5)
# plot both the training accuracy and validation accuracy
plt.plot(history.history["accuracy"], label = "training")
plt.plot(history.history["val_accuracy"], label = "validation")
# add title, xlabel, and ylabel
plt.gca().set(xlabel = "epoch", ylabel = "accuracy", title = "Third Model by Title and Text with Validation Accuracy: " + str(val_acc))
plt.legend()

![png]({{ site.baseurl }}/images/Blog3_revise_files/Blog3_revise_92_1.png)

Based on the graph above, our third model reaches a validation accuracy of about 0.99978, which is very high. Since we used Dropout in the pipelines for both title_features_third and text_features_third, our third model does not have overfitting problems.

Mode Comparisons

In general, the first model has a validation accuracy of about 0.93822; the second model has a validation accuracy of about 0.99348; and the third model has a validation accuracy of about 0.99978. Since all the models prevent overfitting, the third model is the best. Hence, I think the algorithms should use both the title and the text when seeking to detect fake news.

Model Evaluation

In this section, we are going to use the test data to evaluate our best model, the third model, to answer the following question:

If we used our best model as a fake news detector, how often would we be right?

Similar to what we did when importing the training data at the beginning of this blog post, we are going to import the test data from

https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_test.csv?raw=true

Let's import the csv file I have downloaded following the link above to a test_data dataframe by pandas package.

# import fake_news_test.csv into a DataFrame called test_data
test_data = pd.read_csv("fake_news_test.csv")

Then, let's use the make_dataset function to our test_data.

test_ds = make_dataset(test_data)

Now, let's evaualte our best model by the test_ds.

model.evaluate(test_ds)
225/225 [==============================] - 1s 5ms/step - loss: 0.1200 - accuracy: 0.9816
[0.11995154619216919, 0.9816027283668518]

Based on the above output, it shows that our best model as a fake news detector predicts with about 98.2% accuracy, which is high. Thus, this proves again that our third model is great.

Embedding Visualization

In the last section, we are going to make embedding visualization of our model. By plotting graphs of different words in points, we are able to know what kinds of words are usually have similar meaning and tend to be used in real news articles and fake news articles. In order to plot a relevant graph, we are going to use the embedding layer in our best model to calculate weights and than get the vocabulary.

weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()
# import PCA reduce the dimension down to a visualizable number
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
weights = pca.fit_transform(weights)

We are going to create a dataframe called embedding_df about the results about vocabulary and weights above. This dataframe will be used in plotting in the last step.

embedding_df = pd.DataFrame({
    'word' : vocab, 
    'x0'   : weights[:,0],
    'x1'   : weights[:,1]
})
embedding_df
word x0 x1
0 -0.232555 0.176622
1 [UNK] -0.187082 0.124137
2 s -0.206853 0.833015
3 trump -0.219647 0.032457
4 said -3.730632 0.383721
... ... ... ...
1995 suffered -0.618630 1.402475
1996 basically 3.909034 0.414360
1997 regular -1.900339 -1.296854
1998 projects 1.243296 -0.185288
1999 firms -2.884571 0.157406

2000 rows × 3 columns

Now, we are ready to create an interactive plotly scatterplot to visualize embedding of our best model, where each point represents a word.

# import packages for embedding visualization
import numpy as np
import plotly.express as px 
# plot an interactive plotly scatterplot
fig = px.scatter(embedding_df, 
                 x = "x0", 
                 y = "x1", 
                 size = list(np.ones(len(embedding_df))),
                 size_max = 2,
                 hover_name = "word")
# show the figure
fig.show()
# use to save the output interactive graph by plotly as a html file
from plotly.io import write_html
# save the output in a html file
write_html(fig, "Embedding_Visualization.html")

From the graph above, we might find out that the points are gathering at the center and trying to spreading out a little bit more horizontally. By checking for the hovering name of each point, we might find out that some words, such as some words close to the center, have more "controversial" and "aggressive" meanings than the others. This is reasonable since words that are used more frequently in the fake news articles stay together based on the goal of our model.

That's the end of this blog post. Thank you for reading!