This post will be on text mining. I will be using the 2017 and 2018 Budget statements of Ghana for this tutorial.This post will not have in-depth explanation of the results of the analysis but rather show how one can perform analysis on documents all around us. I have provided the links to the documents in the next 2 paragraphs hence one can read and compare how they fare with the results obtained in the post.

The Minister of Finance Mr. Ken Ofori-Atta read the 2017 budget on 2nd of March 2017 and the 2018 budget on 15th Nov. 2017.He gave two Twi names for the budgets: Asempa Budget (good news budget) for 2017 and Adwuma Budget (job budget) for 2018.

Many thanks to Citi Fm for making these documents readily available:2017 and 2018.

Import libraries

To start we need to load the necessary packages. All of the packages used in this tutorial do not come with R and RStudio hence they need to be installed.install.packages(readtext)installs the readtext package and its dependencies.

#LOAD IN THE NEEDED LIBRARIES
library(readtext)               # reading pdf files
library(tidytext)               # generating tokens from pdf files 
library(tidyverse)              # data manipulation / visualization
library(stringr)                # regular expression
library(wordcloud)              # generating wordclouds
library(gridExtra)              # data visualization

Get the data

Next, I downloaded the files and stored them in variables.Remember to add the file extension: (.pdf) as an error will be thrown when this is excluded.

year2017 <- "2017-BUDGET-STATEMENT-AND-ECONOMIC-POLICY.pdf"
year2018 <- "2018-Budget-Statement.pdf"

Create functions and themes

Before we start exploring what was in the budget readings, let us set up all the themes and functions required for cleaning and shaping the data:

  • theme.plot: will be the default theme to be used by all plots. Any additional style for a plot will be included when building the individual plot.
  • fxn.wdCount: a function that takes the name of the pdf and returns a dataframe of the count of words (numbers excluded) in descending order.
  • fxn.barChart: a function that takes the dataframe of words returned from fxn.wdCount and the year of the budget. It draws a bar chart of the first 20 most occured words.
  • fxn.wdCloud: this function takes a dataframe of words and draws a wordcloud of the first 100 most occuring words.
  • fxn.sentiYr: function will be used to calculate the total sentiment carried by a document.
############THEME############
theme.plot <-
    theme(text=element_text(family="Kalinga",color="#eeeeee"))+
    theme(plot.background = element_rect(fill="#2b7c85"))+
    theme(plot.title = element_text(size=14, hjust=0.5, face="bold"))+
    theme(panel.background = element_rect(fill="#2b7c85"))+
    theme(panel.grid.major = element_line(color="#fafafa"))+
    theme(panel.grid.minor = element_line(color=alpha('#eeeeee',0.2)))+
    theme(legend.background = element_rect(fill="#2b7c85"))+
    theme(legend.title = element_text(face="bold"))+
    theme(legend.key = element_rect(fill=NA))+
    theme(axis.title = element_text(color="#eeeeee", face="bold"))+
    theme(axis.text = element_text(color="#eeeeee"))+
    theme(axis.ticks = element_blank())


############FUNCTIONS#########

fxn.wdCount <- function(nameOfFile){
            df.year <- nameOfFile %>%
                        readtext() %>%
                        unnest_tokens(word, text) %>%
                        filter(str_detect(word, "^[A-Za-z]+$")) %>%
                        anti_join(stop_words) %>%
                        count(word, sort=T)
            return(df.year)
}


fxn.barChart <- function(df.year, budgetYear){
            ggplot(data=df.year[1:20,], aes(x=reorder(word,n), y=n)) +
                geom_bar(stat="identity", fill="grey") +
                geom_text(aes(label=n), hjust=1.1, color="#595959", fontface="bold")+ 
                labs(x="", y="", title=paste("NUMBER OF OCCURENCES IN ",budgetYear, " BUDGET")) +
                theme.plot +
                #theme(panel.grid.minor = element_blank()) +
                coord_flip()
}


fxn.wdCloud <- function(df.year){
                    df.year %>%
                        head(n=100) %>%
                        with(wordcloud(word, n, colors=brewer.pal(6, "Dark2")))
}


fxn.sentiYr <- function(df.year){
    df.sentiYr <- df.year %>%
        inner_join(get_sentiments("nrc")) %>%
        count(sentiment, sort=TRUE)
    return(df.sentiYr)
}

Naming conventions

I have a set of naming conventions for functions, themes, dataframes, etc so I can easily recognize them anytime they are needed.For simplicity,functions: fxn.nameOfFunction, themes: theme.nameOfFunction, dataframe: df.nameOfFunction.

Tokenizing the budgets

It is time to analyze the budget statements. First the content for each document is passed to the fxn.wdCount() to be split into a dataframe of words. Remember the names of the documents were passed to 2 variables, these will be the arguments to the function.The output from the function is stored in the variables df.2017 and df.2018. The dimensions(number of columns and rows) and a few rows of the variables df.2017 and df.2018are displayed.

df.2017 <- fxn.wdCount(year2017)
dim(df.2017)
## [1] 5281    2
head(df.2017)
## # A tibble: 6 x 2
##   word            n
##   <chr>       <int>
## 1 ministry      677
## 2 percent       449
## 3 ghana         369
## 4 development   336
## 5 national      318
## 6 government    311
df.2018 <- fxn.wdCount(year2018)
dim(df.2018)
## [1] 5286    2
head(df.2018)
## # A tibble: 6 x 2
##   word            n
##   <chr>       <int>
## 1 ghana         461
## 2 speaker       434
## 3 percent       416
## 4 ministry      307
## 5 programme     252
## 6 development   249

Draw a bar chart of the occurences

There are over 5000 rows of words in each dataframe. A bar chart of the first 20 most occured words is drawn.The function fxn.barChart() comes in handy here. I prefer the use of functions in that, instead of repeating lines of code for each plot, all are bundled into a single function. To use the function, an argument if required is passed and the plot is drawn or the plot passed to a variable if it so requires.

The dataframe of words df.2017 and df.2018 and the year of budget reading are passed to the function.The year of reading is required for giving the plot a title.

fxn.barChart(df.2017, 2017)

fxn.barChart(df.2018, 2018)

Draw a wordcloud

Next a wordcloud of the first * 100 * most occured words is drawn. The dataframe of words (df.2017, df.2018) are passed as arguments to the function fxn.wdCloud().

fxn.wdCloud(df.2017)

fxn.wdCloud(df.2018)

A quick glance shows that the words * ghana, government * and * national * featured prominently in both readings.The function unnest_tokens() returns a dataframe of words in lowercase by default hence * Ghana * is displayed as ghana.

Sentiment analysis of the budgets

I am interested in knowing the sentiments underlying the budget reading. The sentiment of the entire reading is the sum of sentiment score associated with each word.

The sentiments dataset

There exists a dataset:sentiments in the tidytext package that helps to calculate the sentiment carried by a word. The dataset is made up of 4 columns: word, sentiment,lexicon,score

## # A tibble: 27,314 x 4
##    word        sentiment lexicon score
##    <chr>       <chr>     <chr>   <int>
##  1 abacus      trust     nrc        NA
##  2 abandon     fear      nrc        NA
##  3 abandon     negative  nrc        NA
##  4 abandon     sadness   nrc        NA
##  5 abandoned   anger     nrc        NA
##  6 abandoned   fear      nrc        NA
##  7 abandoned   negative  nrc        NA
##  8 abandoned   sadness   nrc        NA
##  9 abandonment anger     nrc        NA
## 10 abandonment fear      nrc        NA
## # ... with 27,304 more rows

There are over 27,000 words in the sentiments dataframe and 4 sentiment lexicons in the dataframe. We can retrieve a single lexicon for use with the get_sentiments("nrc"/"afinn"/"bing","loughran").In this tutorial, we will be using the nrc lexicon. Each word is assigned to either 1 or more of 10 emotional categories: trust, fear, negative, sadness, anger, surprise, positive, disgust, joy and anticipation.

In the code below, the score of each word found in the nrc lexicon is obtained and the scores for each category is summed up.inner_join() from the dplyr package which has been bundled into the tidyverse package is used to get words in our dataframe that also exists in nrc lexicon.

Calculate sentiments for document

We pass df.2017 and df.2018 to the function fxn.sentiYr() and show the first few rows

senti.2017 <- fxn.sentiYr(df.2017)
senti.2017
## # A tibble: 10 x 2
##    sentiment       nn
##    <chr>        <int>
##  1 positive       498
##  2 trust          294
##  3 negative       276
##  4 anticipation   196
##  5 fear           130
##  6 joy            124
##  7 anger          105
##  8 sadness         98
##  9 disgust         63
## 10 surprise        57
senti.2018 <- fxn.sentiYr(df.2018)
senti.2018
## # A tibble: 10 x 2
##    sentiment       nn
##    <chr>        <int>
##  1 positive       515
##  2 trust          293
##  3 negative       238
##  4 anticipation   195
##  5 fear           133
##  6 joy            129
##  7 anger           93
##  8 sadness         76
##  9 disgust         67
## 10 surprise        57

Plotting the sentiments of the document

It will be interesting to see how the sentiments vary per year. Since the values are numeric as against categorical sentiments values, it is appropraite to use a scatterplot. ### Dataframe of scores First, a dataframe of sentiment scores for 2017 and 2018 is created using bind_rows(). A new column year is created and the year from which the data is from is used as the row name for each category score.

# combine the years into a dataframe
df.ttSenti <- bind_rows("2017" = senti.2017, "2018" = senti.2018, .id="year")
df.ttSenti
## # A tibble: 20 x 3
##    year  sentiment       nn
##    <chr> <chr>        <int>
##  1 2017  positive       498
##  2 2017  trust          294
##  3 2017  negative       276
##  4 2017  anticipation   196
##  5 2017  fear           130
##  6 2017  joy            124
##  7 2017  anger          105
##  8 2017  sadness         98
##  9 2017  disgust         63
## 10 2017  surprise        57
## 11 2018  positive       515
## 12 2018  trust          293
## 13 2018  negative       238
## 14 2018  anticipation   195
## 15 2018  fear           133
## 16 2018  joy            129
## 17 2018  anger           93
## 18 2018  sadness         76
## 19 2018  disgust         67
## 20 2018  surprise        57

Scatterplot of scores

We will use the scatter plot to illustrate how the scores for sentiments vary according to the category of emotions.They are colored based on the year from which the score was obtained.

df.ttSenti %>%
    ggplot(aes(sentiment, nn, color=year)) +
    geom_point(size=5) +
    ggtitle("SENTIMENTS AS CONVEYED IN THE BUDGET STATEMENTS") +
    labs(x="", y="Sum of emotions") +
    theme.plot +
    theme(axis.text.x = element_text(angle=50))+
    scale_color_manual(values=c("#f8766d","#3c3b1d"))+
    theme(axis.text.x=element_text(size=10,face="bold"))