Happy New Year,I wish you all the best in 2019. Last year, I started a blog series on collecting Ghana’s election results from 1992 to 2016. In episode Episode 1 and Episode 2, I showed the process of getting the presidential and parliamentary elections for 1992.The datasets are available here: 1992 presidential elections and 1992 parliamentary elections.

In this episode I will collect data on the 1996 elections. This data is available on the Electoral Commission of Ghana’s website. Stay with me till the end to know what will be the plot in Part 2 of this Episode.

The 1992 election had 3 presidential aspirants:

  • J.J Rawlings who represented NDC (National Democratic Party).In 1992, he led the Alliance Coalition (was made up of the NDC, National Covention Party and Every Ghanaian Living Everywhere) into victory. As at the time of the elections he was the sitting president.
  • J.A Kufuor replaced A.Adu Boahen who was the New Patriotic Party’s candidate in the 1992 elections.
  • E.N Mahama just like the NPP candidate also led the Peoples’s National Congress in 1996. He replaced Dr. Hilla Limann who was the party’s candidate in 1992.

At the end, the NDC candidate and sitting president: J.J Rawlings won the elections.

Importing Libraries and Data Extraction

The data is an Excel file in the .xls extension. I prefer to use Hadley Wickham’s readxl package as it allows me to extract data with ease. The function read_excel takes a single mandatory argument: name of the file. When passed with a single argument, this function reads all data from the first sheet only into a variable. We can tell the function to read any other sheet by specifying either the sheet’s name or its index read_excel("nameOfFile.xls", sheet="nameOfSheet") OR read_excel("nameOfFile.xls", sheet=3 The packages in the tidyverse library is for data cleaning and extraction.

library(readxl)
library(tidyverse)

columns = c("Constituency_Num","Region","J.J Rawlings","NDC%Vote",
"J.A Kufuor","NPP%Vote","E.N Mahama","PNC%Vote","Valid_Votes", "Rejected_Ballots","Tt_Votes_Cast","Tt_Reg_Voters","Turnout%")

sheet1 <- read_excel("1996-presidential-election-results.xls", 
col_names = columns, skip=2)

#Dimensions of the data
dim(sheet1)
## [1] 274  13
# The first 10 rows of the data
head(sheet1, 10)
## # A tibble: 10 x 13
##    Constituency_Num Region `J.J Rawlings` `NDC%Vote` `J.A Kufuor`
##    <chr>            <chr>  <chr>          <chr>      <chr>       
##  1 S/NO.            WESTE~ J.J RAWLINGS   <NA>       J. A KUFUOR 
##  2 <NA>             CONST~ VALID VOTES    %          VALID VOTES 
##  3 1                JOMORO 23232          62.327627~ 13344       
##  4 2                EVALU~ 7856           45.149425~ 8904        
##  5 3                ELLEM~ 11856          47.972808~ 11450       
##  6 4                AHANT~ 13215          42.266359~ 16994       
##  7 5                AFFIA~ 18345          32.459834~ 36002       
##  8 6                TAKOR~ 11268          29.741072~ 26221       
##  9 7                SEKON~ 15721          35.485982~ 27792       
## 10 8                SHAMA  13711          53.477124~ 11364       
## # ... with 8 more variables: `NPP%Vote` <chr>, `E.N Mahama` <chr>,
## #   `PNC%Vote` <chr>, Valid_Votes <chr>, Rejected_Ballots <chr>,
## #   Tt_Votes_Cast <chr>, Tt_Reg_Voters <chr>, `Turnout%` <chr>

Why skip 2 rows of data? The first row contains NA values and the second contained the column header. It was easier skipping these rows skip=2 and manually assigning column header when reading the file using col_names=columns.

Extracting Regional Information: Votes & Region Names

Number of Votes Per Region

A look at the data shows that it is in a wide format that is each row shows: 1. Name of the region. 2. Number of votes a candidate obtained 3. The number of votes as a percentage of total votes cast. 4. Number of valid votes. 5. Number of rejected votes. 6. Total number of votes cast. 7. Number of registered voters in the region. 8. Percentage of turnout ([# total votes / # registered voters] * 100)

As it was done in Episode 1, for the final data I want each row to contain the number of votes and the percentage of votes per candidate for a region. Since there are 3 candidates and 10 regions, I must have a total of 30 rows in my final dataset.

# Find rows with regional total
vote_data <- filter(sheet1, Region == "REGIONAL TOTAL" |
Region == "REGIONAL TOTALS")
vote_data
## # A tibble: 10 x 13
##    Constituency_Num Region `J.J Rawlings` `NDC%Vote` `J.A Kufuor`
##    <chr>            <chr>  <chr>          <chr>      <chr>       
##  1 <NA>             REGIO~ 405992         57.296241~ 289730      
##  2 <NA>             REGIO~ 330841         55.221987~ 259555      
##  3 <NA>             REGIO~ 658626         53.993128~ 528484      
##  4 <NA>             REGIO~ 690421         94.545710~ 34538       
##  5 <NA>             REGIO~ 459092         53.761622~ 384597      
##  6 <NA>             REGIO~ 412474         32.787711~ 827804      
##  7 <NA>             REGIO~ 395381         61.732656~ 230457      
##  8 <NA>             REGIO~ 370030         61.131367~ 199801      
##  9 <NA>             REGIO~ 230791         68.990039~ 58041       
## 10 <NA>             REGIO~ 145812         74.608185~ 21871       
## # ... with 8 more variables: `NPP%Vote` <chr>, `E.N Mahama` <chr>,
## #   `PNC%Vote` <chr>, Valid_Votes <chr>, Rejected_Ballots <chr>,
## #   Tt_Votes_Cast <chr>, Tt_Reg_Voters <chr>, `Turnout%` <chr>

The total number of votes per region and the percentage of votes is stored in the final row for each block of constituency. I noticed they had a pattern: that row had either REGIONAL TOTAL or REGIONAL TOTALS in the Region column. Combining the pipe (|) operator and the filter function from dplyr package, it returns what my heart desires: 10 rows and 13 columns of data.

Names of Regions

The dataframe above has the Region column filled with either REGIONAL TOTAL or REGIONAL TOTALS as it was the search criteria used above.Using regular expressions, I return all rows in column Region having a value ending with REGION. If you are confused read the above statement again and have a look at the code.

region_names <- sheet1[grepl("REGION$", sheet1$Region), "Region"]
region_names
## # A tibble: 10 x 1
##    Region             
##    <chr>              
##  1 WESTERN REGION     
##  2 CENTRAL REGION     
##  3 GT ACCRA  REGION   
##  4 VOLTA  REGION      
##  5 EASTERN  REGION    
##  6 ASHANTI  REGION    
##  7 BRONG AHAFO  REGION
##  8 NORTHERN  REGION   
##  9 UPPER EAST  REGION 
## 10 UPPER WEST REGION

Combine dataframes

Next, assign the contents in region_names to the column Region in vote_data.Now we have a dataset with regional information only.

vote_data["Region"] = region_names

# replace all NA values with 0
vote_data[is.na(vote_data)] <- 0

# See the first 3 rows of vote_data
head(vote_data, 3)
## # A tibble: 3 x 13
##   Constituency_Num Region `J.J Rawlings` `NDC%Vote` `J.A Kufuor` `NPP%Vote`
##   <chr>            <chr>  <chr>          <chr>      <chr>        <chr>     
## 1 0                WESTE~ 405992         57.296241~ 289730       40.888589~
## 2 0                CENTR~ 330841         55.221987~ 259555       43.323357~
## 3 0                GT AC~ 658626         53.993128~ 528484       43.324291~
## # ... with 7 more variables: `E.N Mahama` <chr>, `PNC%Vote` <chr>,
## #   Valid_Votes <chr>, Rejected_Ballots <chr>, Tt_Votes_Cast <chr>,
## #   Tt_Reg_Voters <chr>, `Turnout%` <chr>

More data cleaning

The data in the first column is made up of NA values only so I will take off. Using the dim(), I will confirm the changes by checking the number of rows before and after the change.

dim(vote_data)
## [1] 10 13
vote_data = vote_data[, -1]
dim(vote_data)
## [1] 10 12

In the “GATHERING” of columns

If you follow my blog, at this stage you will have an idea of what I am about to do based on the data we have now. Let me give you a hint, go to Episode 2 and the section Cleaning the Dataframe. There I gave a detailed explanation of how to convert a dataframe from a wide format to a long format.

dframe_Names <- function(dframe, rowNum){
namesOnly <- dframe[, !grepl("%", colnames(dframe))]
namesOnly <- gather(namesOnly, Candidate, NumVote, -1)
namesOnly <- namesOnly[-c(rowNum:nrow(namesOnly)), ]

return(namesOnly)
}

A summarized explanation of the function is

  1. It takes a dataframe and a number
  2. Searches for columns that do not include % sign, this returns the columns: Region, J.J Rawlings, J.A Kufuor, E.N Mahama, Valid_Votes, Rejected_Ballots, Tt_Votes_Cast and Tt_Reg_Voters
  3. With the exception of the first column (Region), create a new dataframe with 3 columns.The number of rows in this dataframe will be 7 * 10 = 70. 7 represents the number of returned columns in except the Region. 10 on the other hand represents the number of values in Region column. The first column will be a duplication of the region names (70 times). Second column will be the the names of the columns in 2 repeated 10 times Third column will be the values under each column. Gather Details
  4. We use the second argument to take from 1:number of rows out of the gathered dataframe.

  5. Finally, we cap the number of rows to be returned using the second argument.
  6. Returns the final dataframe to the variable that calls the function.

Calling the dframe_Names function

names_1996 <- dframe_Names(vote_data, 31)
dim(names_1996)
## [1] 30  3
head(names_1996)
## # A tibble: 6 x 3
##   Region           Candidate    NumVote
##   <chr>            <chr>        <chr>  
## 1 WESTERN REGION   J.J Rawlings 405992 
## 2 CENTRAL REGION   J.J Rawlings 330841 
## 3 GT ACCRA  REGION J.J Rawlings 658626 
## 4 VOLTA  REGION    J.J Rawlings 690421 
## 5 EASTERN  REGION  J.J Rawlings 459092 
## 6 ASHANTI  REGION  J.J Rawlings 412474

Getting the Percent Values

Next I also want the percentage values each candidate received per region. Just like dframe_Names, this function is also explained in detail in Episode 1 section dframe_Percent

dframe_Percent <- function(dframe, rowNum){
    percentOnly <- dframe[,grepl("%",colnames(dframe))]
    # add the Regions
    percentOnly <- percentOnly %>%
        mutate(Region = c("WESTERN","CENTRAL","GT ACCRA","VOLTA",
                          "EASTERN","ASHANTI","B/AHAFO","NORTHERN",
                          "UPPER EAST","UPPER WEST"),
               `Turnout%` = NULL)
    
    # convert from wide to long format
    # separate to get party name
    percentLong <- percentOnly %>%
        gather(Details, PercentVote) %>%
        separate(Details, c("PartyName","Vote"),"%")
    
    percentLong <- percentLong[-c(rowNum:nrow(percentLong)),]
    return(percentLong)
}

Summarizing the workings of the functions

  1. Also takes a dataframe and number of rows
  2. Keeps only rows with %: NDC%Vote, NPP%Vote, PNC%Vote, Turnout%
  3. Renames the Region column
  4. Separate the Details column so the party name can be a column on its own.
  5. Deletes the Turnout% column.
  6. Cap the number of rows to be returned
  7. Pass the final data frame to the variable that calls the function.

Calling the dframe_Percent function

percent_1996 <- dframe_Percent(vote_data, 31)
dim(percent_1996)
## [1] 30  3
head(percent_1996)
## # A tibble: 6 x 3
##   PartyName Vote  PercentVote       
##   <chr>     <chr> <chr>             
## 1 NDC       Vote  57.296241518295645
## 2 NDC       Vote  55.221987244433834
## 3 NDC       Vote  53.993128567599001
## 4 NDC       Vote  94.545710995260535
## 5 NDC       Vote  53.761622596435345
## 6 NDC       Vote  32.787711424515145

Combine long dataframes

The returned dataframes from the functions are combined using cbind I show how the data looks for some rows.

full_1996 <- cbind(names_1996, percent_1996)
full_1996[c(1,12,23),]
##              Region    Candidate NumVote PartyName Vote        PercentVote
## 1    WESTERN REGION J.J Rawlings  405992       NDC Vote 57.296241518295645
## 12   CENTRAL REGION   J.A Kufuor  259555       NPP Vote  43.32335744127549
## 23 GT ACCRA  REGION   E.N Mahama   32723       PNC Vote   2.68258032042091

Final cleanup on data

The fifth column Vote is made up of all Vote so it is deleted. Rearrange the column so the numerical columns are shown after the string columns.

full_1996 <- full_1996[, -5]

# rearrange columns
full_1996 <- full_1996[, c(1,2,4,3,5)]

# convert the NumVote to integer
full_1996[, 4] <- as.integer(full_1996[, 4])

# convert the PercentVote to numeric with 2 decimal places
full_1996[, 5] <- round(as.numeric(full_1996[,5]))

# convert the rest of the columns: the first 3 columns to factors
full_1996 <- full_1996 %>% mutate_if(is.character, as.factor)

# confirm the data types have been changed
str(full_1996)
## 'data.frame':    30 obs. of  5 variables:
##  $ Region     : Factor w/ 10 levels "ASHANTI  REGION",..: 10 3 5 9 4 1 2 6 7 8 ...
##  $ Candidate  : Factor w/ 3 levels "E.N Mahama","J.A Kufuor",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ PartyName  : Factor w/ 3 levels "NDC","NPP","PNC": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NumVote    : int  405992 330841 658626 690421 459092 412474 395381 370030 230791 145812 ...
##  $ PercentVote: num  57 55 54 95 54 33 62 61 69 75 ...

Sanity Check

I want to verify the data collected. I would group the votes by regions and find the sum of their percentages.With the exception of Northern Region which did not have any value for data in the Rejected_Ballots column, the sum of percentages of votes should be less than 100%.

full_1996 %>%
    group_by(Region) %>%
    summarise(Reg_Percent = sum(PercentVote))
## # A tibble: 10 x 2
##    Region              Reg_Percent
##    <fct>                     <dbl>
##  1 ASHANTI  REGION              78
##  2 BRONG AHAFO  REGION         100
##  3 CENTRAL REGION               99
##  4 EASTERN  REGION             100
##  5 GT ACCRA  REGION            100
##  6 NORTHERN  REGION            100
##  7 UPPER EAST  REGION          100
##  8 UPPER WEST REGION           100
##  9 VOLTA  REGION               101
## 10 WESTERN REGION              100

Shockingly, majority of them arebetween 99.99 and 100.01 except Ashanti Region. But what about the percentage of rejected votes recorded per region? Let us check the values for Ashanti Region.

full_1996[grepl("^ASHANTI", full_1996$Region),]
##             Region    Candidate PartyName NumVote PercentVote
## 6  ASHANTI  REGION J.J Rawlings       NDC  412474          33
## 16 ASHANTI  REGION   J.A Kufuor       NPP  827804          44
## 26 ASHANTI  REGION   E.N Mahama       PNC   17736           1

From the output above, NDC obtained half the number of votes the NPP obtained. Hence the value for PercentVote for both Regions is ????? In Part 2 of this episode, I will want to dig deeper into this.

The Data

The data is written to a csv format and it is available here

write.csv(full_1996, "presidential_results_1996.csv")

Thank you and see you in Part 2.