Episode 3:The 1996 Presidential Elections-Part 1
Happy New Year,I wish you all the best in 2019. Last year, I started a blog series on collecting Ghana’s election results from 1992 to 2016. In episode Episode 1 and Episode 2, I showed the process of getting the presidential and parliamentary elections for 1992.The datasets are available here: 1992 presidential elections and 1992 parliamentary elections.
In this episode I will collect data on the 1996 elections. This data is available on the Electoral Commission of Ghana’s website. Stay with me till the end to know what will be the plot in Part 2 of this Episode.
The 1992 election had 3 presidential aspirants:
- J.J Rawlings who represented NDC (National Democratic Party).In 1992, he led the Alliance Coalition (was made up of the NDC, National Covention Party and Every Ghanaian Living Everywhere) into victory. As at the time of the elections he was the sitting president.
- J.A Kufuor replaced A.Adu Boahen who was the New Patriotic Party’s candidate in the 1992 elections.
- E.N Mahama just like the NPP candidate also led the Peoples’s National Congress in 1996. He replaced Dr. Hilla Limann who was the party’s candidate in 1992.
At the end, the NDC candidate and sitting president: J.J Rawlings won the elections.
Importing Libraries and Data Extraction
The data is an Excel file in the .xls extension. I prefer to use Hadley Wickham’s readxl package as it allows me to extract data with ease. The function read_excel takes a single mandatory argument: name of the file. When passed with a single argument, this function reads all data from the first sheet only into a variable. We can tell the function to read any other sheet by specifying either the sheet’s name or its index read_excel("nameOfFile.xls", sheet="nameOfSheet") OR
read_excel("nameOfFile.xls", sheet=3
The packages in the tidyverse library is for data cleaning and extraction.
library(readxl)
library(tidyverse)
columns = c("Constituency_Num","Region","J.J Rawlings","NDC%Vote",
"J.A Kufuor","NPP%Vote","E.N Mahama","PNC%Vote","Valid_Votes", "Rejected_Ballots","Tt_Votes_Cast","Tt_Reg_Voters","Turnout%")
sheet1 <- read_excel("1996-presidential-election-results.xls",
col_names = columns, skip=2)
#Dimensions of the data
dim(sheet1)
## [1] 274 13
# The first 10 rows of the data
head(sheet1, 10)
## # A tibble: 10 x 13
## Constituency_Num Region `J.J Rawlings` `NDC%Vote` `J.A Kufuor`
## <chr> <chr> <chr> <chr> <chr>
## 1 S/NO. WESTE~ J.J RAWLINGS <NA> J. A KUFUOR
## 2 <NA> CONST~ VALID VOTES % VALID VOTES
## 3 1 JOMORO 23232 62.327627~ 13344
## 4 2 EVALU~ 7856 45.149425~ 8904
## 5 3 ELLEM~ 11856 47.972808~ 11450
## 6 4 AHANT~ 13215 42.266359~ 16994
## 7 5 AFFIA~ 18345 32.459834~ 36002
## 8 6 TAKOR~ 11268 29.741072~ 26221
## 9 7 SEKON~ 15721 35.485982~ 27792
## 10 8 SHAMA 13711 53.477124~ 11364
## # ... with 8 more variables: `NPP%Vote` <chr>, `E.N Mahama` <chr>,
## # `PNC%Vote` <chr>, Valid_Votes <chr>, Rejected_Ballots <chr>,
## # Tt_Votes_Cast <chr>, Tt_Reg_Voters <chr>, `Turnout%` <chr>
Why skip 2 rows of data? The first row contains NA values and the second contained the column header. It was easier skipping these rows skip=2
and manually assigning column header when reading the file using col_names=columns
.
Extracting Regional Information: Votes & Region Names
Number of Votes Per Region
A look at the data shows that it is in a wide format that is each row shows: 1. Name of the region. 2. Number of votes a candidate obtained 3. The number of votes as a percentage of total votes cast. 4. Number of valid votes. 5. Number of rejected votes. 6. Total number of votes cast. 7. Number of registered voters in the region. 8. Percentage of turnout ([# total votes / # registered voters] * 100)
As it was done in Episode 1, for the final data I want each row to contain the number of votes and the percentage of votes per candidate for a region. Since there are 3 candidates and 10 regions, I must have a total of 30 rows in my final dataset.
# Find rows with regional total
vote_data <- filter(sheet1, Region == "REGIONAL TOTAL" |
Region == "REGIONAL TOTALS")
vote_data
## # A tibble: 10 x 13
## Constituency_Num Region `J.J Rawlings` `NDC%Vote` `J.A Kufuor`
## <chr> <chr> <chr> <chr> <chr>
## 1 <NA> REGIO~ 405992 57.296241~ 289730
## 2 <NA> REGIO~ 330841 55.221987~ 259555
## 3 <NA> REGIO~ 658626 53.993128~ 528484
## 4 <NA> REGIO~ 690421 94.545710~ 34538
## 5 <NA> REGIO~ 459092 53.761622~ 384597
## 6 <NA> REGIO~ 412474 32.787711~ 827804
## 7 <NA> REGIO~ 395381 61.732656~ 230457
## 8 <NA> REGIO~ 370030 61.131367~ 199801
## 9 <NA> REGIO~ 230791 68.990039~ 58041
## 10 <NA> REGIO~ 145812 74.608185~ 21871
## # ... with 8 more variables: `NPP%Vote` <chr>, `E.N Mahama` <chr>,
## # `PNC%Vote` <chr>, Valid_Votes <chr>, Rejected_Ballots <chr>,
## # Tt_Votes_Cast <chr>, Tt_Reg_Voters <chr>, `Turnout%` <chr>
The total number of votes per region and the percentage of votes is stored in the final row for each block of constituency. I noticed they had a pattern: that row had either REGIONAL TOTAL or REGIONAL TOTALS in the Region column. Combining the pipe (|) operator and the filter function from dplyr package, it returns what my heart desires: 10 rows and 13 columns of data.
Names of Regions
The dataframe above has the Region column filled with either REGIONAL TOTAL or REGIONAL TOTALS as it was the search criteria used above.Using regular expressions, I return all rows in column Region having a value ending with REGION. If you are confused read the above statement again and have a look at the code.
region_names <- sheet1[grepl("REGION$", sheet1$Region), "Region"]
region_names
## # A tibble: 10 x 1
## Region
## <chr>
## 1 WESTERN REGION
## 2 CENTRAL REGION
## 3 GT ACCRA REGION
## 4 VOLTA REGION
## 5 EASTERN REGION
## 6 ASHANTI REGION
## 7 BRONG AHAFO REGION
## 8 NORTHERN REGION
## 9 UPPER EAST REGION
## 10 UPPER WEST REGION
Combine dataframes
Next, assign the contents in region_names to the column Region in vote_data.Now we have a dataset with regional information only.
vote_data["Region"] = region_names
# replace all NA values with 0
vote_data[is.na(vote_data)] <- 0
# See the first 3 rows of vote_data
head(vote_data, 3)
## # A tibble: 3 x 13
## Constituency_Num Region `J.J Rawlings` `NDC%Vote` `J.A Kufuor` `NPP%Vote`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 0 WESTE~ 405992 57.296241~ 289730 40.888589~
## 2 0 CENTR~ 330841 55.221987~ 259555 43.323357~
## 3 0 GT AC~ 658626 53.993128~ 528484 43.324291~
## # ... with 7 more variables: `E.N Mahama` <chr>, `PNC%Vote` <chr>,
## # Valid_Votes <chr>, Rejected_Ballots <chr>, Tt_Votes_Cast <chr>,
## # Tt_Reg_Voters <chr>, `Turnout%` <chr>
More data cleaning
The data in the first column is made up of NA values only so I will take off. Using the dim(), I will confirm the changes by checking the number of rows before and after the change.
dim(vote_data)
## [1] 10 13
vote_data = vote_data[, -1]
dim(vote_data)
## [1] 10 12
In the “GATHERING” of columns
If you follow my blog, at this stage you will have an idea of what I am about to do based on the data we have now. Let me give you a hint, go to Episode 2 and the section Cleaning the Dataframe. There I gave a detailed explanation of how to convert a dataframe from a wide format to a long format.
dframe_Names <- function(dframe, rowNum){
namesOnly <- dframe[, !grepl("%", colnames(dframe))]
namesOnly <- gather(namesOnly, Candidate, NumVote, -1)
namesOnly <- namesOnly[-c(rowNum:nrow(namesOnly)), ]
return(namesOnly)
}
A summarized explanation of the function is
- It takes a dataframe and a number
- Searches for columns that do not include % sign, this returns the columns: Region, J.J Rawlings, J.A Kufuor, E.N Mahama, Valid_Votes, Rejected_Ballots, Tt_Votes_Cast and Tt_Reg_Voters
- With the exception of the first column (Region), create a new dataframe with 3 columns.The number of rows in this dataframe will be
7 * 10 = 70
. 7 represents the number of returned columns in except the Region. 10 on the other hand represents the number of values in Region column. The first column will be a duplication of the region names (70 times). Second column will be the the names of the columns in 2 repeated 10 times Third column will be the values under each column. We use the second argument to take from 1:number of rows out of the gathered dataframe.
- Finally, we cap the number of rows to be returned using the second argument.
Returns the final dataframe to the variable that calls the function.
Calling the dframe_Names function
names_1996 <- dframe_Names(vote_data, 31)
dim(names_1996)
## [1] 30 3
head(names_1996)
## # A tibble: 6 x 3
## Region Candidate NumVote
## <chr> <chr> <chr>
## 1 WESTERN REGION J.J Rawlings 405992
## 2 CENTRAL REGION J.J Rawlings 330841
## 3 GT ACCRA REGION J.J Rawlings 658626
## 4 VOLTA REGION J.J Rawlings 690421
## 5 EASTERN REGION J.J Rawlings 459092
## 6 ASHANTI REGION J.J Rawlings 412474
Getting the Percent Values
Next I also want the percentage values each candidate received per region. Just like dframe_Names, this function is also explained in detail in Episode 1 section dframe_Percent
dframe_Percent <- function(dframe, rowNum){
percentOnly <- dframe[,grepl("%",colnames(dframe))]
# add the Regions
percentOnly <- percentOnly %>%
mutate(Region = c("WESTERN","CENTRAL","GT ACCRA","VOLTA",
"EASTERN","ASHANTI","B/AHAFO","NORTHERN",
"UPPER EAST","UPPER WEST"),
`Turnout%` = NULL)
# convert from wide to long format
# separate to get party name
percentLong <- percentOnly %>%
gather(Details, PercentVote) %>%
separate(Details, c("PartyName","Vote"),"%")
percentLong <- percentLong[-c(rowNum:nrow(percentLong)),]
return(percentLong)
}
Summarizing the workings of the functions
- Also takes a dataframe and number of rows
- Keeps only rows with %: NDC%Vote, NPP%Vote, PNC%Vote, Turnout%
- Renames the Region column
- Separate the Details column so the party name can be a column on its own.
- Deletes the Turnout% column.
- Cap the number of rows to be returned
- Pass the final data frame to the variable that calls the function.
Calling the dframe_Percent function
percent_1996 <- dframe_Percent(vote_data, 31)
dim(percent_1996)
## [1] 30 3
head(percent_1996)
## # A tibble: 6 x 3
## PartyName Vote PercentVote
## <chr> <chr> <chr>
## 1 NDC Vote 57.296241518295645
## 2 NDC Vote 55.221987244433834
## 3 NDC Vote 53.993128567599001
## 4 NDC Vote 94.545710995260535
## 5 NDC Vote 53.761622596435345
## 6 NDC Vote 32.787711424515145
Combine long dataframes
The returned dataframes from the functions are combined using cbind I show how the data looks for some rows.
full_1996 <- cbind(names_1996, percent_1996)
full_1996[c(1,12,23),]
## Region Candidate NumVote PartyName Vote PercentVote
## 1 WESTERN REGION J.J Rawlings 405992 NDC Vote 57.296241518295645
## 12 CENTRAL REGION J.A Kufuor 259555 NPP Vote 43.32335744127549
## 23 GT ACCRA REGION E.N Mahama 32723 PNC Vote 2.68258032042091
Final cleanup on data
The fifth column Vote is made up of all Vote so it is deleted. Rearrange the column so the numerical columns are shown after the string columns.
full_1996 <- full_1996[, -5]
# rearrange columns
full_1996 <- full_1996[, c(1,2,4,3,5)]
# convert the NumVote to integer
full_1996[, 4] <- as.integer(full_1996[, 4])
# convert the PercentVote to numeric with 2 decimal places
full_1996[, 5] <- round(as.numeric(full_1996[,5]))
# convert the rest of the columns: the first 3 columns to factors
full_1996 <- full_1996 %>% mutate_if(is.character, as.factor)
# confirm the data types have been changed
str(full_1996)
## 'data.frame': 30 obs. of 5 variables:
## $ Region : Factor w/ 10 levels "ASHANTI REGION",..: 10 3 5 9 4 1 2 6 7 8 ...
## $ Candidate : Factor w/ 3 levels "E.N Mahama","J.A Kufuor",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ PartyName : Factor w/ 3 levels "NDC","NPP","PNC": 1 1 1 1 1 1 1 1 1 1 ...
## $ NumVote : int 405992 330841 658626 690421 459092 412474 395381 370030 230791 145812 ...
## $ PercentVote: num 57 55 54 95 54 33 62 61 69 75 ...
Sanity Check
I want to verify the data collected. I would group the votes by regions and find the sum of their percentages.With the exception of Northern Region which did not have any value for data in the Rejected_Ballots column, the sum of percentages of votes should be less than 100%.
full_1996 %>%
group_by(Region) %>%
summarise(Reg_Percent = sum(PercentVote))
## # A tibble: 10 x 2
## Region Reg_Percent
## <fct> <dbl>
## 1 ASHANTI REGION 78
## 2 BRONG AHAFO REGION 100
## 3 CENTRAL REGION 99
## 4 EASTERN REGION 100
## 5 GT ACCRA REGION 100
## 6 NORTHERN REGION 100
## 7 UPPER EAST REGION 100
## 8 UPPER WEST REGION 100
## 9 VOLTA REGION 101
## 10 WESTERN REGION 100
Shockingly, majority of them arebetween 99.99 and 100.01 except Ashanti Region. But what about the percentage of rejected votes recorded per region? Let us check the values for Ashanti Region.
full_1996[grepl("^ASHANTI", full_1996$Region),]
## Region Candidate PartyName NumVote PercentVote
## 6 ASHANTI REGION J.J Rawlings NDC 412474 33
## 16 ASHANTI REGION J.A Kufuor NPP 827804 44
## 26 ASHANTI REGION E.N Mahama PNC 17736 1
From the output above, NDC obtained half the number of votes the NPP obtained. Hence the value for PercentVote for both Regions is ????? In Part 2 of this episode, I will want to dig deeper into this.
The Data
The data is written to a csv format and it is available here
write.csv(full_1996, "presidential_results_1996.csv")
Thank you and see you in Part 2.