A Researcher’s Love for Excel and SPSS

Working for the research department at the municipality of Rotterdam (OBI), I discovered their love for excel spread sheets and SPSS documents. Hundreds, no, thousands of them. Each file additionally has another 100 sheets. Sometimes, I needed to join sheets, or bind columns and rows, which is tedious in excel itself. I started scouring packages and functions to find what I needed. In this post, you’ll find some tips and tricks that helped me speed things up.

Excel

Let’s first get an overview of useful availabale packages:

ReadXL

Readxl, the tidyverse excel readig package, has some incredibly useful features and loads all the data immediatly into the succint tbl format. You can list sheet names with excel_sheets(), specify ranges to choose how many rows or columns to load in with excel_sheets(data, range = cell_rows(1:10)).

I’ve found this feature extremely useful in order to load all sheets of a file into my global environment with the names intact. See the function below.

All Excel sheets into Global Environment with original names

# Read all excel sheets at once###

read_excel_allsheets <- function(filename) {
    sheets <- readxl::excel_sheets(filename)
    x <-    lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
    names(x) <- paste0(sheets) #Use paste0 function to add to your
    x
}

# Create list of tibbles
MyData <- read_excel_allsheets("~/Your/File/Location/dataset.xls")

# Unlist all tibbles into environment
list2env(MyData ,.GlobalEnv)

Data Table

data.table's fread() and fwrite() functions are fast. Really fast. Which is incredibly useful if you have large files. Recently I came across a file that was incredibly large. Too large for my laptop’s RAM to handle. There are many ways to increase your computer’s computing power like Hadoop or more recently Spark. Sparklyr by Rstudio is a good and useful alternative.

However, I did not have that luxury, considering I was working on a Remote Access server. I turned my attention towards data.table. I did not need the entire file. I only needed a subset of the file, filtered by the contents of another file. This was my approach:

library(data.table)
library(dplyr)

#-------------------------------------------
# Loop to filter the large dataset in parts
#-----------------------------------------
for(i in 1:10){
  FilterData <- fread("Path/To/FilterData.csv")
  
  #Create sequence to split the dataset. In this case 10 folds. 
  #specify the by if you want to split it more ways.
  Div <- seq(from = 0.1, to = 1, by = 0.1) 
  
  #Since we are splitting the .csv file, the first row will no longer be the header. 
  #Specify your own colnames (most of the time this is better anyway)
  ColNames <- c("colname1", "colname2", "etc.")
  
  #Find out the total number of rows in your file
  No_Rows <- "total number of rows"
  
  #Specify x and y
  x <- round(No_Rows * 0.1, 0)
  
  if(Div[i] == 0.1){
    y <- 0}
  else{
    y <- round(No_Rows - (Div[i]-0.1),0)
  }
  
  df <- fread("Path/To/LargeDataFile.csv", 
              nrow = x, skip = y, col.names = ColNames)
  # Filter dataset
  df %>%
  inner_join(y = FilterData, by = "FilterColumnName") %>%
  # Write part of the dataset
  fwrite(file = Paste0("Path/To/Output/Location/", "Part_", i, ".csv")) 
  
  # Clear Environment to make space for next file
  rm(list=ls())
}
#--------------------------------------------------------------
# Binding the dataset back together and writing the final file
#--------------------------------------------------------------
FileNames = list.files(path = "Path/To/Output/Location", pattern = ".csv", full.names = T)

## Create List of Files in the Folder
Parts <- lapply(FileNames, fread)
## Bind dataframes together
Complete <- do.call(rbind, Parts)
## Write the output
fwrite(Complete, file = "Path/To/Output/FinalDataset.csv")

SPSS

Haven

Haven is another package by Hadley Wickham & co, which is also party of the tidyverse. Just like readxl, this package converts the loaded files into the tibble format. Unlike many other SPSS reading packages, haven can also write files back into an SPSS format with write_sav(). This might come in handy if your researchers have only ever heard of R as a letter in the alphabet. Besides .sav, haven also reads .sas7bdat (SAS) and .dta (Stata) files.

Although Haven is great, I had a similar issue when I came across a file of 20 GB, with approximetly 170 million rows. This led me to the package:

Memisc

memisc is a package created for the Management of Survey Data and Presentation of Analysis Results. Memisc can find information about the SPSS file without having to load the complete file. Which could be very useful. For instance, SPSSFile <- spss.system.file("Path/To/File.sav") creats an importer object. Calling a print(), or just typing the object name gives us this:

## 
## SPSS system file 'C:/Users/Stefan/Documents/Projects/Personal-Website/static/data/Excel_SPSS/testdata.sav' 
##  with 17 variables and 5 observations

While the function description() gives us this:

	Label
numeric	numeric variable
numeric_long_label	numeric variable with long label: this variable hat five observations (one is missing). All values between 1 and 2 are also declared as missing. We use two decimal places and the measurement level is “Scale”.
factor_numeric	numeric factor with missing range
factor_n_long_value_label	numeric factor with long value labels
factor_n_coded_miss	numeric factor with coded missing values
factor_n_duplicated	numeric factor with duplicated factor labels
factor_n_undeclared	numeric factor with undeclared values
factor_n_undeclared2	numeric factor with undeclared values
string	string variable
string_500	long string variable
strin0	long string variable
string_miss	string factor with missings
factor_s_coded_miss	string factor with coded missing
factor_s_duplicated	string factor with duplicated factor labels
factor_s_undeclared	string factor with undeclared values
factor_s_undeclared2	string factor with undeclared values
date	date format tt.mm.yyyy

If we know which columns we need from the SPSS file, we can specify this with the subset() function, eg:

df <- as.data.table(subset(SPSSFile, 
                     select = C("colname1",
                                "colname2",
                                "colname7",
                                "etc")))

memisc has a lot more functionalities, so be sure to check out the documentation.

Roundup

So, before you start reading and writing data immediatly with the same package every time, think about the structure, size and what it is that you actually need. Then choose the most suitable package, and start reading away!

Reading Excel & SPSS Files in R | Tips & Tricks

A Researcher’s Love for Excel and SPSS

Excel

ReadXL

Data Table

SPSS

Haven

Memisc

Roundup

Stefan Musch

Reading Excel & SPSS Files in R | Tips & Tricks

A Researcher’s Love for Excel and SPSS

Excel

ReadXL

Data Table

SPSS

Haven

Memisc

Roundup

Stefan Musch

Advanced Marketing Analytics

An Awesome Spotify Playlist Analysis

Creating your own beautiful holiday poster!

Predictive Modeling - Regression Summary

How Jolly are Christmas Songs Actually?

Reading Excel & SPSS Files in R | Tips & Tricks

Dutch Housing Situation - Rotterdam

My Tracked Locations

My R Blog Introduction!