A Researcher’s Love for Excel and SPSS

Working for the research department at the municipality of Rotterdam (OBI), I discovered their love for excel spread sheets and SPSS documents. Hundreds, no, thousands of them. Each file additionally has another 100 sheets. Sometimes, I needed to join sheets, or bind columns and rows, which is tedious in excel itself. I started scouring packages and functions to find what I needed. In this post, you’ll find some tips and tricks that helped me speed things up.

Excel

Let’s first get an overview of useful availabale packages:

ReadXL

Readxl, the tidyverse excel readig package, has some incredibly useful features and loads all the data immediatly into the succint tbl format. You can list sheet names with excel_sheets(), specify ranges to choose how many rows or columns to load in with excel_sheets(data, range = cell_rows(1:10)).

I’ve found this feature extremely useful in order to load all sheets of a file into my global environment with the names intact. See the function below.

All Excel sheets into Global Environment with original names

# Read all excel sheets at once###

read_excel_allsheets <- function(filename) {
    sheets <- readxl::excel_sheets(filename)
    x <-    lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
    names(x) <- paste0(sheets) #Use paste0 function to add to your
    x
}

# Create list of tibbles
MyData <- read_excel_allsheets("~/Your/File/Location/dataset.xls")

# Unlist all tibbles into environment
list2env(MyData ,.GlobalEnv)

Data Table

data.table's fread() and fwrite() functions are fast. Really fast. Which is incredibly useful if you have large files. Recently I came across a file that was incredibly large. Too large for my laptop’s RAM to handle. There are many ways to increase your computer’s computing power like Hadoop or more recently Spark. Sparklyr by Rstudio is a good and useful alternative.

However, I did not have that luxury, considering I was working on a Remote Access server. I turned my attention towards data.table. I did not need the entire file. I only needed a subset of the file, filtered by the contents of another file. This was my approach:

library(data.table)
library(dplyr)

#-------------------------------------------
# Loop to filter the large dataset in parts
#-----------------------------------------
for(i in 1:10){
  FilterData <- fread("Path/To/FilterData.csv")
  
  #Create sequence to split the dataset. In this case 10 folds. 
  #specify the by if you want to split it more ways.
  Div <- seq(from = 0.1, to = 1, by = 0.1) 
  
  #Since we are splitting the .csv file, the first row will no longer be the header. 
  #Specify your own colnames (most of the time this is better anyway)
  ColNames <- c("colname1", "colname2", "etc.")
  
  #Find out the total number of rows in your file
  No_Rows <- "total number of rows"
  
  #Specify x and y
  x <- round(No_Rows * 0.1, 0)
  
  if(Div[i] == 0.1){
    y <- 0}
  else{
    y <- round(No_Rows - (Div[i]-0.1),0)
  }
  
  df <- fread("Path/To/LargeDataFile.csv", 
              nrow = x, skip = y, col.names = ColNames)
  # Filter dataset
  df %>%
  inner_join(y = FilterData, by = "FilterColumnName") %>%
  # Write part of the dataset
  fwrite(file = Paste0("Path/To/Output/Location/", "Part_", i, ".csv")) 
  
  # Clear Environment to make space for next file
  rm(list=ls())
}
#--------------------------------------------------------------
# Binding the dataset back together and writing the final file
#--------------------------------------------------------------
FileNames = list.files(path = "Path/To/Output/Location", pattern = ".csv", full.names = T)

## Create List of Files in the Folder
Parts <- lapply(FileNames, fread)
## Bind dataframes together
Complete <- do.call(rbind, Parts)
## Write the output
fwrite(Complete, file = "Path/To/Output/FinalDataset.csv")

SPSS

Haven

Haven is another package by Hadley Wickham & co, which is also party of the tidyverse. Just like readxl, this package converts the loaded files into the tibble format. Unlike many other SPSS reading packages, haven can also write files back into an SPSS format with write_sav(). This might come in handy if your researchers have only ever heard of R as a letter in the alphabet. Besides .sav, haven also reads .sas7bdat (SAS) and .dta (Stata) files.

Although Haven is great, I had a similar issue when I came across a file of 20 GB, with approximetly 170 million rows. This led me to the package:

Memisc

memisc is a package created for the Management of Survey Data and Presentation of Analysis Results. Memisc can find information about the SPSS file without having to load the complete file. Which could be very useful. For instance, SPSSFile <- spss.system.file("Path/To/File.sav") creats an importer object. Calling a print(), or just typing the object name gives us this:

## 
## SPSS system file 'C:/Users/Stefan/Documents/Projects/Personal-Website/static/data/Excel_SPSS/testdata.sav' 
##  with 17 variables and 5 observations

While the function description() gives us this:

Label
numeric numeric variable
numeric_long_label numeric variable with long label: this variable hat five observations (one is missing). All values between 1 and 2 are also declared as missing. We use two decimal places and the measurement level is “Scale”.
factor_numeric numeric factor with missing range
factor_n_long_value_label numeric factor with long value labels
factor_n_coded_miss numeric factor with coded missing values
factor_n_duplicated numeric factor with duplicated factor labels
factor_n_undeclared numeric factor with undeclared values
factor_n_undeclared2 numeric factor with undeclared values
string string variable
string_500 long string variable
strin0 long string variable
string_miss string factor with missings
factor_s_coded_miss string factor with coded missing
factor_s_duplicated string factor with duplicated factor labels
factor_s_undeclared string factor with undeclared values
factor_s_undeclared2 string factor with undeclared values
date date format tt.mm.yyyy

If we know which columns we need from the SPSS file, we can specify this with the subset() function, eg:

df <- as.data.table(subset(SPSSFile, 
                     select = C("colname1",
                                "colname2",
                                "colname7",
                                "etc")))

memisc has a lot more functionalities, so be sure to check out the documentation.

Roundup

So, before you start reading and writing data immediatly with the same package every time, think about the structure, size and what it is that you actually need. Then choose the most suitable package, and start reading away!