Chapter 3 Basic data handling

3.1 R

3.1.1 Data frames

The basic object type for representing rectangular “data matrices” in R is the class "data.frame". The rows correspond to observational units and the columns to variables. Numeric vectors of class "numeric" or "integer" are typically used for quantitative variables and "factor" vectors for qualitative variables (either nominal or ordinal).

There are many more data classes that can be used for columns/variables in a data frame such as text ("character"), date ("Date"), time/date ("POSIXct"), time series ("ts", "zoo", …), or spatial ("sf") information. For this course we focus on quantitative and qualitative vectors, though.

To enter data “on the fly” in R, the function data.frame() can be used, potentially in combination with the function factor() for qualitative/categorical variables. As an example, we generate a small artificial data sets:

mydata <- data.frame(
  expenditure = c(724.3, 639.4, 1000, 75.87),
  age = c(28, 54, 39, 17),
  gender = factor(c(1, 1, 3, 2), levels = 1:3, labels = c("male", "female", "diverse"))
)
mydata
##   expenditure age  gender
## 1      724.30  28    male
## 2      639.40  54    male
## 3     1000.00  39 diverse
## 4       75.87  17  female

Entering data into R using data.frame() is clearly only helpful for small data sets or data that is already available in some other form within R. Larger data sets are typically read from data sources such as text files, spreadsheets, or binary files from R or another statistics software such as Stata or SPSS.

3.1.2 Tibbles

A variation of data frames with somewhat stricter checking and modified formatting is called “tibbles”. These are popularized by the so-called tidyverse family of packages and are employed by RStudio’s built-in infrastructure for reading data from files via the menu.

The corresponding class is "tbl_df" provided by the tibble package. The class inherits from "tbl" and "data.frame". However, unless the “tidyverse” is used for the data wrangling we recommend to convert tibbles to plain data frames using as.data.frame(). The reverse coercion via as_tibble() (with underscore rather than period) is also available.

3.1.3 Text files

For exchanging simple rectangular data text-based formats are a popular choice. These typically have one text line per row of data, using some separator symbol for the different columns: e.g., comma, semicolon, tabulator, etc. The basic workhorse function to read this kind of data in R is read.table() which offers a wide range of arguments for fine-control of the reading process. See ?read.table or help("read.table") for a detailed overview. The most important arguments are:

  • sep: Separator symbol between columns, e.g., tabulator "\t", comma ",", semicolon ";".
  • dec: Decimal separator, typically either a decimal point "." or decimal comma ",".
  • header: Are variable names provided in the first line of the text file? (TRUE or FALSE)

Probably the most-used “flavor” of text-based data is CSV (comma-separated values) which can also be easily imported/exported to Excel and many other spreadsheet or statistics programs. R offers a convenience function

read.csv(...)

for this type of data which internally essentially calls

read.table(..., sep = ",", dec = ".", header = TRUE)

Note that this type of CSV is only used by Excel on English systems. In contrast, on German systems Excel uses a decimal comma (rather than point). Therefore, the comma cannot be used as column separator and decimal separator. Consequently, semicolons are used as column separators (while still referring to the format as comma-separated). R offers another convenience function for this:

read.csv2(...)

which corresponds to

read.table(..., sep = ";", dec = ",", header = TRUE)

All these functions return a "data.frame" by internally setting up vectors for each column and then calling data.frame() (see above). By default both numeric and text columns (with some character entries) are simply preserved as such. However, optionally text columns can also be converted to factors automatically (by setting stringsAsFactors = TRUE) which is often useful if the data file was already preprocessed before importing it into R. Note that in R versions prior to 4.0.0 the default behavior was stringsAsFactors = TRUE but in current R versions it is FALSE. If a column represents a qualitative variable but uses a numeric coding, then factor() should be used subsequently to declare the factor variable with suitable labels.

For illustration we import the Bookbinders Book Club data from the file BBBClub.csv.

choice gender amount freq last first child youth cook diy art
yes    male   113    8    1    8     0     1     0    0   0
yes    male   418    6    11   66    0     2     3    2   3
...

Therefore, read.csv() is used for reading the data, turning the two categorical variables choice and gender into factors automatically:

BBBClub <- read.csv("BBBClub.csv", stringsAsFactors = TRUE)
summary(BBBClub)
##  choice       gender        amount         freq           last          first     
##  no :900   female:456   Min.   : 15   Min.   : 2.0   Min.   : 1.0   Min.   : 2.0  
##  yes:400   male  :844   1st Qu.:127   1st Qu.: 6.0   1st Qu.: 1.0   1st Qu.:12.0  
##                         Median :204   Median :12.0   Median : 2.0   Median :18.0  
##                         Mean   :201   Mean   :12.1   Mean   : 3.2   Mean   :22.3  
##                         3rd Qu.:273   3rd Qu.:16.0   3rd Qu.: 4.0   3rd Qu.:30.0  
##                         Max.   :474   Max.   :36.0   Max.   :12.0   Max.   :86.0  
##      child           youth            cook            diy             art       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000  
##  Median :0.000   Median :0.000   Median :0.000   Median :0.000   Median :0.000  
##  Mean   :0.732   Mean   :0.325   Mean   :0.742   Mean   :0.395   Mean   :0.453  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :8.000   Max.   :4.000   Max.   :6.000   Max.   :4.000   Max.   :5.000

In addition to this comma-separated version of the data file, there is also a semicolon-separated version in BBBClub2.csv which can be read by read.csv2() instead of read.csv().

As a final remark: For exporting such text-based files from R the corresponding commands write.table(), write.csv(), write.csv2(), etc. are available.

3.1.4 Binary R data

A particularly convenient method to exchange data for usage with R (as opposed to spreadsheets or other statistics systems) is R’s own binary format. This has the advantage that R’s own objects/classes can be stored, i.e., not only "data.frame" objects but also plain vectors, fitted model objects (like "lm" or "glm"), even functions, etc. There are two variants of R’s binary data:

  • .rda (or .RData) files that can contain an arbitrary number of objects.
    Import/export via load() and save().
  • .rds files that contain a single R object.
    Import/export via readRDS() and saveRDS().

The latter can be used similarly to the read.*() functions from the previous section, i.e., the result needs to be assigned to a new object. For illustration, we export the mydata data frame and re-import it as newdata, making sure that the objects are identical:

saveRDS(mydata, "mydata.rds")
newdata <- readRDS("mydata.rds")
identical(mydata, newdata)
## [1] TRUE

In contrast, the .rda format can contain many objects and hence load()-ing them already creates them in the user’s workspace with their original object names. Thus, no assignment is needed here. For illustration, we load both the BBBClub and GSA data from the corresponding .rda files.

load("BBBClub.rda")
load("GSA.rda")
ls()
## [1] "BBBClub" "GSA"     "X"       "m"       "mydata"  "newdata" "x"       "y"

Then, it is also possible to save() all of these data sets in a single file, remove() them subsequently from the workspace, and load() all of them again in one go:

save(mydata, BBBClub, GSA, file = "alldata.rda")
remove(mydata, BBBClub, GSA)
ls()
## [1] "X"       "m"       "newdata" "x"       "y"
load("alldata.rda")
ls()
## [1] "BBBClub" "GSA"     "X"       "m"       "mydata"  "newdata" "x"       "y"

3.1.5 Excel spreadsheets

Another convenient option to exchange data between R and Excel is Excel’s own spreadsheet format, i.e., .xls or .xlsx files. This is not supported in base R but through a number of different package. The reason for this is that Excel’s spreadsheet format is somewhat complicated: It has been changing over Excel versions and it can become quite complex because it can contain reactive cells, tables, graphics, etc. Here, we focus on simple data matrices that could also be exported as .csv files from Excel and then imported into R.

To work with the Excel spreadsheets directly the following two packages are convenient. Both of them are written in C++, i.e., need to be compiled when installed. However, suitable pre-compiled binaries are readily available on CRAN for Windows and OS X:

  • openxlsx: read.xlsx() and write.xlsx() for reading/writing .xlsx files. Old .xls files are not supported.
  • readxl: read_xls() and read_xlsx() as well as a wrapper function read_excel() that tries to guess the format. Returns a "tibble" and not a "data.frame". Can also be used through the RStudio graphical menus. No write_*() function is available (but write_xlsx() is provided in the writexl package).

Additionally, a number of other packages are available that have some nice properties but also some drawbacks:

  • gdata: read.xls() for reading both .xls and .xlsx files. Internally, the spreadsheets are transformed to .csv first and then read.table() is used, so that the same arguments are available for reading the data matrix. Requires Perl, see https://www.perl.org/get.html.
  • xlsx: read.xlsx() for reading both .xls and .xlsx files along with corresponding write.xlsx(). Provides many arguments that read.table() also has. Requires Java and is somewhat slow.
  • XLConnect: Functionality for querying and manipulating spreadsheets (including reading/writing). Requires Java.

Overall, there is not one “best” implementation that clearly outperforms all others. Users have to choose what works best for their application scenarios.

3.1.6 Stata and SPSS files

For exchanging data with various other statistics systems, the base R system ships with the package foreign. Thus, this package is available in all R installations and does not need to be installed from CRAN.

The Stata .dta format (version 5-12) can be read and written with read.dta() and write.dta(), respectively. However, newer Stata versions starting from Stata 13 are not supported by foreign directly but available in the CRAN package readstata13.

The SPSS .sav format can be read with read.spss(). By default, this returns a "list" object but setting read.spss(..., to.data.frame = TRUE) creates a "data.frame".

3.1.7 Transforming variables in data frames

After having created a data frame it is often necessary transform one or more of the variables. Typical operations include:

  • Convert a numeric variable into a factor.
  • Scale a variable to a different unit, e.g., in 1000s, from Euros to Dollars, or normalize to mean 0 and standard deviation 1.
  • Add a new variable that is a transformation of one or more variables, e.g., computing income from hours worked and wage per hour.

One basic option for this is to access a given variable, say d$gender or d[["gender"]], carry out the transformation, and re-assign to the same variable (using either $ or [[ notation). For example, we could first read an all-numeric version of the d data frame from above and subsequently transform gender from numeric to factor:

d <- data.frame(
  expenditure = c(724.3, 639.4, 1000, 75.87),
  age = c(28, 54, 39, 17),
  gender = c(1, 1, 3, 2)
)
d
##   expenditure age gender
## 1      724.30  28      1
## 2      639.40  54      1
## 3     1000.00  39      3
## 4       75.87  17      2
d$gender <- factor(d$gender,
  levels = 1:3, labels = c("male", "female", "diverse"))
d
##   expenditure age  gender
## 1      724.30  28    male
## 2      639.40  54    male
## 3     1000.00  39 diverse
## 4       75.87  17  female

Because d$gender is an existing column in the data frame it is replaced by the assignment. But in the same way we could have also created a new variable with a different name, say d$sex.

For transforming a single variable this is usually simple enough. However, repeating d$ quickly becomes tedious when many variables have to be transformed or created. A convenient alternative solution is then to use the transform() function within which all variables can be directly accessed (due to so-called non-standard evaluation). For illustration, we transform the expenditure column from Austrian Schilling to Euros, and we add a new column age2 for age-squared:

d <- transform(d,
  expenditure  = expenditure/13.7603,
  age2 = age^2
)
d
##   expenditure age  gender age2
## 1    52.63693  28    male  784
## 2    46.46701  54    male 2916
## 3    72.67283  39 diverse 1521
## 4     5.51369  17  female  289

Of course, it would have been also possible to create another newdata instead of overwriting d.

3.1.8 Selecting subsets

Often it is useful to first select the relevant subset of a bigger data frame for a particular analysis. This can affect both:

  • Rows: Keep only some observational units in the data, e.g., only men or only observations after a certain time point, etc.
  • Columns: Keep only some variables that are necessary for the analysis while omitting other covariates.

A simple way to do this is to index with the [ operator, using either a row index and/or a column index d[i, j] where each of i and j can be:

  • Integer: The rows/columns pertaining to positive integers are kept, negative integers are omitted.
  • Logical: The rows/columns pertaining to TRUE are kept, FALSE are omitted.
  • Character: If row and/or column names are available, these can also be used for indexing.
  • Missing: If the row and/or column index is missing all rows and/or columns are kept, respecitvely.

For illustration, we could select the variables expenditure and age from d by:

d[, 1:2]
##   expenditure age
## 1    52.63693  28
## 2    46.46701  54
## 3    72.67283  39
## 4     5.51369  17

Equivalently, we could have also used:

d[, c(1, 2)]
d[, -c(3, 4)]
d[, c(TRUE, TRUE, FALSE, FALSE)]
d[, c("expenditure", "age")]

One caveat when selecting columns is that when selecting a single column the data frame attributes are dropped and only a single vector is returned. This can be avoided by explicitly setting the argument drop = FALSE:

d[, 1]
## [1] 52.63693 46.46701 72.67283  5.51369
d[, 1, drop = FALSE]
##   expenditure
## 1    52.63693
## 2    46.46701
## 3    72.67283
## 4     5.51369

This is one of the important differences between "data.frame" and "tibble" objects. For the former the default is drop = TRUE while the latter has the default drop = FALSE.

Similar to the indexing of columns above, we can also index rows. For illustration, we could select the first and the third observation (with expenditure > 50):

d[c(1, 3), ]
##   expenditure age  gender age2
## 1     52.6369  28    male  784
## 3     72.6728  39 diverse 1521
d[d$expenditure > 50, ]
##   expenditure age  gender age2
## 1     52.6369  28    male  784
## 3     72.6728  39 diverse 1521

Note that the latter creates a logical variable first and then uses that for indexing:

d$expenditure > 50
## [1]  TRUE FALSE  TRUE FALSE

As repeating d$ can again be tedious, there is a dedicated convenience function subset() that uses a similar strategy as transform() (based on so-called non-standard evaluation where variables can be used directly within the function).

So we could easily select only the men with expenditure > 50, keeping only the expenditure and age variables by:

subset(d, subset = gender == "male" & expenditure > 50,
  select = c(expenditure, age))
##   expenditure age
## 1     52.6369  28

3.1.9 Handling missing values

Sometimes some of the variables are missing for some of the observational units, e.g., due to an error in the data collection or because the realization was not observable. In practice, data sets typically include missing values and these have to be handled in some way prior to subsequent analysis.

One common strategy is to omit those observations/rows where at least one of the variables in the analysis is missing. However, this strategy might waste a lot of useful data if there are many missing values in different variables. Also, if data is not missing at random, the fact that on observation is missing might be of interest in itself. Hence, omitting missing values is typically only recommended if there are very few missing values.

Alternatively, missing values are often imputed with different strategies. These might depend on the type of data (time series vs. cross-section) and the type of data analysis applied (especially when fitting parametric models). Such imputation strategies are beyond the scope of this course but we refer to the CRAN task view on MissingData for more details.

In R, missing values are represented by the dedicated symbol NA for “not available”. Base R comes with a number of na.*() functions that implemented relatively simple strategies for handling the missing values. Further na.*() functions are provided by various packages, especially for time series data. The function na.omit() implements the simple strategy to omit all observations where at least one observation is missing. However, care is needed if there are still variables in the data frame that are not really needed for the subsequent analysis because na.omit() will also omit rows where these are missing. Thus, make sure that you have selected the right subset() first.

As a simple example, consider a data frame with four variables where only the first two are really of interest. It makes a big difference whethere the NAs are ommitted first and then variables selected - or vice versa. The latter is typically a better idea.

d <- data.frame(
  x1 = c(1.1, 2.2, NA, 4.4),
  x2 = sin(1:4),
  z1 = cos(1:4),
  z2 = c(4.4, NA, NA, NA)
)
d
##    x1        x2        z1  z2
## 1 1.1  0.841471  0.540302 4.4
## 2 2.2  0.909297 -0.416147  NA
## 3  NA  0.141120 -0.989992  NA
## 4 4.4 -0.756802 -0.653644  NA
na.omit(d)[, c("x1", "x2")]
##    x1       x2
## 1 1.1 0.841471
na.omit(d[, c("x1", "x2")])
##    x1        x2
## 1 1.1  0.841471
## 2 2.2  0.909297
## 4 4.4 -0.756802

3.1.10 Aggregating data frames

Often it is of interest to first aggregate or average groups of observations prior to subsequent analysis. Here, this is illustrated by aggregating the interest in different summer activities within countries of origin for all the tourists in the Guest Survey Austria data. Download the GSA data set as GSA.rda.

First, the data for all the individual tourists is loaded:

load("GSA.rda")
dim(GSA)
## [1] 14571    52

The we proceed, as above, by first selecting the variables of interest (here, the country and the eight summer activities SA*.*) before omitting observations/rows with incomplete information due to missing values.

gsa <- GSA[, c(5, 10:12, 14, 25:27, 29)]
gsa <- na.omit(gsa)
dim(gsa)
## [1] 12449     9
head(gsa, 3)
##       country SA01.tennis SA02.cycle SA03.ride SA05.swim SA17.shop SA18.concert
## 1     Germany          no        yes        no        no        no           no
## 2 Switzerland          no         no        no        no        no           no
## 3     Germany          no        yes        no       yes       yes           no
##   SA19.sight SA21.museum
## 1        yes          no
## 2         no          no
## 3         no         yes

To aggregate this interest in the different summer activities - from individual tourist level to country level - we want to compute the country-wise “success rates”. These are simply the conditional proportions of each summer activity given country. One way to compute these is by setting up a contingency table with table() or xtabs() and then computing conditional proportions with prop.table():

prop.table(xtabs(~ country + SA01.tennis, data = gsa), 1)
##                   SA01.tennis
## country                    no        yes
##   Austria (Vienna) 0.96203905 0.03796095
##   Austria (other)  0.96010879 0.03989121
##   Belgium          0.98656716 0.01343284
##   Denmark          0.98080614 0.01919386
##   France           0.98104265 0.01895735
##   Germany          0.98196101 0.01803899
##   Hungary          0.97528090 0.02471910
##   Italy            0.97155050 0.02844950
##   Netherlands      0.97823721 0.02176279
##   Spain            0.99595142 0.00404858
##   Sweden           0.98780488 0.01219512
##   Switzerland      0.95652174 0.04347826
##   UK               0.98891626 0.01108374
##   USA              0.99402390 0.00597610
##   other            0.98714653 0.01285347

These proportions sum to 1 (or 100%) in each row and the second column provides the “success rates” for the summer activity “tennis” in each of the countries.

Thus, one way to proceed would be to use the analogous command for all eight summer activities separately and then cbind()-ing the results. However, this is somewhat tedious, especially if the number of variables is large.

An alternative command that does this in one go is aggregate(). It can aggregate all columns in a data frame along one or more grouping variables based on a user-supplied function. The interface is somewhat similar to tapply().

To apply either aggregate() or tapply() we therefore need a function that computes the unconditional success proportions because the conditioning is then done by aggregate() or tapply(). We can also use prop.table() for this and simply remove the conditioning:

prop.table(table(gsa$SA01.tennis))
## 
##        no       yes 
## 0.9776689 0.0223311

Then, we only need the second element of that proportion table and wrap the command into a custom function sucrate():

sucrate <- function(x) prop.table(table(x))[2]
sucrate(gsa$SA01.tennis)
##       yes 
## 0.0223311

To apply this function separately for each country for a single summer activity tapply() could be used:

tapply(gsa$SA01.tennis, gsa$country, sucrate)
## Austria (Vienna)  Austria (other)          Belgium          Denmark           France 
##       0.03796095       0.03989121       0.01343284       0.01919386       0.01895735 
##          Germany          Hungary            Italy      Netherlands            Spain 
##       0.01803899       0.02471910       0.02844950       0.02176279       0.00404858 
##           Sweden      Switzerland               UK              USA            other 
##       0.01219512       0.04347826       0.01108374       0.00597610       0.01285347

But to use the same approach for all variables in gsa, aggregate() is more convenient:

gsa <- aggregate(gsa, list(gsa$country), sucrate)
head(gsa, 3)
##            Group.1 country SA01.tennis SA02.cycle  SA03.ride SA05.swim SA17.shop
## 1 Austria (Vienna)       0   0.0379610   0.187636 0.01301518  0.414317 0.0856833
## 2  Austria (other)       1   0.0398912   0.199456 0.00543971  0.389846 0.1133273
## 3          Belgium       0   0.0134328   0.100000 0.00597015  0.317910 0.1835821
##   SA18.concert SA19.sight SA21.museum
## 1    0.0162690   0.163774   0.0574837
## 2    0.0253853   0.211242   0.0679964
## 3    0.0447761   0.304478   0.0865672

Note that gsa$country has to be wrapped into a list() because multiple aggregation indexes could be used. Note also that the first two columns of gsa are not really needed anymore. The first column would be more useful as row names for the data and the second is not really meaningful anyway. Therefore, we can clean up the data by

rownames(gsa) <- gsa[,1]
gsa <- gsa[, -(1:2)]
dim(gsa)
## [1] 15  8

This is the 15 x 8 data matrix that is ready for subsequent analysis with multivariate methods. Below the first couple of rows and columns are shown:

gsa[1:3, 1:5]
##                  SA01.tennis SA02.cycle  SA03.ride SA05.swim SA17.shop
## Austria (Vienna)   0.0379610   0.187636 0.01301518  0.414317 0.0856833
## Austria (other)    0.0398912   0.199456 0.00543971  0.389846 0.1133273
## Belgium            0.0134328   0.100000 0.00597015  0.317910 0.1835821

We also save() this aggregated data set so that the resulting .rda can be loaded and we do not need to re-run the aggregation for future analyses.

save(gsa, file = "gsa-country.rda")

3.2 Python

When the necessary packages are installed (see Chapter 2), they need to be imported in order to work with them. Here, pandas and numpy are imported.

import pandas as pd 
import numpy as np

Throughout the book, the pandas package is used to work with data frames. This is very similar in spirit to the data frames employed in the R code above and provides many features that would not be available in base Python.

3.2.1 Transforming variables in data frames

To set up a first data frame on the fly, we first define a dictionary d of three variables, each of which is a list, and then transform it to the data frame mydata.

d = {
  'expenditure': [724.3, 639.4, 1000, 75.87],
  'age': [28, 54, 39, 17],
  'gender': [1, 1, 3, 2]
}
mydata = pd.DataFrame(data = d)
mydata
##    expenditure  age  gender
## 0       724.30   28       1
## 1       639.40   54       1
## 2      1000.00   39       3
## 3        75.87   17       2

The numerically coded gender variable is cut() into a categorical factor with appropriate levels.

mydata.gender = pd.cut(mydata.gender, bins = [0, 1, 2, 3],
  labels = ["male", "female", "diverse"])
mydata
##    expenditure  age   gender
## 0       724.30   28     male
## 1       639.40   54     male
## 2      1000.00   39  diverse
## 3        75.87   17   female

The expenditure is converted from Austrian Schilling to Euros and a squared age variable is added.

mydata['expenditure'] = mydata.expenditure.transform(lambda x: x/13.7603)
mydata['age2'] = mydata.age.transform(np.square)
mydata
##    expenditure  age   gender  age2
## 0    52.636934   28     male   784
## 1    46.467010   54     male  2916
## 2    72.672834   39  diverse  1521
## 3     5.513688   17   female   289

3.2.2 Selecting subsets

Subsets of variables can be extracting by either using their numeric index, by a logical index, or using their character names.

mydata.iloc[:, [0, 1]]
##    expenditure  age
## 0    52.636934   28
## 1    46.467010   54
## 2    72.672834   39
## 3     5.513688   17
mydata.iloc[:, [True, True, False, False]]
##    expenditure  age
## 0    52.636934   28
## 1    46.467010   54
## 2    72.672834   39
## 3     5.513688   17
mydata[["expenditure", "age"]]
##    expenditure  age
## 0    52.636934   28
## 1    46.467010   54
## 2    72.672834   39
## 3     5.513688   17

Note the difference between extracting a single column using its index and computing a subset of a data frame using a list of indexes.

mydata.iloc[:, 0]
## 0    52.636934
## 1    46.467010
## 2    72.672834
## 3     5.513688
## Name: expenditure, dtype: float64
mydata.iloc[:, [0]]
##    expenditure
## 0    52.636934
## 1    46.467010
## 2    72.672834
## 3     5.513688

Similarly, rows can be selected using a list of row indexes or by using logical conditions.

mydata.iloc[[0, 2], :]
##    expenditure  age   gender  age2
## 0    52.636934   28     male   784
## 2    72.672834   39  diverse  1521
mydata[mydata.expenditure > 50]
##    expenditure  age   gender  age2
## 0    52.636934   28     male   784
## 2    72.672834   39  diverse  1521

In the latter line of code the logical variable was created on the fly via:

mydata.expenditure > 50
## 0     True
## 1    False
## 2     True
## 3    False
## Name: expenditure, dtype: bool

Multiple logical conditions can also be combined.

mydata[(mydata.expenditure > 50) & (mydata.gender == "male")]
##    expenditure  age gender  age2
## 0    52.636934   28   male   784

Finally, it is also possible to select a subset of rows (here, based on logical conditions) and a subset of columns.

mydata[(mydata.expenditure > 50) & (mydata.gender == "male")].iloc[:, [0,1]]
##    expenditure  age
## 0    52.636934   28

3.2.3 Handling missing values

For illustrating basic handling of missing values, the following simple data frame is set up.

d = {
  'x1': [.1, 2.2, np.nan, 4.4],
  'x2': np.sin(np.arange(1, 5)),
  'z1': np.cos(np.arange(1, 5)),
  'z2': [4.4, np.nan, np.nan, np.nan]
}
mydata = pd.DataFrame(data = d)
mydata
##     x1        x2        z1   z2
## 0  0.1  0.841471  0.540302  4.4
## 1  2.2  0.909297 -0.416147  NaN
## 2  NaN  0.141120 -0.989992  NaN
## 3  4.4 -0.756802 -0.653644  NaN

Using dropna() all rows are omitted that contain a missing value in any of the four variables. Note that here inplace = False is used here, so that the operation is illustrated but mydata is not modified immediately “in place”.

mydata.dropna(inplace = False)
##     x1        x2        z1   z2
## 0  0.1  0.841471  0.540302  4.4

To omit only those rows which have missing values in specific variables, the subset argument can be used.

mydata.dropna(subset = ['x1', 'x2'], inplace = False)
##     x1        x2        z1   z2
## 0  0.1  0.841471  0.540302  4.4
## 1  2.2  0.909297 -0.416147  NaN
## 3  4.4 -0.756802 -0.653644  NaN

3.2.4 Aggregating data frames

Download the GSA data set as GSA.csv. This comma-separated values (CSV) file can be read into Python, using the first line for the variable names and not setting and index variable.

GSA = pd.read_csv("GSA.csv", index_col = False, header = 0) 
GSA.shape
## (14571, 52)

The following code selects the country variable along with the summer activities of interest before dropping all missing values in these variables.

cols = [4] + list(range(9, 12)) + [13] + list(range(24, 27)) + [28]
gsa = GSA.iloc[:, cols].copy()
gsa.dropna(inplace = True)
gsa.shape
## (12449, 9)

The first three rows of gsa look like this:

gsa.head(3)
##        country SA01.tennis SA02.cycle  ... SA18.concert SA19.sight SA21.museum
## 0      Germany          no        yes  ...           no        yes          no
## 1  Switzerland          no         no  ...           no         no          no
## 2      Germany          no        yes  ...           no         no         yes
## 
## [3 rows x 9 columns]

The goal is now to compute a data frame of “success rates” for each of the eight summer activities for each country. For tennis, this can be extracted from the cross tabulation of country and SA01.tennis, normalized with respect to the index variable country.

# Get frequency table
pd.crosstab(index = gsa["country"], columns = gsa["SA01.tennis"],
  normalize = 'index') 
## SA01.tennis             no       yes
## country                             
## Austria (Vienna)  0.962039  0.037961
## Austria (other)   0.960109  0.039891
## Belgium           0.986567  0.013433
## Denmark           0.980806  0.019194
## France            0.981043  0.018957
## Germany           0.981961  0.018039
## Hungary           0.975281  0.024719
## Italy             0.971550  0.028450
## Netherlands       0.978237  0.021763
## Spain             0.995951  0.004049
## Sweden            0.987805  0.012195
## Switzerland       0.956522  0.043478
## UK                0.988916  0.011084
## USA               0.994024  0.005976
## other             0.987147  0.012853

To apply this to all columns, the function sucrate() is defined.

def sucrate(column, index, return_key):
    return pd.crosstab(index = index, columns = column, normalize = 'index')[return_key]
sucrate(column = gsa["SA01.tennis"], index = gsa["country"], return_key = "yes")
## country
## Austria (Vienna)    0.037961
## Austria (other)     0.039891
## Belgium             0.013433
## Denmark             0.019194
## France              0.018957
## Germany             0.018039
## Hungary             0.024719
## Italy               0.028450
## Netherlands         0.021763
## Spain               0.004049
## Sweden              0.012195
## Switzerland         0.043478
## UK                  0.011084
## USA                 0.005976
## other               0.012853
## Name: yes, dtype: float64

Finally, the entire gsa data is aggregated by sucrate().

gsa_agg = gsa.iloc[:, 1:].agg(func = lambda x:
  sucrate(x, index = gsa["country"], return_key = "yes"), axis = "index")
gsa_agg.head(3)
##                   SA01.tennis  SA02.cycle  ...  SA19.sight  SA21.museum
## country                                    ...                         
## Austria (Vienna)     0.037961    0.187636  ...    0.163774     0.057484
## Austria (other)      0.039891    0.199456  ...    0.211242     0.067996
## Belgium              0.013433    0.100000  ...    0.304478     0.086567
## 
## [3 rows x 8 columns]
gsa_agg.shape
## (15, 8)

Before saving the aggregated data in a CSV file, the index column name is added in upper case.

gsa_agg.index.name = "Country"
gsa_agg.to_csv("gsa-country.csv")

3.2.5 Working with dataframes

The aggregated CSV file of the GSA data aggregated by country can also be downloaded as gsa-country.csv.

To load the data, read_csv() from pandas is used, setting index_col = 0 and header = 0 which specifies that the first row contains variable names (header = 0), and the first column contains an index (index_col = 0).

gsa = pd.read_csv("gsa-country.csv", index_col = 0, header = 0)
gsa.head(2)
##                   SA01.tennis  SA02.cycle  ...  SA19.sight  SA21.museum
## Country                                    ...                         
## Austria (Vienna)     0.037961    0.187636  ...    0.163774     0.057484
## Austria (other)      0.039891    0.199456  ...    0.211242     0.067996
## 
## [2 rows x 8 columns]

Alternatively, the index column could have also been specified using its name (index_col = "Country") rather than its position.

There are also cases when it is useful to set index_col = False so that the entire CSV file is stored in the data frame columns and a numerical index is added.

gsa = pd.read_csv("gsa-country.csv", index_col = False, header = 0)
gsa.head(2)
##             Country  SA01.tennis  ...  SA19.sight  SA21.museum
## 0  Austria (Vienna)     0.037961  ...    0.163774     0.057484
## 1   Austria (other)     0.039891  ...    0.211242     0.067996
## 
## [2 rows x 9 columns]

However, index_col = False only retains the order of column values if all columns (including the index) have a name. If the CSV file has an unnamed index variable in the first column, then it is better to read with index_col = 0 and subsequently reset and rename the index. To illustrate this case, the data is read again but the index name is deleted afterwards.

gsa = pd.read_csv("gsa-country.csv", index_col = 0, header = 0)
gsa.index.name = None
gsa.head(2)
##                   SA01.tennis  SA02.cycle  ...  SA19.sight  SA21.museum
## Austria (Vienna)     0.037961    0.187636  ...    0.163774     0.057484
## Austria (other)      0.039891    0.199456  ...    0.211242     0.067996
## 
## [2 rows x 8 columns]

By resetting the index, the corresponding column in the data becomes a regular variable and a numerical index is added.

gsa.reset_index(inplace = True)
gsa.head(2)
##               index  SA01.tennis  ...  SA19.sight  SA21.museum
## 0  Austria (Vienna)     0.037961  ...    0.163774     0.057484
## 1   Austria (other)     0.039891  ...    0.211242     0.067996
## 
## [2 rows x 9 columns]

Afterwards the "index" variable name is changed to "Country" as a more intelligible name.

gsa.rename(columns={"index": "Country"}, inplace = True)
gsa.head(2)
##             Country  SA01.tennis  ...  SA19.sight  SA21.museum
## 0  Austria (Vienna)     0.037961  ...    0.163774     0.057484
## 1   Austria (other)     0.039891  ...    0.211242     0.067996
## 
## [2 rows x 9 columns]

Note that "Country" could be made the index column again with .set_index().

gsa.set_index(["Country"], inplace=True)
gsa.head(2)
##                   SA01.tennis  SA02.cycle  ...  SA19.sight  SA21.museum
## Country                                    ...                         
## Austria (Vienna)     0.037961    0.187636  ...    0.163774     0.057484
## Austria (other)      0.039891    0.199456  ...    0.211242     0.067996
## 
## [2 rows x 8 columns]

Furthermore, it is possible to set multiple indices, e.g., with gsa.set_index(["country", "SA01.tennis"], inplace = True). This is exemplified below by first deleting the index and setting it again afterwards.

gsa.index.name = None
gsa.head(2)
##                   SA01.tennis  SA02.cycle  ...  SA19.sight  SA21.museum
## Austria (Vienna)     0.037961    0.187636  ...    0.163774     0.057484
## Austria (other)      0.039891    0.199456  ...    0.211242     0.067996
## 
## [2 rows x 8 columns]
gsa.index.name = "Country"
gsa.head(2)
##                   SA01.tennis  SA02.cycle  ...  SA19.sight  SA21.museum
## Country                                    ...                         
## Austria (Vienna)     0.037961    0.187636  ...    0.163774     0.057484
## Austria (other)      0.039891    0.199456  ...    0.211242     0.067996
## 
## [2 rows x 8 columns]

With drop() a column which is no longer needed can be omitted from the data.

gsa.drop(["SA21.museum"], axis=1, inplace = True)
gsa.head(2)
##                   SA01.tennis  SA02.cycle  ...  SA18.concert  SA19.sight
## Country                                    ...                          
## Austria (Vienna)     0.037961    0.187636  ...      0.016269    0.163774
## Austria (other)      0.039891    0.199456  ...      0.025385    0.211242
## 
## [2 rows x 7 columns]

With rename() variables can be renamed.

gsa.rename(columns={"SA18.concert": "SA22.opera"}, inplace = True)
gsa.head(2)
##                   SA01.tennis  SA02.cycle  ...  SA22.opera  SA19.sight
## Country                                    ...                        
## Austria (Vienna)     0.037961    0.187636  ...    0.016269    0.163774
## Austria (other)      0.039891    0.199456  ...    0.025385    0.211242
## 
## [2 rows x 7 columns]

As a general cautionary remark, note that applying methods with inplace = True immediately and permanently modifies the data “in place”. Thus, reverting such a change is not always possible. In such a case the data needs to be reloaded (e.g., from the source CSV file) and all transformations and modifications need to be re-applied step by step. In a Jupyter notebook this is typically done by re-running the notebook from the start (run-all) after correcting the corresponding code chunks.