Chapter 3 Basic data handling
3.1 R
3.1.1 Data frames
The basic object type for representing rectangular “data matrices” in R is the
class "data.frame"
. The rows correspond to observational units and the columns
to variables. Numeric vectors of class "numeric"
or "integer"
are typically
used for quantitative variables and "factor"
vectors for qualitative variables
(either nominal or ordinal).
There are many more data classes that can be used for columns/variables in a
data frame such as text ("character"
), date ("Date"
), time/date
("POSIXct"
), time series ("ts"
, "zoo"
, …), or spatial ("sf"
)
information. For this course we focus on quantitative and qualitative vectors,
though.
To enter data “on the fly” in R, the function data.frame()
can be used,
potentially in combination with the function factor()
for
qualitative/categorical variables. As an example, we generate a small artificial
data sets:
mydata <- data.frame(
expenditure = c(724.3, 639.4, 1000, 75.87),
age = c(28, 54, 39, 17),
gender = factor(c(1, 1, 3, 2), levels = 1:3, labels = c("male", "female", "diverse"))
)
mydata
## expenditure age gender
## 1 724.30 28 male
## 2 639.40 54 male
## 3 1000.00 39 diverse
## 4 75.87 17 female
Entering data into R using data.frame()
is clearly only helpful for small
data sets or data that is already available in some other form within R. Larger data
sets are typically read from data sources such as text files, spreadsheets, or
binary files from R or another statistics software such as Stata or SPSS.
3.1.2 Tibbles
A variation of data frames with somewhat stricter checking and modified formatting is called “tibbles”. These are popularized by the so-called tidyverse family of packages and are employed by RStudio’s built-in infrastructure for reading data from files via the menu.
The corresponding class is "tbl_df"
provided by the tibble package. The
class inherits from "tbl"
and "data.frame"
. However, unless the “tidyverse”
is used for the data wrangling we recommend to convert tibbles to plain data
frames using as.data.frame()
. The reverse coercion via as_tibble()
(with
underscore rather than period) is also available.
3.1.3 Text files
For exchanging simple rectangular data text-based formats are a popular choice.
These typically have one text line per row of data, using some separator symbol
for the different columns: e.g., comma, semicolon, tabulator, etc. The basic
workhorse function to read this kind of data in R is read.table()
which offers
a wide range of arguments for fine-control of the reading process. See
?read.table
or help("read.table")
for a detailed overview. The most
important arguments are:
sep
: Separator symbol between columns, e.g., tabulator"\t"
, comma","
, semicolon";"
.dec
: Decimal separator, typically either a decimal point"."
or decimal comma","
.header
: Are variable names provided in the first line of the text file? (TRUE
orFALSE
)
Probably the most-used “flavor” of text-based data is CSV (comma-separated values) which can also be easily imported/exported to Excel and many other spreadsheet or statistics programs. R offers a convenience function
for this type of data which internally essentially calls
Note that this type of CSV is only used by Excel on English systems. In contrast, on German systems Excel uses a decimal comma (rather than point). Therefore, the comma cannot be used as column separator and decimal separator. Consequently, semicolons are used as column separators (while still referring to the format as comma-separated). R offers another convenience function for this:
which corresponds to
All these functions return a "data.frame"
by internally setting up vectors for
each column and then calling data.frame()
(see above). By default both numeric
and text columns (with some character entries) are simply preserved as such. However,
optionally text columns can also be converted to factors automatically (by setting
stringsAsFactors = TRUE
) which is often useful if the data file was already preprocessed
before importing it into R. Note that in R versions prior to 4.0.0 the default
behavior was stringsAsFactors = TRUE
but in current R versions it is FALSE
.
If a column represents a qualitative variable but uses a numeric coding, then
factor()
should be used subsequently to declare the factor variable with
suitable labels.
For illustration we import the Bookbinders Book Club data from the file BBBClub.csv.
choice gender amount freq last first child youth cook diy art
yes male 113 8 1 8 0 1 0 0 0
yes male 418 6 11 66 0 2 3 2 3
...
Therefore, read.csv()
is used for reading the data, turning the two
categorical variables choice
and gender
into factors automatically:
## choice gender amount freq last first
## no :900 female:456 Min. : 15 Min. : 2.0 Min. : 1.0 Min. : 2.0
## yes:400 male :844 1st Qu.:127 1st Qu.: 6.0 1st Qu.: 1.0 1st Qu.:12.0
## Median :204 Median :12.0 Median : 2.0 Median :18.0
## Mean :201 Mean :12.1 Mean : 3.2 Mean :22.3
## 3rd Qu.:273 3rd Qu.:16.0 3rd Qu.: 4.0 3rd Qu.:30.0
## Max. :474 Max. :36.0 Max. :12.0 Max. :86.0
## child youth cook diy art
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000 Median :0.000 Median :0.000
## Mean :0.732 Mean :0.325 Mean :0.742 Mean :0.395 Mean :0.453
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :8.000 Max. :4.000 Max. :6.000 Max. :4.000 Max. :5.000
In addition to this comma-separated version of the data file, there is also
a semicolon-separated version in BBBClub2.csv
which can be read by
read.csv2()
instead of read.csv()
.
As a final remark: For exporting such text-based files from R the corresponding
commands write.table()
, write.csv()
, write.csv2()
, etc. are available.
3.1.4 Binary R data
A particularly convenient method to exchange data for usage with R (as opposed
to spreadsheets or other statistics systems) is R’s own binary format. This has
the advantage that R’s own objects/classes can be stored, i.e., not only
"data.frame"
objects but also plain vectors, fitted model objects (like "lm"
or "glm"
), even functions, etc. There are two variants of R’s binary data:
.rda
(or.RData
) files that can contain an arbitrary number of objects.
Import/export viaload()
andsave()
..rds
files that contain a single R object.
Import/export viareadRDS()
andsaveRDS()
.
The latter can be used similarly to the read.*()
functions from the previous
section, i.e., the result needs to be assigned to a new object. For
illustration, we export the mydata
data frame and re-import it as newdata
,
making sure that the objects are identical:
## [1] TRUE
In contrast, the .rda
format can contain many objects and hence load()
-ing
them already creates them in the user’s workspace with their original object
names. Thus, no assignment is needed here. For illustration, we load both the
BBBClub and GSA data from the corresponding .rda
files.
## [1] "BBBClub" "GSA" "X" "m" "mydata" "newdata" "x" "y"
Then, it is also possible to save()
all of these data sets in a single file,
remove()
them subsequently from the workspace, and load()
all of them again
in one go:
## [1] "X" "m" "newdata" "x" "y"
## [1] "BBBClub" "GSA" "X" "m" "mydata" "newdata" "x" "y"
3.1.5 Excel spreadsheets
Another convenient option to exchange data between R and Excel is Excel’s own
spreadsheet format, i.e., .xls
or .xlsx
files. This is not supported in
base R but through a number of different package. The reason for this is that
Excel’s spreadsheet format is somewhat complicated: It has been changing over
Excel versions and it can become quite complex because it can contain reactive
cells, tables, graphics, etc. Here, we focus on simple data matrices that could
also be exported as .csv
files from Excel and then imported into R.
To work with the Excel spreadsheets directly the following two packages are convenient. Both of them are written in C++, i.e., need to be compiled when installed. However, suitable pre-compiled binaries are readily available on CRAN for Windows and OS X:
- openxlsx:
read.xlsx()
andwrite.xlsx()
for reading/writing.xlsx
files. Old.xls
files are not supported. - readxl:
read_xls()
andread_xlsx()
as well as a wrapper functionread_excel()
that tries to guess the format. Returns a"tibble"
and not a"data.frame"
. Can also be used through the RStudio graphical menus. Nowrite_*()
function is available (butwrite_xlsx()
is provided in the writexl package).
Additionally, a number of other packages are available that have some nice properties but also some drawbacks:
- gdata:
read.xls()
for reading both.xls
and.xlsx
files. Internally, the spreadsheets are transformed to.csv
first and thenread.table()
is used, so that the same arguments are available for reading the data matrix. Requires Perl, see https://www.perl.org/get.html. - xlsx:
read.xlsx()
for reading both.xls
and.xlsx
files along with correspondingwrite.xlsx()
. Provides many arguments thatread.table()
also has. Requires Java and is somewhat slow. - XLConnect: Functionality for querying and manipulating spreadsheets (including reading/writing). Requires Java.
Overall, there is not one “best” implementation that clearly outperforms all others. Users have to choose what works best for their application scenarios.
3.1.6 Stata and SPSS files
For exchanging data with various other statistics systems, the base R system ships with the package foreign. Thus, this package is available in all R installations and does not need to be installed from CRAN.
The Stata .dta
format (version 5-12) can be read and written with read.dta()
and write.dta()
, respectively. However, newer Stata versions starting from
Stata 13 are not supported by foreign directly but available in the CRAN
package readstata13.
The SPSS .sav
format can be read with read.spss()
. By default, this returns
a "list"
object but setting read.spss(..., to.data.frame = TRUE)
creates a
"data.frame"
.
3.1.7 Transforming variables in data frames
After having created a data frame it is often necessary transform one or more of the variables. Typical operations include:
- Convert a numeric variable into a factor.
- Scale a variable to a different unit, e.g., in 1000s, from Euros to Dollars, or normalize to mean 0 and standard deviation 1.
- Add a new variable that is a transformation of one or more variables, e.g., computing income from hours worked and wage per hour.
One basic option for this is to access a given variable, say
d$gender
or d[["gender"]]
, carry out the transformation,
and re-assign to the same variable (using either $
or [[
notation). For
example, we could first read an all-numeric version of the d
data frame
from above and subsequently transform gender
from numeric to factor:
d <- data.frame(
expenditure = c(724.3, 639.4, 1000, 75.87),
age = c(28, 54, 39, 17),
gender = c(1, 1, 3, 2)
)
d
## expenditure age gender
## 1 724.30 28 1
## 2 639.40 54 1
## 3 1000.00 39 3
## 4 75.87 17 2
## expenditure age gender
## 1 724.30 28 male
## 2 639.40 54 male
## 3 1000.00 39 diverse
## 4 75.87 17 female
Because d$gender
is an existing column in the data frame it is replaced
by the assignment. But in the same way we could have also created a new variable
with a different name, say d$sex
.
For transforming a single variable this is usually simple enough. However,
repeating d$
quickly becomes tedious when many variables have to be
transformed or created. A convenient alternative solution is then to use the
transform()
function within which all variables can be directly accessed (due
to so-called non-standard evaluation). For illustration, we transform the
expenditure
column from Austrian Schilling to Euros, and we add a new column
age2
for age-squared:
## expenditure age gender age2
## 1 52.63693 28 male 784
## 2 46.46701 54 male 2916
## 3 72.67283 39 diverse 1521
## 4 5.51369 17 female 289
Of course, it would have been also possible to create another newdata
instead
of overwriting d
.
3.1.8 Selecting subsets
Often it is useful to first select the relevant subset of a bigger data frame for a particular analysis. This can affect both:
- Rows: Keep only some observational units in the data, e.g., only men or only observations after a certain time point, etc.
- Columns: Keep only some variables that are necessary for the analysis while omitting other covariates.
A simple way to do this is to index with the [
operator, using either a row
index and/or a column index d[i, j]
where each of i
and j
can be:
- Integer: The rows/columns pertaining to positive integers are kept, negative integers are omitted.
- Logical: The rows/columns pertaining to
TRUE
are kept,FALSE
are omitted. - Character: If row and/or column names are available, these can also be used for indexing.
- Missing: If the row and/or column index is missing all rows and/or columns are kept, respecitvely.
For illustration, we could select the variables expenditure
and age
from
d
by:
## expenditure age
## 1 52.63693 28
## 2 46.46701 54
## 3 72.67283 39
## 4 5.51369 17
Equivalently, we could have also used:
One caveat when selecting columns is that when selecting a single column the
data frame attributes are dropped and only a single vector is returned. This can
be avoided by explicitly setting the argument drop = FALSE
:
## [1] 52.63693 46.46701 72.67283 5.51369
## expenditure
## 1 52.63693
## 2 46.46701
## 3 72.67283
## 4 5.51369
This is one of the important differences between "data.frame"
and "tibble"
objects. For the former the default is drop = TRUE
while the latter has the
default drop = FALSE
.
Similar to the indexing of columns above, we can also index rows. For
illustration, we could select the first and the third observation (with
expenditure > 50
):
## expenditure age gender age2
## 1 52.6369 28 male 784
## 3 72.6728 39 diverse 1521
## expenditure age gender age2
## 1 52.6369 28 male 784
## 3 72.6728 39 diverse 1521
Note that the latter creates a logical variable first and then uses that for indexing:
## [1] TRUE FALSE TRUE FALSE
As repeating d$
can again be tedious, there is a dedicated convenience
function subset()
that uses a similar strategy as transform()
(based on
so-called non-standard evaluation where variables can be used directly within
the function).
So we could easily select only the men with expenditure > 50, keeping only the expenditure and age variables by:
## expenditure age
## 1 52.6369 28
3.1.9 Handling missing values
Sometimes some of the variables are missing for some of the observational units, e.g., due to an error in the data collection or because the realization was not observable. In practice, data sets typically include missing values and these have to be handled in some way prior to subsequent analysis.
One common strategy is to omit those observations/rows where at least one of the variables in the analysis is missing. However, this strategy might waste a lot of useful data if there are many missing values in different variables. Also, if data is not missing at random, the fact that on observation is missing might be of interest in itself. Hence, omitting missing values is typically only recommended if there are very few missing values.
Alternatively, missing values are often imputed with different strategies. These might depend on the type of data (time series vs. cross-section) and the type of data analysis applied (especially when fitting parametric models). Such imputation strategies are beyond the scope of this course but we refer to the CRAN task view on MissingData for more details.
In R, missing values are represented by the dedicated symbol NA
for “not
available”. Base R comes with a number of na.*()
functions that implemented
relatively simple strategies for handling the missing values. Further na.*()
functions are provided by various packages, especially for time series data. The
function na.omit()
implements the simple strategy to omit all observations
where at least one observation is missing. However, care is needed if there are
still variables in the data frame that are not really needed for the subsequent
analysis because na.omit()
will also omit rows where these are missing. Thus,
make sure that you have selected the right subset()
first.
As a simple example, consider a data frame with four variables where only the
first two are really of interest. It makes a big difference whethere the NA
s
are ommitted first and then variables selected - or vice versa. The latter is
typically a better idea.
d <- data.frame(
x1 = c(1.1, 2.2, NA, 4.4),
x2 = sin(1:4),
z1 = cos(1:4),
z2 = c(4.4, NA, NA, NA)
)
d
## x1 x2 z1 z2
## 1 1.1 0.841471 0.540302 4.4
## 2 2.2 0.909297 -0.416147 NA
## 3 NA 0.141120 -0.989992 NA
## 4 4.4 -0.756802 -0.653644 NA
## x1 x2
## 1 1.1 0.841471
## x1 x2
## 1 1.1 0.841471
## 2 2.2 0.909297
## 4 4.4 -0.756802
3.1.10 Aggregating data frames
Often it is of interest to first aggregate or average groups of observations prior to subsequent analysis. Here, this is illustrated by aggregating the interest in different summer activities within countries of origin for all the tourists in the Guest Survey Austria data. Download the GSA data set as GSA.rda.
First, the data for all the individual tourists is loaded:
## [1] 14571 52
The we proceed, as above, by first selecting the variables of interest (here,
the country
and the eight summer activities SA*.*
) before omitting
observations/rows with incomplete information due to missing values.
## [1] 12449 9
## country SA01.tennis SA02.cycle SA03.ride SA05.swim SA17.shop SA18.concert
## 1 Germany no yes no no no no
## 2 Switzerland no no no no no no
## 3 Germany no yes no yes yes no
## SA19.sight SA21.museum
## 1 yes no
## 2 no no
## 3 no yes
To aggregate this interest in the different summer activities - from individual
tourist level to country level - we want to compute the country-wise “success
rates”. These are simply the conditional proportions of each summer activity
given country. One way to compute these is by setting up a contingency table
with table()
or xtabs()
and then computing conditional proportions with
prop.table()
:
## SA01.tennis
## country no yes
## Austria (Vienna) 0.96203905 0.03796095
## Austria (other) 0.96010879 0.03989121
## Belgium 0.98656716 0.01343284
## Denmark 0.98080614 0.01919386
## France 0.98104265 0.01895735
## Germany 0.98196101 0.01803899
## Hungary 0.97528090 0.02471910
## Italy 0.97155050 0.02844950
## Netherlands 0.97823721 0.02176279
## Spain 0.99595142 0.00404858
## Sweden 0.98780488 0.01219512
## Switzerland 0.95652174 0.04347826
## UK 0.98891626 0.01108374
## USA 0.99402390 0.00597610
## other 0.98714653 0.01285347
These proportions sum to 1 (or 100%) in each row and the second column provides the “success rates” for the summer activity “tennis” in each of the countries.
Thus, one way to proceed would be to use the analogous command for all eight
summer activities separately and then cbind()
-ing the results. However, this
is somewhat tedious, especially if the number of variables is large.
An alternative command that does this in one go is aggregate()
. It can
aggregate all columns in a data frame along one or more grouping variables based
on a user-supplied function. The interface is somewhat similar to tapply()
.
To apply either aggregate()
or tapply()
we therefore need a function that
computes the unconditional success proportions because the conditioning is
then done by aggregate()
or tapply()
. We can also use prop.table()
for
this and simply remove the conditioning:
##
## no yes
## 0.9776689 0.0223311
Then, we only need the second element of that proportion table and wrap the
command into a custom function sucrate()
:
## yes
## 0.0223311
To apply this function separately for each country for a single summer activity
tapply()
could be used:
## Austria (Vienna) Austria (other) Belgium Denmark France
## 0.03796095 0.03989121 0.01343284 0.01919386 0.01895735
## Germany Hungary Italy Netherlands Spain
## 0.01803899 0.02471910 0.02844950 0.02176279 0.00404858
## Sweden Switzerland UK USA other
## 0.01219512 0.04347826 0.01108374 0.00597610 0.01285347
But to use the same approach for all variables in gsa
, aggregate()
is more
convenient:
## Group.1 country SA01.tennis SA02.cycle SA03.ride SA05.swim SA17.shop
## 1 Austria (Vienna) 0 0.0379610 0.187636 0.01301518 0.414317 0.0856833
## 2 Austria (other) 1 0.0398912 0.199456 0.00543971 0.389846 0.1133273
## 3 Belgium 0 0.0134328 0.100000 0.00597015 0.317910 0.1835821
## SA18.concert SA19.sight SA21.museum
## 1 0.0162690 0.163774 0.0574837
## 2 0.0253853 0.211242 0.0679964
## 3 0.0447761 0.304478 0.0865672
Note that gsa$country
has to be wrapped into a list()
because multiple
aggregation indexes could be used. Note also that the first two columns of gsa
are not really needed anymore. The first column would be more useful as row
names for the data and the second is not really meaningful anyway. Therefore, we
can clean up the data by
## [1] 15 8
This is the 15 x 8 data matrix that is ready for subsequent analysis with multivariate methods. Below the first couple of rows and columns are shown:
## SA01.tennis SA02.cycle SA03.ride SA05.swim SA17.shop
## Austria (Vienna) 0.0379610 0.187636 0.01301518 0.414317 0.0856833
## Austria (other) 0.0398912 0.199456 0.00543971 0.389846 0.1133273
## Belgium 0.0134328 0.100000 0.00597015 0.317910 0.1835821
We also save()
this aggregated data set so that the resulting .rda
can be
loaded and we do not need to re-run the aggregation for future analyses.
3.2 Python
When the necessary packages are installed (see Chapter 2),
they need to be imported in order to work with them. Here, pandas
and numpy
are
imported.
Throughout the book, the pandas
package is used to work with data frames. This is very
similar in spirit to the data frames employed in the R code above and provides many features
that would not be available in base Python.
3.2.1 Transforming variables in data frames
To set up a first data frame on the fly, we first define a dictionary d
of three
variables, each of which is a list, and then transform it to the data frame mydata
.
d = {
'expenditure': [724.3, 639.4, 1000, 75.87],
'age': [28, 54, 39, 17],
'gender': [1, 1, 3, 2]
}
mydata = pd.DataFrame(data = d)
mydata
## expenditure age gender
## 0 724.30 28 1
## 1 639.40 54 1
## 2 1000.00 39 3
## 3 75.87 17 2
The numerically coded gender
variable is cut()
into a categorical factor
with appropriate levels.
mydata.gender = pd.cut(mydata.gender, bins = [0, 1, 2, 3],
labels = ["male", "female", "diverse"])
mydata
## expenditure age gender
## 0 724.30 28 male
## 1 639.40 54 male
## 2 1000.00 39 diverse
## 3 75.87 17 female
The expenditure
is converted from Austrian Schilling to Euros and a squared
age
variable is added.
mydata['expenditure'] = mydata.expenditure.transform(lambda x: x/13.7603)
mydata['age2'] = mydata.age.transform(np.square)
mydata
## expenditure age gender age2
## 0 52.636934 28 male 784
## 1 46.467010 54 male 2916
## 2 72.672834 39 diverse 1521
## 3 5.513688 17 female 289
3.2.2 Selecting subsets
Subsets of variables can be extracting by either using their numeric index, by a logical index, or using their character names.
## expenditure age
## 0 52.636934 28
## 1 46.467010 54
## 2 72.672834 39
## 3 5.513688 17
## expenditure age
## 0 52.636934 28
## 1 46.467010 54
## 2 72.672834 39
## 3 5.513688 17
## expenditure age
## 0 52.636934 28
## 1 46.467010 54
## 2 72.672834 39
## 3 5.513688 17
Note the difference between extracting a single column using its index and computing a subset of a data frame using a list of indexes.
## 0 52.636934
## 1 46.467010
## 2 72.672834
## 3 5.513688
## Name: expenditure, dtype: float64
## expenditure
## 0 52.636934
## 1 46.467010
## 2 72.672834
## 3 5.513688
Similarly, rows can be selected using a list of row indexes or by using logical conditions.
## expenditure age gender age2
## 0 52.636934 28 male 784
## 2 72.672834 39 diverse 1521
## expenditure age gender age2
## 0 52.636934 28 male 784
## 2 72.672834 39 diverse 1521
In the latter line of code the logical variable was created on the fly via:
## 0 True
## 1 False
## 2 True
## 3 False
## Name: expenditure, dtype: bool
Multiple logical conditions can also be combined.
## expenditure age gender age2
## 0 52.636934 28 male 784
Finally, it is also possible to select a subset of rows (here, based on logical conditions) and a subset of columns.
## expenditure age
## 0 52.636934 28
3.2.3 Handling missing values
For illustrating basic handling of missing values, the following simple data frame is set up.
d = {
'x1': [.1, 2.2, np.nan, 4.4],
'x2': np.sin(np.arange(1, 5)),
'z1': np.cos(np.arange(1, 5)),
'z2': [4.4, np.nan, np.nan, np.nan]
}
mydata = pd.DataFrame(data = d)
mydata
## x1 x2 z1 z2
## 0 0.1 0.841471 0.540302 4.4
## 1 2.2 0.909297 -0.416147 NaN
## 2 NaN 0.141120 -0.989992 NaN
## 3 4.4 -0.756802 -0.653644 NaN
Using dropna()
all rows are omitted that contain a missing value in any
of the four variables. Note that here inplace = False
is used here, so that
the operation is illustrated but mydata
is not modified immediately “in place”.
## x1 x2 z1 z2
## 0 0.1 0.841471 0.540302 4.4
To omit only those rows which have missing values in specific variables, the
subset
argument can be used.
## x1 x2 z1 z2
## 0 0.1 0.841471 0.540302 4.4
## 1 2.2 0.909297 -0.416147 NaN
## 3 4.4 -0.756802 -0.653644 NaN
3.2.4 Aggregating data frames
Download the GSA data set as GSA.csv. This comma-separated values (CSV) file can be read into Python, using the first line for the variable names and not setting and index variable.
## (14571, 52)
The following code selects the country
variable along with the
summer activities of interest before dropping all missing values
in these variables.
cols = [4] + list(range(9, 12)) + [13] + list(range(24, 27)) + [28]
gsa = GSA.iloc[:, cols].copy()
gsa.dropna(inplace = True)
gsa.shape
## (12449, 9)
The first three rows of gsa
look like this:
## country SA01.tennis SA02.cycle ... SA18.concert SA19.sight SA21.museum
## 0 Germany no yes ... no yes no
## 1 Switzerland no no ... no no no
## 2 Germany no yes ... no no yes
##
## [3 rows x 9 columns]
The goal is now to compute a data frame of “success rates” for each of the eight
summer activities for each country. For tennis, this can be extracted from the
cross tabulation of country
and SA01.tennis
, normalized with respect to the
index variable country
.
# Get frequency table
pd.crosstab(index = gsa["country"], columns = gsa["SA01.tennis"],
normalize = 'index')
## SA01.tennis no yes
## country
## Austria (Vienna) 0.962039 0.037961
## Austria (other) 0.960109 0.039891
## Belgium 0.986567 0.013433
## Denmark 0.980806 0.019194
## France 0.981043 0.018957
## Germany 0.981961 0.018039
## Hungary 0.975281 0.024719
## Italy 0.971550 0.028450
## Netherlands 0.978237 0.021763
## Spain 0.995951 0.004049
## Sweden 0.987805 0.012195
## Switzerland 0.956522 0.043478
## UK 0.988916 0.011084
## USA 0.994024 0.005976
## other 0.987147 0.012853
To apply this to all columns, the function sucrate()
is defined.
def sucrate(column, index, return_key):
return pd.crosstab(index = index, columns = column, normalize = 'index')[return_key]
sucrate(column = gsa["SA01.tennis"], index = gsa["country"], return_key = "yes")
## country
## Austria (Vienna) 0.037961
## Austria (other) 0.039891
## Belgium 0.013433
## Denmark 0.019194
## France 0.018957
## Germany 0.018039
## Hungary 0.024719
## Italy 0.028450
## Netherlands 0.021763
## Spain 0.004049
## Sweden 0.012195
## Switzerland 0.043478
## UK 0.011084
## USA 0.005976
## other 0.012853
## Name: yes, dtype: float64
Finally, the entire gsa
data is aggregated by sucrate()
.
gsa_agg = gsa.iloc[:, 1:].agg(func = lambda x:
sucrate(x, index = gsa["country"], return_key = "yes"), axis = "index")
gsa_agg.head(3)
## SA01.tennis SA02.cycle ... SA19.sight SA21.museum
## country ...
## Austria (Vienna) 0.037961 0.187636 ... 0.163774 0.057484
## Austria (other) 0.039891 0.199456 ... 0.211242 0.067996
## Belgium 0.013433 0.100000 ... 0.304478 0.086567
##
## [3 rows x 8 columns]
## (15, 8)
Before saving the aggregated data in a CSV file, the index column name is added in upper case.
3.2.5 Working with dataframes
The aggregated CSV file of the GSA data aggregated by country can also be downloaded as gsa-country.csv.
To load the data, read_csv()
from pandas
is used, setting index_col = 0
and
header = 0
which specifies that the first row contains variable names (header = 0
),
and the first column contains an index (index_col = 0
).
## SA01.tennis SA02.cycle ... SA19.sight SA21.museum
## Country ...
## Austria (Vienna) 0.037961 0.187636 ... 0.163774 0.057484
## Austria (other) 0.039891 0.199456 ... 0.211242 0.067996
##
## [2 rows x 8 columns]
Alternatively, the index column could have also been specified using its name
(index_col = "Country"
) rather than its position.
There are also cases when it is useful to set index_col = False
so that the entire CSV file
is stored in the data frame columns and a numerical index is added.
## Country SA01.tennis ... SA19.sight SA21.museum
## 0 Austria (Vienna) 0.037961 ... 0.163774 0.057484
## 1 Austria (other) 0.039891 ... 0.211242 0.067996
##
## [2 rows x 9 columns]
However, index_col = False
only retains the order of column values if all columns (including
the index) have a name. If the CSV file has an unnamed index variable in the first column,
then it is better to read with index_col = 0
and subsequently reset and rename the index.
To illustrate this case, the data is read again but the index name is deleted afterwards.
## SA01.tennis SA02.cycle ... SA19.sight SA21.museum
## Austria (Vienna) 0.037961 0.187636 ... 0.163774 0.057484
## Austria (other) 0.039891 0.199456 ... 0.211242 0.067996
##
## [2 rows x 8 columns]
By resetting the index, the corresponding column in the data becomes a regular variable and a numerical index is added.
## index SA01.tennis ... SA19.sight SA21.museum
## 0 Austria (Vienna) 0.037961 ... 0.163774 0.057484
## 1 Austria (other) 0.039891 ... 0.211242 0.067996
##
## [2 rows x 9 columns]
Afterwards the "index"
variable name is changed to "Country"
as a more intelligible name.
## Country SA01.tennis ... SA19.sight SA21.museum
## 0 Austria (Vienna) 0.037961 ... 0.163774 0.057484
## 1 Austria (other) 0.039891 ... 0.211242 0.067996
##
## [2 rows x 9 columns]
Note that "Country"
could be made the index column again with .set_index()
.
## SA01.tennis SA02.cycle ... SA19.sight SA21.museum
## Country ...
## Austria (Vienna) 0.037961 0.187636 ... 0.163774 0.057484
## Austria (other) 0.039891 0.199456 ... 0.211242 0.067996
##
## [2 rows x 8 columns]
Furthermore, it is possible to set multiple indices, e.g., with gsa.set_index(["country", "SA01.tennis"], inplace = True)
.
This is exemplified below by first deleting the index and setting it again afterwards.
## SA01.tennis SA02.cycle ... SA19.sight SA21.museum
## Austria (Vienna) 0.037961 0.187636 ... 0.163774 0.057484
## Austria (other) 0.039891 0.199456 ... 0.211242 0.067996
##
## [2 rows x 8 columns]
## SA01.tennis SA02.cycle ... SA19.sight SA21.museum
## Country ...
## Austria (Vienna) 0.037961 0.187636 ... 0.163774 0.057484
## Austria (other) 0.039891 0.199456 ... 0.211242 0.067996
##
## [2 rows x 8 columns]
With drop()
a column which is no longer needed can be omitted from the data.
## SA01.tennis SA02.cycle ... SA18.concert SA19.sight
## Country ...
## Austria (Vienna) 0.037961 0.187636 ... 0.016269 0.163774
## Austria (other) 0.039891 0.199456 ... 0.025385 0.211242
##
## [2 rows x 7 columns]
With rename()
variables can be renamed.
## SA01.tennis SA02.cycle ... SA22.opera SA19.sight
## Country ...
## Austria (Vienna) 0.037961 0.187636 ... 0.016269 0.163774
## Austria (other) 0.039891 0.199456 ... 0.025385 0.211242
##
## [2 rows x 7 columns]
As a general cautionary remark, note that applying methods with inplace = True
immediately and
permanently modifies the data “in place”. Thus, reverting such a change is not always possible.
In such a case the data needs to be reloaded (e.g., from the source CSV file) and all transformations
and modifications need to be re-applied step by step. In a Jupyter notebook this is typically done
by re-running the notebook from the start (run-all
) after correcting the corresponding code chunks.