Chapter 3 Basic data handling
3.1 R
3.1.1 Data frames
The basic object type for representing rectangular “data matrices” in R is the
class "data.frame"
. The rows correspond to observational units and the columns
to variables. Numeric vectors of class "numeric"
or "integer"
are typically
used for quantitative variables and "factor"
vectors for qualitative variables
(either nominal or ordinal).
There are many more data classes that can be used for columns/variables in a
data frame such as text ("character"
), date ("Date"
), time/date
("POSIXct"
), time series ("ts"
, "zoo"
, …), or spatial ("sf"
)
information. For this course we focus on quantitative and qualitative vectors,
though.
To enter data “on the fly” in R, the function data.frame()
can be used,
potentially in combination with the function factor()
for
qualitative/categorical variables. As an example, we generate a small artificial
data sets:
mydata <- data.frame(
expenditure = c(724.3, 639.4, 1000, 75.87),
age = c(28, 54, 39, 17),
gender = factor(c(1, 1, 3, 2), levels = 1:3, labels = c("male", "female", "diverse"))
)
mydata
## expenditure age gender
## 1 724.30 28 male
## 2 639.40 54 male
## 3 1000.00 39 diverse
## 4 75.87 17 female
A particular caveat when using data.frame()
is that columns with text strings
(i.e., class "character"
) are by “factory-fresh” default automatically converted into
qualitative "factor"
variables. For nicely preprocessed data this is a useful
default. However, when reading less structured data that still needs
streamlining it is typically more useful to first read it in as plain text and
only create custom factors after suitable preprocessing. To do so, simply set
data.frame(..., stringsAsFactors = FALSE)
. And to change the default behavior
to this set options(stringsAsFactors = FALSE)
.
In any case, entering data into R using data.frame()
is clearly only helpful for small
data sets or data that is already available in some other form within R. Larger data
sets are typically read from data sources such as text files, spreadsheets, or
binary files from R or another statistics software such as Stata or SPSS.
3.1.2 Tibbles
A variation of data frames with somewhat stricter checking and modified formatting is called “tibbles”. These are popularized by the so-called tidyverse family of packages and are employed by RStudio’s built-in infrastructure for reading data from files via the menu.
The corresponding class is "tbl_df"
provided by the tibble package. The
class inherits from "tbl"
and "data.frame"
. However, unless the “tidyverse”
is used for the data wrangling we recommend to convert tibbles to plain data
frames using as.data.frame()
. The reverse coercion via as_tibble()
(with
underscore rather than period) is also available.
3.1.3 Text files
For exchanging simple rectangular data text-based formats are a popular choice.
These typically have one text line per row of data, using some separator symbol
for the different columns: e.g., comma, semicolon, tabulator, etc. The basic
workhorse function to read this kind of data in R is read.table()
which offers
a wide range of arguments for fine-control of the reading process. See
?read.table
or help("read.table")
for a detailed overview. The most
important arguments are:
sep
: Separator symbol between columns, e.g., tabulator"\t"
, comma","
, semicolon";"
.dec
: Decimal separator, typically either a decimal point"."
or decimal comma","
.header
: Are variable names provided in the first line of the text file? (TRUE
orFALSE
)
Probably the most-used “flavor” of text-based data is CSV (comma-separated values) which can also be easily imported/exported to Excel and many other spreadsheet or statistics programs. R offers a convenience function
for this type of data which internally essentially calls
Note that this type of CSV is only used by Excel on English systems. In contrast, on German systems Excel uses a decimal comma (rather than point). Therefore, the comma cannot be used as column separator and decimal separator. Consequently, semicolons are used as column separators (while still referring to the format as comma-separated). R offers another convenience function for this:
which corresponds to
All these functions return a "data.frame"
by internally setting up vectors for
each column and then calling data.frame()
(see above). By default both numeric
and text columns (with some character entries) are simply preserved as such. However,
optionally text columns can also be converted to factors automatically (by setting
stringsAsFactors = TRUE
) which is often useful if the data file was already preprocessed
before importing it into R. Note that in R versions prior to 4.0.0 the default
behavior was stringsAsFactors = TRUE
but in current R versions it is FALSE
.
If a column represents a qualitative variable but uses a numeric coding, then
factor()
should be used subsequently to declare the factor variable with
suitable labels.
For illustration we import the Bookbinders Book Club data from the file BBBClub.csv.
choice gender amount freq last first child youth cook diy art
yes male 113 8 1 8 0 1 0 0 0
yes male 418 6 11 66 0 2 3 2 3
...
Therefore, read.csv()
is used for reading the data, turning the two
categorical variables choice
and gender
into factors automatically:
## choice gender amount freq last first
## no :900 female:456 Min. : 15 Min. : 2.0 Min. : 1.0 Min. : 2.0
## yes:400 male :844 1st Qu.:127 1st Qu.: 6.0 1st Qu.: 1.0 1st Qu.:12.0
## Median :204 Median :12.0 Median : 2.0 Median :18.0
## Mean :201 Mean :12.1 Mean : 3.2 Mean :22.3
## 3rd Qu.:273 3rd Qu.:16.0 3rd Qu.: 4.0 3rd Qu.:30.0
## Max. :474 Max. :36.0 Max. :12.0 Max. :86.0
## child youth cook diy art
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000 Median :0.000 Median :0.000
## Mean :0.732 Mean :0.325 Mean :0.742 Mean :0.395 Mean :0.453
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :8.000 Max. :4.000 Max. :6.000 Max. :4.000 Max. :5.000
In addition to this comma-separated version of the data file, there is also
a semicolon-separated version in BBBClub2.csv
which can be read by
read.csv2()
instead of read.csv()
.
As a final remark: For exporting such text-based files from R the corresponding
commands write.table()
, write.csv()
, write.csv2()
, etc. are available.
3.1.4 Binary R data
A particularly convenient method to exchange data for usage with R (as opposed
to spreadsheets or other statistics systems) is R’s own binary format. This has
the advantage that R’s own objects/classes can be stored, i.e., not only
"data.frame"
objects but also plain vectors, fitted model objects (like "lm"
or "glm"
), even functions, etc. There are two variants of R’s binary data:
.rda
(or.RData
) files that can contain an arbitrary number of objects.
Import/export viaload()
andsave()
..rds
files that contain a single R object.
Import/export viareadRDS()
andsaveRDS()
.
The latter can be used similarly to the read.*()
functions from the previous
section, i.e., the result needs to be assigned to a new object. For
illustration, we export the mydata
data frame and re-import it as newdata
,
making sure that the objects are identical:
## [1] TRUE
In contrast, the .rda
format can contain many objects and hence load()
-ing
them already creates them in the user’s workspace with their original object
names. Thus, no assignment is needed here. For illustration, we load both the
BBBClub and GSA data from the corresponding .rda
files.
## [1] "BBBClub" "GSA" "X" "m" "mydata" "newdata" "x" "y"
Then, it is also possible to save()
all of these data sets in a single file,
remove()
them subsequently from the workspace, and load()
all of them again
in one go:
## [1] "X" "m" "newdata" "x" "y"
## [1] "BBBClub" "GSA" "X" "m" "mydata" "newdata" "x" "y"
3.1.5 Excel spreadsheets
Another convenient option to exchange data between R and Excel is Excel’s own
spreadsheet format, i.e., .xls
or .xlsx
files. This is not supported in
base R but through a number of different package. The reason for this is that
Excel’s spreadsheet format is somewhat complicated: It has been changing over
Excel versions and it can become quite complex because it can contain reactive
cells, tables, graphics, etc. Here, we focus on simple data matrices that could
also be exported as .csv
files from Excel and then imported into R.
To work with the Excel spreadsheets directly the following two packages are convenient. Both of them are written in C++, i.e., need to be compiled when installed. However, suitable pre-compiled binaries are readily available on CRAN for Windows and OS X:
- openxlsx:
read.xlsx()
andwrite.xlsx()
for reading/writing.xlsx
files. Old.xls
files are not supported. - readxl:
read_xls()
andread_xlsx()
as well as a wrapper functionread_excel()
that tries to guess the format. Returns a"tibble"
and not a"data.frame"
. Can also be used through the RStudio graphical menus. Nowrite_*()
function is available (butwrite_xlsx()
is provided in the writexl package).
Additionally, a number of other packages are available that have some nice properties but also some drawbacks:
- gdata:
read.xls()
for reading both.xls
and.xlsx
files. Internally, the spreadsheets are transformed to.csv
first and thenread.table()
is used, so that the same arguments are available for reading the data matrix. Requires Perl, see https://www.perl.org/get.html. - xlsx:
read.xlsx()
for reading both.xls
and.xlsx
files along with correspondingwrite.xlsx()
. Provides many arguments thatread.table()
also has. Requires Java and is somewhat slow. - XLConnect: Functionality for querying and manipulating spreadsheets (including reading/writing). Requires Java.
Overall, there is not one “best” implementation that clearly outperforms all others. Users have to choose what works best for their application scenarios.
3.1.6 Stata and SPSS files
For exchanging data with various other statistics systems, the base R system ships with the package foreign. Thus, this package is available in all R installations and does not need to be installed from CRAN.
The Stata .dta
format (version 5-12) can be read and written with read.dta()
and write.dta()
, respectively. However, newer Stata versions starting from
Stata 13 are not supported by foreign directly but available in the CRAN
package readstata13.
The SPSS .sav
format can be read with read.spss()
. By default, this returns
a "list"
object but setting read.spss(..., to.data.frame = TRUE)
creates a
"data.frame"
.
3.1.7 Transforming variables in data frames
After having created a data frame it is often necessary transform one or more of the variables. Typical operations include:
- Convert a numeric variable into a factor.
- Scale a variable to a different unit, e.g., in 1000s, from Euros to Dollars, or normalize to mean 0 and standard deviation 1.
- Add a new variable that is a transformation of one or more variables, e.g., computing income from hours worked and wage per hour.
One basic option for this is to access a given variable, say
d$gender
or d[["gender"]]
, carry out the transformation,
and re-assign to the same variable (using either $
or [[
notation). For
example, we could first read an all-numeric version of the d
data frame
from above and subsequently transform gender
from numeric to factor:
d <- data.frame(
expenditure = c(724.3, 639.4, 1000, 75.87),
age = c(28, 54, 39, 17),
gender = c(1, 1, 3, 2)
)
d
## expenditure age gender
## 1 724.30 28 1
## 2 639.40 54 1
## 3 1000.00 39 3
## 4 75.87 17 2
## expenditure age gender
## 1 724.30 28 male
## 2 639.40 54 male
## 3 1000.00 39 diverse
## 4 75.87 17 female
Because d$gender
is an existing column in the data frame it is replaced
by the assignment. But in the same way we could have also created a new variable
with a different name, say d$sex
.
For transforming a single variable this is usually simple enough. However,
repeating d$
quickly becomes tedious when many variables have to be
transformed or created. A convenient alternative solution is then to use the
transform()
function within which all variables can be directly accessed (due
to so-called non-standard evaluation). For illustration, we transform the
expenditure
column from Austrian Schilling to Euros, and we add a new column
age2
for age-squared:
## expenditure age gender age2
## 1 52.63693 28 male 784
## 2 46.46701 54 male 2916
## 3 72.67283 39 diverse 1521
## 4 5.51369 17 female 289
Of course, it would have been also possible to create another newdata
instead
of overwriting d
.
3.1.8 Selecting subsets
Often it is useful to first select the relevant subset of a bigger data frame for a particular analysis. This can affect both:
- Rows: Keep only some observational units in the data, e.g., only men or only observations after a certain time point, etc.
- Columns: Keep only some variables that are necessary for the analysis while omitting other covariates.
A simple way to do this is to index with the [
operator, using either a row
index and/or a column index d[i, j]
where each of i
and j
can be:
- Integer: The rows/columns pertaining to positive integers are kept, negative integers are omitted.
- Logical: The rows/columns pertaining to
TRUE
are kept,FALSE
are omitted. - Character: If row and/or column names are available, these can also be used for indexing.
- Missing: If the row and/or column index is missing all rows and/or columns are kept, respecitvely.
For illustration, we could select the variables expenditure
and age
from
d
by:
## expenditure age
## 1 52.63693 28
## 2 46.46701 54
## 3 72.67283 39
## 4 5.51369 17
Equivalently, we could have also used:
One caveat when selecting columns is that when selecting a single column the
data frame attributes are dropped and only a single vector is returned. This can
be avoided by explicitly setting the argument drop = FALSE
:
## [1] 52.63693 46.46701 72.67283 5.51369
## expenditure
## 1 52.63693
## 2 46.46701
## 3 72.67283
## 4 5.51369
This is one of the important differences between "data.frame"
and "tibble"
objects. For the former the default is drop = TRUE
while the latter has the
default drop = FALSE
.
Similar to the indexing of columns above, we can also index rows. For
illustration, we could select the first and the third observation (with
expenditure > 50
):
## expenditure age gender age2
## 1 52.6369 28 male 784
## 3 72.6728 39 diverse 1521
## expenditure age gender age2
## 1 52.6369 28 male 784
## 3 72.6728 39 diverse 1521
Note that the latter creates a logical variable first and then uses that for indexing:
## [1] TRUE FALSE TRUE FALSE
As repeating d$
can again be tedious, there is a dedicated convenience
function subset()
that uses a similar strategy as transform()
(based on
so-called non-standard evaluation where variables can be used directly within
the function).
So we could easily select only the men with expenditure > 50, keeping only the expenditure and age variables by:
## expenditure age
## 1 52.6369 28
3.1.9 Handling missing values
Sometimes some of the variables are missing for some of the observational units, e.g., due to an error in the data collection or because the realization was not observable. In practice, data sets typically include missing values and these have to be handled in some way prior to subsequent analysis.
One common strategy is to omit those observations/rows where at least one of the variables in the analysis is missing. However, this strategy might waste a lot of useful data if there are many missing values in different variables. Also, if data is not missing at random, the fact that on observation is missing might be of interest in itself. Hence, omitting missing values is typically only recommended if there are very few missing values.
Alternatively, missing values are often imputed with different strategies. These might depend on the type of data (time series vs. cross-section) and the type of data analysis applied (especially when fitting parametric models). Such imputation strategies are beyond the scope of this course but we refer to the CRAN task view on MissingData for more details.
In R, missing values are represented by the dedicated symbol NA
for “not
available”. Base R comes with a number of na.*()
functions that implemented
relatively simple strategies for handling the missing values. Further na.*()
functions are provided by various packages, especially for time series data. The
function na.omit()
implements the simple strategy to omit all observations
where at least one observation is missing. However, care is needed if there are
still variables in the data frame that are not really needed for the subsequent
analysis because na.omit()
will also omit rows where these are missing. Thus,
make sure that you have selected the right subset()
first.
As a simple example, consider a data frame with four variables where only the
first two are really of interest. It makes a big difference whethere the NA
s
are ommitted first and then variables selected - or vice versa. The latter is
typically a better idea.
d <- data.frame(
x1 = c(1.1, 2.2, NA, 4.4),
x2 = sin(1:4),
z1 = cos(1:4),
z2 = c(4.4, NA, NA, NA)
)
d
## x1 x2 z1 z2
## 1 1.1 0.841471 0.540302 4.4
## 2 2.2 0.909297 -0.416147 NA
## 3 NA 0.141120 -0.989992 NA
## 4 4.4 -0.756802 -0.653644 NA
## x1 x2
## 1 1.1 0.841471
## x1 x2
## 1 1.1 0.841471
## 2 2.2 0.909297
## 4 4.4 -0.756802
3.1.10 Aggregating data frames
Often it is of interest to first aggregate or average groups of observations prior to subsequent analysis. Here, this is illustrated by aggregating the interest in different summer activities within countries of origin for all the tourists in the Guest Survey Austria data. Download the GSA data set as GSA.rda.
First, the data for all the individual tourists is loaded:
## [1] 14571 52
The we proceed, as above, by first selecting the variables of interest (here,
the country
and the eight summer activities SA*.*
) before omitting
observations/rows with incomplete information due to missing values.
## [1] 12449 9
## country SA01.tennis SA02.cycle SA03.ride SA05.swim SA17.shop SA18.concert
## 1 Germany no yes no no no no
## 2 Switzerland no no no no no no
## 3 Germany no yes no yes yes no
## SA19.sight SA21.museum
## 1 yes no
## 2 no no
## 3 no yes
To aggregate this interest in the different summer activities - from individual
tourist level to country level - we want to compute the country-wise “success
rates”. These are simply the conditional proportions of each summer activity
given country. One way to compute these is by setting up a contingency table
with table()
or xtabs()
and then computing conditional proportions with
prop.table()
:
## SA01.tennis
## country no yes
## Austria (Vienna) 0.96203905 0.03796095
## Austria (other) 0.96010879 0.03989121
## Belgium 0.98656716 0.01343284
## Denmark 0.98080614 0.01919386
## France 0.98104265 0.01895735
## Germany 0.98196101 0.01803899
## Hungary 0.97528090 0.02471910
## Italy 0.97155050 0.02844950
## Netherlands 0.97823721 0.02176279
## Spain 0.99595142 0.00404858
## Sweden 0.98780488 0.01219512
## Switzerland 0.95652174 0.04347826
## UK 0.98891626 0.01108374
## USA 0.99402390 0.00597610
## other 0.98714653 0.01285347
These proportions sum to 1 (or 100%) in each row and the second column provides the “success rates” for the summer activity “tennis” in each of the countries.
Thus, one way to proceed would be to use the analogous command for all eight
summer activities separately and then cbind()
-ing the results. However, this
is somewhat tedious, especially if the number of variables is large.
An alternative command that does this in one go is aggregate()
. It can
aggregate all columns in a data frame along one or more grouping variables based
on a user-supplied function. The interface is somewhat similar to tapply()
.
To apply either aggregate()
or tapply()
we therefore need a function that
computes the unconditional success proportions because the conditioning is
then done by aggregate()
or tapply()
. We can also use prop.table()
for
this and simply remove the conditioning:
##
## no yes
## 0.9776689 0.0223311
Then, we only need the second element of that proportion table and wrap the
command into a custom function sucrate()
:
## yes
## 0.0223311
To apply this function separately for each country for a single summer activity
tapply()
could be used:
## Austria (Vienna) Austria (other) Belgium Denmark France
## 0.03796095 0.03989121 0.01343284 0.01919386 0.01895735
## Germany Hungary Italy Netherlands Spain
## 0.01803899 0.02471910 0.02844950 0.02176279 0.00404858
## Sweden Switzerland UK USA other
## 0.01219512 0.04347826 0.01108374 0.00597610 0.01285347
But to use the same approach for all variables in gsa
, aggregate()
is more
convenient:
## Group.1 country SA01.tennis SA02.cycle SA03.ride SA05.swim SA17.shop
## 1 Austria (Vienna) 0 0.0379610 0.187636 0.01301518 0.414317 0.0856833
## 2 Austria (other) 1 0.0398912 0.199456 0.00543971 0.389846 0.1133273
## 3 Belgium 0 0.0134328 0.100000 0.00597015 0.317910 0.1835821
## SA18.concert SA19.sight SA21.museum
## 1 0.0162690 0.163774 0.0574837
## 2 0.0253853 0.211242 0.0679964
## 3 0.0447761 0.304478 0.0865672
Note that gsa$country
has to be wrapped into a list()
because multiple
aggregation indexes could be used. Note also that the first two columns of gsa
are not really needed anymore. The first column would be more useful as row
names for the data and the second is not really meaningful anyway. Therefore, we
can clean up the data by
## [1] 15 8
This is the 15 x 8 data matrix that is ready for subsequent analysis with multivariate methods. Below the first couple of rows and columns are shown:
## SA01.tennis SA02.cycle SA03.ride SA05.swim SA17.shop
## Austria (Vienna) 0.0379610 0.187636 0.01301518 0.414317 0.0856833
## Austria (other) 0.0398912 0.199456 0.00543971 0.389846 0.1133273
## Belgium 0.0134328 0.100000 0.00597015 0.317910 0.1835821
We also save()
this aggregated data set so that the resulting .rda
can be
loaded and we do not need to re-run the aggregation for future analyses.
3.2 Python
# Make sure that the required libraries are installed.
# Import the necessary libraries and classes.
import pandas as pd
import numpy as np
3.2.1 Transforming variables in data frames
d = {'expenditure': [724.3, 639.4, 1000, 75.87], 'age': [28, 54, 39, 17], 'gender': [1, 1, 3, 2]}
mydata = pd.DataFrame(data=d)
mydata.head()
## expenditure age gender
## 0 724.30 28 1
## 1 639.40 54 1
## 2 1000.00 39 3
## 3 75.87 17 2
mydata.gender = pd.cut(mydata.gender, bins=[0,1,2,3],labels=["male", "female", "diverse"])
mydata.head()
## expenditure age gender
## 0 724.30 28 male
## 1 639.40 54 male
## 2 1000.00 39 diverse
## 3 75.87 17 female
mydata['expenditure'] = mydata.expenditure.transform(lambda x: x/13.7603)
mydata['age2'] = mydata.age.transform(np.square)
mydata.head()
## expenditure age gender age2
## 0 52.636934 28 male 784
## 1 46.467010 54 male 2916
## 2 72.672834 39 diverse 1521
## 3 5.513688 17 female 289
3.2.2 Selecting subsets
## expenditure age
## 0 52.636934 28
## 1 46.467010 54
## 2 72.672834 39
## 3 5.513688 17
## expenditure age
## 0 52.636934 28
## 1 46.467010 54
## 2 72.672834 39
## 3 5.513688 17
## expenditure age
## 0 52.636934 28
## 1 46.467010 54
## 2 72.672834 39
## 3 5.513688 17
## 0 52.636934
## 1 46.467010
## 2 72.672834
## 3 5.513688
## Name: expenditure, dtype: float64
## expenditure
## 0 52.636934
## 1 46.467010
## 2 72.672834
## 3 5.513688
## expenditure age gender age2
## 0 52.636934 28 male 784
## 2 72.672834 39 diverse 1521
## expenditure age gender age2
## 0 52.636934 28 male 784
## 2 72.672834 39 diverse 1521
## 0 True
## 1 False
## 2 True
## 3 False
## Name: expenditure, dtype: bool
mydata[(mydata.expenditure > 50) & (mydata.gender == "male")] # Note: flip the conditions doesn't work!
## expenditure age gender age2
## 0 52.636934 28 male 784
## expenditure age
## 0 52.636934 28
3.2.3 Handling missing values
d = {'x1': [.1, 2.2, np.nan, 4.4], 'x2': np.sin(np.arange(1,5)), 'z1': np.cos(np.arange(1,5)), 'z2': [4.4, np.nan, np.nan, np.nan]}
mydata = pd.DataFrame(data=d)
mydata.head()
## x1 x2 z1 z2
## 0 0.1 0.841471 0.540302 4.4
## 1 2.2 0.909297 -0.416147 NaN
## 2 NaN 0.141120 -0.989992 NaN
## 3 4.4 -0.756802 -0.653644 NaN
## x1 x2 z1 z2
## 0 0.1 0.841471 0.540302 4.4
## x1 x2 z1 z2
## 0 0.1 0.841471 0.540302 4.4
## 1 2.2 0.909297 -0.416147 NaN
## 3 4.4 -0.756802 -0.653644 NaN
3.2.4 Aggregating data frames
Download the GSA data set as GSA.csv.
## (14571, 52)
cols = [4] + list(range(9,12)) + [13] + list(range(24,27)) + [28]
gsa = GSA.iloc[:, cols].copy()
gsa.dropna(inplace=True)
gsa.shape
## (12449, 9)
## country SA01.tennis SA02.cycle ... SA18.concert SA19.sight SA21.museum
## 0 Germany no yes ... no yes no
## 1 Switzerland no no ... no no no
## 2 Germany no yes ... no no yes
##
## [3 rows x 9 columns]
# Get frequency table
pd.crosstab(index=gsa["country"], columns=gsa["SA01.tennis"], normalize='index')
## SA01.tennis no yes
## country
## Austria (Vienna) 0.962039 0.037961
## Austria (other) 0.960109 0.039891
## Belgium 0.986567 0.013433
## Denmark 0.980806 0.019194
## France 0.981043 0.018957
## Germany 0.981961 0.018039
## Hungary 0.975281 0.024719
## Italy 0.971550 0.028450
## Netherlands 0.978237 0.021763
## Spain 0.995951 0.004049
## Sweden 0.987805 0.012195
## Switzerland 0.956522 0.043478
## UK 0.988916 0.011084
## USA 0.994024 0.005976
## other 0.987147 0.012853
def sucrate(column, index, return_key):
return pd.crosstab(index=index, columns=column, normalize='index')[return_key]
sucrate(column=gsa["SA01.tennis"], index=gsa["country"], return_key="yes")
## country
## Austria (Vienna) 0.037961
## Austria (other) 0.039891
## Belgium 0.013433
## Denmark 0.019194
## France 0.018957
## Germany 0.018039
## Hungary 0.024719
## Italy 0.028450
## Netherlands 0.021763
## Spain 0.004049
## Sweden 0.012195
## Switzerland 0.043478
## UK 0.011084
## USA 0.005976
## other 0.012853
## Name: yes, dtype: float64
gsa_agg = gsa.iloc[:,1:].agg(func=lambda x: sucrate(x, index=gsa["country"], return_key="yes"), axis="index")
gsa_agg.head(3)
## SA01.tennis SA02.cycle ... SA19.sight SA21.museum
## country ...
## Austria (Vienna) 0.037961 0.187636 ... 0.163774 0.057484
## Austria (other) 0.039891 0.199456 ... 0.211242 0.067996
## Belgium 0.013433 0.100000 ... 0.304478 0.086567
##
## [3 rows x 8 columns]
## (15, 8)