Chapter 1 Introduction
1.1 Data, data analysis, data analytics
Data analysis: This course covers various methods for data analysis and statistical modeling or statistical learning. The goal of the analysis is to gain a compact insight into the patterns and dependencies in the data and possibly use this as a basis for forecasts.
Data: The data are the basis of all modeling/learning. Fundamental concepts are observational units and variables. The values or realizations of the variables are measured (collected) for each unit, resulting in the data or observations.
Data Analytics: Often used as an alternative term for data analysis for decision making, especially in a business intelligence context. Emphasizes combination of statistics with computer programming as well as focus on continuous reporting and predictions.
Data matrix: Statistical data usually have a rectangular shape, called data matrix or data frame, where rows pertain to units and columns to variables.
## expenditure age gender occupation
## 1 604.800 58 male Arbeitslos
## 2 1346.000 50 male Angestellt
## 3 968.000 54 male Leit.Angest.
## 4 513.700 79 male Pensionist
## 5 488.571 59 male Angestellt
## 6 332.410 67 male Pensionist
## 7 1775.000 62 male Selbstaendig
Scale: The choice of suitable statistical methods typically crucially depends on the measurement scale of the variables to be analyzed. It is often distinguished into qualitative (or categorical) and quantitative (numeric, metric, continuous).
1.2 Guest Survey Austria Data
Background: The Guest Survey Austria from 1994 and 1997 encompasses 14,571 tourists who visited Austria in the summer seasons 1994 and 1997. In addition to some socio-demographic data such as age, gender, occupation, etc. the survey included questions about preferred summer activities (hiking, sightseeing, swimming, etc.) and motives for selection of the destination (relaxing, sports, culture, etc.).
Source: Guest Survey Austria, http://www.tourmis.info/
.
Variable | Description |
---|---|
age |
Age in years. |
occupation |
Job or occupation. |
income |
Monthly household income in ATS. |
gender |
Gender. |
country |
Country of origin. |
expenditure |
Average expenditures per day in ATS. |
province |
Province of destination. |
accomodation |
Type of accommodation. |
year |
Year of visit (1994, 1997). |
SAnn.xxx |
Engaged in \(\underline{S}\)ummer \(\underline{A}\)ctivity? |
Mnn.xxx |
Was the \(\underline{M}\)otive a reason for visiting Austria or not? |
1.3 Myopic Loss Aversion Data
Background: Myopic loss aversion is a phenomenon in behavioral economics, where individuals do not behave economically rationally when making short-term decisions under uncertainty. Example: In lotteries with positive expectation investments are lower than the maximum possible (loss aversion). This effect is enhanced for short-term investments (myopia).
Experiment: Sutter & Glätzle-Rützler (UIBK). Pupils (Schwaz/Innsbruck) could invest \(X\) points (0–100) in each of 9 rounds in a lottery. Payouts: \(100 + 2.5 \cdot X\) with probability 1/3 and \(100 - X\) with 2/3. (expectation: \(100 + 1/6 \cdot X\).) The investments could either be modified in each round (treatment “short”) or only in round 1, 4, 7 (treatment “long”). Decisions were either made alone or in teams of two. Payouts: EUR 0.5 per 100 points for lower grades (Unterstufe, 6–8) or EUR 1.0 per 100 points for upper grades (Oberstufe, 10–12).
Source: Glätzle-Rützler D, Sutter M, Zeileis A (2015). “No Myopic Loss Aversion in Adolescents? – An Experimental Note”, Journal of Economic Behavior & Organization, 111, 169–176.
Variable | Description |
---|---|
invest |
Average invested points across all 9 rounds. |
gender |
Gender of (team of) player(s). |
male |
Was (at least one of) the player(s) male (in the team)? |
age |
Age in years (averaged for teams). |
treatment |
Type of treatments (long vs. short). |
grade |
School grades (6–8, 11–14 years vs. 10–12, 15–18 years). |
arrangement |
Single player vs. team of two? |
1.4 Bookbinder’s Book Club Data
Background: The Bookbinder’s Book Club data are a case study about a (fictional) American book club who sent a flyer about the book “The Art History of Florence” to 20,000 customers. 1,806 out of these went on to buy the book. The BBB Club has collected various variables for predicting a customer’s purchase decision. Here, a subsample of 1,300 units (customers) and 11 variables is considered.
Source: Lilien & Rangaswamy (2004).
Files: BBBClub.csv, BBBClub.rda.
Variable | Description |
---|---|
choice |
Did the customer buy the book “The Art History of Florence”? |
gender |
Gender. |
amount |
Total amount spent at BBB Club. |
freq |
Total frequency of purchases at BBB Club. |
last |
Months since last purchase. |
first |
Months since first purchase. |
child |
Number of children’s books purchased. |
youth |
Number of youth books purchased. |
cook |
Number of cooking books purchased. |
diy |
Number of do-it-yourself books purchased. |
art |
Number of art books purchased. |
1.5 What Drives Taxi Drivers?
Background: Fraud in trust goods is a problem that economists study intensively, one example being the cost of taxi fares. In a field experiment, it was investigated whether taxi drivers charge a higher price (overcharge
) or take a detour (overtreatment
) if the level of information of their passengers is lower or the passengers’ supposed income higher.
Three test subjects (triple
) undertook test drives in the urban area of Athens. These persons showed themselves to be either localized Athenians, foreign Greeks or foreigners (origin
). They either wore a
suit and had a briefcase with them (symbol for higher income) or were travelling in street clothes with a backpack (symbol for lower income). For more details you may want to watch the YouTube video
\(\href{https://www.youtube.com/watch?v=FjR7FI7c178}{\textit{What drives taxi drivers?}}\).
Source: Balafoutas L, Beck A, Kerschbamer R, Sutter M (2013). “What Drives Taxi Drivers? A Field Experiment on Fraud in a Market for Credence Goods.” The Review of Economic Studies (https://doi.org/10.1093/restud/rds049).
Variable | Description |
---|---|
overcharge |
Was the price too high? |
overtreatment |
Overtreatment index. Was there a detour? The shortest distance of the 3 test persons was chosen a reference, i.e. an overtreatment of 1.03 means a 3% longer distance than the reference distance. |
triple |
Three test persons per destination. |
origin |
Passenger is a local Athenian (resident), Greek, but not local (nonresident), or foreigner (foreign). |
income |
Perceived income of the passenger (high/low). |
dgender |
Gender of the driver. |
dage |
Estimated age of the driver. |
rushhour |
Did the journey take place during rush hour? |
1.6 Supervised vs. unsupervised learning
Supervised learning: The goal is to establish a good regression model or prediction model, capturing how (at least) one dependent variable depends on one or more independent variables (also explanatory variables or regressors). A typical example is linear regression.
Unsupervised learning: The goal is to find and understand patterns among a number of variables, reflecting which units are rather similar or different. Usually all variables considered enter the analysis equally, i.e., there is no distinction of dependent vs. independent variables. A typical example is cluster analysis, where the units are separated into different groups (or clusters).
1.7 Model-based vs. exploratory analysis
Model-based analysis: Typically the dependencies between the variables are captured by a parametric model. Thus, the model structure is pre-specified except for a few unknown parameters that have to be estimated from the observed data. A typical example is the mean equation with unknown coefficients in linear regression.
Exploratory analysis: An exploratory analysis does not assume a particular model. Instead graphics/visualizations as well as numeric statistics are employed to get a compact overview of the data. Exploratory analyses typically complement model-based analyses (but not necessarily vice versa). Typical examples include statistical graphics such as histograms or scatter plots etc.