Chapter 1 Introduction

1.1 Data, data analysis, data analytics

Data analysis: This course covers various methods for data analysis and statistical modeling or statistical learning. The goal of the analysis is to gain a compact insight into the patterns and dependencies in the data and possibly use this as a basis for forecasts.

Data: The data are the basis of all modeling/learning. Fundamental concepts are observational units and variables. The values or realizations of the variables are measured (collected) for each unit, resulting in the data or observations.

Data Analytics: Often used as an alternative term for data analysis for decision making, especially in a business intelligence context. Emphasizes combination of statistics with computer programming as well as focus on continuous reporting and predictions.

Data matrix: Statistical data usually have a rectangular shape, called data matrix or data frame, where rows pertain to units and columns to variables.

##   expenditure age gender   occupation
## 1     604.800  58   male   Arbeitslos
## 2    1346.000  50   male   Angestellt
## 3     968.000  54   male Leit.Angest.
## 4     513.700  79   male   Pensionist
## 5     488.571  59   male   Angestellt
## 6     332.410  67   male   Pensionist
## 7    1775.000  62   male Selbstaendig

Scale: The choice of suitable statistical methods typically crucially depends on the measurement scale of the variables to be analyzed. It is often distinguished into qualitative (or categorical) and quantitative (numeric, metric, continuous).

1.2 Guest Survey Austria Data

Background: The Guest Survey Austria from 1994 and 1997 encompasses 14,571 tourists who visited Austria in the summer seasons 1994 and 1997. In addition to some socio-demographic data such as age, gender, occupation, etc. the survey included questions about preferred summer activities (hiking, sightseeing, swimming, etc.) and motives for selection of the destination (relaxing, sports, culture, etc.).

Source: Guest Survey Austria, http://www.tourmis.info/.

Files: GSA.csv, GSA.rda.

Table 1.1: Variables in the Guest Survey Austria Dataset
Variable	Description
`age`	Age in years.
`occupation`	Job or occupation.
`income`	Monthly household income in ATS.
`gender`	Gender.
`country`	Country of origin.
`expenditure`	Average expenditures per day in ATS.
`province`	Province of destination.
`accomodation`	Type of accommodation.
`year`	Year of visit (1994, 1997).
`SAnn.xxx`	Engaged in \(\underline{S}\)ummer \(\underline{A}\)ctivity?
`Mnn.xxx`	Was the \(\underline{M}\)otive a reason for visiting Austria or not?

1.3 Myopic Loss Aversion Data

Background: Myopic loss aversion is a phenomenon in behavioral economics, where individuals do not behave economically rationally when making short-term decisions under uncertainty. Example: In lotteries with positive expectation investments are lower than the maximum possible (loss aversion). This effect is enhanced for short-term investments (myopia).

Experiment: Sutter & Glätzle-Rützler (UIBK). Pupils (Schwaz/Innsbruck) could invest \(X\) points (0–100) in each of 9 rounds in a lottery. Payouts: \(100 + 2.5 \cdot X\) with probability 1/3 and \(100 - X\) with 2/3. (expectation: \(100 + 1/6 \cdot X\).) The investments could either be modified in each round (treatment “short”) or only in round 1, 4, 7 (treatment “long”). Decisions were either made alone or in teams of two. Payouts: EUR 0.5 per 100 points for lower grades (Unterstufe, 6–8) or EUR 1.0 per 100 points for upper grades (Oberstufe, 10–12).

Source: Glätzle-Rützler D, Sutter M, Zeileis A (2015). “No Myopic Loss Aversion in Adolescents? – An Experimental Note”, Journal of Economic Behavior & Organization, 111, 169–176.

Files: MLA.csv, MLA.rda.

Table 1.2: Variables in the Myopic Loss Aversion Dataset
Variable	Description
`invest`	Average invested points across all 9 rounds.
`gender`	Gender of (team of) player(s).
`male`	Was (at least one of) the player(s) male (in the team)?
`age`	Age in years (averaged for teams).
`treatment`	Type of treatments (long vs. short).
`grade`	School grades (6–8, 11–14 years vs. 10–12, 15–18 years).
`arrangement`	Single player vs. team of two?

1.4 Bookbinder’s Book Club Data

Background: The Bookbinder’s Book Club data are a case study about a (fictional) American book club who sent a flyer about the book “The Art History of Florence” to 20,000 customers. 1,806 out of these went on to buy the book. The BBB Club has collected various variables for predicting a customer’s purchase decision. Here, a subsample of 1,300 units (customers) and 11 variables is considered.

Source: Lilien & Rangaswamy (2004).

Files: BBBClub.csv, BBBClub.rda.

Table 1.3: Variables in the Bookbinder’s Book Club Dataset
Variable	Description
`choice`	Did the customer buy the book “The Art History of Florence”?
`gender`	Gender.
`amount`	Total amount spent at BBB Club.
`freq`	Total frequency of purchases at BBB Club.
`last`	Months since last purchase.
`first`	Months since first purchase.
`child`	Number of children’s books purchased.
`youth`	Number of youth books purchased.
`cook`	Number of cooking books purchased.
`diy`	Number of do-it-yourself books purchased.
`art`	Number of art books purchased.

1.5 What Drives Taxi Drivers?

Background: Fraud in trust goods is a problem that economists study intensively, one example being the cost of taxi fares. In a field experiment, it was investigated whether taxi drivers charge a higher price (overcharge) or take a detour (overtreatment) if the level of information of their passengers is lower or the passengers’ supposed income higher.

Three test subjects (triple) undertook test drives in the urban area of Athens. These persons showed themselves to be either localized Athenians, foreign Greeks or foreigners (origin). They either wore a suit and had a briefcase with them (symbol for higher income) or were travelling in street clothes with a backpack (symbol for lower income). For more details you may want to watch the YouTube video \(\href{https://www.youtube.com/watch?v=FjR7FI7c178}{\textit{What drives taxi drivers?}}\).

Source: Balafoutas L, Beck A, Kerschbamer R, Sutter M (2013). “What Drives Taxi Drivers? A Field Experiment on Fraud in a Market for Credence Goods.” The Review of Economic Studies (https://doi.org/10.1093/restud/rds049).

Files: Taxi.csv, Taxi.rda.

Table 1.4: Variables in the Taxi Drivers Dataset
Variable	Description
`overcharge`	Was the price too high?
`overtreatment`	Overtreatment index. Was there a detour? The shortest distance of the 3 test persons was chosen a reference, i.e. an overtreatment of 1.03 means a 3% longer distance than the reference distance.
`triple`	Three test persons per destination.
`origin`	Passenger is a local Athenian (resident), Greek, but not local (nonresident), or foreigner (foreign).
`income`	Perceived income of the passenger (high/low).
`dgender`	Gender of the driver.
`dage`	Estimated age of the driver.
`rushhour`	Did the journey take place during rush hour?

1.6 Supervised vs. unsupervised learning

Supervised learning: The goal is to establish a good regression model or prediction model, capturing how (at least) one dependent variable depends on one or more independent variables (also explanatory variables or regressors). A typical example is linear regression.

Unsupervised learning: The goal is to find and understand patterns among a number of variables, reflecting which units are rather similar or different. Usually all variables considered enter the analysis equally, i.e., there is no distinction of dependent vs. independent variables. A typical example is cluster analysis, where the units are separated into different groups (or clusters).

1.7 Model-based vs. exploratory analysis

Model-based analysis: Typically the dependencies between the variables are captured by a parametric model. Thus, the model structure is pre-specified except for a few unknown parameters that have to be estimated from the observed data. A typical example is the mean equation with unknown coefficients in linear regression.

Exploratory analysis: An exploratory analysis does not assume a particular model. Instead graphics/visualizations as well as numeric statistics are employed to get a compact overview of the data. Exploratory analyses typically complement model-based analyses (but not necessarily vice versa). Typical examples include statistical graphics such as histograms or scatter plots etc.