Chapter 2 Sotware installation and first steps

2.1 R

In addition to installing R itself from the Comprehensive R Archive Network (CRAN, https://CRAN.R-project.org/), it is recommended to install the accompanying “R tools” encompassing various command line and compiler tools. These are not strictly necessary for the first steps but typically come in handy when installing packages at some point. The links for the different platforms are:

Windows: The current R installation executable is available at
https://CRAN.R-project.org/bin/windows/base/
and the corresponding R tools at
https://CRAN.R-project.org/bin/windows/Rtools/
Both are .exe files that just require a couple of clicks for successful installation.
macOS: The R binary package file is available at
https://CRAN.R-project.org/bin/macosx/
and it is recommended to also install the clang and gfortran compilers from
https://CRAN.R-project.org/bin/macosx/tools/
All of these are standard .pkg files for macOS.
Linux: While CRAN also hosts binary packages for various Linux distributions, it is typically easiest to install these from the respective package repositories with apt or yum etc. In addition to the R base core package, the installation of the corresponding developer/header files is recommended. For example, on Ubuntu or Debian this typically corresponds to installing:
```
apt-get install r-base-core r-base-dev
```

On top of the R base system there are several user interfaces and development environments that make using R easier and more convenient, especially for beginners. For example, those familiar with Emacs will be interested in using ESS. The most popular integrated development environment for R, though, is RStudio. This is also used in the lecture and tutorial sessions for the course. Binaries can easily be installed from
https://rstudio.com/products/rstudio/download/#download

2.1.1 R as…

R can be used in a number of different ways:

…as a calculator: For simple arithmetics.

1 + 1

## [1] 2

log(16, base = 2)

## [1] 4

…for vectorized arithmetics:

x <- c(0, 2.5, 5)
y <- 3 * x + 2
y

## [1]  2.0  9.5 17.0

exp(x)

## [1]   1.0000  12.1825 148.4132

…as a matrix-based language: For example, for matrix products, solving systems of equations, etc.

X <- cbind(1, x, x^2)
X %*% c(2, 3, 0)

##      [,1]
## [1,]  2.0
## [2,]  9.5
## [3,] 17.0

solve(X, y)

##   x   
## 2 3 0

…as a statistics system: For example, for simulations, for fitting regression models, and for statistical inference.

y <- y + rnorm(3)
m <- lm(y ~ x)
summary(m)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##       1       2       3 
##  0.0551 -0.1101  0.0551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.7835     0.1231    14.5   0.0439
## x             2.8286     0.0381    74.2   0.0086
## 
## Residual standard error: 0.135 on 1 degrees of freedom
## Multiple R-squared:     1,   Adjusted R-squared:     1 
## F-statistic: 5.5e+03 on 1 and 1 DF,  p-value: 0.00859

Note that due to the random normal (rnorm) error term, results may vary a bit when running the code above.

Moreover, R is a fully-fledged programming language and comes with a powerful visualization system.

2.1.2 First steps

A more detailed introduction providing useful background information and a guided tour of the most important first steps is available in the sibling DiSC course Introduction to Programming with R in Chapters 1-3.

2.2 Python

For the examples it is necessary to install Python 3. A very useful beginner’s guide, covering the steps from download to basic programming, is available at https://wiki.python.org/moin/BeginnersGuide/Download.

Additionally, we recommend to use the Python package installer pip for the package management and Jupyter notebooks for interactive programming. These are not stricly necessary for the examples, though, and other development environments (like PyCharm or Visual Studio Code or simply the command line) could be used as well.

Note: Some operating systems (e.g., macOS or certain flavors of Linux) aim to facilitate using several Python versions in parallel. Hence, these systems provide dedicated commands python3 and pip3, respectively, while python and pip refer to the version 2 commands. In the following paragraphs, when we refer to python or pip, the commands therefore have to be replaced with python3 and pip3, respectively. The version of a given command can be queried using the --version option, e.g.,

python --version

## Python 3.12.4

Similarly, on Windows there is the command py which essentially calls python but additionally tries to avoid certain path problems etc. If available, you can replace the python command with py in the subsequent paragraphs.

2.2.1 Virtual environment

Python 3 has the built-in capability to set up virtual environment that assures a certain setup for a project. Therefore, when working through this book it is helpful to set up a dedicated virtual environment for it, e.g., calling it dataanalytics. On the terminal/shell or command prompt one can use the following command in a dedicated (typically new) directory:

python -m venv dataanalytics

This sets up a new virtual environment and only has to be done once initially.

Subsequently, the following command has to be used at the beginning of each session to active this virtual environment:

source dataanalytics/bin/activate

The following introductory tutorials provide further help for working with commands and directories on the terminal/shell (typically on macOS or Linux) or the Windows command prompt, respectively:

2.2.2 Packages

The basic Python distribution contains relatively little functionality that is dedicated to data analysis. However, there is a wide range of contributed packages, typically available from the Python Package Index (PyPI, https://pypi.org/). For this book we mostly rely on a small set of packages only:

numpy for basic numeric infrastructure,
pandas for data handling,
statsmodels for statistical modeling,
matplotlib and seaborn for visualization.

Additionally, a couple of packages are employed for more specific tasks:

scipy for further numeric infrastructure,
scikit-learn for additional statistical models,
pca for principal component analysis,
plotnine for refined visualizations,
r-functions for calling R functions from Python.

To install these package the Python package installer pip (or pip3, see above) can be used. To check whether pip is already available on the terminal/shell:

which pip

## /usr/bin/pip

If it is not available, it can be installed via:

python -m ensurepip --upgrade

Once pip is ready, it can be used to list, install, or upgrade packages.

pip list
pip install pandas
pip install --upgrade pandas

However, when using a virtual environment (like dataanalytics, see above), make sure to active that prior to package installation. Otherwise packages will be installed globally in the base environment.

To install of the Python packages used in this book in one go, it is convenient to use the following requirements.txt file.

pip install -r requirements.txt

2.2.3 Jupyter

The Jupyter Notebook (https://jupyter.org/) is an open-source web-based environment for interactive programming. It allows to create and share documents that contain live code, equations, visualizations, and narrative text. It can be installed (after activating the virtual environment) via:

pip install jupyter notebook

Alternatively, it can be downloaded and installed from https://jupyter.org/install.html.

To run a Jupyter session from the terminal/shell, go to the working directory where the virtual environment is located, activate the virtual environment, and then start the notebook in a dedicated directory (say my_notebook_dir):

source dataanalytics/bin/activate
jupyter notebook --notebook-dir=my_notebook_dir/