Chapter 2 Sotware installation and first steps
2.1 R
In addition to installing R itself from the Comprehensive R Archive Network (CRAN, https://CRAN.R-project.org/), it is recommended to install the accompanying “R tools” encompassing various command line and compiler tools. These are not strictly necessary for the first steps but typically come in handy when installing packages at some point. The links for the different platforms are:
Windows: The current R installation executable is available at
https://CRAN.R-project.org/bin/windows/base/
and the corresponding R tools at
https://CRAN.R-project.org/bin/windows/Rtools/
Both are.exe
files that just require a couple of clicks for successful installation.macOS: The R binary package file is available at
https://CRAN.R-project.org/bin/macosx/
and it is recommended to also install the clang and gfortran compilers from
https://CRAN.R-project.org/bin/macosx/tools/
All of these are standard.pkg
files for macOS.Linux: While CRAN also hosts binary packages for various Linux distributions, it is typically easiest to install these from the respective package repositories with
apt
oryum
etc. In addition to the R base core package, the installation of the corresponding developer/header files is recommended. For example, on Ubuntu or Debian this typically corresponds to installing:apt-get install r-base-core r-base-dev
On top of the R base system there are several user interfaces and development
environments that make using R easier and more convenient, especially for
beginners. For example, those familiar with Emacs will be interested in using
ESS. The most popular integrated development
environment for R, though, is RStudio. This is also
used in the lecture and tutorial sessions for the course. Binaries can easily be
installed from
https://rstudio.com/products/rstudio/download/#download
2.1.1 R as…
R can be used in a number of different ways:
…as a calculator: For simple arithmetics.
## [1] 2
## [1] 4
…for vectorized arithmetics:
## [1] 2.0 9.5 17.0
## [1] 1.0000 12.1825 148.4132
…as a matrix-based language: For example, for matrix products, solving systems of equations, etc.
## [,1]
## [1,] 2.0
## [2,] 9.5
## [3,] 17.0
## x
## 2 3 0
…as a statistics system: For example, for simulations, for fitting regression models, and for statistical inference.
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## 1 2 3
## 0.0551 -0.1101 0.0551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7835 0.1231 14.5 0.0439
## x 2.8286 0.0381 74.2 0.0086
##
## Residual standard error: 0.135 on 1 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.5e+03 on 1 and 1 DF, p-value: 0.00859
Note that due to the random normal (rnorm
) error term, results may vary
a bit when running the code above.
Moreover, R is a fully-fledged programming language and comes with a powerful visualization system.
2.1.2 First steps
A more detailed introduction providing useful background information and a guided tour of the most important first steps is available in the sibling DiSC course Introduction to Programming with R in Chapters 1-3.
2.2 Python
For the examples it is necessary to install Python 3. A very useful beginner’s guide, covering the steps from download to basic programming, is available at https://wiki.python.org/moin/BeginnersGuide/Download.
Additionally, we recommend to use the Python package installer pip
for the
package management and Jupyter notebooks for interactive programming. These are
not stricly necessary for the examples, though, and other development environments
(like PyCharm or Visual Studio Code or simply the command line) could be used as well.
Note: Some operating systems (e.g., macOS or certain flavors of Linux) aim to
facilitate using several Python versions in parallel. Hence, these systems provide
dedicated commands python3
and pip3
, respectively, while python
and pip
refer to the version 2 commands. In the following paragraphs, when we refer to python
or pip
, the commands therefore have to be replaced with python3
and pip3
, respectively.
The version of a given command can be queried using the --version
option, e.g.,
## Python 3.12.4
Similarly, on Windows there is the command py
which essentially calls python
but
additionally tries to avoid certain path problems etc. If available, you can replace
the python
command with py
in the subsequent paragraphs.
2.2.1 Virtual environment
Python 3 has the built-in capability to set up virtual environment that assures a certain
setup for a project. Therefore, when working through this book it is helpful to set up
a dedicated virtual environment for it, e.g., calling it dataanalytics
. On the terminal/shell
or command prompt one can use the following command in a dedicated (typically new) directory:
This sets up a new virtual environment and only has to be done once initially.
Subsequently, the following command has to be used at the beginning of each session to active this virtual environment:
The following introductory tutorials provide further help for working with commands and directories on the terminal/shell (typically on macOS or Linux) or the Windows command prompt, respectively:
2.2.2 Packages
The basic Python distribution contains relatively little functionality that is dedicated to data analysis. However, there is a wide range of contributed packages, typically available from the Python Package Index (PyPI, https://pypi.org/). For this book we mostly rely on a small set of packages only:
numpy
for basic numeric infrastructure,pandas
for data handling,statsmodels
for statistical modeling,matplotlib
andseaborn
for visualization.
Additionally, a couple of packages are employed for more specific tasks:
scipy
for further numeric infrastructure,scikit-learn
for additional statistical models,pca
for principal component analysis,plotnine
for refined visualizations,r-functions
for calling R functions from Python.
To install these package the Python package installer pip
(or pip3
, see above)
can be used. To check whether pip
is already available on the terminal/shell:
## /usr/bin/pip
If it is not available, it can be installed via:
Once pip
is ready, it can be used to list, install, or upgrade packages.
However, when using a virtual environment (like dataanalytics
, see above),
make sure to active that prior to package installation. Otherwise packages
will be installed globally in the base environment.
To install of the Python packages used in this book in one go, it is convenient to use the following requirements.txt file.
2.2.3 Jupyter
The Jupyter Notebook (https://jupyter.org/) is an open-source web-based environment for interactive programming. It allows to create and share documents that contain live code, equations, visualizations, and narrative text. It can be installed (after activating the virtual environment) via:
Alternatively, it can be downloaded and installed from https://jupyter.org/install.html.
To run a Jupyter session from the terminal/shell, go to the working directory
where the virtual environment is located, activate the virtual environment,
and then start the notebook in a dedicated directory (say my_notebook_dir
):