Further Concepts & Libraries - Excersises
Goals
- Gain an awareness of Python libraries, their uses and potential.
- Be able to install (where appropriate) and import libraries.
- Practise using and writing code using these libraries.
Libraries
In Python, libraries are packages of pre-written code that usually contain functions and objects that can be imported into your own code. They are usually very efficient implementations of algorithms, models and procedures. In general if there is a library that does something you yourself would like to implement, it is better to use the library as it is probably well written, veified and tests, and efficient.
In this course you have already used some libraries, Ciw for simulating queueing systems, networkx for networks, numpy was used to reading in file (but is used for much more than this!), random for generating random numbers, and math for mathematical operations and constants.
There are two kinds of Python library:
- Those in the standard Python library (e.g.
math
), these come pre-installed every distribution of Python. There is no need to install these libraries. Many are written by the same people who write Python itself. - Those not in the standard Python library (e.g. Ciw). In order to use these you will have to install them seperately uring
pip
.
Having said this, in this course we are using an Anaconda distribution of Python, which comes pre-installed with a number of popular libraries that are not in the standard Python library.
This tutorial is a demonstration of the uses of a number of libraries you may find useful during your studies. We will look at:
numpy
for matrix and numerical operations,matplotlib
for creating plots,pandas
for data analysis,scipy
for statistical testing and other scientific procedures,scikit-learn
for machine learning.
Numpy
The power of this library is its efficiency in carrying out numeric computations, especially linear algebraic manipulation.
For example consider two matrices:
\[\mathbf{A} = \begin{pmatrix}3&4&-2\\2&1&-1\\7&-10&-5\end{pmatrix}\]and
\[\mathbf{B} = \begin{pmatrix}1&5&4\end{pmatrix}\]in numpy:
We can access the various elements of these matrices using indexing (like we have seen before with lists):
But also, a more efficent way to index is to use numpy indexing, like so:
We can perform scalar multiplication:
Raise to powers:
Perform matrix addition:
And multiplication:
And further linear algebraic manipulation, such as determinants and inverses:
Numpy is also useful in other contexts, for example to create arrays of equally spaced numbers:
Matplotlib
This is the most popular Python library for producing plots. It is flexible enough to be able to create nearly most plots you will require in most styles, and it also has a simpler interface, pyplot, for quick and easy plotting.
We’ll demonstrate through examples.
Before we begin, in order for plots to display in Jupyter, we need the follwing line of code (this is only needed in Jupyter):
Next we’ll import pyplot, and create a line plot:
Using the same data, we can make a scatterplot (and let’s customise it a little):
A vast number of other types of plot can be produced which can’t all be listed here. Below are examples of creating histograms and boxplots. First some random data is created (using the random
library from the standard Python library):
Finally matplotlib allows plots to be combined, customised and saved in a number of formats (experiment with .png, .svg and .pdf):
Pandas
Pandas is Python’s most popular library for data analysis and data manipulation. It is very useful for storing data in objects called ‘data frames’, which arrange data into useful and meaningful rows and columns. These data frames can be manipulated very efficiently for reshaping data, and performing data analyses on them.
To show an example, let’s read in a csv file (it can be downloaded from here if you’re following along), data of passengers on the Titanic:
And look at the first few rows:
We can see that this data has 4 columns, the passengers’ name, their cabin class, their age and sex, and whether they survived or not.
We’ll use this to demonstrate some of pandas’ data manipulation methods.
As you have seen above some of these methods can be combined, and pandas is very efficient at doing this. Sometimes simply reshaping data can give valuable insights, for example:
This gives the mean value of the ‘Survived’ column for each cabin class, that is for each value in the ‘PClass’ column. Knowing that ‘Survived’ is binary, its mean value is the proportion of passengers who survived.
Therefore by combining groupby
and mean
methods on the relevant columns, we can instantly see that there is a relationship between proportion of survivors and cabin class, with the lower cabin classes having a lower proportion of survivors.
Scipy
This scipy library is very versatile and has a number of functions and methods for conducting scientific procedures and algorithms. It is vast and has a number of specialised sublibraries. We’ll look at two of those here:
scipy.stats
has a number of statistical functions for performing statistical tests and using probability distributions.scipy.optimize
has a number of algorithms for optimizing functions and curve-fitting.
Let’s use the Titanic data from above and consider it as a sample (it isn’t, it’s a population, but let’s consider it as a sample for the sake of demonstrating hypothesis testing). Let’s see if the average age of female passengers was equal to the average age of male passengers; using a independent sample t-test at the 1% level:
A t-test was performed:
- \(H_0\): The mean age of female passengers is equal to the mean age of male passengers.
- \(H_1\): The mean age of female passengers is not equal to the mean age of male passengers.
We obtained a p-value of \(0.12385\dots\), and so at the 1% level the null hypothesis cannot be rejected, and there is not enough evidence to say that the mean ages of the genders differ.
Scipy also allows non-parametric tests if the observed data is not Normally distributed:
A Mann-Whitney U test was performed:
- \(H_0\): The median age of female passengers is equal to the mean age of male passengers.
- \(H_1\): The median age of female passengers is not equal to the mean age of male passengers.
We obtained a p-value of \(0.0348\dots\), and so the 1% level null hypothesis cannot be rejected, and there is not enough evidence to say that the median ages of the genders differ.
Now using scipy.optimize
, let’s see how we can minimise some function arbitrary function:
First define this function as a Python function:
and this gives us the optimal values of \(x = -22\) and \(y = 14\).
Scikit-learn
The final library we will look at is scikit-learn. This is Python’s machine learning library. It can implement a wide number of machine learning algorithms, but here we’ll just demonstrate a clustering algorithm.
First import a data set to demonstrate on (which can be downloaded here):
This has observations of plants with their height and weight recorded. A plot will show more information:
We can see there are four natural groupings. We’ll use k-means clustering to categorise these: