Jóan Petur Petersen

- Worked on an industrial Ph. D. with DTU and GreenSteam.
- Used python troughout the project
- Data collection
- Data processing
- Modelling
- Data presentation
- Deployment

Python can help you in all these steps from the beginning to the end, and it can be done to a large part in and efficient and fast way.

- Native Python memory and speed
- Numpy introduction (matrix manipulation)
- Ipython and Matplotlib (visualization)

We'll start of looking at how native python works memory and speed wise. As you might already know most Python interpreters aren't very fast, so this will affect the way you would implement algorithms and manage data processing for scientific computing.

Then we'll look at a library for presenting matrices and matrix operations. This is probably one of the most commonly used libraries in scientific python.

Finally we'll look at Matploblib, which is a nice 2d plotting library for Python.

Python is a high-level general-purpose interpreted programming language [http://en.wikipedia.org/wiki/Python_(programming_language)].

This talk assumes you know some python, and that you are familiar with Matlab, or similar programs.

So we will look at the subtle differences and pitfalls one might encounter when starting out with NumPy.

Install memory profiler: `pip install -U memory_profiler`

Let's make a 1000x1000 demo matrix:

```
1 @profile
2 def make_matrix(N=1000):
3 matrix_2d = []
4 for i in xrange(N):
5 repeat_me = range(N)
6 matrix_2d.append(repeat_me)
7 return matrix_2d
8
9 if __name__ == '__main__':
10 make_matrix()
```

If you want to run this without memory profiling, you can create a dummy decorator:

def profile(fn): def wrapped(): return fn() return wrapped

Run the program using a memory profiler:

```
>>> python -m memory_profiler profile_make_matrix.py
Filename: profile_make_matrix.py
Line # Mem usage Increment Line Contents
================================================
2 @profile
3 6.03 MB 0.00 MB def make_matrix(N=1000):
4 6.04 MB 0.00 MB matrix_2d = []
5 18.86 MB 12.83 MB for i in xrange(N):
6 18.88 MB 0.01 MB repeat_me = range(N)
7 18.88 MB 0.00 MB matrix_2d.append(repeat_me)
8 18.88 MB 0.00 MB return matrix_2d
```

Use `sys.getsizeof(object)`

to find the actual memory used by an object:

```
>>> import sys
>>> a = [1]
>>> sys.getsizeof(a)
40
>>> sys.getsizeof(a[0])
12
>>> sys.getsizeof(range(1))
40
>>> sys.getsizeof(range(2))
44
>>> sys.getsizeof(range(3))
48
>>> sys.getsizeof(range(1000))
4036
```

`sys.getsizeof()`

returns the size of memory footprint in bytes of objects.
In some python versions container objects it will return the size of only the container object.
The list contains references to the objects, and it can be hard to figure out the
memory consumption, because the interpreter tries to reuse references to objects.

Non-native Python code, such as C code, might also allocate memory without Python's knowledge.

A very brute mean function:

```
1 def calc_mean(list_matrix):
2 """Returns the mean of a matrix given as a list of lists.
3 """
4 total = 0
5 cnt = 0
6 for lst in list_matrix:
7 for number in lst:
8 total += number
9 cnt += 1
10 return (1.*total) / cnt
```

See more at http://scikit-learn.org/dev/developers/performance.html

```
In [43]: %prun calc_mean(m)
3 function calls in 0.159 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.159 0.159 0.159 0.159 calc_mean.py:2(calc_mean)
1 0.000 0.000 0.159 0.159 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
```

`%prun`

only times function calls.

line_profiler can report per line.

```
%lprun -f calc_mean calc_mean(m)
Timer unit: 1e-06 s
File: calc_mean.py
Function: calc_mean at line 2
Total time: 4.27212 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 def calc_mean(list_matrix):
3 """Returns the mean of a matrix given as a list of lists.
4 """
5 1 11 11.0 0.0 total = 0
6 1 1 1.0 0.0 cnt = 0
7 1001 1497 1.5 0.0 for lst in list_matrix:
8 1001000 1352026 1.4 31.6 for number in lst:
9 1000000 1412441 1.4 33.1 total += number
10 1000000 1506138 1.5 35.3 cnt += 1
11 1 4 4.0 0.0 return (1.*total) / cnt
```

It is interesting to note how the innermost loop takes the most of the time. Each line in the inner loop accounts for almost 1/3 of the total execution time each.

Scikit has a nice page on optimization: http://scikit-learn.org/dev/developers/performance.html

line_profiler doc: http://packages.python.org/line_profiler/

Why should we use NumPy?

- NumPy arrays are more compact than python lists
- Large library of implemeted functions
- Computationally faster than using native python lists

- Matlab users: http://www.scipy.org/NumPy_for_Matlab_Users
- Tutorial: http://www.scipy.org/Tentative_NumPy_Tutorial

We create the same 1000x1000 matrix with Numpy:

```
import numpy as np
def make_matrix(N=1000):
row = np.array(range(1000), dtype=np.int32)
return np.tile(row, (1000,1))
```

```
>> python -m memory_profiler profile_make_matrix_numpy.py
Filename: profile_make_matrix_numpy.py
Line # Mem usage Increment Line Contents
================================================
3 @profile
4 11.36 MB 0.00 MB def make_matrix(N=1000):
5 11.37 MB 0.00 MB row = np.array(range(1000), dtype=np.int32)
6 15.21 MB 3.85 MB return np.tile(row, (1000,1))
```

And speed:

```
In [5]: %prun np.mean(m)
4 function calls in 0.003 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.003 0.003 0.003 0.003 {method 'mean' of 'numpy.ndarray' objects}
1 0.000 0.000 0.003 0.003 fromnumeric.py:2299(mean)
1 0.000 0.000 0.003 0.003 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
```

Note here that we specify the type for the hole array. Numpy arrays are homogeneous, that is all entries have the same type. This saves all the spaced used for keeping each element's type as in the list of lists simple type of matrix.

An array in NumPy can be created this way:

```
1 import numpy as np
2 X = np.array([[1,2], [3,4]])
```

So there are kind of many brackets here. In Matlab it would just have been:

```
>> X = [1,2; 3,4]
```

The NumPy matrix class helps a bit:

```
>>> import numpy as np
>>> X = np.matrix('1 2; 3 4')
```

If we have `from numpy import *`

then we might write it as:

```
>>> from numpy import *
>>> X = matrix('1,2; 3,4')
```

The NumPy matrix is always 2D, and retains its 2D nature through operations.

But normally I don't use the matrix class. One of the reasons being that I just started with the array class, and this is also what most of the Numpy methods will return.

It seems like Numpy is a kind of second grade citizen in Python. The notation isn't really nice for it. But then again; Python is a general purpose language.

There might be some nasty surprises using NumPy arrays:

```
>>> import numpy as np
>>> X = np.array([[1,2],[3,4]])
>>> Y = X
>>> print X
[[1 2]
[3 4]]
>>> print Y
[[1 2]
[3 4]]
>>> X[0,0] = 0
>>> print X
[[0 2]
[3 4]]
>>> print Y
[[0 2]
[3 4]]
```

Simple trick to avoid this: `Y = 1*X`

(makes a copy)

```
>>> import numpy as np
>>> X = np.array([[1,2],[3,4]])
>>> print X
[[1 2]
[3 4]]
>>> print X[0]
[1 2]
>>> print X[0,:]
[1 2]
>>> print X[:,0]
[1 3]
```

There might be some surprises using NumPy arrays:

```
>>> import numpy as np
>>> X = np.array([[1,2],[3,4]])
>>> X.shape
(2, 2)
>>> X1 = X[:,0]
>>> X1.shape
(2,)
```

So it has become a one dimensional array.

```
>>> Y = np.matrix('1,2; 3,4')
>>> Y[:,0]
matrix([[1],
[3]])
>>> Y1 = Y[:,0]
>>> Y1.shape
(2, 1)
```

```
>>> X = np.array(range(9)).reshape(3,3)
>>> print X
[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> print X[1:, 1:]
[[4 5]
[7 8]]
>>> print X[:1, :1]
[[0]]
```

A slices is a reference to part of the original array.

```
>>> a = eye(3)
>>> print a
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]
>>> b = a[:,1]
>>> print b
[ 0. 1. 0.]
>>> b[1]=2
>>> print a
[[ 1. 0. 0.]
[ 0. 2. 0.]
[ 0. 0. 1.]]
```

```
>>> X = np.array(range(25)).reshape(5,5)
>>> print X
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
>>> print X[:,::2]
[[ 0 2 4]
[ 5 7 9]
[10 12 14]
[15 17 19]
[20 22 24]]
>>> print X[:,1::2]
[[ 1 3]
[ 6 8]
[11 13]
[16 18]
[21 23]]
```

More info here: http://scipy-lectures.github.com/advanced/advanced_numpy/index.html A video here: https://www.youtube.com/watch?v=7vcjjN9eNvs

```
>>> X = np.array(range(25)).reshape(5,5)
>>> print X
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
>>> B = X % 3 == 0
>>> print B
[[ True False False True False]
[False True False False True]
[False False True False False]
[ True False False True False]
[False True False False True]]
>>> X[B] = 0
>>> print X
[[ 0 1 2 0 4]
[ 5 0 7 8 0]
[10 11 0 13 14]
[ 0 16 17 0 19]
[20 0 22 23 0]]
```

IPython is an enhanced Python shell.

- Code completion and highlighting
- Profiling code
- Command history

Enthought has a nice installer for IPython.

`%hist`

: Show command history`%edit`

: Open editor and execute code after closing`%prun`

: Profile a method call-
`%paste`

: Paste and run from clipboard -
*variable*`?`

: Show type and docstring *variable*`??`

: Show the code

```
>>> import numpy as np
>>> np.mean?
Base Class: <type 'function'>
String Form:<function mean at 0x12f6570>
Namespace: Interactive
File: /Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/numpy/core/fromnumeric.py
Definition: np.mean(a, axis=None, dtype=None, out=None)
Docstring:
Compute the arithmetic mean along the specified axis.
Returns the average of the array elements. The average is taken over
the flattened array by default, otherwise over the specified axis.
`float64` intermediate and return values are used for integer inputs.
Parameters
----------
a : array_like
Array containing numbers whose mean is desired. If `a` is not an
array, a conversion is attempted.
axis : int, optional
Axis along which the means are computed. The default is to compute
the mean of the flattened array.
dtype : data-type, optional
Type to use in computing the mean. For integer inputs, the default
is `float64`; for floating point inputs, it is the same as the
...
```

Matplotlib's gallery is a good place to get inspiration:

http://matplotlib.org/gallery.html

Special support for Matplotlib.pylab plots using `-pylab`

(`--pylab`

since IPython version 0.12).

Multi-threaded handling of the figures in the background.

Some libaries that can make life easier:

- NumPy (matrix reprensentation, linear algebra, C/C++ code integration)
- SciPy (optimization, more linear algebra, special functions, FFT, ODE, ...)
- Scikit-learn (machine learning, supervised/unsupervised, model selection, ...)

Visualization:

- Matplotlib (nice 2d plots, some support for 3d plots, maps)
- Mayavi (3d plotting)

With python you have

- access to a large library of well tested methods for linear algebra, optimization, machine learning, and so on.

Wrap:

- Fortran, C, C++, ...
- R

And then access to all the other python libraries: os, sys, wxPython, Django, flask, ...

I'll just drop these:

Generally:

- Only optimize code that needs to be.
- Has someone already optimized this problem?
- Is there a better way of doing it? Another algorithm?

Try to avoid python loops, especially nested loops, and write the code in such a way that it uses numpy, scipy, and similar functions instead.

Cython?

- http://www.scipy.org/Tentative_NumPy_Tutorial

Table of Contents | t |
---|---|

Exposé | ESC |

Full screen slides | e |

Presenter View | p |

Source Files | s |

Slide Numbers | n |

Toggle screen blanking | b |

Show/hide slide context | c |

Notes | 2 |

Help | h |