Introduction¶
Background Questionaire¶
Who has used Theano before?
What did you do with it?
Who has used Python? NumPy? SciPy? matplotlib?
Who has used iPython?
Who has used it as a distributed computing engine?
Who has done C/C++ programming?
Who has organized computation around a particular physical memory layout?
Who has used a multidimensional array of >2 dimensions?
Who has written a Python module in C before?
Who has written a program to generate Python modules in C?
Who has used a templating engine?
Who has programmed a GPU before?
Using OpenGL / shaders ?
Using CUDA (runtime? / driver?)
Using PyCUDA ?
Using OpenCL / PyOpenCL ?
Using cudamat / gnumpy ?
Other?
Who has used Cython?
Python in one slide¶
General-purpose high-level OO interpreted language
Emphasizes code readability
Comprehensive standard library
Dynamic type and memory management
Built-in types: int, float, str, list, dict, tuple, object
Slow execution
Popular in web-dev and scientific communities
#######################
# PYTHON SYNTAX EXAMPLE
#######################
a = 1 # no type declaration required!
b = (1, 2, 3) # tuple of three int literals
c = [1, 2, 3] # list of three int literals
d = {'a': 5, b: None} # dictionary of two elements
# N.B. string literal, None
print d['a'] # square brackets index
# -> 5
print d[(1, 2, 3)] # new tuple == b, retrieves None
# -> None
print d[6]
# raises KeyError Exception
x, y, z = 10, 100, 100 # multiple assignment from tuple
x, y, z = b # unpacking a sequence
b_squared = [b_i**2 for b_i in b] # list comprehension
def foo(b, c=3): # function w default param c
return a + b + c # note scoping, indentation
foo(5) # calling a function
# -> 1 + 5 + 3 == 9 # N.B. scoping
foo(b=6, c=2) # calling with named args
# -> 1 + 6 + 2 == 9
print b[1:3] # slicing syntax
class Foo(object): # Defining a class
def __init__(self):
self.a = 5
def hello(self):
return self.a
f = Foo() # Creating a class instance
print f.hello() # Calling methods of objects
# -> 5
class Bar(Foo): # Defining a subclass
def __init__(self, a):
self.a = a
print Bar(99).hello() # Creating an instance of Bar
# -> 99
NumPy in one slide¶
Python floats are full-fledged objects on the heap
Not suitable for high-performance computing!
NumPy provides a N-dimensional numeric array in Python
Perfect for high-performance computing.
NumPy provides
elementwise computations
linear algebra, Fourier transforms
pseudorandom numbers from many distributions
SciPy provides lots more, including
more linear algebra
solvers and optimization algorithms
matlab-compatible I/O
I/O and signal processing for images and audio
##############################
# Properties of NumPy arrays
# that you really need to know
##############################
import numpy as np # import can rename
a = np.random.rand(3, 4, 5) # random generators
a32 = a.astype('float32') # arrays are strongly typed
a.ndim # int: 3
a.shape # tuple: (3, 4, 5)
a.size # int: 60
a.dtype # np.dtype object: 'float64'
a32.dtype # np.dtype object: 'float32'
Arrays can be combined with numeric operators, standard mathematical functions. NumPy has great documentation.
Training an MNIST-ready classification neural network in pure NumPy might look like this:
#########################
# NumPy for Training a
# Neural Network on MNIST
#########################
x = np.load('data_x.npy')
y = np.load('data_y.npy')
w = np.random.normal(
avg=0,
std=.1,
size=(784, 500))
b = np.zeros((500,))
v = np.zeros((500, 10))
c = np.zeros((10,))
batchsize = 100
for i in range(1000):
x_i = x[i * batchsize: (i + 1) * batchsize]
y_i = y[i * batchsize: (i + 1) * batchsize]
hidin = np.dot(x_i, w) + b
hidout = np.tanh(hidin)
outin = np.dot(hidout, v) + c
outout = (np.tanh(outin) + 1) / 2.0
g_outout = outout - y_i
err = 0.5 * np.sum(g_outout ** 2)
g_outin = g_outout * outout * (1.0 - outout)
g_hidout = np.dot(g_outin, v.T)
g_hidin = g_hidout * (1 - hidout ** 2)
b -= lr * np.sum(g_hidin, axis=0)
c -= lr * np.sum(g_outin, axis=0)
w -= lr * np.dot(x_i.T, g_hidin)
v -= lr * np.dot(hidout.T, g_outin)
What’s missing?¶
Non-lazy evaluation (required by Python) hurts performance
NumPy is bound to the CPU
NumPy lacks symbolic or automatic differentiation
Now let’s have a look at the same algorithm in Theano, which runs 15 times faster if you have GPU (I’m skipping some dtype-details which we’ll come back to).
#########################
# Theano for Training a
# Neural Network on MNIST
#########################
import numpy as np
import theano
import theano.tensor as tensor
x = np.load('data_x.npy')
y = np.load('data_y.npy')
# symbol declarations
sx = tensor.matrix()
sy = tensor.matrix()
w = theano.shared(np.random.normal(avg=0, std=.1,
size=(784, 500)))
b = theano.shared(np.zeros(500))
v = theano.shared(np.zeros((500, 10)))
c = theano.shared(np.zeros(10))
# symbolic expression-building
hid = tensor.tanh(tensor.dot(sx, w) + b)
out = tensor.tanh(tensor.dot(hid, v) + c)
err = 0.5 * tensor.sum(out - sy) ** 2
gw, gb, gv, gc = tensor.grad(err, [w, b, v, c])
# compile a fast training function
train = theano.function([sx, sy], err,
updates={
w: w - lr * gw,
b: b - lr * gb,
v: v - lr * gv,
c: c - lr * gc})
# now do the computations
batchsize = 100
for i in range(1000):
x_i = x[i * batchsize: (i + 1) * batchsize]
y_i = y[i * batchsize: (i + 1) * batchsize]
err_i = train(x_i, y_i)
Theano in one slide¶
High-level domain-specific language tailored to numeric computation
Compiles most common expressions to C for CPU and GPU.
Limited expressivity means lots of opportunities for expression-level optimizations
No function call -> global optimization
Strongly typed -> compiles to machine instructions
Array oriented -> parallelizable across cores
Support for looping and branching in expressions
Expression substitution optimizations automatically draw on many backend technologies for best performance.
FFTW, MKL, ATLAS, SciPy, Cython, CUDA
Slower fallbacks always available
Automatic differentiation
Project status¶
Mature: theano has been developed and used since January 2008 (3.5 yrs old)
Driven over 40 research papers in the last few years
Good user documentation
Active mailing list with participants from outside our lab
Core technology for a funded Silicon-Valley startup
Many contributors (some from outside our lab)
Used to teach IFT6266 for two years
Used for research at Google and Yahoo.
Unofficial RPMs for Mandriva
Downloads (January 2011 - June 8 2011):
Pypi 780
MLOSS: 483
Assembla (bleeding edge repository): unknown
Why scripting for GPUs?¶
They Complement each other:
GPUs are everything that scripting/high level languages are not
Highly parallel
Very architecture-sensitive
Built for maximum FP/memory throughput
So hard to program that meta-programming is easier.
CPU: largely restricted to control
Optimized for sequential code and low latency (rather than high throughput)
Tasks (1000/sec)
Scripting fast enough
Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
How Fast are GPUs?¶
Theory
Intel Core i7 980 XE (107Gf/s float64) 6 cores
NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
NVIDIA GTX580 (1.5Tf/s float32) 512 cores
GPUs are faster, cheaper, more power-efficient
Practice (our experience)
Depends on algorithm and implementation!
Reported speed improvements over CPU in lit. vary widely (.01x to 1000x)
Matrix-matrix multiply speedup: usually about 10-20x.
Convolution speedup: usually about 15x.
Elemwise speedup: slower or up to 100x (depending on operation and layout)
Sum: can be faster or slower depending on layout.
Benchmarking is delicate work…
How to control quality of implementation?
How much time was spent optimizing CPU vs GPU code?
Theano goes up to 100x faster on GPU because it uses only one CPU core
Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
If you see speedup > 100x, the benchmark is probably not fair.
Software for Directly Programming a GPU¶
Theano is a meta-programmer, doesn’t really count.
CUDA: C extension by NVIDIA
Vendor-specific
Numeric libraries (BLAS, RNG, FFT) maturing.
OpenCL: multi-vendor version of CUDA
More general, standardized
Fewer libraries, less adoption.
PyCUDA: python bindings to CUDA driver interface
Python interface to CUDA
Memory management of GPU objects
Compilation of code for the low-level driver
Makes it easy to do GPU meta-programming from within Python
PyOpenCL: PyCUDA for PyOpenCL