Performance

Theano uses several tricks to obtain good performance:
  • common sub-expression elimination

  • [custom generated] C code for many operations

  • pre-allocation of temporary storage

  • loop fusion (which gcc normally can’t do)

On my neural net experiments for my course projects, I was getting around 10x speed improvements over basic numpy by using theano. [More specific speed tests would be nice.]

With a little work, Theano could also implement more sophisticated optimizations:

  • automatic ordering of matrix multiplications

  • profile-based memory layout decisions (e.g. row-major vs. col-major)

  • gcc intrinsics to use MMX, SSE2 parallelism for faster element-wise arithmetic

  • conditional expressions