Wednesday, November 3, 2010

Cython, Sage, and the Need for Speed

Cython seriously rocks, at least for much of what I need. It's still the killer feature of Python/Sage, IMHO. And meetings like EuroScipy last summer really confirmed that, where almost every other talk used Cython.


Greg Ewing wrote "Pyrex" in 2002--2004..., which I guess he named
after some cooking ware. It is amazing, but to understand this you
must take a quick tour of Extending and embedding and the Python/C API reference. Pyrex let you write basically Python-ish code that gets magically turned into C extension code.

In 2007 I forked it (discuss why briefly) and named Cython after this punk rock guy.

At that time, Robert Bradshaw and Stefen Behnel spent a lot of time improving Cython, implementing tons of _optimizations_ and new features.

Cython is now very popular in the "Scientific computing using Python" world. It is also heavily used in Sage.

Are You Serious?

If you want to use a computer for math research, and you are serious (not some lazy person who fiddles then gives up), you will likely run into situations where you need code to run fast. Writing such code only in Python (or any other interpreter) is often impossible.

If you want to write fast code on a computer, and don't want to mess with assembler, the only option right now is C, or something with equivalent speed... Cython! By "fast" I mean 100-1000 times what you'll get out of Python on certain tasks. I also mean code that is evil, scary, and dangerous... if you aren't careful with preconditions.

Compiled versus Interpreted Code

Here's how interpreter code usually runs.
    1. Check a bunch of conditions then do one single thing.
    2. Check a bunch of conditions then do one single thing.
    10^6. Check a bunch of conditions then do one single thing.
Here's how compiled (C, Cython, etc.) can can be written:

    1. Check some conditions (optional, but a good idea);
    2. Do very unsafe stuff with no checks at all (but they 
       in theory should be safe given 1).
    10^6. Do very unsafe stuff with no checks at all (but they 
       in theory should be safe given 1).
The problem is that all the checks in step 1 (in either case) can easily take over 100 times as long as "do very unsafe stuff".

sage: def sum_sage(n):
...       s = 1
...       for i in range(n):
...           s += i
...       return s
sage: timeit('sum_sage(100000r)')
5 loops, best of 3: 84.6 ms per loop
sage: %python
sage: def sum_python(n):
...       s = 1
...       for i in range(n):
...           s += i
...       return s    
sage: 84.6/5.88
sage: timeit('sum_python(100000r)')
125 loops, best of 3: 5.88 ms per loop
sage: %cython
sage: def sum_cython(int n):
...       cdef int i, s = 1
...       for i in range(n):
...           s += i
...       return s
sage: timeit('sum_cython(100000r)')
625 loops, best of 3: 61.6 ┬Ás per loop
sage: 5.88 / 0.061

Let me explain what's going in the above. How, e.g., in the first one (sum_sage), the program is doing a sort of monologue: "I have to add a Python int to a Sage int. I don't have any code to do that directly (that would get too complicated, and they are so big and complicated and different objects, and they might change, oh my). So I'll convert the Python int to Sage int, because that's the only conversion I know. OK, I do that via (it used to be base 10 string parsing!) some code Gonzalo Tornaria wrote that is scary complicated... and once that is done, I got my new MPIR-based Sage integer, which I think add to s. The addition takes some memory that points to the two MPIR integers, and since Python numbers are supposed to be immutable, I make yet another MPIR number (wrapped in a Python object), which is the result of asking MPIR to add them. MPIR numbers are also very complicated objects, involving stuff like limbs, and C structs, which hardly anybody fully understands. Despite these integers happening to be small, there is still quite some overhead in the addition, but it happens (taking a small fraction of the total runtime). Then we move on to the next step in the loop!"

With sum_python, the loop is similar, but MPIR isn't involved, and there are no conversions. This buys a 14-fold speedup. But it is still not super fast, since many new Python objects get created, the code is for "potentially huge integers", hence a potentially complicated data structure has to be checked for, etc.

With sum_cython, the integers are only C ints, which are a 32 or 64-bit location in memory. Doing "s += i" just modifies in place that position in memory. There's no conversions or type checks done at all at run time. It's really fast... 1386 times faster than the first version!!!

Key point: If you truly understand what is going on, you'll see that this isn't Sage being fundamentally broken. Instead, you'll hopefully be able to look at a block of Sage code and have a clue about how to figure out what it is really doing in order to see whether writing a new implementation of the same algorithm using Cython (which will likely mean directly working with C level data structures) is likely to give you a big speedup. If you look at the innermost statement in a loop, and there's a big monologue about what is really going on, then you might get a 1000-fold speedup by using Cython.

In mathematics, general theorems -- once we have them -- are almost always much better than proofs of special cases. In math, proving a special case can often seem more awkward and unnatural than proving the general case (e.g., how would you proof that ever integer of the form a^2 + 7*a + 5 factors uniquely as a product of primes!?). With general theorems in math, the statements are often simple and clear so applying them is easier than applying theorems that are only about some very special case, which has often more elaborate hypothesis. In mathematics, usually a general theorem is simply all around much better than a theorem about some very special cases (especially if both are available).

In contrast, when writing computer programs, algorithms to solve very general cases of problems often have significant drawbacks in terms of speed (and sometimes complexity) over algorithms for special cases. Since you are mathematicians, you should constantly guard against your instincts from math research which can point you in exactly the wrong direction for writing very fast code. Often implementations of very general algorithms _are_ easier to understand, and are much less buggy than a bunch of implementations of special cases. However, there are also usually very severe performance penalties in implementing only the general case. Watch out.

A huge part of understanding the point of Cython for writing fast math code is that you must accept that you're going to write a lot of "ugly" (from a mathematicians perspective) code that only deals with special cases. But it's beautiful from the perspective of somebody who absolutely needs fast code for a specific research application; your fast code can lead to whole new frontiers of research.

Continuing the example from above:
sage: sum_python(10^8)
sage: sum_cython(10^8)

Yep, we just implemented only a special case in Cython!


Useful things to do if you want Cython enlightenment (all at once, no order):
  • Definitely read/skim Extending and try all examples. This is critical if you want to understand Cython with any depth at all. Don't think: I don't need to know any of this, because I have Cython. Yes, after you play with this a bit you may never explicitly use it. But you do need to know it.
  • Oh, definitely learn the C language if you haven't already. This is
    the book. There are courses. It's crazy not to learn C, anyways, since it (along with C++) is hugely popular, and a massive amount of code is written in C/C++. (See, e.g., where C and C++ together are the most popular, by a long margin.)

  • Obviously, read until it is burned into your brain.

  • Look at the code that Cython generates (use "cython -a" for a nice HTML view).