Monday, November 15, 2010

Brief History and Motivation Behind the Sage Coercion Model

In Sage (like in Magma), most objects are either elements or parents. Think of a "parent" as a set. This Parent/Element idea is a powerful algebraic approach to implementing mathematical objects on a computer, which does not exist in Mathematica, Maple, PARI, Maxima, and many other math software platforms.

I learned about this approach from using Magma:

%magma
R<x> := PolynomialRing(Integers());
print Parent(x);
///
Univariate Polynomial Ring in x over Integer Ring

(In this blog post I'll put %magma above code that gets input to Magma; all other code gets input to Sage. The input and output is separated by ///.)

In Sage:

R.<x> = ZZ[]
parent(x)
///
Univariate Polynomial Ring in x over Integer Ring

x.parent()
///
Univariate Polynomial Ring in x over Integer Ring

isinstance(ZZ, Parent)
///
True

isinstance(2, Parent)
///
False

Automatic Coercions:

"The primary goal of coercion is to be able to transparently do arithmetic, comparisons, etc. between elements of distinct parents."

When I used to try to get people to use Magma, perhaps the number one complaint I heard about Magma was that doing arithmetic with objects having distinct parents was difficult and frustrating.

For the first year, in Sage, there was a very simple coercion system:

If you try to compute a + b or a * b, first somehow put b into the parent of a, then do the arithmetic.

That seriously sucked. E.g.,

Mod(2,7) + 6

was completely different than

6 + Mod(2,7)!

The first was Mod(1,7), and the second was the integer 8. This makes understanding code difficult and unpredictable.

So I rewrote coercion to be a bit better (this was a really painful rewrite that I mostly did myself over several hard months of work):

If you try to compute a + b (or a*b), check for a "canonical coercions" from the parent of a into b, or failing that, from the parent of b into a. If there aren't any raise an error. If there is one, use it. There won't be both unless there is some canonical isomorphism.
There are some axioms about what a canonical coercion is. At least it is homomorphism.

Then we decided that there is a canonical homomorphism Z --> Z/7Z, but there is not one Z/7Z --> Z since there is no ring homomorphism in this direction, hence the above makes sense in either order.

One implication of this new model was that parent objects have to be immutable, i.e., you can't fundamentally change them after you make them. This is why in Sage you must specify the name of the generator of a polynomial ring at creation time, and can't change it. In Magma, it is typical to specify the name only later if you want.

Objects must be immutable because the canonical maps between them depend on the objects themselves, and we don't want them to just change left and right at runtime.

%magma
R := PolynomialRing(RationalField(), 2);
f := R.1^3 + 3*R.2^3 - 4/5;
print f;
///
$.1^3 + 3*$.2^3 - 4/5
[ $.1, $.2 ]

%magma
AssignNames(~R, ["x", "y"]);
print f;
[R.1, R.2]
///
x^3 + 3*y^3 - 4/5
[x, y]

%magma
AssignNames(~R, ["z", "w"]);
print f;
///
z^3 + 3*w^3 - 4/5

Now in Sage:

R = PolynomialRing(QQ)
///
TypeError: You must specify the names of the variables.

R.<x,y> = PolynomialRing(QQ)
f = x^3 + 3*y^3 - 4/5; f
///
x^3 + 3*y^3 - 4/5

Note: In Sage, you can can use a with block to temporarily change the names if you really need to for some reason. This is allowed since at the end of the with block the names are guaranteed to be changed back.

with localvars(R, ['z','w']):
    print f
print "back?", f    
///
z^3 + 3*w^3 - 4/5
back? x^3 + 3*y^3 - 4/5

But this new model had a major problem too, e.g., if x in Z[x] then "x + 1/2" would FAILS! This is because 1/2 does not coerce into Z[x] (the parent of x), and x does not coerce into Q (the parent of 1/2).

Maybe the implementors of Magma have the answers? Evidently not.

%magma
R<x> := PolynomialRing(Integers());
x + 1/2;
///
Runtime error in '+': Bad argument types
Argument types given: RngUPolElt[RngInt], FldRatElt

Robert Bradshaw did though, and now it is in Sage:

R.<x> = ZZ[]
x + 1/2
///
x + 1/2

His new design is (for the most part) what Sage actually uses now.

He launched an effort in 2008 (see the Dev Days 1 Wiki) to implement a rewrite of the coercion model to his new design. This ended up swallowing up half the development effort at the workshop, and was a massive amount of work, since every parent structure and element had to have some modifications made to it.

This meant people changing a lot of code all over Sage that they didn't necessarily understand, and crossing their fingers that the doctest test suite would catch their mistakes. This was SCARY. After much work, none of this went into Sage. It was just way too risky. This failure temporarily (!) burned out some developers.

Robert Bradshaw, on the other hand, persisted and came up with a new approach that involved migrating Sage code gradually. I.e., he made it so that the old coercion model was still fully supported simultaneously with the new one, then he migrated a couple of parent structures, and got the code into Sage. I'm sure not everything is migrated, even today. There are two points to what he did:

He extended the rules so x + 1/2 works, i.e., the result of a+b need not live in the parent of a or the parent of b.
He made implementing coercion much more top down: simply implement various methods in a class that derives from Parent. This meant that instead of coercion being rules and conventions that people have to understand and implement in their own code all over Sage, they just implement a small amount of code and the rules (and benefits) are all enforced automatically.

The Coercion Model

The coercion model is explained here: http://sagemath.org/doc/reference/coercion.html

Monday, November 8, 2010

Getting Started With Cython

Quote about Cython:

Andrew Tipton says "I'm honestly never going back to writing C again. Cython gives me all the expressiveness of Python combined with all the performance and close-to-the-metal-godlike-powers of C. I've been using it to implement high-performance graph traversal and routing algorithms and to interface with C/C++ libraries, and it's been an absolute amazing productivity boost." Yep.

Cython has two major use cases

Extending the CPython interpreter with fast compiled modules,
Interfacing Python code with external C/C++ libraries.

Cython supports type declarations

For changing code from having dynamic Python semantics into having static-and-fast (but less generic) C semantics.
Directly manipulating C data types defined in external libraries.

Tutorial: Building Your First Cython Code by Hand

It happens in two stages:

A .pyx file is compiled by Cython to a .c or .cpp file.
The .c or .cpp file is compild by a C compiler (such as GCC) to a .so file.

Let's try it now:
First, create a file sum.pyx that contains the following code (see this directory for original code files):

def sum_cython(long n):
    cdef long i, s = 0
    for i in range(n):
        s += i
    return s

Then use Cython to compile it:

Since we're using Sage, you can do

bash$ sage -cython sum.pyx
bash$ ls
sum.c  sum.pyx

Notice the new file sum.c. Compile it with gcc as follows on OS X:

bash$ sage -sh
bash$ gcc -I$SAGE_ROOT/local/include/python2.6 -bundle -undefined dynamic_lookup sum.c -o sum.so

On Linux, do:

bash$ sage -sh
bash$ gcc -I$SAGE_ROOT/local/include/python2.6 -shared -fPIC sum.c -o sum.so

Finally, try it out.

You must run Sage from the same directory that contains the file sum.so. When you type import sum below, the Python interpreter sees the file sum.so, opens it, and it contains functions and data that define a compiled "Python C-extension module", so Python can load it (like it would like a module like sum.py).

bash$ sage
-------------------------------------------------
| Sage Version 4.6, Release Date: 2010-10-30     
| Type notebook() for the GUI, and license() for
-------------------------------------------------
sage: import sum
sage: sum.sum_cython(101)
5050
sage: timeit('sum.sum_cython(101)')
625 loops, best of 3: 627 ns per loop
sage: timeit('sum.sum_cython(101)', number=10^6)    # better quality timing
1000000 loops, best of 3: 539 ns per loop

Finally, take a look at the (more than 1000 line) autogenerated C file sum.c:

bash$ wc -l sum.c
    1178 sum.c
bash$ less sum.c

...

Notice code like this, which illustrates that Cython generates code that supports both Python2 and Python3:

#if PY_MAJOR_VERSION < 3
  #define __Pyx_BUILTIN_MODULE_NAME "__builtin__"
#else
  #define __Pyx_BUILTIN_MODULE_NAME "builtins"
#endif

#if PY_MAJOR_VERSION >= 3
  #define Py_TPFLAGS_CHECKTYPES 0
  #define Py_TPFLAGS_HAVE_INDEX 0
#endif

The official Python docs say: "If you are writing a new extension module, you might consider Cython. It translates a Python-like language to C. The extension modules it creates are compatible with Python 3.x and 2.x."

If you scroll down further you'll get past the boilerplate and see the actual code:

...
  /* "/Users/wstein/edu/2010-2011/581d/notes/2010-11-08/sum.pyx":2
 * def sum_cython(long n):
 *     cdef long i, s = 0             # <<<<<<<<<<<<<<
 *     for i in range(n):
 *         s += i
 */
  __pyx_v_s = 0;

  /* "/Users/wstein/edu/2010-2011/581d/notes/2010-11-08/sum.pyx":3
 * def sum_cython(long n):
 *     cdef long i, s = 0
 *     for i in range(n):             # <<<<<<<<<<<<<<
 *         s += i
 *     return s
 */
  __pyx_t_1 = __pyx_v_n;
  for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2+=1) {
    __pyx_v_i = __pyx_t_2;

    /* "/Users/wstein/edu/2010-2011/581d/notes/2010-11-08/sum.pyx":4
 *     cdef long i, s = 0
 *     for i in range(n):
 *         s += i             # <<<<<<<<<<<<<<
 *     return s
 */
    __pyx_v_s += __pyx_v_i;
  }
...

There is a big comment that shows the original Cython code with context and a little arrow pointing at the current line (these comment blocks with context were I think the first thing I personally added to Pyrex... before, it just gave that first line with the .pyx filename and line number, but nothing else). Below that big comment, there is the actual C code that Cython generates. For example, the Cython code s += i is turned into the C code __pyx_v_s += __pyx_v_i;.

The Same Extension From Scratch, for Comparison

If you read Extending and Embedding Python you'll see how you could write a C extension module from scratch that does the same thing as sum.so above. Let's see what this is like, for comparison. Given how simple sum.pyx is, this isn't so hard. When creating more complicated Cython code---e.g., new extension classes, more complicated type conversions, and memory management---writing C code directly quickly becomes unwieldy.
First, create a file sum2.c as follows:

#include <Python.h>

static PyObject * 
sum2_sum_c(PyObject *self, PyObject *n_arg)
{
    long i, s=0, n = PyInt_AsLong(n_arg);
    
    for (i=0; i<n; i++)  {
 s += i;
    }
    PyObject* t = PyInt_FromLong(s);
    return t;
}

static PyMethodDef Sum2Methods[] = {
    {"sum_c", sum2_sum_c, METH_O, "Sum the numbers up to n."},
    {NULL, NULL, 0, NULL} /* Sentinel */
};

PyMODINIT_FUNC
initsum2(void)
{
    PyObject *m;
    m = Py_InitModule("sum2", Sum2Methods);
}

Now compile and run it as before:

bash$ sage -sh

bash$ gcc -I$SAGE_ROOT/local/include/python2.6 -bundle -undefined dynamic_lookup sum2.c -o sum2.so 
bash$ sage
...
sage: import sum2
sage: sum2.sum_c(101)
5050
sage: import sum
sage: sum.sum_cython(101)
5050
sage: timeit('sum.sum_cython(1000000r)')
125 loops, best of 3: 2.54 ms per loop
sage: timeit('sum2.sum_c(1000000r)')
125 loops, best of 3: 2.03 ms per loop

Note that this is a little faster than the corresponding Cython code. This is because the Cython code is more careful, checking various error conditions, etc.

Note that the C code is 5 times as long as the Cython code.

Building Extensions using Setuptools Instead

In nontrivial projects, the Cython step of transforming your code from .pyx to .c is typically done by explicitly calling cython somehow (this will change in the newest version of Cython), but the step of running the C compiler is usually done using either distutils or setuptools. To use the tools, one creates a file "setup.py" which defines the extensions in your project, and Python itself then runs a C compiler for you, with the proper options, includes paths, etc.

Let's create a new setuptools project that includes the sum and sum2 extensions that we defined above. First, create the following file and call it setup.py. This should be in the same directory as sum.c and sum2.c.

from setuptools import setup, Extension

ext_modules = [
    Extension("sum", ["sum.c"]),
    Extension("sum2", ["sum2.c"])
    ]

setup(
    name = 'sum',
    version = '0.1',
    ext_modules = ext_modules)

Then type

bash$ rm *.so  # make sure something happens
bash$ sage setup.py develop
...
bash$ ls *.so
sum.so sum2.so

Notice that running

setup.py develop

resulted in Python generating the right gcc commmand lines for your platform. You don't have to do anything differently on Linux, OS X, etc.

If you change sum2.c, and want to rebuild it, just type sage setup.py develop again to rebuild sum2.so If you change sum.pyx, you have to manually run Cython:

sage -cython sum.pyx

then again do sage setup.py develop to rebuild sum.so. Try this now. In sum.pyx, change

for i in range(n):

for i in range(1,n+1):

then rebuild:

bash$ sage -cython sum.pyx
...
bash$ sage setup.py develop
...
bash$ sage
...
sage: import sum
sage: sum.sum_cython(100)
5050

There are ways to make setup.py automatically notice when sum.pyx changes, and run Cython. A nice implementation of this will be in the next Cython release. See the setup.py and build_system.py files of Purple sage for an example of how to write a little build system write now (before the new version of Cython).

An Automated Way to Experiment

Given any single Cython file such as sum.pyx, in Sage you can do

sage: load sum.pyx
Compiling sum.pyx...
sage: sum_cython(100)
5050

Behind the scenes, Sage created a setup.py file, ran Cython, made a new module, compiled it, and imported everything it defines into the global namespace. If you look in the spyx subdirectory of the directory listed below, before you exit Sage (!), then you'll see all this.

sage: SAGE_TMP
'/Users/wstein/.sage//temp/deep.local/14837/'

You can also do

sage: attach sum.pyx

Then every time sum.pyx changes, Sage will notice this and reload it. This can be useful for development of small chunks of Cython code.

You can also use the Sage notebook, and put %cython as the first line of a notebook cell. The rest of the cell will be compiled exactly as if it were written to a .pyx file and loaded as above. In fact, that is almost exactly what happens behind the scenes.

Next Time

Now that we understand at a reasonably deep level what Cython really is and does, it is time to learn about the various constructs of the language:

How to create extension classes using Cython.
How to call external C/C++ library code.

We will rewrite our sum.pyx file first to use a class. Then we'll rewrite it again to make use of the MPIR (or GMP) C library for arithmetic, and again to make use of the C++ NTL library.

Wednesday, November 3, 2010

Cython, Sage, and the Need for Speed

Cython seriously rocks, at least for much of what I need. It's still the killer feature of Python/Sage, IMHO. And meetings like EuroScipy last summer really confirmed that, where almost every other talk used Cython.

History

Greg Ewing wrote "Pyrex" in 2002--2004..., which I guess he named
after some cooking ware. It is amazing, but to understand this you
must take a quick tour of Extending and embedding and the Python/C API reference. Pyrex let you write basically Python-ish code that gets magically turned into C extension code.

In 2007 I forked it (discuss why briefly) and named Cython after this punk rock guy.

At that time, Robert Bradshaw and Stefen Behnel spent a lot of time improving Cython, implementing tons of _optimizations_ and new features.

Cython is now very popular in the "Scientific computing using Python" world. It is also heavily used in Sage.

Are You Serious?

If you want to use a computer for math research, and you are serious (not some lazy person who fiddles then gives up), you will likely run into situations where you need code to run fast. Writing such code only in Python (or any other interpreter) is often impossible.

If you want to write fast code on a computer, and don't want to mess with assembler, the only option right now is C, or something with equivalent speed... Cython! By "fast" I mean 100-1000 times what you'll get out of Python on certain tasks. I also mean code that is evil, scary, and dangerous... if you aren't careful with preconditions.

Compiled versus Interpreted Code

Here's how interpreter code usually runs.

    1. Check a bunch of conditions then do one single thing.
    2. Check a bunch of conditions then do one single thing.
    ...
    10^6. Check a bunch of conditions then do one single thing.

Here's how compiled (C, Cython, etc.) can can be written:

    1. Check some conditions (optional, but a good idea);
    2. Do very unsafe stuff with no checks at all (but they 
       in theory should be safe given 1).
    ...
    10^6. Do very unsafe stuff with no checks at all (but they 
       in theory should be safe given 1).

The problem is that all the checks in step 1 (in either case) can easily take over 100 times as long as "do very unsafe stuff".

TYPICAL EXAMPLE:

sage: def sum_sage(n):
...       s = 1
...       for i in range(n):
...           s += i
...       return s
sage: timeit('sum_sage(100000r)')
5 loops, best of 3: 84.6 ms per loop
sage: %python
sage: def sum_python(n):
...       s = 1
...       for i in range(n):
...           s += i
...       return s    
sage: 84.6/5.88
14.3877551020408
sage: timeit('sum_python(100000r)')
125 loops, best of 3: 5.88 ms per loop
sage: %cython
sage: def sum_cython(int n):
...       cdef int i, s = 1
...       for i in range(n):
...           s += i
...       return s
sage: timeit('sum_cython(100000r)')
625 loops, best of 3: 61.6 µs per loop
sage: 5.88 / 0.061
96.3934426229508

Let me explain what's going in the above. How, e.g., in the first one (sum_sage), the program is doing a sort of monologue: "I have to add a Python int to a Sage int. I don't have any code to do that directly (that would get too complicated, and they are so big and complicated and different objects, and they might change, oh my). So I'll convert the Python int to Sage int, because that's the only conversion I know. OK, I do that via (it used to be base 10 string parsing!) some code Gonzalo Tornaria wrote that is scary complicated... and once that is done, I got my new MPIR-based Sage integer, which I think add to s. The addition takes some memory that points to the two MPIR integers, and since Python numbers are supposed to be immutable, I make yet another MPIR number (wrapped in a Python object), which is the result of asking MPIR to add them. MPIR numbers are also very complicated objects, involving stuff like limbs, and C structs, which hardly anybody fully understands. Despite these integers happening to be small, there is still quite some overhead in the addition, but it happens (taking a small fraction of the total runtime). Then we move on to the next step in the loop!"

With sum_python, the loop is similar, but MPIR isn't involved, and there are no conversions. This buys a 14-fold speedup. But it is still not super fast, since many new Python objects get created, the code is for "potentially huge integers", hence a potentially complicated data structure has to be checked for, etc.

With sum_cython, the integers are only C ints, which are a 32 or 64-bit location in memory. Doing "s += i" just modifies in place that position in memory. There's no conversions or type checks done at all at run time. It's really fast... 1386 times faster than the first version!!!

Key point: If you truly understand what is going on, you'll see that this isn't Sage being fundamentally broken. Instead, you'll hopefully be able to look at a block of Sage code and have a clue about how to figure out what it is really doing in order to see whether writing a new implementation of the same algorithm using Cython (which will likely mean directly working with C level data structures) is likely to give you a big speedup. If you look at the innermost statement in a loop, and there's a big monologue about what is really going on, then you might get a 1000-fold speedup by using Cython.

In mathematics, general theorems -- once we have them -- are almost always much better than proofs of special cases. In math, proving a special case can often seem more awkward and unnatural than proving the general case (e.g., how would you proof that ever integer of the form a^2 + 7*a + 5 factors uniquely as a product of primes!?). With general theorems in math, the statements are often simple and clear so applying them is easier than applying theorems that are only about some very special case, which has often more elaborate hypothesis. In mathematics, usually a general theorem is simply all around much better than a theorem about some very special cases (especially if both are available).

In contrast, when writing computer programs, algorithms to solve very general cases of problems often have significant drawbacks in terms of speed (and sometimes complexity) over algorithms for special cases. Since you are mathematicians, you should constantly guard against your instincts from math research which can point you in exactly the wrong direction for writing very fast code. Often implementations of very general algorithms _are_ easier to understand, and are much less buggy than a bunch of implementations of special cases. However, there are also usually very severe performance penalties in implementing only the general case. Watch out.

A huge part of understanding the point of Cython for writing fast math code is that you must accept that you're going to write a lot of "ugly" (from a mathematicians perspective) code that only deals with special cases. But it's beautiful from the perspective of somebody who absolutely needs fast code for a specific research application; your fast code can lead to whole new frontiers of research.

Continuing the example from above:

sage: sum_python(10^8)
4999999950000001
sage: sum_cython(10^8)
887459713

Yep, we just implemented only a special case in Cython!

---------------

Useful things to do if you want Cython enlightenment (all at once, no order):

Definitely read/skim Extending and try all examples. This is critical if you want to understand Cython with any depth at all. Don't think: I don't need to know any of this, because I have Cython. Yes, after you play with this a bit you may never explicitly use it. But you do need to know it.
Oh, definitely learn the C language if you haven't already. This is
the book. There are courses. It's crazy not to learn C, anyways, since it (along with C++) is hugely popular, and a massive amount of code is written in C/C++. (See, e.g., http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html where C and C++ together are the most popular, by a long margin.)
Obviously, read http://docs.cython.org/ until it is burned into your brain.

Look at the code that Cython generates (use "cython -a" for a nice HTML view).

Sunday, October 31, 2010

How to Referee Sage Trac Tickets

The current workflow and tools for peer review in Sage are not perfect, and it's clear they can be improved in many ways. However, the current system is also stable and works well, and though it will surely evolve over time, it is well worth learning. This post is about the current Sage peer review workflow. I will restrict my discussion to tickets that involve the Sage library, not standard or optional packages or other scripts not in the Sage library.

Why Referee a Ticket?

There are a couple of reasons that you might referee a ticket:

You want to contribute to Sage development, but don't know where to start. There's a list right here of about 200 tickets (sorted by component) that need review, and probably some require only background you know something about.
You want a specific chunk of code to get into Sage that somebody else wrote. If you care about this code, you are probably qualified to review it.
Somebody asked you to review a certain ticket, perhaps in exchange for them reviewing one of your tickets.

Refereeing trac tickets is different than refereeing papers for inclusion in a journal in several ways. There is no central editor that assigns tickets to reviewers -- anybody (you, right now!) can just jump in and start refereeing tickets. Also, everything about refereeing is completely open and archived (hopefully) forever.

What To Expect

The code attached to a trac ticket can do all kinds of things, including:

Fix a little (or huge) bug.
Add documentation to a function.
Introduce a new feature to Sage (and possibly new bugs!).
Change the name of an existing function.
Raise (or lower) the "doctest coverage" of Sage.

How to Referee a Ticket

Refereeing a ticket involves several steps. The steps below are mainly aimed at tickets that add new capabilities to Sage, but I've included some remarks about other types of tickets in another section below.

Use Mercurial queues to apply all the patches on the trac ticket to your own (very recent, usually alpha) version of Sage, then type "sage -br" to rebuild the Sage library. You download
each foo.patch, then do "hg qimport foo.patch" followed by "hg qpush" (I personally use this little script). Note: this can be automated; type "sage -merge" to learn how. If you can't apply the patch, then the ticket may "need work", i.e., the author has to rebase the patches; note this and change the Action at the bottom of the ticket page to "needs work".
Test out the code to make sure the doctests work, using "sage -t affected_files" and "make ptest" in the SAGE_ROOT directory. Doing "make ptest" is a good idea, since often a change in one part of Sage breaks something far, far away that was written by somebody else long, long ago. (Again, see "sage -merge" for automating this.) If any tests fail, the ticket probably "needs work"; note this and change the action to "needs work". I often do this testing on sage.math.washington.edu, since it has 24 processors.
Make sure that every function introduced in the code, even ones that start with underscores (!), have examples that fully illustrate how they work. Make sure that all input parameters to these functions are tested in the examples, and ensure that the function is actually being used by the examples. If the patch is large you can do "sage --coverageall" both before and after applying the patches, and make sure the coverage score doesn't go down. Also, make sure the docstrings are laid out according to our standard template, e.g.,
```
"""
Short description... (one sentence usually)

Long description... (optional but sometimes good)

INPUT:
    - param1 -- description of
      the first param
    - param2 -- description of
      the second param
OUTPUT:
    - description of output

AUTHORS: 
    - Sage Developer1
    - Sage Developer2

EXAMPLES::

    sage: 2+2
    4

TESTS::

    sage: further examples nobody should look at.
"""
```
It is essential that there be two colons after "EXAMPLES" and a blank line!! Also, do not omit the INPUT/OUTPUT blocks, since I often get complaints from end users that these are missing.
If you've got this far without marking the ticket "needs work", then every function in the new code should work and have solid docstrings, including INPUT/OUTPUT blocks and examples that illustrate how the function works. Much of what you did in 1-3 is pretty routine (and yes, there are people working on trying to automate some of this). In this step, you finally get to have more fun and be creative: think seriously about whether the code is "probably correct" and sufficiently well documented to read. If not, do not give the code a positive review. The Sage library is already huge, and we don't want new code that is just going to cause a lot of pain to maintain later (there are other places for such code, such as Purple Sage). Basically, at this point, pretend you are a playtester or hacker and try to break the functionality introduced by the trac ticket. Some techniques include:
- Throw random input at the functions and see what happens.
- If you know any theorems or conjectures that leads to consistency checks on the functions, try them out. For example, if a function factors something as a product, try factoring products, then multiplying the factors back together. If the function computes the Ramanujan tau function, try it on large inputs and test the standard congruences.
- If the functions are implemented in other software (e.g., Magma or Mathematica) write a little program to systematically compare the output of Sage and the other system for some range of values. Also, compare timings; if code is 100 times slower in Sage than Magma, there better be a good reason for this.
- Very often when reading code, your "devious mind" will think of ways to potentially break it. Make sure that you have a sage prompt up, so you can try out your ideas for breaking the code. If anything succeeds at breaking the code, put that in your report and set the trac Action to "Needs work". Also, if you think of good doctests to add, note these.

Special Cases

As mentioned above, most of the steps above assume that the code attached to the trac ticket is implementing new features.

If the code fixes a bug, then if at all possible, there should be a new doctest that illustrates that the bug is fixed. This doctest should mention the relevant trac ticket.
If a trac ticket mainly adds new documentation, pay special attention to the English grammar, writing, etc. In my experience, almost nobody (definitely not me!) ever writes a few paragraphs of English without making several mistakes, typos, etc.
If the code changes the name of an existing function, then this must happen in accordance with our deprecation policy. You have to leave the old function name in place and add a deprecation warning. This can be removed after ONE YEAR. The proper way to do this is to put the following at the top of the function:
```
from sage.misc.misc import deprecation
deprecation("function_name is deprecated...")
```
Also, you may want to put in the Sage version number to give people a clue about when to remove the function (see the second argument to the deprecation function).

The core Sage library is meant to be mature, stable and highly polished, and function names, class names, etc., are not allowed to just be changed left and right. There are other places to post code that uses Sage, but which is not yet stable.

Posting Patches as the Referee

Often when going through the above steps, it is easiest to just make changes directly to the source code. E.g., if you see documentation misspelled, or think of examples to add. Just make a new patch that has all these changes and post a "referee patch". If you do this, then you should notify the original patch author(s), so they can referee your referee patch. This can get a little confusing, but usually sorts itself out.

Summary

I hope you can see from the above that refereeing tickets can be really fun. Think of it as a game to break whatever is posted on a trac ticket. It's much better if you find issues now when the author is still interested in the code, rather than some poor user finding bugs a few years later when the author of the code has moved on to other things and forgotten everything about the code. The core Sage library can only someday become high quality with your help as referees. And getting substantial code into Sage itself can only have any hope of attaining a similar status to publishing in a peer reviewed journal if this process of code refereeing itself gets refined, documented, and results in good quality mathematical software.

Remember, the goal is that only high quality code goes into Sage that satisfies the requirements outlined above; this is much more important than worrying about following "bureaucratic protocols".

Tuesday, October 26, 2010

Sage as a Python Library?

When I started Sage I viewed it as a distribution of a bunch of math software, and Python as just the interpreter language I happen to use at the time. I didn't even know if using Python as the language would last. However, it's also possible to think of Sage as a Python library.

Anyway, it has occurred to me (a few times, and again recently) that it would be possible to make much of the Sage distribution, without Python of course, into a Python library. What I mean is the following. You would have a big Python library called "sagemath", say, and inside of that would be a huge HG repository. In that repository, one would check in the source code for many of the standard Sage spkg's... e.g., GAP, Pari, etc. When you type

python setup.py install

then GAP, Pari, etc., would all get built, controlled by some Python scripts, then installed as package_data in the sagemath directory of /site-packages/.

From a technical perspective, I don't see any reason why this couldn't be made to work. HG can handle this much data, and "python setup.py install" can do anything. It does lead to a very different way of looking at Sage though, and it could help untangle things in interesting ways.

(1) Have a Python library called "sagecore", which is just the most important standard spkg's (e.g., Singular, PARI, etc.), perhaps eventually built *only* as shared object libraries (no standalone interpreters).

(2) Have a Python library which is the current Sage library (we already have this), and which can be installed assuming sagecore is installed.

(3) Have other Python libraries (like psage: http://code.google.com/p/purplesage/source/browse/), which depend on (2). Maybe a lot of the "sage-combinat" code could also be moved to such a library, so they can escape the "combinat patch queue" madness. Maybe many other research groups in algebraic topology, differential geometry, special functions, etc., will start developing such libraries... on their own, and share them with the community (but without having to deal directly with the sage project until they want to).

To emphasize (3), when people want to write a lot of mathematics code in some area, e.g., differential geometry, they would just make a new library that depends on Sage (the library in (2)). We do the work needed to make it easy for people to write code outside of the Sage library, which depends on Sage. Especially writing Cython code like this can be difficult and confusing, and we don't explain it all in any Sage documentation. It actually took me quite a while to figure out how to do it today (with psage).

The core Sage library (2) above would continue to have a higher and higher level of code review, tough referee process etc. However, the development models for (3) would be up to the authors of those libraries.

The above is already how the ecosystem with Python (http://pypi.python.org/pypi), Perl (http://www.cpan.org/), R, etc., work. Fortunately, Python has reasonably good support already for this.

I think without a shift in this direction, Sage is going to be very frustrating for people writing research oriented code.

Fortunately, it's possible to do everything I'm describing above without disturbing the mainline Sage project itself, at least for now.

Monday, October 25, 2010

Revision Control and Sage

The main target audience of this blog post is mathematics researchers who are relatively new to revision control in the context of Sage.

Background

In the 1990s when I first started using revision control, it was a major, major pain in the arse. Making a tarball snapshot of the project directory with a date-stamp was better! In 2005 when I started Sage, I didn't know any better, and didn't both with revision control for the first year. In 2006, Gonzalo Tornaria told me that things had improved, and got me to use Darcs. That lasted just over a year, when I switched Sage to use Mercurial. The switch was due to major headaches installing Darcs (a Haskell program), and Darcs hanging when our repo got too big and complicated.

We are fortunate that today many difficult problems and design decisions for revision control have been addressed through lots of work. Revision control is amazingly good now compared to what it was 15 years ago!

Some milestones: RCS, CVS, Subversion, Bitkeeper & Monotone, Darcs, Git/Mercurial/Bazaar

A reference: see the Mercurial guide.

Getting started with Mercurial

In sage we use "hg" = (Mercurial). Why?

- Because I chose it in early 2007, when Darcs started sucking too much
- Distributed is the way to go these days. (Modern)
- Bazaar far too unstable in 2007, to put it mildly. Bazaar is by the people who made Ubuntu, and is much more stable today.
- GIT badly documented, too hard to setup, etc... back in 2007; again, git is much better today.

---> so Mercurial was the only reasonable choice then. It was just "good enough" at the time that we could survive. Hence Mercurial.

Mercurial comes with Sage, and you can use it like so:
  - "sage -hg"
  - sage: install_scripts command
  - sage: hg_sage.[tab]

- Or Install: easy to install standalone on OS X, Linux, and Windows.

- You can do "$ sudo easy_install mercurial" to install systemwide the latest version into your default system wide Python.   It will probably appear in /usr/local/bin/hg

Documentation for hg:
- http://mercurial.selenic.com/
- http://hgbook.red-bean.com/
- http://mercurial.selenic.com/learn/
- http://hginit.com/

Project Hosting

If you create your own HG projects, you can easily host them on Google Code (and also http://bitbucket.org/).

I'm doing this a lot lately with smaller projects, including:
- Purple Sage
- Sage Notebook

- Other systems. Google code only does HG and Subversion.
But there are also similar hosting services for git and bazaar:
- github -- free hosting for git projects
- Bazaar's Launchpad

- Sourceforge: The first major free hosting site I heard of is sourceforge.net, which offers hosting for all sorts of repos, including mercurial. It is more general, but not as "clean and focused" in abilities as the above, e.g., their mailing lists are often plagued by spam, and I seem to recall a lot of banner ads.

These hosting services are a very popular and fairly new thing (but not too new); all but sourceforge became popular well after I started Sage. They provide much, much more than just what the revision control systems do; they are more than just a place to host your code repository. They all provide:

- code hosting (of course)
- bug tracking
- code review
- mailing lists, wiki's, etc.

Sage isn't hosted on one of these probably because they are so new, and Sage is older. This may change. Also, Sage is really, really big (Google code has a 2GB limit).

Main point: if you want to start a small project with a small number of other people -- example, writing a research paper and some corresponding code -- you can in seconds easily get the full suite of hosting, bug tracking, code review, etc., totally for free, backed up, high quality, etc. Obviously, you are trusting your data to either Google, Github, or Canonical, but you can easily backup most of it (e.g., the repos, mailing lists).

They want you to use them: "431,000 people hosting over 1,330,000 git repositories" (from github)

Story: In 2006, when I came to UW, Randy Leveque (a professor in applied math) found out about me running the Sage project, and wanted to know how to get Mercurial, Wiki's, trac, etc., setup, so that he could organize his software development project(s). I explained some basics, and I think he and his students setup various project infrastructure, and ran into assorted troubles. Today, I would just say "http://code.google.com" (or github, etc.) and be done with it! And Randy would be up and running with more tools than before in a matter of minutes.

Command line mercurial

The main point of the rest of this lecture will be about how to use Mercurial. It's very hard to beat the official Mercurial Quickstart, which we'll just go over together in class:

See the Mercurial Quickstart.

- Make a directory in your computer be under revision control
- Setup your .hgrc file, if you haven't already.
- Try cloning a remote repository, e.g., purple sage
- Try committing a change that spans multiple files, so you can see that all the changes are together. Then view the commit via "hg serve -p 8200".

WARNING: on OS X with the sage-included hg, this "hg serve" crashed for me with "Trace/BPT trap" when I visited http://localhost:8200. I opened trac ticket 10171.

So I did "sudo easy_install mercurial" to get a new version systemwide, and now this works for me: "/usr/local/bin/hg serve -p 8200"

A future blog post may cover more advanced HG topics, including branching, cloning, queues, and bundles.

Monday, October 4, 2010

Standalone C/C++ libraries that are included in Sage

Every copy of Sage includes many free open source C/C++ libraries, which can also be used standalone without Sage. They got into Sage because enough people really cared about using them, and felt it was worth the effort to convince other people that these should be included in Sage. Thus every single library also provides some key functionality to Sage. Thus understanding the main points about these libraries should be of interest to anybody who knows C/C++ who wants to do mathematical computation. Moreover, most of these libraries were written in C/C++ because the authors had already mastered the underlying algorithms, and really wanted an implementation that was extremely fast, often potentially orders of magnitude faster than what they might implement using only an interpreter, such as Python, Maple or Magma.

It's also possible using Cython (or even ctypes) to directly call any of the functionality of the libraries below from Python (hence Sage). Thus it is vital that you know about these libraries if you want to write fast code using the Sage framework.

When you want to look at the source code for any of these libraries, you can download the Sage source code from here or here, unpack it, look in the directory sage-x.y.z/spkg/standard/, and for any "spkg" (=tar, bzip2) file there, unpack it, e.g.,

wstein@sage:$ tar jxvf mpir-1.2.2.p1.spkg
wstein@sage:$ cd mpir-1.2.2.p1/src/
wstein@sage:$ ls
acinclude.m4     doc               missing        
...

Note that you can't unpack the spkg files included in a binary of Sage, since they are actually empty placeholders.

You can also find direct links to all of the standard spkg by going to this page, along with short descriptions of each.

Arithmetic

mpir (written in C and assembler; has a C++ interface): this is a library for doing fast arithmetic with arbitrary precision integers and rationals. It's very fast and cross-platform, and it's used by many other libraries. (NOTE: MPIR was started as an angry fork" of GMP=Gnu Multiprecision Library. GMP is used by Magma, Maple, and Mathematica; I've been told Mathematica doesn't use MPIR because of the overlap of Sage and MPIR developers, and their general suspicion toward the Sage project.)
mpfr (C library with C++ interface): this is a library for doing arithmetic with fixed precision floating point numbers, i.e., decimals with a given fixed choice of precision. "Floating-point numbers are a lot like sandpiles: Every time you move one you lose a little sand and pick up a little dirt." [Kernighan and Plauger]. MPFR is used by Magma (for details, google for "mpfr magma"), and several other libraries and systems out there. The number theory system and library PARI does *NOT* use MPFR; since PARI also does arbitrary precision floating point arithmetic, you see that PARI includes alternative implementations of some of the same algorithms (note: PARI also has its own alternative to MPIR). MPFR is unusual in that the main author -- Paul Zimmerman -- is unusually careful for somebody who works with floating point numbers, and my understanding is that all the algorithms in MPFR are proven to give the claimed precision. Basically, MPFR generalizes IEEE standards (that define what floating point arithmetic means) to arbitrary precision. MPFR also supports numerous "rounding modes". The mpmath Python library is in some cases faster than MPFR, but it has very different standards of rigor.
mpfi (C library): mpfi is built on top of MPFR, and implements arbitrary precision floating point "interval arithmetic". Interval arithmetic is an idea where you do arithmetic on intervals [a,b],
and apply functions to them. Given two intervals, their sum, product, quotient, etc., is another interval. The actual "feel" of using this is that you're working with floating point numbers, and as you lose precision due to rounding errors, the computer keeps track of the loss every step of the way... and in some cases you can quickly discover you have no bits left at all! MPFI is nicely integrated into Sage (by Carl Witty), and is a pretty sophisticated and robust implementation of interval arithmetic, on which some other things in Sage are built, such as the field QQbar of all algebraic numbers.
flint (C library): "Flint" is an abbreviation for "Fast Library for Number Theory". However, I've listed it here under basic arithmetic, since so far in Sage at least we only use FLINT for arithmetic with polynomials over Z[x] and (Z/nZ)[x]. (Sage should use it for Q[x], but still doesn't; see the horendously painful trac ticket, number 4000!) The main initial challenge and reason for existence of flint was to multiply polynomials in Z[x] more quickly than Magma (hence any other software on the planet), and FLINT succeeded. Now FLINT also does many other relevant polynomial algorithms, e.g., GCD, addition, etc. Most new development is going into "FLINT2", which is a total rewrite of FLINT, with an incompatible C interface. Bill Hart's vision is that eventually FLINT2 have capabilities sort of like PARI. Sage users have found at least one serious bug involving f,g in Z[x] where f*g was computed incorrectly.
znpoly (C libary): (by David Harvey) one version of znpoly is included in FLINT, and another is a standalone package in Sage. The point of this library is to do fast arithmetic in (Z/nZ)[x]. ("Monsky-Washnitzer!" we once found a very subtle bug in znpoly-in-FLINT that only appeared when multiplying polynomials of huge degree.)
givaro (C++ library): in Sage we use it entirely for fast arithmetic over small-cardinality finite fields, up to cardinality "2 to the power 16". Givaro makes a log table and does all multiplications as additions of small ints. Givaro actually does a lot more, e.g., arithmetic with algebraic numbers, and a bunch of matrix algorithms, including Block Wiedemann, which we don't even try to make available in Sage.

NOTE: The C library API conventions of mpir (gmp), mpfr, mpfi, flint and zn_poly are all pretty similar. Once you learn one, the others are comfortable to use. Givaro is totally different than the others because it is a C++ library, and hence has operator overloading available. The upshot: in Givaro, you type "c=a+b" to add two numbers, but in the other libaries (when used from C) you type, maybe "mpz_add(c,a,b)".

Number Theory

pari (C library; interpreter): a hugely powerful C library (and interactive interpreter) for number theory. This library is very, very good and fast for doing computations of many functions relevant to number theory, of "class groups of number fields", and for certain computations with elliptic curves. It also has functions for univariate polynomial factorization and arithmetic over number fields and finite fields. PARI has a barbaric stack-based memory scheme, which can be quite annoying. Key Point: PARI is easy to build on my iPhone, which says something. That said, we have Neil Sloane, 2007: ``I would like to thank everyone who responded to my question about installing PARI on an iMAC. The consensus was that it would be simplest to install Sage, which includes PARI and many other things. I tried this and it worked! Thanks! Neil (It is such a shock when things actually work!!)'' PARI is very tightly integrated into Sage.
ntl (C++ library): a C++ library aimed at providing foundations for implementing number theory algorithms. It provides C++ classes for polynomials, vectors, and matrices over ZZ, Z/pZ, GF(q), (Z/pZ)[x], and RR, and algorithms for each. Some algorithms, e.g., for polynomial factorization in ZZ[x], are very sophisticated. Others, e.g., Hermite form of matrices over ZZ, are naive. In fact, most of the matrix algebra in NTL is naive, at least compared to what is in IML, Linbox, Sage and Magma. NTL is very stable and considered "done" by its author.
eclib (C++, standalone program): This is John Cremona's prized collection of C++ code that he wrote for computing elliptic curves, both finding them (using "modular symbols") and computing tricky things about them (via the "mwrank" program). It's a large amount of C++ code that Cremona developed over nearly 2 decades. There is still much functionality in there that could be made more readily available in Sage via Cython, but for which nobody has put in the effort to do so (I'm sure Cremona would be very helpful in providing guidance).
lcalc (C++ library, standalone program): This is Mike Rubinstein's C++ library for computing with "motivic L-functions", which are generalizations of the Riemann zeta function, of great importance in number theory. It's particularly good at computing tons of zeros of such functions in their critical strip, hence getting information about generalizations of the Riemann Hypothesis. It's the best general purpose L-functions program for computing zeros.
sympow (standalone C program): Though technically not a library it could be adapted into one. Sympow is a C program written by Mark Watkins for quickly computing special values of symmetric power L-functions (so related to what lcalc does). It has no dependencies (instead of PARI), because Mark didn't want to have to license sympow under the GPL.
ratpoints (C library): Michael Stoll's highly optimized C program for searching for certain rational points on hyperelliptic curves (i.e. of the form y^2 = f(x)). This library goes back to the early 1990s and work of Elkies. I believe this is in Sage for one reason only: Robert Miller wanted to implement 2-descent, and he needed this. (Unfortunately, Miller's implementation is only done in the case when there is a rational 2-torsion point.) This library could be made more useful in Sage, via making it so functionality in sage.libs.ratpoints is easily available for hyperelliptic curves...
genus2reduction (standalone C program that uses PARI): Not technically a library. This a C program that depends on PARI, and quickly computes "stuff" about genus 2 curves over the rational numbers... which as far as I know is fairly unique in what it does. A. Brumer used it recently for a large scale computation, and pointed out that it has stability issues.
flintqs (C++ program): This is a standalone C++ program that implements the quadratic sieve algorithm for integer factorization (which is good at factoring p*q with p and q of similar size and p*q up to maybe 100 digits). I think that this algorithm is also implemented in PARI, though in some cases flintqs is faster. A silly fact is that a better version of this program is evidently in FLINT itself, but for some reason Sage doesn't use it.
ecm (C library and C program; uses MPFR, MPIR): This implements the elliptic curve integer factorization algorithm, which Hendrik Lenstra invented, and is good at factoring p*n with p a prime with up to 30 (?) or so digits. There is another highly optimized implementation of the same algorithm as part of PARI, but ECM is in some (all?) cases better. It also has a huge number of tunable parameters. The ECM program is also often run standalone by people, sometimes on clusters, to attempt to factor integers. NOTE: Sage has a "factor" command, which just calls PARI. One could imagine it calling ECM and flintqs, etc., but nobody has made it do so... though many have tried.

Linear Algebra

IML (C library): "Integer matrix library". This library solves A*X = B, with A and B matrices over the integers. It does this by solving "A*X = B (mod p^n)" for several prime powers p^n, which involves computing the reduced row echelon form of A mod p (for various p) quickly... which is in turn done via matrix multiplication over matrices with double precision entries. It is the fastest software in the world for solving AX=B in many cases, especially as the entries of A and B get huge.
Linbox (C++ library): asymptotically fast black box linear algebra algorithms. It was just a "research sandbox", but "we" (mainly Clement Pernet) got Linbox whipped into shape enough to be of use in Sage for certain things. It is used at least for: matrix multiplication and computation of characteristic polynomials of matrices over the integers and finite fields, and asymptotically fast echelon form of matrices over finite field. I don't think it is used for much else yet. Implementing fast versions of the above in general is a gargantuan task, and we had done much of it (but no good charpoly!) natively in Sage before using Linbox, so now multiple implementations are available -- Linbox's are usually faster. Until Linbox/Sage, there were no fast open source programs for asymptoically fast exact linear algebra -- Magma was the only game in town (neither Maple or Mathematica are good at this either).
m4ri (C library): "Mary" -- scary fast arithmetic with matrices over the finite field GF(2) with 2 elements. This library uses various tricks (involving so-called "Gray codes") and asymptotically fast divide-and-conquer algorithms to compute reduced row echelon form and do matrix multiplication with matrices over GF(2). It is memory efficient and is the world's fastest at many benchmarks. There is related work (on "m4ri(e)") to do arithmetic over GF(2^n) and also some work by Boothby and Bradshaw on GF(p^n) for small p and n.
libfplll (C++ library): floating point LLL. The whole point of fpLLL is to implement very fast algorithms for computing LLL reduced basis for lattices, possibly with rounding error if you are not careful. LLL gets used as a key algorithm in many other algorithms (e.g., polynomial factorization). Magma also uses some variant of libfplll, and an older rewritten version of fpLLL is in PARI. If you look, e.g., at Mathematica, it does have some sort of limited LLL implementation, but it is nowhere near as sophisticated as what is available overall in Sage... NTL and PARI also have their own interfaces to LLL, which can have various pros and cons over using fpLLL directly. (In Sage, you can make a matrix over the integers and call the LLL or LLL_gram methods on it.) Magma also uses fpLLL.
polybori (C++ library): commutative algebra with boolean rings, which are characteristic 2 rings such that x^2 = x for every element of the ring. Evidently, polybori is important in cryptography and circuit design (?). I have personally never succeeded in really understanding how or seeing for myself polybori do well on benchmarks. Polybori is why the boost-cropped package is in Sage, which provides some C++ datastructure and algorithms.

Combinatorics

symmetrica (C++ library): a C++ library for doing very fast arithmetic with symmetric functions. The combinatorics code in Sage builds on this.
cliquer (C library): for finding cliques (=maximal complete subgraphs) in a graph.

Geometry

gfan (C++ program): "groebner fans" -- for computing a geometric object that parametrizes all Groebner basis for an ideal.
cddlib (C library): for computing with convex polytopes that gfan requires, using the "double description" method (whatever that is). There is code in sage.geometry.polyhedra that also uses cddlib.
glpk (C library): the GNU linear programming toolkit. This is a good free open source program for "linear programming". There are also commercial programs (e.g., CPLEX) for linear programming, and whenever people talk about linear programming software they always point out that some commercial programs are way, way faster than glpk. But glpk is in Sage.
palp (C program): computes stuff about lattice polytopes, like all the integers points in one. There is a big Sage package that builds on top of this. Evidently, this is of substantial interest to some physicists.

Numerical applied math

gsl (C library): implements core algorithms of scientific computing, is very stable and considered "done" by its authors. New relevant code goes into other libraries that use GSL.

Non mathematics

readline: provides command line editing, which is used by IPython, PARI, Singular, etc. Often misbuilt or not available on various platforms, so included in Sage. Readline is perhaps famous for being a key reason that both PARI and clisp are licenced under the GPL.
bzip2: bzip2 compression; easily usable from Sage via "import bz2". NOTE: this is used by the Sage build system so it is in SAGE_ROOT/spkg/base instead of SAGE_ROOT/spkg/standard/.
zlib: gzip-compatible compression; makes it so you can easily "gzip" files, and even work with gzip'd files in memory from Python.

Sunday, September 26, 2010

The Sagemath Cluster's Fileserver

In addition to being involved in programming in Sage, I am also the systems administrator for the "Sagemath cluster" of computers, which are used for Sage development, and also support mathematics research. This is a cluster of 5 machines with 128GB RAM and 24 cores in each machine, along with a 24 terabyte Sun X4540 fileserver.

When we setup this cluster nearly 2 years ago, we installed OpenSolaris and ZFS on the fileserver. It's run very solidly, in that the uptime was over 600 days a few days ago. However, no software would really work well on OpenSolaris for me -- top crashed usually, emacs crashed, stuff just didn't work. I never got Sage to build. Also, USB disks were a major pain on this machine, which complicated backups. Also, I frankly found the performance of ZFS and NFS-from-ZFS to be disappointing. In addition, nobody had a clue how to do maintenance and security updates on this server, meaning it was probable a danger.

Then Sun got bought by Oracle, and Oracle killed OpenSolaris. Also, I started getting really into MongoDB for database work related to the modular forms database project, and it would be nice to be able to run a MongoDB server and other related software and servers directly on my fileserver (yes, MongoDB supports Solaris, but...). Getting anything to work on Solaris costs too much time and confusion. So I decided to delete OpenSolaris and install Linux. Since there's many terabytes of data on the fileserver, and dozens of people using it, this involved many rsync's back and forth.

I eventually succeed in installing Ubuntu 10.04.1 LTS Server on disk.math. I setup a big 20TB software RAID-5 array on disk.math (with 2 spares), and added it to an LVM2 (=logical volume management) volume group.

I created 3 partitions:

   home -- 3.5 terabytes: for people's home directories
   scratch -- 3.5 terabytes; for scratch that is available on all machines on the cluster
   lmfdb -- 3.5 terabytes; for the L-functions and modular forms database project.

Thus over 7.5 terabytes is not allocated at all right now. This could be added to the lmfdb partition as needed.

The RAID-5 array has impressive (to me) performance:

   root@disk:/dev# /sbin/hdparm -t /dev/md0
   /dev/md0:
   Timing buffered disk reads: 1638 MB in 3.00 seconds = 545.88 MB/sec

All the above 3.5TB partitions are NFS exported from disk.math, and (will all be) mounted on the other machines in the cluster.

By using LVM, we will still get snapshotting (like ZFS has), which is important for robust backups over rsync, and is great for users (like me!) who accidentally delete important files.

I chose 3.5TB for the partitions above, since that size is easy to backup using 4TB external USB disks. Now that I'm running linux on disk.math, it will be easy to just plug in a physical disk to the fileserver and make a complete backup, then unplug it and swap in another backup disk, etc.

----

Some general sysadmin comments about, in case people are interested.

   (1) Oracle shutdown access to firmware upgrades, which meant I couldn't upgrade the firmware on disk.math. Maybe it would have been possible via phone calls, etc., but I was so pissed off to see Oracle do something that lame. It's just evil to not give away firmware upgrades for hardware. Give me a break. Oracle sucks. I'm really glad I just happened to upgrade the firmware on all my other Sun boxes recently.

   (2) The remote console -- which is needed to install from a CD image on disk.math -- does not work correctly on 64-bit OS X or 64-bit Linux. It just fails with "not supported on this platform" errors, but with no hint as to why. By installing a 32-bit Linux into a virtual machine, I was able to get this to work.

  (3) Linux NFS + software RAID + Linux LVM2 (=logical volume management) does roughly "the same thing" as ZFS did on the old disk.math machine. I spent about equal time with the documentation for both systems, RAID+LVM2+NFS and ZFS, and the documentation for RAID+LVM2+NFS is definitely less professional, but at the same time vastly more helpful. There's lots of good step-by-step guides, forums, and just generally a *ton* of useful help out there. With the ZFS documentation, there is basically one canonical document, which though professional, is just really frustrating as soon as you have a question not answered in it. I personally greatly prefer the Linux ecosystem.

  (4) RAID+LVM2+NFS is much more modular than ZFS. Each part of the system can be used by itself without the other parts, each has its own documentation, and each can make use of the other in multiple flexible and surprising ways. It's very impressive. One of the reasons I chose ZFS+Solaris instead of Linux+whatever 2 years ago is that I didn't realize that Linux offers similar capabilities.... because it didn't back in 2001 when I was learning tons about Linux. But it does now. Linux on the server has really, really improved over the years. (I've been using Linux since 1993.)

Tuesday, August 17, 2010

*-Overflow

I spent some time obsessively browsing http://mathoverflow.net during the last few days (and posting too), and it is a really awesome site. I found out about it last December, when I visited Berkeley to give a talk, and had lunch with one of the guys that runs the site, but didn't pay much attention to it until recently. It's much more suitable for discussion of research-level mathematics than general use and development of Sage.

Mathoverflow is interesting in that their comment system supports jsmath with realtime preview, which is something the TinyMCE editor in the Sage notebook doesn't do. It's also something that http://stackoverflow.com doesn't do, as far as I can tell. The grad students at Berkeley that organize mathoverflow probably somehow hacked jsmath into the comment system.

I was very surprised to learn that Mathoverflow and Stackoverflow (and all the dozens and dozens of stackoverflow-based sites) are closed source. There is a company that runs stackoverflow and hosts related sites, and keeping their code closed is part of their business model.

Some people sort of forked something related to stackexchange at some point, and created this: CNPROG, which is fortunately based on Python. There are about 20 sites using CNPROG now, including one for all questions about the use of Python in science, mathematics, and engineering.
I tried this site, answering a question about Cython, and it is snappy and clean.

Perhaps I will add jsmath or mathjax support to cnprog and start another site (http://q.sagemath.org ?), say on using math software as an aid to mathematical research. The site would not be specific to Sage or restricted to open source. The FAQ would say that all questions must be of the form: "(How) can I use a computer to compute XYZ, where XYZ should be some sort of advanced mathematical question." The emphasis on "advanced" and "research" is that there are already other sites about how to do elementary stuff (=people's homework), and it will provide an excuse to keep that out, which will greatly raise the quality. It is, after all, exactly this uncompromising focus on research that makes http://mathoverflow.net so interesting.

EDIT: I've now created http://ask.sagemath.org which is based on http://askbot.org, which is in turn based on CNPROG.

Monday, June 7, 2010

PARI on the iPAD (a component of Sage)

I got PARI (http://pari.math.u-bordeaux.fr/), which is a very powerful C program for doing number theory, to run natively on my iPad and iPhone!

"All" I had to do was:

1. Jailbreak my iPad using Spirit.
2. Make a gcc wrapper script on my MAC laptop that can cross-compiler binaries for iPhoneOS, and put this script first in my path (I called it gcc):

#!/bin/sh

# the following is all on one line:

/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/gcc -arch armv6 -mthumb -isysroot /Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS3.2.sdk/ "$@"

3. Download the PARI source code and extract it: pari-2.3.5.tar.gz

4. Configure and build PARI (this requires that you have the latex XCode installed).

5. The build fails, so I edited the file src/kernel/none/mp_indep.c in a somewhat random way to get it to compile. More precisely, I commented out lines 794-801, and replaced them by:

# define INDEX0 1

# define INDEX1 0

6. Copy Odarwin-i386/gp-sta over to the iPad/iPhone using ssh (I installed an ssh daemon using Cydia).

7. Run it, via iSSH (purchased from the App store) connected to localhost -- see above.

Here is a link to my gp-sta, in case you want to just drop it on your phone and run it.

If I can put a simple GUI on this, I may try to submit this to the App store, so that anybody can trivially install GP for free. Another possibility would be to get it included in Cydia.

This is of course just a tiny step on the way to porting Sage to the iPad.

Addendum

I was also able to build Pari *directly* on the iPhone with readline support. Installing GCC was pretty crazy.

William-A-Steins-iPad:~/pari/pari-2.3.5 mobile$ Odarwin-arm/gp-sta 
                                 GP/PARI CALCULATOR Version 2.3.5 (released)
                            arm running darwin (portable C kernel) 32-bit version
                      compiled: Jun  7 2010, gcc-4.2.1 (Based on Apple Inc. build 5555)
                             (readline v6.0 enabled, extended help not available)

                                    Copyright (C) 2000-2006 The PARI Group

PARI/GP is free software, covered by the GNU General Public License, and comes WITHOUT ANY WARRANTY 
WHATSOEVER.

Type ? for help, \q to quit.
Type ?12 for how to get moral (and possibly technical) support.

parisize = 4000000, primelimit = 500000
? 2+3
%1 = 5
?

My notes for installing GCC on the iphone:

1) Install this

  http://files.getdropbox.com/u/876743/fake-libgcc_1.0_iphoneos-arm.deb

dpkg -i fake-libgcc_1.0_iphoneos-arm.deb

2) apt-get install iphone-gcc


3) Install headers:

Copy these over, which I got from http://www.megaupload.com/?d=55ZNOCKI, and extra to /usr/include

   sdk-2.0-headers.tar.gz

I literally just put them in /usr/include/   

4) Missing

  /usr/lib/libgcc_s.1.dylib

...

wget http://apt.saurik.com/debs/libgcc_4.2-20080410-1-6_iphoneos-arm.deb


5) Hmm:

  http://blog.aaronash.com/?p=15

I downloaded this and extracted it as lib, and had to use some funny password.   
"aksblog.co.nr"
Then I copied lib/libSystem.dylib to /usr/lib/ on my iPad

... and now gcc seems to work! 

6) This isn't needed for some reason: Make it so I can sign my apps:

    apt-get install ldid  

ldid -S main

Maybe I don't need this only because I'm not making a gui yet. 


7) Compiling runs into fork limits a lot, so raise the limit:

William-A-Steins-iPad:~/pari/pari-2.3.5 mobile$ sysctl -w kern.maxprocperuid=64
kern.maxprocperuid: 26
sysctl: kern.maxprocperuid: Operation not permitted
William-A-Steins-iPad:~/pari/pari-2.3.5 mobile$ sudo sysctl -w kern.maxprocperuid=64
Password:
kern.maxprocperuid: 26 -> 64

Friday, May 7, 2010

A Python on iPad benchmark

Now that my iPad is really mine, i.e., jailbroken, I installed Python on it and tried a large integer arithmetic benchmark. I'm curious because this will indicate something about the performance of fully porting Sage to the iPad (though of course Sage would use MPIR for large integers...).

First, on my 64-bit 2.6Ghz Intel Dunnington box (sage.math.washington.edu):

>>> import resource
>>> def cputime(s=0): return sum(resource.getrusage(resource.RUSAGE_SELF)[:2])-s
... 
>>> t=cputime()
>>> a=3**(10**6)
>>> cputime()-t
0.53000000000000003
>>> b=a*(a+1)
>>> cputime()-t
2.46

Next on my laptop, which is a top-end macbook air (64-bit intel core2 duo running 64-bit python 2.6.x):

>>> t=cputime()
>>> a=3**(10**6)
>>> cputime()-t
0.64609700000000003
>>> b=a*(a+1)
>>> cputime()-t
3.1051849999999996

And, on the iPad

>>> t=cputime()
>>> a=3**(10**6)
>>> cputime()-t
2.3500000000000014
>>> b=a*(a+1)
>>> cputime()-t
9.5899999999999963

Not bad!! Note that 32-bit versus 64-bit (and Python 2.5 versus 2.6) may be relevant, depending on how Python big integer arithmetic is implemented.

As a bonus, I have an older iPhone 3G (not 3Gs), where we get:

>>> t=cputime()
>>> a=3**(10**6)
>>> cputime()-t
7.6699999999999999
>>> b=a*(a+1)
>>> cputime()-t

Wow, so the iPhone 3G at this benchmark is about TEN TIMES slower than the iPad. No wonder the iPad feels snappier.

Note that I view genuinely porting Sage to the iPad/iPhone as something that will only ever be available to people who jailbreak their devices. This is because Apple bans running any interpreter except Javascript on the iPhone. (Of course, I can always hope that Apple will get sued into opening up their app store...) For a person like me, a jailbroken iPad is dramatically superior to the standard iPad -- full multitasking, ssh access, Python, being able to easily change screen brightness, having full access to the filesystem, etc. Now if only I had Sage! (Of course, Sage usable on the iPad via the web, but sometimes I don't have reliable network access.)