Thursday, December 15, 2011

How big is the core Sage library?

I just did the following with Sage-4.8.alpha5:
  1. "sudo apt-get install sloccount".
  2. "cp -rv SAGE_ROOT/devel/sage-main /tmp/x"
  3. Use a script [1] to rename all .pyx and .pxi files to .py.
  4. Ran "sloccount *" in the /tmp/x directory, which ignores autogenerated .c/.cpp files coming from Cython.

Here's the result for the full Sage library, which does not distinguish between Python and Cython. Note that sloccount really only counts lines of code -- comments are blank lines are ignored.

Totals grouped by language (dominant language first):
python:      530370 (96.41%)
ansic:        14538 (2.64%)
cpp:           5188 (0.94%)

This suggests that the core Sage library is just over a "half million lines of Python and Cython source code, not counting comments and whitespace".

Here's the breakdown by module:
SLOC    Directory       SLOC-by-Language (Sorted)
88903   rings           python=87720,cpp=1183
72913   combinat        python=71629,cpp=1284
47747   schemes         python=46255,cpp=1492
39815   graphs          python=28377,ansic=11438
31540   matrix          python=31540
31019   modular         python=31012,ansic=7
24475   libs            python=21171,ansic=2845,cpp=459
20517   misc            python=20383,ansic=134
18006   interfaces      python=18006
17577   geometry        python=16936,cpp=641
12775   categories      python=12775
12093   server          python=12093
11971   groups          python=11971
11961   plot            python=11961
10686   crypto          python=10686
9920    modules         python=9920
8389    symbolic        python=8260,cpp=129
8150    algebras        python=8150
7260    ext             python=7198,ansic=62
7093    structure       python=7093
6364    coding          python=6364
5670    functions       python=5670
5249    homology        python=5249
4798    numerical       python=4798
4323    quadratic_forms python=4323
3919    gsl             python=3919
3911    calculus        python=3911
3879    sandpiles       python=3879
3003    sets            python=3003
2647    databases       python=2647
2074    logic           python=2074
1736    finance         python=1736
1608    games           python=1608
1465    monoids         python=1465
1435    tests           python=1383,ansic=52
1370    stats           python=1370
971     interacts       python=971
959     tensor          python=959
906     lfunctions      python=906
308     parallel        python=308
275     probability     python=275
219     media           python=219
197     top_dir         python=197

Here is the script [1]:
#!/usr/bin/env python

import os, shutil

for dirpath, dirnames, filenames in os.walk('.'):
    for f in filenames:
        if f.endswith('.pyx') or f.endswith('.pxi'):
            print f
            shutil.move(os.path.join(dirpath, f),
                        os.path.join(dirpath, os.path.splitext(f)[0] + '.py'))

Tuesday, December 13, 2011

Using Sage to Support Research Mathematics

When using Sage to support research mathematics, the most important point to make is to strongly encourage people to do the extra work to turn their "scruffy research code" into a patch that can be peer reviewed and included in Sage. They will have to 100% doctest it, and the quality of their code may improve dramatically as a result. Including code in Sage means that the code will continue to work as Sage is updated. Also, the code is peer reviewed and has to have examples and documentation for every function. That's a much higher bar than just "reproducible research".

Moreover, getting code up to snuff to include in Sage will often also reveal mistakes that will avoid embarrassment later. I'm fixing some issues related to a soon-to-be-done paper right now that I found when doing just this for trac 11975.

This final step of turning snippets of research code into a peer-reviewed contribution to Sage is: (1) a surprisingly huge amount of very important useful work, (2) something that is emphasized as an option for Sage more than with Magma or Mathematica or Pari (say), (3) something whose value people have to be sold on, since they get no real extra academic credit for it, at present, usually, and journal referees often don't care either way (I do, but I'm probably in the minority there), and (4) something that a *lot* of research mathematicians do not do. As an example of (4), in the last two months I've seen a ton of (separate!) bodies of code which is all sort of secret research code in various Dropbox repos, and which isn't currently squarely aimed at going into Sage. I've also seen a bunch of code related to Edixhoven et al.'s algorithm for computing Galois representation with a similar property (there is now trac 12132, due to my
urging).

I did *not* do this step yet with this recently accepted paper. Instead I used "scrappy research code" in psage to do the fast L-series computations. The referee for Math Comp didn't care either way, actually... I hope this doesn't come back to haunt me, though there are many double checks here (e.g., BSD) so I'm not too worried. I will do this get-it-in-Sage step at some point though.

This will be better for the community in the long run, and better for individual researcher's credibility too. And there is a lot of value in having a stable refereed snapshot of code on which a published (=very stable) paper is based.

Saturday, November 12, 2011

Is the time ripe for http://sagenb.com?

On a Sage mailing list, Karl Crisman wrote: "Regarding the downtime issue [of http://sagenb.org], there have occasionally been rumors of someone or some organization starting a service which would guarantee uptime and provide support."

I might do this. It might be at http://sagenb.com (right now sagenb.com points to sagenb.org). Here's the "business plan". Probably, sagenb.com would appear almost the same as sagenb.org, except there would be Google (?) ads on the side of your worksheets, and the revenue would go toward paying for:
  1. Commercial server hosting:  Some Amazon EC2 instances
  2. An employee (or later, employees) to maintain the servers: at the beginning, this would be me in my free time, since I have a lot of experience with this.
  3. Advertising that sagenb.com exists and we want users!:  Unlike sagenb.org, we would very strongly encourage as many people as possible to use sagenb.com.   The advertising and landing page would explain that though sagenb.com generates money, all that money is all given back to the Sage development community (see below).
  4. Hire employees to improve the notebook: Fix bugs, implement features, etc. There would be a public commitment that all such work would be open sourced and get included with Sage. This would include adding support for a for-pay "pro" subscription version, adding nicer "offline support" (via a Sage install on the user's computer), integrated spreadsheets and better data visualization and manipulation tools for Science and business applications, and enabling development of Sage's core library and patches submission entirely through the notebook, etc.


At some point, there could be a $X/year subscription version that would remove all ads, and increase disk space available to that user. There would also be a $Y/year university site license version with customized authentication (e.g., LDAP?) for each university.   The university site license version might also include Maple/Mathematica/Matlab/Magma pre-installed in their notebook server, assuming appropriate site licenses are honored, so sagenb.com could be something that goes beyond just a platform for "sage as math software".   We can of course also tap into the R market, given that Sage includes R.

I imagine the above being done as a not-for-profit effort, so if it brought in a lot of revenue (e.g., more than needed for hosting and employees), excess money would go to the Sage Foundation to support other Sage development activities. Regarding numbers, according to Google Analytics, right now on average well over 1000 people use sagenb.org every day.    Standard commercial hosting costs for EC2 to support this load would be roughly $100/month. If each visitor generates on average of 1 penny of ad revenue per day (is this a reasonable estimate -- I have no clue?), then we would expect to make $3,650 in one year, which would be enough to fund the EC2 service (at about $2000/year), with a profit of $1,650.  

Now let's dream big!  There might be 1,000,000 potential daily users out there for a service like this, if it worked sufficiently well, since there are many college students (and people that use math and stats programs like R in the sciences, and R is part of Sage) in the world.   Scaling up by a factor of 1,000 would yield over $1 million/year, after paying for hosting fees.   This would be enough to fund substantial work to improve Sage, the notebook, and have a paid Patch Editor position (imagine buying out a top Sage developer professor, e.g., John Palmierri, from 50% of his teaching obligations in exchange for him spending the time he would spend teaching instead organizing that the patches to Sage get properly refereed).  Maybe we could even hire back some of the (many!) people who were Sage developers when they were mathematics grad student or postdocs, but who then went to work at a top web technology company and acquired useful experience there (and are now way too busy to work on Sage).

This has the potential to make Sage a more longterm self-sustaining project. It's probably not possible to get traditional venture capital for a not-for-profit plan like the one above, but fortunately that is not needed due to (1) the generous support the National Science Foundation is currently providing toward development on the Sage notebook, and (2) private donations to the Sage Foundation.  In particular, (2) provides enough money to bootstrap a sagenb.com for a while.

I think the main potential downside is competition.  If somebody else does the same thing right now for profit without giving back their changes, and captures the market it's hard to imagine how the above would work.  Since we don't use the Affero GPL for the Sage notebook, it is legal for somebody to do a lot of customization work to Sage and notebook, create a web-app using this customized version, and give back nothing to the community, so long as they don't redistribute their modified versions publicly. This isn't crazy -- not so long ago, I had a major company (I won't say who out of respect) tell me they planed to do something like that.    And "random people" suggest it somewhat regularly when I give talks about Sage.   I'm surprised it hasn't happened yet, but it definitely hasn't.

So the time to do this is today.  The notebook software we have is now finally reasonably scalable, primarily due to work of Mike Hansen and Rado Kirov.  Our funding situation is good this year.  We have strong good will and interest right now from both the Mathematical Association of America and WebWork developers.   If we wait longer, the one chance to truly make Sage financially self-sustaining in a way that fits with my dreams and values will pass.  

I hope that by making all plans open, and by having a commitment to put the profits back into Sage development, an enterprise like I describe here will be in a better position to attract users than a purely for profit venture by somebody else.

Wednesday, August 31, 2011

Sage: Creating a Viable Free Open Source Alternative to Magma, Maple, Mathematica, and MATLAB


The goal of the Sage project is to create a viable fre open source alternative to Magma, Maple(TM), Mathematica(R), and MATLAB(R), which are the most popular non-free closed source mathematical software systems. (Maple is a trademark of Waterloo Maple Inc. Mathematica is a registered trademark of Wolfram Research Incorporated. MATLAB is a registered trademark of MathWorks. I will refer to the four systems together as ``Ma'' in the rest of this article.) Magma is (by far) the most advanced non-free system for structured abstract algebraic computation, Mathematica and Maple are popular and highly developed systems that shine at symbolic manipulation, and MATLAB is the most popular system for applied numerical mathematics. Together there are over 3,000 employees working at the companies that produce the four Ma's listed above, which take in over a hundred million dollars of revenue annually.

By a viable free alternative to the Ma's, we mean a system that will have the important mathematical features of each Ma, with comparable speed. It will have 2d and 3d graphics, an interactive notebook-based graphical user interface, and documentation, including books, papers, school and college curriculum materials, etc. A single alternative to all of the Ma's is not necessarily a drop-in replacement for any of the Ma's; in particular, it need not run programs written in the custom languages of those systems. Thus it need not be like Octave or R, which (nearly) clone the languages of MATLAB and S, respectively. Development would instead focus on implementing functions that users demand, rather than systematically trying to implement every single function of the Ma's. The culture, architecture, and general look and feel of such a system would be very different than that of the Ma's.



Motivation for Starting Sage


Each of the Ma's cost substantial money, and is hence expensive for me, my collaborators, and students. The Ma's are not owned by the community like Sage is, or Wikipedia is, for that matter.


The Ma's are closed, which means that the implementation of some algorithms are secret, in which case you are not allowed to modify or extend them.

The Mathematica Documentation: "You should realize at the outset that while knowing about the internals of Mathematica may be of intellectual interest, it is usually much less important in practice than you might at first suppose. Indeed, in almost all practical uses of Mathematica, issues about how Mathematica works inside turn out to be largely irrelevant. Particularly in more advanced applications of Mathematica, it may sometimes seem worthwhile to try to analyze internal algorithms in order to predict which way of doing a given computation will be the most efficient. [...] But most often the analyses will not be worthwhile. For the internals of Mathematica are quite complicated.."

The philosophy espoused in Sage, and indeed by the vast open source software community, is exactly the opposite. We want you to know about the internals, and when they are quite complicated, we want you to help make them more understandable. Indeed, Sage's growth depends on you analyzing how Sage works, improving it, and contributing your improvements back.
sage: crt(2, 1, 3, 5)  # Chinese Remainder Theorem
    11 
    sage: crt?        # ? = documentation and examples
    Returns a solution to a Chinese Remainder Theorem...
    ...
    sage: crt??       # ?? = source code
    def crt(...):
    ...
        g, alpha, beta = XGCD(m, n)
        q, r = (b - a).quo_rem(g)
        if r != 0:
            raise ValueError("No solution ...")
        return (a + q*alpha*m) % lcm(m, n)
Moreover, by browsing the Mercurial repository, you can see exactly who wrote or modified any particular line of code in the Sage library, when they did it, and why. Everything included in Sage is free and open source, and it will foreover remain that way.

Linus Torvalds: "I see open source as Science. If you don't spread your ideas in the open, if you don't allow other people to look at how your ideas work and verify that they work, you are not doing Science, you are doing Witchcraft. Traditional software development models, where you keep things inside a company and hide what you are doing, are basically Witchcraft. Open source is all about the fact that it is open; people can actually look at what you are doing, and they can improve it, and they can build on top of it. [...] One of my favorite quotes from history is Newton: `If I had seen further, it has been by standing on the shoulders of giants.'"

The design decisions of the Ma's are not made openly by the community. In contrast, important decisions about Sage development are made via open public discussions and voting that is archived on public mailing lists with thousands of subscribers.

Every one of the Ma's uses a special mathematics-oriented interpreted programming language, which locks you into their product, makes writing some code outside mathematics unnecessarily difficult, and impacts the number of software engineers that are experts at programming in that language. In contrast, the user language of Sage is primarily the mainstream free open source language Python, which is one of the world's most popular interpreted programming languages. The Sage project neither invented nor maintains the underlying Python language, but gains immediate access to the IPython shell, Python scientific libraries (such as NumPy, SciPy, CVXopt and MatPlotLib), and a large Python community with major support from big companies such as Google. In comparison to Python, the Ma's are small players in terms of language development. Thus for Sage most of the problems of language development are handled by someone else.

The bug tracking done for three of four of the Ma's is currently secret (MATLAB has an open bug tracker, though it requires free registration to view.), which means that there is no published accounting of all known bugs, the status of work on them, and how bugs are resolved. But the Ma's do have many bugs; see the release notes of each new version, which lists bugs that were fixed. Sage also has bugs, which are all publicly tracked at the trac server, and there are numerous ``Bug Days'' workshops devoted entirely to fixing bugs in Sage. Moreover, all discussion about resolving a given bug, including peer review of solutions, is publicly archived. We note that sadly even some prize winning free open source systems, such as GAP, do not have an open bug tracking system, resulting in people reporting the same bugs over and over again.

Each of the Ma's is a combination of secret unchangeable compiled code and less secret interpreted code. Users with experience programming in compiled languages such as Fortran or C++ may find the loss of a compiler to be frustrating. None of the Ma's has an optimizing compiler that converts programs written in their custom interpreted language to a fast executable binary format that is not interpreted at runtime. (MATLAB has a compiler, but ``the source code is still interpreted at run-time, and performance of code should be the same whether run in standalone mode or in MATLAB.'' Mathematica also has a Compile function, but simply compiles expressions to a different internal format that is interpreted, much like Sage's fast_callable function.) In contrast, Sage is tightly integrated with Cython. The Cython project has received extensive contributions from Sage developers, and is very popular in the world of Python-based scientific computing. Cython is a Python-to-C/C++ compiler that speeds up code execution and has support for statically declaring data types (for potentially enormous speedups) and natively calling existing C/C++/Fortran code. For example, enter the following in a cell of the Sage notebook:
def python_sum2(n):
    s = int(0)
    for i in xrange(1, n+1):
        s += i*i
    return s
Then enter the following in another cell:
%cython
def cython_sum2(long n):
    cdef long i, s = 0
    for i in range(1, n+1):
        s += i*i
    return s
The second implementation, despite looking nearly identical, is nearly a hundred times faster than the first one (your timings may vary).
sage: timeit('python_sum2(2*10^6)')
5 loops, best of 3: 154 ms per loop
sage: timeit('cython_sum2(2*10^6)')
125 loops, best of 3: 1.76 ms per loop
sage: 154/1.76
87.5

Of course, it is better to choose a different algorithm. In case you don't remember a closed form expression for the sum of the first $n$ squares, Sage can deduce it:
sage: var('k, n')
sage: factor(sum(k^2, k, 1, n))
1/6*(n + 1)*(2*n + 1)*n
And now our simpler fast implementation is:
def sum2(n):
    return n*(2*n+1)*(n+1)/6
Just as above, we can also use the Cython compiler:
%cython
def c_sum2(long n):
    return n*(2*n+1)*(n+1)/6
Comparing times, we see that Cython is 10 times faster:
sage: n = 2*10^6
sage: timeit('sum2(n)')
625 loops, best of 3: 1.41 microseconds per loop
sage: timeit('c_sum2(n)')
625 loops, best of 3: 0.145 microseconds per loop
sage: 1.41/.145
9.72413793103448
In this case, the enhanced speed comes at a cost, in that the answer is wrong when the input is large enough to cause an overflow:
sage: c_sum2(2*10^6)   # WARNING: overflow
-407788678951258603
Cython is very powerful, but to fully benefit from it, one must understand machine level arithmetic data types, such as long, int, float, etc. With Sage you have that option.

What is Sage?


The goal of Sage is to compete with the Ma's, and the intellectual property at our disposal is the complete range of GPL-compatibly licensed open source software.

Sage is a self-contained free open source distribution of about 100 open source software packages and libraries that aims to address all computational areas of pure and applied mathematics. See the list of packages in Sage, which includes R, Pari, Singular, GAP, Maxima, GSL, Numpy, Scipy, ATLAS, Matplotlib, and many other popular programs. The download of Sage contains all dependencies required for the normal functioning of Sage, including Python itself. Sage includes a substantial amount of code that provides a unified Python-based interface to these other packages. Sage also includes a library of new code written in Python, Cython and C/C++, which implements a huge range of algorithms.



History


I made the first release of Sage in February 2005, and at the time called it "Software for Arithmetic Geometry Experimentation." I was a serious user of, and contributor to, Magma at the time, and was motivated to start Sage for many of the reasons discussed above. In particular, I was personally frustrated with the top-down closed development model of Magma, the fact that several million lines of the source code of Magma are closed source, and the fees that my colleagues had to pay in order to use the substantial amount of code that I contributed to Magma. Despite my early naive hope that Magma would be open sourced, it never was. So I started Sage motivated by the dream that someday the single most important item of software I use on a daily basis would be free and open. David Joyner, David Kohel, Joe Wetherell, and Martin Albrecht were also involved in the development of Sage during the first year.

In February 2006, the National Science Foundation funded a 2-day workshop called ``Sage Days 2006'' at UC San Diego, which had about 40 participants and speakers from several open and closed source mathematical software projects. After doing a year of fulltime mostly solitary work on Sage, I was surprised by the positive reception of Sage by members of the mathematical research community. What Sage promised was something many mathematicians wanted. Whether or not Sage would someday deliver on that promise was (and for many still is) an open question.

I had decided when I started Sage that I would make it powerful enough for my research, with or without the help of anybody else, and was pleasantly surprised at this workshop to find that many other people were interested in helping, and understood the shortcomings of existing open source software, such as GAP and PARI, and the longterm need to move beyond Magma. Six months later, I ran another Sage Days workshop, which resulted in numerous talented young graduate students, including David Harvey, David Roe, Robert Bradshaw, and Robert Miller, getting involved in Sage development. I used startup money from University of Washington to hire Alex Clemesha as a fulltime employee to implement 2d graphics and help create a notebook interface to Sage. I also learned that there was much broader interest in such a system, and stopped referring to Sage as being exclusively for ``arithmetic geometry''; instead, Sage became ``Software for Algebra and Geometry Experimentation.'' Today the acronym is deprecated.

The year 2007 was a major turning point for Sage. Far more people got involved with development, we had four Sage Days workshops, and prompted by Craig Citro, we instituted a requirement that all new code must have tests for 100% of the functions touched by that code, and every modification to Sage must be peer reviewed. Our peer review process is much more open than in mathematical research journals; everything that happens is publicly archived on trac. During 2007, I also secured some funding for Sage development from Microsoft Research, Google, and NSF. Also, a German graduate student studying cryptography, Martin Albrecht presented Sage at the Troph'ees du Libre competition in France, and Sage won first place in ``Scientific Software'', which led to a huge amount of good publicity, including articles in many languages around the world and appearances, for example, this Slashdot article.

In 2008, I organized 7 Sage Days workshops at places such as IPAM (at UCLA) and the Clay Mathematics Institute, and for the first time, several people besides me made releases of Sage. In 2009, we had 8 more Sage Days workshops, and the underlying foundations of Sage improved, including development of a powerful coercion architecture. This coercion model systematically determines what happens when performing operations such as a + b, when a and b are elements of potentially different rings (or groups, or modules, etc.).
sage: R. = PolynomialRing(ZZ)
    sage: f = x + 1/2; f
    x + 1/2
    sage: parent(f)
    Univariate Polynomial Ring in x over Rational Field
We compare this with Magma (V2.17-4), which has a more ad hoc coercion system:
> R := PolynomialRing(IntegerRing());
    > x + 1/2
        ^
    Runtime error in '+': Bad argument types
    Argument types given: RngUPolElt[RngInt], FldRatElt

Robert Bradshaw and I also added support for beautiful browser-based 3D graphics to Sage, which involved writing a 3D graphics library, and adapting the free open source JMOL Java library for rendering molecules to instead plot mathematical objects.

sage: f(x,y) = sin(x - y) * y * cos(x)
    sage: plot3d(f, (x,-3,3), (y,-3,3), color='red')

In 2009, following a huge amount of porting work by Mike Hansen, development of algebraic combinatorics in Sage picked up substantial momentum, with the switch of the entire MuPAD-combinat group to Sage (forming sage-combinat), only months before the formerly free system MuPAD{\textregistered}\footnote{MuPAD is a registered trademark of SciFace Software GmbH \& Co.} was bought out by Mathworks (makers of MATLAB). In addition to work on Lie theory by Dan Bump, this also led to a massive amount of work on a category theoretic framework for Sage by Nicolas Thiery.

In 2010, there were 13 Sage Days workshops in many parts of the world, and grant funding for Sage significantly improved, including new NSF funding for undergraduate curriculum development. I also spent much of my programming time during 2010--2011 developing a number theory library called psage, which is currently not included in Sage, but can be easily installed.

Many aspects of Sage make it an ideal tool for teaching mathematics, so there's a steadily growing group of teachers using it: for example, there have been MAA PREP workshops on Sage for the last two years, and a third is likely to run next summer, there are regular posts on the Sage lists about setting up classroom servers, and there is an NSF-funded project called UTMOST devoted to creating undergraduate curriculum materials for Sage.

The publications page lists 101 accepted publications that use Sage, 47 preprints, 22 theses, and 16 books, and there are surely many more ``in the wild'' that we are not aware of. According to Google Analytics, the main Sage website gets about 2,500 absolute unique visitors per day, and the website http://sagenb.org, which allows anybody to easily use Sage through their web browser, has around 700 absolute unique visitors per day.

For many mathematicians and students, Sage is today the mature, open source, and free foundation on which they can build their research program.

Monday, November 15, 2010

Brief History and Motivation Behind the Sage Coercion Model

In Sage (like in Magma), most objects are either elements or parents. Think of a "parent" as a set. This Parent/Element idea is a powerful algebraic approach to implementing mathematical objects on a computer, which does not exist in Mathematica, Maple, PARI, Maxima, and many other math software platforms.

I learned about this approach from using Magma:

%magma
R<x> := PolynomialRing(Integers());
print Parent(x);
///
Univariate Polynomial Ring in x over Integer Ring

(In this blog post I'll put %magma above code that gets input to Magma; all other code gets input to Sage. The input and output is separated by ///.)

In Sage:

R.<x> = ZZ[]
parent(x)
///
Univariate Polynomial Ring in x over Integer Ring

x.parent()
///
Univariate Polynomial Ring in x over Integer Ring


isinstance(ZZ, Parent)
///
True

isinstance(2, Parent)
///
False


Automatic Coercions:

"The primary goal of coercion is to be able to transparently do arithmetic, comparisons, etc. between elements of distinct parents."

When I used to try to get people to use Magma, perhaps the number one complaint I heard about Magma was that doing arithmetic with objects having distinct parents was difficult and frustrating.

For the first year, in Sage, there was a very simple coercion system:

  • If you try to compute a + b or a * b, first somehow put b into the parent of a, then do the arithmetic.

That seriously sucked.  E.g., 

    Mod(2,7) + 6

was completely different than

    6 + Mod(2,7)!

The first was Mod(1,7), and the second was the integer 8.   This makes understanding code difficult and unpredictable.

So I rewrote coercion to be a bit better (this was a really painful rewrite that I mostly did myself over several hard months of work):

  • If you try to compute a + b (or a*b), check for a "canonical coercions" from the parent of a into b, or failing that, from the parent of b into a.  If there aren't any raise an error.  If there is one, use it.  There won't be both unless there is some canonical isomorphism. 
  • There are some axioms about what a canonical coercion is.  At least it is homomorphism. 

Then we decided that there is a canonical homomorphism Z --> Z/7Z, but there is not one Z/7Z --> Z since there is no ring homomorphism in this direction, hence the above makes sense in either order.  

One implication of this new model was that parent objects have to be immutable, i.e., you can't fundamentally change them after you make them.  This is why in Sage you must specify the name of the generator of a polynomial ring at creation time, and can't change it.  In Magma, it is typical to specify the name only later if you want.

Objects must be immutable because the canonical maps between them depend on the objects themselves, and we don't want them to just change left and right at runtime. 


%magma
R := PolynomialRing(RationalField(), 2);
f := R.1^3 + 3*R.2^3 - 4/5;
print f;
///
$.1^3 + 3*$.2^3 - 4/5
[ $.1, $.2 ]

%magma
AssignNames(~R, ["x", "y"]);
print f;
[R.1, R.2]
///
x^3 + 3*y^3 - 4/5
[x, y]

%magma
AssignNames(~R, ["z", "w"]);
print f;
///
z^3 + 3*w^3 - 4/5


Now in Sage:
R = PolynomialRing(QQ)
///
TypeError: You must specify the names of the variables.

R.<x,y> = PolynomialRing(QQ)
f = x^3 + 3*y^3 - 4/5; f
///
x^3 + 3*y^3 - 4/5

Note: In Sage, you can can use a with block to temporarily change the names if you really need to for some reason.  This is allowed since at the end of the with block the names are guaranteed to be changed back.


with localvars(R, ['z','w']):
    print f
print "back?", f    
///
z^3 + 3*w^3 - 4/5
back? x^3 + 3*y^3 - 4/5


But this new model had a major problem too, e.g., if x in Z[x] then "x + 1/2" would FAILS!   This is because 1/2 does not coerce into Z[x] (the parent of x), and x does not coerce into Q (the parent of 1/2).

 

Maybe the implementors of Magma have the answers?  Evidently not. 


%magma
R<x> := PolynomialRing(Integers());
x + 1/2;
///
Runtime error in '+': Bad argument types
Argument types given: RngUPolElt[RngInt], FldRatElt

Robert Bradshaw did though, and now it is in Sage:


R.<x> = ZZ[]
x + 1/2
///
x + 1/2

His new design is (for the most part) what Sage actually uses now.

He launched an effort in 2008 (see the Dev Days 1 Wiki) to implement a rewrite of the coercion model to his new design.  This ended up swallowing up half the development effort at the workshop, and was a massive amount of work, since every parent structure and element had to have some modifications made to it. 

This meant people changing a lot of code all over Sage that they didn't necessarily understand, and crossing their fingers that the doctest test suite would catch their mistakes.    This was SCARY.   After much work, none of this went into Sage.  It was just way too risky.  This failure temporarily (!) burned out some developers. 

Robert Bradshaw, on the other hand, persisted and came up with a new approach that involved migrating Sage code gradually.  I.e., he made it so that the old coercion model was still fully supported simultaneously with the new one, then he migrated a couple of parent structures, and got the code into Sage.   I'm sure not everything is migrated, even today.  There are two points to what he did:

  1. He extended the rules so x + 1/2 works, i.e., the result of a+b need not live in the parent of a or the parent of b.
  2. He made implementing coercion much more top down: simply implement various methods in a class that derives from Parent.  This meant that instead of coercion being rules and conventions that people have to understand and implement in their own code all over Sage, they just implement a small amount of code and the rules (and benefits) are all enforced automatically. 


The Coercion Model

The coercion model is explained here: http://sagemath.org/doc/reference/coercion.html

 

Monday, November 8, 2010

Getting Started With Cython

Getting Started With Cython



Quote about Cython:

Andrew Tipton says "I'm honestly never going back to writing C again. Cython gives me all the expressiveness of Python combined with all the performance and close-to-the-metal-godlike-powers of C. I've been using it to implement high-performance graph traversal and routing algorithms and to interface with C/C++ libraries, and it's been an absolute amazing productivity boost."  Yep.


Cython has two major use cases

  1. Extending the CPython interpreter with fast compiled modules,
  2. Interfacing Python code with external C/C++ libraries.

Cython supports type declarations

  1. For changing code from having dynamic Python semantics into having static-and-fast (but less generic) C semantics.
  2. Directly manipulating C data types defined in external libraries.

Tutorial: Building Your First Cython Code by Hand

It happens in two stages:
  1. A .pyx file is compiled by Cython to a .c or .cpp file.
  2. The .c or .cpp file is compild by a C compiler (such as GCC) to a .so file.
Let's try it now:
First, create a file sum.pyx that contains the following code (see this directory for original code files):

def sum_cython(long n):
    cdef long i, s = 0
    for i in range(n):
        s += i
    return s
Then use Cython to compile it:


Since we're using Sage, you can do

bash$ sage -cython sum.pyx
bash$ ls
sum.c  sum.pyx
Notice the new file sum.c.  Compile it with gcc as follows on OS X: 

bash$ sage -sh
bash$ gcc -I$SAGE_ROOT/local/include/python2.6 -bundle -undefined dynamic_lookup sum.c -o sum.so 

On Linux, do:

bash$ sage -sh
bash$ gcc -I$SAGE_ROOT/local/include/python2.6 -shared -fPIC sum.c -o sum.so
Finally, try it out.


You must run Sage from the same directory that contains the file sum.so. When you type import sum below, the Python interpreter sees the file sum.so, opens it, and it contains functions and data that define a compiled "Python C-extension module", so Python can load it (like it would like a module like sum.py).

bash$ sage
-------------------------------------------------
| Sage Version 4.6, Release Date: 2010-10-30     
| Type notebook() for the GUI, and license() for
-------------------------------------------------
sage: import sum
sage: sum.sum_cython(101)
5050
sage: timeit('sum.sum_cython(101)')
625 loops, best of 3: 627 ns per loop
sage: timeit('sum.sum_cython(101)', number=10^6)    # better quality timing
1000000 loops, best of 3: 539 ns per loop

Finally, take a look at the (more than 1000 line) autogenerated C file sum.c:
bash$ wc -l sum.c
    1178 sum.c
bash$ less sum.c
...

Notice code like this, which illustrates that Cython generates code that supports both Python2 and Python3:


#if PY_MAJOR_VERSION < 3
  #define __Pyx_BUILTIN_MODULE_NAME "__builtin__"
#else
  #define __Pyx_BUILTIN_MODULE_NAME "builtins"
#endif

#if PY_MAJOR_VERSION >= 3
  #define Py_TPFLAGS_CHECKTYPES 0
  #define Py_TPFLAGS_HAVE_INDEX 0
#endif
The official Python docs say: "If you are writing a new extension module, you might consider Cython. It translates a Python-like language to C. The extension modules it creates are compatible with Python 3.x and 2.x."  

If you scroll down further you'll get past the boilerplate and see the actual code:

...
  /* "/Users/wstein/edu/2010-2011/581d/notes/2010-11-08/sum.pyx":2
 * def sum_cython(long n):
 *     cdef long i, s = 0             # <<<<<<<<<<<<<<
 *     for i in range(n):
 *         s += i
 */
  __pyx_v_s = 0;

  /* "/Users/wstein/edu/2010-2011/581d/notes/2010-11-08/sum.pyx":3
 * def sum_cython(long n):
 *     cdef long i, s = 0
 *     for i in range(n):             # <<<<<<<<<<<<<<
 *         s += i
 *     return s
 */
  __pyx_t_1 = __pyx_v_n;
  for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2+=1) {
    __pyx_v_i = __pyx_t_2;

    /* "/Users/wstein/edu/2010-2011/581d/notes/2010-11-08/sum.pyx":4
 *     cdef long i, s = 0
 *     for i in range(n):
 *         s += i             # <<<<<<<<<<<<<<
 *     return s
 */
    __pyx_v_s += __pyx_v_i;
  }
...
There is a big comment that shows the original Cython code with context and a little arrow pointing at the current line (these comment blocks with context were I think the first thing I personally added to Pyrex... before, it just gave that first line with the .pyx filename and line number, but nothing else).  Below that big comment, there is the actual C code that Cython generates.  For example, the Cython code  s += i is turned into the C code __pyx_v_s += __pyx_v_i;.  


The Same Extension From Scratch, for Comparison

If you read Extending and Embedding Python you'll see how you could write a C extension module from scratch that does the same thing as sum.so above. Let's see what this is like, for comparison. Given how simple sum.pyx is, this isn't so hard. When creating more complicated Cython code---e.g., new extension classes, more complicated type conversions, and memory management---writing C code directly quickly becomes unwieldy.
First, create a file sum2.c as follows:


#include <Python.h>

static PyObject * 
sum2_sum_c(PyObject *self, PyObject *n_arg)
{
    long i, s=0, n = PyInt_AsLong(n_arg);
    
    for (i=0; i<n; i++)  {
 s += i;
    }
    PyObject* t = PyInt_FromLong(s);
    return t;
}

static PyMethodDef Sum2Methods[] = {
    {"sum_c", sum2_sum_c, METH_O, "Sum the numbers up to n."},
    {NULL, NULL, 0, NULL} /* Sentinel */
};

PyMODINIT_FUNC
initsum2(void)
{
    PyObject *m;
    m = Py_InitModule("sum2", Sum2Methods);
}
Now compile and run it as before: 
bash$ sage -sh
bash$ gcc -I$SAGE_ROOT/local/include/python2.6 -bundle -undefined dynamic_lookup sum2.c -o sum2.so 
bash$ sage
...
sage: import sum2
sage: sum2.sum_c(101)
5050
sage: import sum
sage: sum.sum_cython(101)
5050
sage: timeit('sum.sum_cython(1000000r)')
125 loops, best of 3: 2.54 ms per loop
sage: timeit('sum2.sum_c(1000000r)')
125 loops, best of 3: 2.03 ms per loop
Note that this is a little faster than the corresponding Cython code. This is because the Cython code is more careful, checking various error conditions, etc.  

Note that the C code is 5 times as long as the Cython code.

Building Extensions using Setuptools Instead

In nontrivial projects, the Cython step of transforming your code from .pyx to .c is typically done by explicitly calling cython somehow (this will change in the newest version of Cython), but the step of running the C compiler is usually done using either distutils or setuptools. To use the tools, one creates a file "setup.py" which defines the extensions in your project, and Python itself then runs a C compiler for you, with the proper options, includes paths, etc.

Let's create a new setuptools project that includes the sum and sum2 extensions that we defined above. First, create the following file and call it setup.py. This should be in the same directory as sum.c and sum2.c.
from setuptools import setup, Extension

ext_modules = [
    Extension("sum", ["sum.c"]),
    Extension("sum2", ["sum2.c"])
    ]

setup(
    name = 'sum',
    version = '0.1',
    ext_modules = ext_modules)
Then type 
bash$ rm *.so  # make sure something happens
bash$ sage setup.py develop
...
bash$ ls *.so
sum.so sum2.so


Notice that running
setup.py develop

resulted in Python generating the right gcc commmand lines for your platform. You don't have to do anything differently on Linux, OS X, etc.

If you change sum2.c, and want to rebuild it, just type sage setup.py develop again to rebuild sum2.so If you change sum.pyx, you have to manually run Cython:
sage -cython sum.pyx

then again do sage setup.py develop to rebuild sum.so. Try this now. In sum.pyx, change
for i in range(n):

to
for i in range(1,n+1):

then rebuild:
 
bash$ sage -cython sum.pyx
...
bash$ sage setup.py develop
...
bash$ sage
...
sage: import sum
sage: sum.sum_cython(100)
5050

There are ways to make setup.py automatically notice when sum.pyx changes, and run Cython. A nice implementation of this will be in the next Cython release. See the setup.py and build_system.py files of Purple sage for an example of how to write a little build system write now (before the new version of Cython).

An Automated Way to Experiment

Given any single Cython file such as sum.pyx, in Sage you can do
sage: load sum.pyx
Compiling sum.pyx...
sage: sum_cython(100)
5050
Behind the scenes, Sage created a setup.py file, ran Cython, made a new module, compiled it, and imported everything it defines into the global namespace.   If you look in the spyx subdirectory of the directory listed below, before you exit Sage (!), then you'll see all this. 
sage: SAGE_TMP
'/Users/wstein/.sage//temp/deep.local/14837/'

You can also do
sage: attach sum.pyx

Then every time sum.pyx changes, Sage will notice this and reload it. This can be useful for development of small chunks of Cython code.

You can also use the Sage notebook, and put %cython as the first line of a notebook cell. The rest of the cell will be compiled exactly as if it were written to a .pyx file and loaded as above. In fact, that is almost exactly what happens behind the scenes.

Next Time

Now that we understand at a reasonably deep level what Cython really is and does, it is time to learn about the various constructs of the language:
  1. How to create extension classes using Cython.
  2. How to call external C/C++ library code.

We will rewrite our sum.pyx file first to use a class. Then we'll rewrite it again to make use of the MPIR (or GMP) C library for arithmetic, and again to make use of the C++ NTL library.

Wednesday, November 3, 2010

Cython, Sage, and the Need for Speed

Cython seriously rocks, at least for much of what I need. It's still the killer feature of Python/Sage, IMHO. And meetings like EuroScipy last summer really confirmed that, where almost every other talk used Cython.

History


Greg Ewing wrote "Pyrex" in 2002--2004..., which I guess he named
after some cooking ware. It is amazing, but to understand this you
must take a quick tour of Extending and embedding and the Python/C API reference. Pyrex let you write basically Python-ish code that gets magically turned into C extension code.

In 2007 I forked it (discuss why briefly) and named Cython after this punk rock guy.

At that time, Robert Bradshaw and Stefen Behnel spent a lot of time improving Cython, implementing tons of _optimizations_ and new features.

Cython is now very popular in the "Scientific computing using Python" world. It is also heavily used in Sage.

Are You Serious?

If you want to use a computer for math research, and you are serious (not some lazy person who fiddles then gives up), you will likely run into situations where you need code to run fast. Writing such code only in Python (or any other interpreter) is often impossible.

If you want to write fast code on a computer, and don't want to mess with assembler, the only option right now is C, or something with equivalent speed... Cython! By "fast" I mean 100-1000 times what you'll get out of Python on certain tasks. I also mean code that is evil, scary, and dangerous... if you aren't careful with preconditions.

Compiled versus Interpreted Code

Here's how interpreter code usually runs.
    1. Check a bunch of conditions then do one single thing.
    2. Check a bunch of conditions then do one single thing.
    ...
    10^6. Check a bunch of conditions then do one single thing.
Here's how compiled (C, Cython, etc.) can can be written:

    1. Check some conditions (optional, but a good idea);
    2. Do very unsafe stuff with no checks at all (but they 
       in theory should be safe given 1).
    ...
    10^6. Do very unsafe stuff with no checks at all (but they 
       in theory should be safe given 1).
The problem is that all the checks in step 1 (in either case) can easily take over 100 times as long as "do very unsafe stuff".

TYPICAL EXAMPLE:
sage: def sum_sage(n):
...       s = 1
...       for i in range(n):
...           s += i
...       return s
sage: timeit('sum_sage(100000r)')
5 loops, best of 3: 84.6 ms per loop
sage: %python
sage: def sum_python(n):
...       s = 1
...       for i in range(n):
...           s += i
...       return s    
sage: 84.6/5.88
14.3877551020408
sage: timeit('sum_python(100000r)')
125 loops, best of 3: 5.88 ms per loop
sage: %cython
sage: def sum_cython(int n):
...       cdef int i, s = 1
...       for i in range(n):
...           s += i
...       return s
sage: timeit('sum_cython(100000r)')
625 loops, best of 3: 61.6 µs per loop
sage: 5.88 / 0.061
96.3934426229508   

Let me explain what's going in the above. How, e.g., in the first one (sum_sage), the program is doing a sort of monologue: "I have to add a Python int to a Sage int. I don't have any code to do that directly (that would get too complicated, and they are so big and complicated and different objects, and they might change, oh my). So I'll convert the Python int to Sage int, because that's the only conversion I know. OK, I do that via (it used to be base 10 string parsing!) some code Gonzalo Tornaria wrote that is scary complicated... and once that is done, I got my new MPIR-based Sage integer, which I think add to s. The addition takes some memory that points to the two MPIR integers, and since Python numbers are supposed to be immutable, I make yet another MPIR number (wrapped in a Python object), which is the result of asking MPIR to add them. MPIR numbers are also very complicated objects, involving stuff like limbs, and C structs, which hardly anybody fully understands. Despite these integers happening to be small, there is still quite some overhead in the addition, but it happens (taking a small fraction of the total runtime). Then we move on to the next step in the loop!"

With sum_python, the loop is similar, but MPIR isn't involved, and there are no conversions. This buys a 14-fold speedup. But it is still not super fast, since many new Python objects get created, the code is for "potentially huge integers", hence a potentially complicated data structure has to be checked for, etc.

With sum_cython, the integers are only C ints, which are a 32 or 64-bit location in memory. Doing "s += i" just modifies in place that position in memory. There's no conversions or type checks done at all at run time. It's really fast... 1386 times faster than the first version!!!

Key point: If you truly understand what is going on, you'll see that this isn't Sage being fundamentally broken. Instead, you'll hopefully be able to look at a block of Sage code and have a clue about how to figure out what it is really doing in order to see whether writing a new implementation of the same algorithm using Cython (which will likely mean directly working with C level data structures) is likely to give you a big speedup. If you look at the innermost statement in a loop, and there's a big monologue about what is really going on, then you might get a 1000-fold speedup by using Cython.

In mathematics, general theorems -- once we have them -- are almost always much better than proofs of special cases. In math, proving a special case can often seem more awkward and unnatural than proving the general case (e.g., how would you proof that ever integer of the form a^2 + 7*a + 5 factors uniquely as a product of primes!?). With general theorems in math, the statements are often simple and clear so applying them is easier than applying theorems that are only about some very special case, which has often more elaborate hypothesis. In mathematics, usually a general theorem is simply all around much better than a theorem about some very special cases (especially if both are available).

In contrast, when writing computer programs, algorithms to solve very general cases of problems often have significant drawbacks in terms of speed (and sometimes complexity) over algorithms for special cases. Since you are mathematicians, you should constantly guard against your instincts from math research which can point you in exactly the wrong direction for writing very fast code. Often implementations of very general algorithms _are_ easier to understand, and are much less buggy than a bunch of implementations of special cases. However, there are also usually very severe performance penalties in implementing only the general case. Watch out.

A huge part of understanding the point of Cython for writing fast math code is that you must accept that you're going to write a lot of "ugly" (from a mathematicians perspective) code that only deals with special cases. But it's beautiful from the perspective of somebody who absolutely needs fast code for a specific research application; your fast code can lead to whole new frontiers of research.

Continuing the example from above:
sage: sum_python(10^8)
4999999950000001
sage: sum_cython(10^8)
887459713

Yep, we just implemented only a special case in Cython!

---------------

Useful things to do if you want Cython enlightenment (all at once, no order):
  • Definitely read/skim Extending and try all examples. This is critical if you want to understand Cython with any depth at all. Don't think: I don't need to know any of this, because I have Cython. Yes, after you play with this a bit you may never explicitly use it. But you do need to know it.
  • Oh, definitely learn the C language if you haven't already. This is
    the book. There are courses. It's crazy not to learn C, anyways, since it (along with C++) is hugely popular, and a massive amount of code is written in C/C++. (See, e.g., http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html where C and C++ together are the most popular, by a long margin.)

  • Obviously, read http://docs.cython.org/ until it is burned into your brain.


  • Look at the code that Cython generates (use "cython -a" for a nice HTML view).