Thursday, December 15, 2011

How big is the core Sage library?

I just did the following with Sage-4.8.alpha5:
  1. "sudo apt-get install sloccount".
  2. "cp -rv SAGE_ROOT/devel/sage-main /tmp/x"
  3. Use a script [1] to rename all .pyx and .pxi files to .py.
  4. Ran "sloccount *" in the /tmp/x directory, which ignores autogenerated .c/.cpp files coming from Cython.

Here's the result for the full Sage library, which does not distinguish between Python and Cython. Note that sloccount really only counts lines of code -- comments are blank lines are ignored.

Totals grouped by language (dominant language first):
python:      530370 (96.41%)
ansic:        14538 (2.64%)
cpp:           5188 (0.94%)

This suggests that the core Sage library is just over a "half million lines of Python and Cython source code, not counting comments and whitespace".

Here's the breakdown by module:
SLOC    Directory       SLOC-by-Language (Sorted)
88903   rings           python=87720,cpp=1183
72913   combinat        python=71629,cpp=1284
47747   schemes         python=46255,cpp=1492
39815   graphs          python=28377,ansic=11438
31540   matrix          python=31540
31019   modular         python=31012,ansic=7
24475   libs            python=21171,ansic=2845,cpp=459
20517   misc            python=20383,ansic=134
18006   interfaces      python=18006
17577   geometry        python=16936,cpp=641
12775   categories      python=12775
12093   server          python=12093
11971   groups          python=11971
11961   plot            python=11961
10686   crypto          python=10686
9920    modules         python=9920
8389    symbolic        python=8260,cpp=129
8150    algebras        python=8150
7260    ext             python=7198,ansic=62
7093    structure       python=7093
6364    coding          python=6364
5670    functions       python=5670
5249    homology        python=5249
4798    numerical       python=4798
4323    quadratic_forms python=4323
3919    gsl             python=3919
3911    calculus        python=3911
3879    sandpiles       python=3879
3003    sets            python=3003
2647    databases       python=2647
2074    logic           python=2074
1736    finance         python=1736
1608    games           python=1608
1465    monoids         python=1465
1435    tests           python=1383,ansic=52
1370    stats           python=1370
971     interacts       python=971
959     tensor          python=959
906     lfunctions      python=906
308     parallel        python=308
275     probability     python=275
219     media           python=219
197     top_dir         python=197

Here is the script [1]:
#!/usr/bin/env python

import os, shutil

for dirpath, dirnames, filenames in os.walk('.'):
    for f in filenames:
        if f.endswith('.pyx') or f.endswith('.pxi'):
            print f
            shutil.move(os.path.join(dirpath, f),
                        os.path.join(dirpath, os.path.splitext(f)[0] + '.py'))

Tuesday, December 13, 2011

Using Sage to Support Research Mathematics

When using Sage to support research mathematics, the most important point to make is to strongly encourage people to do the extra work to turn their "scruffy research code" into a patch that can be peer reviewed and included in Sage. They will have to 100% doctest it, and the quality of their code may improve dramatically as a result. Including code in Sage means that the code will continue to work as Sage is updated. Also, the code is peer reviewed and has to have examples and documentation for every function. That's a much higher bar than just "reproducible research".

Moreover, getting code up to snuff to include in Sage will often also reveal mistakes that will avoid embarrassment later. I'm fixing some issues related to a soon-to-be-done paper right now that I found when doing just this for trac 11975.

This final step of turning snippets of research code into a peer-reviewed contribution to Sage is: (1) a surprisingly huge amount of very important useful work, (2) something that is emphasized as an option for Sage more than with Magma or Mathematica or Pari (say), (3) something whose value people have to be sold on, since they get no real extra academic credit for it, at present, usually, and journal referees often don't care either way (I do, but I'm probably in the minority there), and (4) something that a *lot* of research mathematicians do not do. As an example of (4), in the last two months I've seen a ton of (separate!) bodies of code which is all sort of secret research code in various Dropbox repos, and which isn't currently squarely aimed at going into Sage. I've also seen a bunch of code related to Edixhoven et al.'s algorithm for computing Galois representation with a similar property (there is now trac 12132, due to my
urging).

I did *not* do this step yet with this recently accepted paper. Instead I used "scrappy research code" in psage to do the fast L-series computations. The referee for Math Comp didn't care either way, actually... I hope this doesn't come back to haunt me, though there are many double checks here (e.g., BSD) so I'm not too worried. I will do this get-it-in-Sage step at some point though.

This will be better for the community in the long run, and better for individual researcher's credibility too. And there is a lot of value in having a stable refereed snapshot of code on which a published (=very stable) paper is based.