tag:blogger.com,1999:blog-63655882020252923152014-04-15T15:02:32.404-07:00Sage: Open Source Mathematics SoftwareThis is my blog about things related to Sage.William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comBlogger58125tag:blogger.com,1999:blog-6365588202025292315.post-16404064223541938132014-04-15T15:02:00.002-07:002014-04-15T15:02:32.417-07:00SageMathCloud's new storage architecture<b>Keywords:</b> ZFS, bup, rsync, Sage <br /><br /><a href="https://cloud.sagemath.com/" target="_blank">SageMathCloud</a> (SMC) is a browser-based hosted cloud computing environment for easily collaborating on Python programs, IPython notebooks, Sage worksheets and LaTeX documents. I spent the last four months wishing very much that <b>less</b> people would use SMC. Today that has changed, and this post explains some of the reasons why. <br /><h2>Consistency Versus Availability</h2>Consistency and availability are competing requirements. It is trivial to keep the files in a <a href="https://cloud.sagemath.com/" target="_blank">SageMathCloud</a> project consistent if we store it in exactly one place; however, when the machine that project is on goes down for any reason, the project stops working, and the users of the project are very unhappy. By making many copies of the files in a project, it's fairly easy to ensure that the project is always available, even if network switches in multiple data centers completely fail, etc. Unfortunately, if there are too many users and the synchronization itself puts too heavy of a load on the overall system, then machines will fail more frequently, and though projects are available, files do not stay consistent and data is lost to the user (though still "out there" somewhere for <b>me</b> to find). <br /><br />Horizontal scalability of file storage and availability of files are also competing requirements. If there are a few compute machines in one place, then they can all mount user files from one central file server. Unfortunately, this approach leads to horrible performance if instead the network is slow and has high latency; it also doesn't scale up to potentially millions of users. A benchmark I care about is downloading a <a href="http://boxen.math.washington.edu/home/sagemath/sage-mirror/linux/64bit/sage-6.1.1-x86_64-Linux-Ubuntu_12.04_x86_64.tar.lzma" target="_blank">Sage binary (630MB)</a> and extracting it (creating over 70,000 files); I want this to take at most 3 minutes total, which is hard using a networked filesystem served over the general Internet between data centers. Instead, in SMC, we store the files for user projects on the compute machines themselves, which provides optimal speed. Moreover, we use a compressed filesystem, so in many cases read and write speeds are nearly twice as fast as they might be otherwise. <br /><h2>New Architecture of <a href="https://cloud.sagemath.com/" target="_blank">SageMathCloud</a> </h2>An SMC project with id <tt>project_id</tt> consists of two directories of files, replicated across several machines using rsync: <br /><ol><li> The HOME directory: <tt>/projects/project_id</tt></li><li> A <a href="https://github.com/bup/bup" target="_blank">bup</a> repository: <tt>/bup/bups/project_id</tt></li></ol>Users can also create files they don't care too much about in <tt>/scratch</tt>, which is a compressed and deduplicated ZFS filesystem. It is not backed up in any way, and is local to that compute. <br /><br />The <tt>/projects</tt> directory is one single big <a href="http://zfsonlinux.org/" target="_blank">ZFS</a> filesystem, which is both lz4 compressed and deduplicated. ZFS compression is just plain awesome. ZFS deduplication is much more subtle, as deduplication is tricky to do right. Since data can be deleted at any time, one can't just use a <a href="http://en.wikipedia.org/wiki/Bloom_filter" target="_blank">bloom filter</a> to very efficiently tell whether data is already known to the filesystem, and instead ZFS uses a much less memory efficient data structure. Nonetheless, deduplication works well in our situation, since the compute machines all have sufficient RAM (around 30-60GB), and the total data stored in <tt>/projects</tt> is well under 1TB. In fact, right now most compute machines have about 100GB stored in <tt>/projects</tt>. <br />The <tt>/bup/bups</tt> directory is also one single big ZFS filesystem; however, it is neither compressed nor deduplicated. It contains <a href="https://github.com/bup/bup" target="_blank">bup</a> repositories, where bup is an <b><i>awesome</i></b> git-based backup tool written in Python that is designed for storing snapshots of potentially large collections of arbitrary files in a compressed and highly deduplicated way. Since the git pack format is already compressed and deduplicated, and bup itself is highly efficient at deduplication, we would gain almost nothing by using compression or deduplication directly on this ZFS filesystem. When bup deduplicates data, it does so using a sliding window through the file, unlike ZFS which simply breaks the file up into blocks, so bup does a much better job at deduplication. Right now, most compute machines have about 50GB stored in <tt>/bup/bups</tt>. <br /><br />When somebody actively uses a project, the "important" working files are snapshotted about once every two minutes. These snapshots are done using bup and stored in <tt>/bup/bups/project_id</tt>, as mentioned above. After a snapshot is successfully created, the files in the working directory and in the bup repository are copied via rsync to each replica node. The users of the project do not have direct access to <tt>/bup/bups/project_id</tt>, since it is of vital importance that these snapshots cannot be corrupted or deleted, e.g., if you are sharing a project with a fat fingered colleague, you want peace of mind that even if they mess up all your files, you can easily get them back. However, all snapshots are mounted at <tt>/projects/project_id/.snapshots</tt> and browseable by the user; this uses bup's FUSE filesystem support, enhanced with some <a href="https://www.blogger.com/github.com/williamstein/bup-1" target="_blank">patches I wrote</a> to support file permissions, sizes, change times, etc. Incidentally, the bup snapshots have no impact on the user's disk quota. <br /><br />We also backup <i>all</i> of the bup archives (and the database nodes) to a single large bup archive, which we regularly backup offsite on encrypted USB drives. Right now, with nearly 50,000 projects, the total size of this large bup archive is under 250GB (!), and we can use it efficiently recover any particular version of any file in any project. The size is relatively small due to the excellent deduplication and compression that bup provides. <br /><br />In addition to the bup snapshots, we also create periodic snapshots of the two ZFS filesystems mentioned above... just in case. Old snapshots are regularly deleted. These are accessible to users if they search around enough with the command line, but are not consistent between different hosts of the project, hence using them is not encouraged. This ensures that even if the whole replication/bup system were to somehow mess up a project, I can still recover everything exactly as it was before the problem happened; so far there haven't been any reports of problems. <br /><h2>Capacity</h2>Right now there are about 6000 unique weekly users of SageMathCloud and often about 300-400 simultaneous users, and there are nearly 50,000 distinct projects. Our machines are at about 20% disk space capacity, and most of them can easily be expanded by a factor of 10 (from 1TB to 12TB). Similarly, disk space for our Google compute engine nodes is <a href="https://cloud.google.com/products/compute-engine/#pricing">$0.04 GB / month</a>. So space-wise we could scale up by a factor of 100 without too much trouble. The CPU load is at about 10% as I write this, during a busy afternoon with 363 clients connected very actively modifying 89 projects. <b>The architecture that we have built could scale up to a million users, if only they would come our way...</b> William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-11689537860570028152014-02-12T06:36:00.001-08:002014-02-12T06:41:42.773-08:00What is SageMathCloud?<h2>The two main reasons for existence of <a href="https://cloud.sagemath.com">SageMathCloud (SMC)</a> are...</h2> <p><b>Goal 1.</b> Increase resource for <a href="http://sagemath.org">Sage</a>: Generate a different longterm revenue stream to support development of Sage, i.e., open source mathematical software. By "different", I mean different than government and foundation grants and donations, which are relatively limited for primarily pure mathematics software development, which is what Sage specializes in. Even in my wildest dreams, it is very unlikely Sage will get more than a million dollars a year in funding (and in practice it gets a lot less); however, a successful commercial product with wide adoption has the potential to generate significantly more than a million dollars a year in revenue -- of course most would go back into the product... but when the product is partly Sage, that's fine. The National Science Foundation (and other donors) have played a major part during the last 8 years in funding Sage, but I think everybody would benefit from another funding source. <p><b>Goal 2.</b> Increase the usage of Sage: The number of unique visitors per month to <a href="http://sagemath.org">http://sagemath.org</a> grew nicely from 2005 (when I started Sage) until Summer 2011, after which point it has remained fairly constant at 70,000 unique visitors. There is no growth at all: it was 70,332 in Jan 2011, and it was 70,449 last month (Jan 2014), both with a bounce rate of about 50%. A significant obstruction to growth is accessible, which SMC helps to address for certain users (last month the SMC website has 17,700 unique visitors with a bounce rate of about 30%). <div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-69d24EsPxds/UvuG3pmFHwI/AAAAAAABKA0/OF5NNIbPyqI/s1600/Screen+Shot+2014-02-12+at+6.35.58+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-69d24EsPxds/UvuG3pmFHwI/AAAAAAABKA0/OF5NNIbPyqI/s400/Screen+Shot+2014-02-12+at+6.35.58+AM.png" /></a></div> <p>Here's an actual email I received from somebody literally as I was writing this, which I think illustrates how SMC addresses the second goal: <pre><br /> Hey William,<br /><br /> Today I stopped by cloud.sagemath.com because <br /> I wanted to do some computation with sage, and <br /> cloud is announced in a big way on sagemath.org<br /><br /> This is after a lengthy hiatus from computing<br /> with sage ( maybe a year ).<br /><br /> Using cloud.sagemath.com completely blew my <br /> mind. At first I did not really understand <br /> why sagenb was ditched after all the work that <br /> went into it. But man, cloud is really a <br /> pleasure to use !<br /><br /> I just wanted to share the joy :)<br /><br /> Thanks for all that you do !<br /></pre> <h2>Licensing and Reuse of the SageMathCloud Codebase</h2> The design and coding of SageMathCloud (SMC) has been mostly supported by University of Washington (UW). Due to goal 1 above, I have been working from the start (before a line of code was written) with the <a href="http://depts.washington.edu/uwc4c/about-c4c/">commercialization/tech transfer office of UW</a>, who (because of 1) are not enthusiastic about simply open source the whole SMC codebase, as a condition for their help with commercialization. Some of SMC is <a href="https://github.com/sagemath/cloud/">open sourced</a>, mainly the code that runs on the VM's and some of the HTML5 client that runs on the browser. We also plan to make the HTML5 client and a mini server BSD licensed, and include them with Sage (say) as a new local graphical interface. Of course SMC builds on top of many standard open source libraries and tools (e.g., CodeMirror, Cassandra, ZFS, Node.js, etc.). <p>There is, however, a large amount of interesting backend code, which is really the "cloud" part of SMC, and which we do not intend to release as open source. We do intend to sell licenses (with support) for the complete package, when it is sufficiently polished, since many organizations want to run their own private SMC servers, mainly for confidentiality reasons. <p>Goal 2 above mainly impacts how we market SMC. However, it's easy to completely ignore Sage and still get a lot of value out of SMC. I just glanced at what people are doing as I write this, and the result seems pretty typical: latex'ing documents, some Sage worksheets, some IPython notebooks, editing a perl script. <p>It's important to understand how SMC is different than other approaches to cloud computing. It's designed to make certain things very easy, but they are quite different things than what "traditional" cloud stacks like OpenStack are designed to make easy. SMC is supposed to make the following easy: <ul> <li> using Sage and IPython, both command line and notebook interfaces.</li> <li>writing a paper using LaTeX (possibly with a specific private list of collaborators),</li> <li>editing source code, e.g., developing Python/C/etc., libraries., again possibly with realtime collaboration.</li> <li>creating collaborative "projects", which are really a Linux account on a machine, and provide isolation from other projects.</li> <li> backups: all data is automatically snapshotted frequently</li> <li>high availability: failure of a machine (or even whole data center) results in at most a few minutes of lost time/work.</li> <li>speed: files are stored on a compressed local filesystem, which is snapshotted and replicated out regularly; thus the filesystem feels fast and is scalable, as compared to a networked filesystem.</li> </ul> <p>The above design goals are useful for certain target audiences, e.g., people doing Sage/Python/etc. development, teachers and students in courses that make use of Sage/Python/etc., collaborative math research projects. SMC is designed so that a large number of people can make simultaneous small use of ever-expanding resources. SMC should also fully support the "social networks" that form in this context. At the same time, it's critical that SMC have excellent uptime and availability (and offsite backups, just in case), so that people can trust it. By trust, I don't mean so much in the sense of "trust it with proprietary info", but in the sense of "trust it to not just loose all my data and to be there when I'm giving a talk/teaching a class/need to do homework/etc.". <div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-MCCg-P5HRfc/UvuFYxydQ_I/AAAAAAABKAo/En6oZi_fXHM/s1600/Screen+Shot+2014-02-12+at+6.29.00+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-MCCg-P5HRfc/UvuFYxydQ_I/AAAAAAABKAo/En6oZi_fXHM/s320/Screen+Shot+2014-02-12+at+6.29.00+AM.png" /></a></div> <p>However, exactly the above design goals are at odds with some of goals of large-scale scientific/supercomputing. The following are <b>not</b> design goals of SMC: <ul> <li> <b> supercomputing</b> -- have large data that many distributed processes operate on: exactly what people often do on supercomputers (or with Hadoop, etc.)</li> <li> <b>traditional "cloud computing"</b> -- dynamically spin up many VM's, run computations on them; then destroy them. With SMC, things tend to get created but not destroyed (e.g., projects and files in them), and a full VM is much too heavy given the number of users and type of usage that we have already (and plan to have).</li></ul> <p>What happens in practice with SMC is that people run smaller-scale computations on SMC (say things that just take a few cores), and when they want to run something bigger, they ssh from SMC to other resources they have (e.g., a supercomputer account) and launch computations there. All project collaborators can see what anybody types in a terminal, which can be helpful when working with remote compute clusters. <p>Anyway, I hope this helps to clarify what exactly SMC actually is.William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-11833056431724506482013-12-16T09:25:00.000-08:002013-12-16T09:27:27.741-08:00Holiday Coding the SageMath Cloud<div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-isYJONABqY4/Uq8vBRPmF5I/AAAAAAABIz0/TVt98dvbXfM/s1600/Screen+Shot+2013-12-16+at+8.41.43+AM.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="146" src="http://1.bp.blogspot.com/-isYJONABqY4/Uq8vBRPmF5I/AAAAAAABIz0/TVt98dvbXfM/s200/Screen+Shot+2013-12-16+at+8.41.43+AM.png" width="200" /></a></div><span style="background-color: white; color: #404040; font-family: Roboto, arial, sans-serif; font-size: 13px; line-height: 18px;">I love the Holiday break. </span><span style="background-color: white; color: #404040; font-family: Roboto, arial, sans-serif; font-size: 13px; line-height: 18px;">I get to work on <a href="https://cloud.sagemath.com">https://cloud.sagemath.com</a> (SMC) all day again! Right now I'm working on a multi-data center extension of <a href="http://www.gluster.org">http://www.gluster.org</a> for storing a large pool of sparse compressed deduplicated ZFS image files that are efficiently replicated between data centers. Soon SMC projects will all be hosted in this, which will mean that they can very quickly be moved between computers, are available even if all but one data center goes down, and will have ZFS snapshots instead of the current snapshot system. ZFS snapshots are much better for this application, since you can force them to happen at a point in time, with tags, and also delete them if you want. A little later I'll even make it so you can do a full download (to your computer) of an SMC project (and all snapshots!) by just downloading the ZFS image file and mounting it yourself. </span><br /><span style="background-color: white; color: #404040; font-family: Roboto, arial, sans-serif; font-size: 13px; line-height: 18px;"><br /></span><span style="background-color: white; color: #404040; font-family: Roboto, arial, sans-serif; font-size: 13px; line-height: 18px;">I'm also continuing to work on adding a Google Compute Engine data center; this is the web server parts hosted there right now <a href="https://108.59.84.126/">https://108.59.84.126/</a>, but the real interesting part will be making compute nodes available, since the GCE compute nodes are very fast. I'll be making 30GB RAM 8-core instances available, so one can start a project there and just get access to that -- for free for to SMC users, despite the official price being </span><span style="background-color: #fafafa; color: #222222; font-family: Arial; font-size: 13px; line-height: 17.546875px; text-align: center;">$0.829/hour. I hope this happens soon. </span><br /><br /><br /><span style="background-color: white; color: #404040; font-family: Roboto, arial, sans-serif; font-size: 13px; line-height: 18px;"><br /></span><span style="background-color: white; color: #404040; font-family: Roboto, arial, sans-serif; font-size: 13px; line-height: 18px;"><br /></span>William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-11157937977742508622013-12-10T14:24:00.002-08:002013-12-10T14:24:49.737-08:00The Sagemath Cloud: a minute "elevator description"<a href="https://cloud.sagemath.com/" target="_blank">The Sagemath Cloud</a> combines open source technology that has come out of cloud computing and mathematical software (e.g., web-based Sage and IPython worksheets) to make online mathematical computation easily accessible. People can collaboratively use mathematical software, author <latex> documents, use a full command line terminal, and edit complicated computer programs, all using a standard web browser with no special plugins. The core design goals of the site are collaboration and very high reliability, with data mirrored between multiple data centers. The current dedicated infrastructure should handle over a thousand simultaneous active users, and the plan is to scale up to tens of thousands of users as demand grows (about 100 users sign up each day right now). Most open source mathematical software is pre-installed, and users can also install their own copies of proprietary software, if necessary. There are currently around 1000 users on the site each day from all over the world. </latex> <br /><latex><br /></latex>The Sagemath Cloud is under very active development, and there is an ongoing commercialization effort through University of Washington, motivated by many users who have requested more compute power, disk space, or the option to host their own install of the site. Also, though the main focus is on mathematics, the website has also been useful to people in technical areas outside mathematics that involve computation. William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-38329981284912386962013-10-19T13:27:00.001-07:002013-10-19T21:18:41.540-07:00Jason Grout's description of the Sagemath Cloud<h2>Jason Grout's description of <a href="https://cloud.sagemath.com/">the Sagemath Cloud</a>: </h2>William Stein, the lead developer of Sage, has been developing a new online interface to Sage, the Sage Cloud at <a href="https://cloud.sagemath.com/">https://cloud.sagemath.com</a>. Currently in beta status, it is already a powerful computation and collaboration tool. Work is organized into projects which can be shared with others. Inside a project, you can create any number of files, folders, Sage worksheets, LaTeX documents, code libraries, and other resources. Real-time collaborative editing allows multiple people to edit and chat about the same document simultaneously over the web. <br /><br />The LaTeX editor features near real-time preview, forward and reverse search, and real-time collaboration. Also, it is easy to have Sage do computations or draw gures and have those automatically embedded into a LaTeX document using the SageTeX package (for example, after including the sagetex package, typing <tt>\sageplot{plot(sin(x))}</tt> in a TeX document inserts the plot of sin(x)). A complete Linux terminal is also available from the browser to work within the project directory. Snapshots are automatically saved and backed up every minute to ensure work is never lost. William is rapidly adding new features, often within days of a user requesting them.William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-46494936681273912972013-10-12T20:57:00.000-07:002013-10-12T20:57:12.477-07:00"A Symphony of Cursors" (guest post by Jason Grout)Today's post is from guest blogger, Jason Grout, lead developer of the <a target="_blank" href="https://sagecell.sagemath.org/">Sage Cell Server</a>. <p><p>The other day some students and I met to do some development on the Sage cell server. We each opened up our shared project on <a target="_blank" href="https://cloud.sagemath.com">cloud.sagemath.com</a> on our own laptops, and started going through the code. We had a specific objective. The session went something like this: <p><p>Jason: Okay, here's the function that we need to modify. We need to change this line to do X, and we need to change this other line to do Y. We also need to write this extra function and put it here, and change this other line to do Z. James: can you do X? David: can you look up somewhere on the net how to do Y and write that extra function? I'll do Z. <p><p>Then in a matter of minutes, cursors scattering out to the different parts of the code, we had the necessary changes written. I restarted the development sage cell server running inside the cloud account and we were each able to test the changes. We realized a few more things needed to be changed, we divided up the work, and in a few more minutes each had made the necessary changes. <p><p>It was amazing: watching all of the cursors scatter out into the code, each person playing a part to make the vision come true, and then quickly coming back together to regroup, reassess, and test the final complete whole. Forgive me for waxing poetic, but it was like a symphony of cursors, each playing their own tune in their lines of the code file, weaving together a beautiful harmony. This fluid syncing William wrote takes distributed development to a new level. <p><p>Thanks! <br><br> <div class="separator" style="clear: both; text-align: center;"><a href="http://bigbandtranscriptions.com/sym3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://bigbandtranscriptions.com/sym3.jpg" /></a></div>William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-52976419113529877642013-10-03T13:41:00.002-07:002013-10-04T08:10:36.745-07:00Backing up the Sagemath CloudThe terms of usage of the <a href="https://cloud.sagemath.com/" target="_blank">Sagemath Cloud</a> say "This free service is <i><b>not guaranteed to have any uptime or backups</b></i>." That said, I do actually care a huge amount about backing up the data stored there, and ensuring that you don't lose your work. <br /><h2>Bup</h2>I spent a lot of time building a snapshot system for user projects on top of <a href="https://www.blogger.com/github.com/bup/bup" target="_blank">bup</a>. Bup is a highly efficient de-duplicating compressed backup system built on top of git; unlike <a href="https://sparkleshare.org/%E2%80%8E" target="_blank">other approaches</a>, you can store arbitrary data, huge files, etc. <br /><br />I looked at many open source options for making efficient de-duplicated distributed snapshots, and I think bup is overall the best, especially because the source code is readable. Right now <a href="https://cloud.sagemath.com/" target="_blank">https://cloud.sagemath.com</a> makes several thousand bup snapshots every day, and it has practically saved people many, many hours in potentially lost work (due to them accidentally deleting or corrupting files). <br /><br />You can access these snapshots by clicking on the camera icon on the right side of the file listing page. <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-06L7aS49Bsg/Uk3WY4PHw4I/AAAAAAABIHM/6QTMXLF64cs/s1600/snap.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="156" src="http://4.bp.blogspot.com/-06L7aS49Bsg/Uk3WY4PHw4I/AAAAAAABIHM/6QTMXLF64cs/s1600/snap.png" width="640" /></a></div><br /><h2>Some lessons learned when implementing the snapshot system</h2><ul><li> Avoid creating a large number of branches/commits -- creating an almost-empty repo, but with say 500 branches, even with very little in them, makes things <i>painfully</i> slow, e.g., due to an enormous number of separate calls to git. When users interactively get directory listings, it should take at most about 1 second to get a listing, or they will be annoyed. I made some possibly-hackish optimization -- mainly caching -- to offset this issue, which are here in case anyone is interested: <a href="https://github.com/williamstein/bup" target="_blank">https://github.com/williamstein/bup</a> (I think they are too hackish to be included in bup, but anybody is welcome to them.)<br /><br /> </li><li> Run a regular test about how long it takes to access the file listing in the latest commit, and if it gets above a threshhold, create a new bup repo. So in fact the bup backup deamons really manage a sequence of bup repos. There are a bunch of these daemons running on different computers, and it was critical to implement locking, since in my experience bad things happen if you try to backup an account using two different bups at the same time. Right now, typically a bup repo will have about 2000 commits before I switch to another one.<br /><br /> </li><li> When starting a commit, I wrote code to save information about the current state, so that everything could be rolled back in case an error occurs, due to files moving, network issues, the snapshot being massive due to a nefarious user, power loss, etc. This was critical to avoid the bup repo getting corrupted, and hence broken.<br /><br /> </li><li> In the end, I stopped using branches, due to complexity and inefficiency, and just make all the commits in the same branch. I keep track of what is what in a separate database. Also, when making a snapshot, I record the changed files (as output by the command mentioned above) in the database with the commit, since this information can be really useful, and is impossible to get out of my backups, due to using a single branch, the bup archives being on multiple computers, and also there being multiple bup archives on each computer. NOTE: I've been recording this information for cloud.sagemath for months, but it is not yet exposed in the user interface, but will be soon.<br /><br /> </li></ul><h2>Availability</h2>The snapshots are distributed around the <a href="https://cloud.sagemath.com/" target="_blank">Sagemath Cloud</a> cluster, so failure of single machines doesn't mean that backups become unavailable. I also have scripts that automatically rsync all of the snapshot repositories to machines in other locations, and keep offsite copies as well. It is thus unlikely that any file you create in cloud.sagemath could just get lost. For better or worse, is also impossible to permanently delete anything. Given the target audience of mathematicians and math students, and the terms of usage, I hope this is reasonable. William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-87602150469697882562013-09-13T10:48:00.000-07:002013-09-13T10:48:20.177-07:00IPython Notebooks in the Cloud with Realtime Synchronization and Support for CollaboratorsI spent the last two weeks implementing hosted IPython notebooks with sync for <a href="https://cloud.sagemath.com/" target="_blank">https://cloud.sagemath.com</a>. Initially I had just plan to simplify the port forwarding setup, since using multiple forward and reverse port forwards seemed complicated. But then I became concerned about multiple users (or users with multiple browsers) overwriting each other's notebooks; this is a real possibility, since projects are frequently shared between multiple people, and everything else does realtime sync. I had planned just to add some very minimal merge-on-save functionality to avoid major issues, but somehow got sucked into implementing full realtime sync (even with the other person's cursor showing). <br /><h2>Here's how to try it out</h2><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-TvAWDaBjqqM/UjNOGZ1fifI/AAAAAAABHy8/NU5Fb13SE5Q/s1600/xkcd.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://2.bp.blogspot.com/-TvAWDaBjqqM/UjNOGZ1fifI/AAAAAAABHy8/NU5Fb13SE5Q/s1600/xkcd.png" height="146" width="200" /></a></div><div><br /></div><ul><li> Go to <a href="https://cloud.sagemath.com/" target="_blank">https://cloud.sagemath.com</a> and make an account; this is a free service hosted on computers at University of Washington.</li><li> Create a new project.</li><li> Click +New, then click "IPython"; alternatively, paste in a link to an IPython notebook (e.g., anything here <a href="http://nbviewer.ipython.org/">http://nbviewer.ipython.org/</a> -- you might need to get the actual link to the ipynb file itself!), or upload a file. </li><li> An IPython notebook server will start, the given .ipynb file should load in a same-domain iframe, and then some of the ipython notebook code is and iframe contents are monkey patched, in order to support sync and better integration with <a href="https://cloud.sagemath.com/" target="_blank">https://cloud.sagemath.com</a>.</li><li>Open the ipynb file in multiple browsers, and see that changes in one appear in the other, including moving cells around, creating new cells, editing markdown (the rendered version appears elsewhere), etc. </li></ul>Since this is all very new and the first (I guess) realtime sync implementation on top of IPython, there are probably a lot of issues. Note that if you click the "i" info button to the right, you'll get a link to the standard IPython notebook server dashboard.<br /><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" frameborder="0" height="180" src="https://www.youtube.com/embed/sDBbt8U4aJw?feature=player_embedded" width="320"></iframe></div><div><br /></div><h2>IPython development</h2>Regarding the monkey patching mentioned above, the right thing to do would be to explain exactly what hooks/changes in the IPython html client I need in order to do sync, etc., make sure these makes sense to the IPython devs, and send a pull request. As an example, in order to do sync <i>efficiently</i>, I have to be able to set a given cell from JSON -- it's critical to do this in place when possible, since the overhead of creating a new cell is huge (due probably to the overhead of creating CodeMirror editors); however, the fromJSON method in IPython assumes that the cell is brand new -- it would be nice to add an option to make a cell fromJSON without assuming it is empty. The ultimate outcome of this could be a clean well-defined way of doing sync for IPython notebooks using any third-party sync implementation. IPython might provide their own sync service and there are starting to be others available these days -- e.g., Google <a href="https://developers.google.com/drive/realtime/">has one</a>, and maybe Guido van Rosum helped write one for Dropbox recently? <br /><h2>How it works</h2>Earlier this year, I implemented <a href="https://neil.fraser.name/writing/sync/" target="_blank">Neil Fraser's differential synchronization algorithm</a>, since I needed it for file and Sage worksheet editing in <a href="https://cloud.sagemath.com/" target="_blank">https://cloud.sagemath.com</a>. There are many approaches to realtime synchronization, and Fraser makes a good argument for his. For example, Google Wave involved a different approach (Operational Transforms), whereas Google Drive/Docs uses Fraser's approach (and code -- he works at Google), and you can see which succeeded. The main idea of his approach is eventually stable iterative process that involves heuristically making and applying patches on a "best effort" basis; it allows for all live versions of the document to be modified simultaneously -- the only locking is during the moment when a patch is applied to the live document. He also explains how to handle packet loss gracefully. I did a complete implementation from scratch (except for using the beautiful Google <a href="https://code.google.com/p/google-diff-match-patch/" target="_blank">diff/patch/match</a> library). There might be a Python implementation of the algorithm as part of <a href="https://code.google.com/p/google-mobwrite/">mobwrite</a>. <br /><br />The hardest part of this project was using Fraser's algorithm, which is designed for unstructured text documents, to deal with IPython's notebook format, which is a structured JSON document. I ended up defining another less structured format for IPython notebooks, which gets used purely for synchronization and nothing else. It's a plain text file whose first line is a JSON object giving metainformation; all other lines correspond, in order, to the JSON for individual cells. When patching, it is in theory possible in edge cases involving conflicts to destroy the JSON structure -- if this happens, the destruction is isolated to a single cell, and that part of the patch just gets rejected. <br /><br />The IPython notebook is embedded as an iframe in the main <a href="https://cloud.sagemath.com/" target="_blank">https://cloud.sagemath.com</a> page, but with exactly the same domain, so the main page has full access to the DOM and Javascript of the iframe. Here's what happens when a user makes changes to a synchronized IPython notebook (and at least 1 second has elapsed): <br /><ul><li>The outer page notices that the notebook's dirty flag is set for some reason, which could involve anything from typing a character, deleting a bunch of cells, output appearing, etc. </li><li>Computes the JSON representation of the notebook, and from that the document representation (with 1 line per cell) described above. This takes a couple of milliseconds, even for large documents, due to caching. </li><li>The document representation of the notebook gets synchronized with the version stored on the server that the client connected with. (This server is one of many node.js programs that handles many clients at once, and in turn synchronizes with another server that is running in the VM where the IPython notebook server is running. The sync architecture itself is complicated and distributed, and I haven't described it publicly yet.) </li><li>In the previous step, we in fact get a patch that we apply -- in a single automatic operation (so the user is blocked for a few milliseconds) -- to our document representation of the notebook in the iframe. If there are any changes, the outer page modifies the iframe's notebook in place to match the document. My first implementation of this update used IPython's noteobook.fromJSON, which could easily take 5 seconds (!!) or more on some of the online IPython notebook samples. I spent about two days just optimizing this step. The main ideas are: <ol><li> Map each of the lines of the current document and the new document to a unicode character,</li><li> Use diff-patch-match to find an efficient sequence of deletions, insertions, swaps to transforms one document to the other (i.e., swapping cells, moving cells, etc.) -- this is critical to do,</li><li> Change cells in place when possible.</li></ol>With these tricks (and more can be done), modifying the notebook in place takes only a few milliseconds in most cases, so you don't notice this as you're typing. </li><li> Send a broadcast message about the position of your cursor, so the other clients can draw it. (Symmetrically, render the cursor on receiving a broadcast message.)</li></ul><br /><br /><br /><br />William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-31892034003984381212013-09-02T10:47:00.000-07:002013-09-02T10:47:20.314-07:00Status report: integrating IPython into https://cloud.sagemath.com -- my approachI'm still working on the IPython notebook integration into <a href="https://cloud.sagemath.com/">https://cloud.sagemath.com</a> right now. This will be a valuable new feature for users, since there's a large amount of good content out there being developed as IPython notebooks, and the IPython notebook itself is fast and rock solid. <br /><br />I spent the last few days (it took longer than expected) creating a generic way to *securely* proxy arbitrary http-services from cloud projects, which is now done. I haven't updated the page yet, but I implemented code so that <br /><pre> <br /><br />https://cloud.sagemath.com/[project-id]/port/[port number]/...<br /><br /></pre>gets all http requests automatically proxied to the given port at the indicated project. Only logged in users with write access to that project can access this url -- with a lot of work, I think I've set things up so that one can safely create password-less non-ssl web services for a groub of collaborators, and all the authentication just piggy backs on cloud.sagemath accounts and projects: it's SSL-backed (with a valid cert) security almost for free, which solves what I know to be a big problem users have. <br /><br />The above approach is also nice, since I can embed IPython notebooks via an iframe in cloud.sagemath pages, and the url is exactly the same as cloud.sagemath's, which avoids subtle issues with firewalls, same-source origin, etc. For comparison, here's what the iframe that contains a single ipynb worksheet looks like for wakari.io: <br /><pre> <br /><br /> iframe class="notebookiframe" id="" src="https://prod-vz-10.wakari.io:9014/auto_login/acd84627972f91a0838e512f32e09c9823782ec0?next=/notebook_relative/Listing 2.ipynb"<br /><br /> </pre>and here's what it's going to look like in cloud.sagemath: <br /><pre> <br /><br /> iframe class="notebookiframe" id="" src="https://cloud.sagemath.com/70a37ef3-4c3f-4bda-a81b-34b894c89701/port/9100/Listing 2.ipynb"<br /><br /> </pre>With the wakari.io approach, some users will find that notebooks just don't work, e.g., students at University of Arizona, at least if their wifi still doesn't allow connecting to nonstandard ports, like it did when I tried to setup a Sage notebook server there once for a big conference. By having exactly the same page origin and no nonstandard orts, the way I set things up, the parent page can also directly call javascript functions in the iframe (and vice versa), which is potentially very useful. <br /><br />IPython notebook servers will be the first to use this framework, then I'll use something similar to serve static files directly out of projects. I'll likely also add sage cell server and the classic sage notebook as well at some point, and maybe wiki's, etc. <br /><br />Having read and learned a lot of about the IPython notebook, my main concern now is their approach to multiple browsers opening the same document. If you open a single worksheet with multiple browsers, there is absolutely no synchronization at all, since there is no server-side state. Either browser can and will silently overwrite the work of the other when you (auto-)save. It's worse than the Sage Notebook, where at least there is a sequence number and the browser that is behind gets a forced refresh (and a visible warning message about their being another viewer). For running your own IPython notebook on your own computer, this probably isn't a problem (just like a desktop app), but for a long-running web service, where a single user may use a bunch of different computers (home laptop, tablet, office computer, another laptop, etc.) or there may be multiple people involved, I'm uncomfortable that it is so easy for all your work to just get overwritten, so I feel I must find some way to address this problem before releasing IPython support. With cloud.sagemath, a lot of people will likely quickly start running ipython notebook servers for groups of users, since it would take about 1 minute to setup a project with a few collaborators -- then they all get secure access to a collection of ipython notebooks (and other files). So I'm trying to figure out what to do about this. I'll probably just implement a mechanism so that the last client to open an ipython notebook gets that notebook, and all older clients get closed or locked. Maybe in a year IPython will implement proper sync, and I can remove the lock. (On the other hand, maybe they won't -- having no sync has its advantages regarding simplicity and *speed*.) <br /><br /><br />William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-75873849893325707682013-08-28T11:41:00.004-07:002013-08-28T11:41:32.406-07:00LaTeX in the CloudMotivated by work on <a href="http://wstein.org/rh">a book</a> and by <a href="http://math.columbia.edu/~dejong/wordpress/?p=3490">the stacks projects</a>, I just wrote a new web-based LaTeX editor, which I've just released. You can try it now by making a free account at <a href="https://cloud.sagemath.com/">https://cloud.sagemath.com</a>, then creating a project, and uploading or creating a .tex file, then opening it. <br /><h3>Features</h3><ul><li> Side-by-side LaTeX editing, with re-build on save (you can set the autosave interval if you want).</li><li> Forward and inverse search.</li><li> Parsing of the log file, with buttons to jump to corresponding place in tex file and pdf file.</li><li>Preview uses high-resolution color png's, so it will work in browsers that don't have any support for pdf.</li><li>The command to LaTeX your document is customizable.</li><li>The build process should run LaTeX, bibtex, and sagetex automatically if the log file says they need to be run; otherwise you can click a button to force bibtex or sagetex to run.</li><li>Scales up to large documents -- my test document is a book! -- for me sitting at home working on my 134 page book, the time from making a change and clicking "save" to when it appears in the preview pane in high resolution is less than 7 seconds.</li></ul><h3> Some advantages over <a href="https://www.sharelatex.com/">https://www.sharelatex.com</a> and <a href="https://www.writelatex.com/">https://www.writelatex.com/</a></h3><b>Disclaimer:</b> I'm not an expert with either of the editors mentioned above, so I may be completely wrong that the following are advantages: <br /><ul><li> This is free (unlimited collaborators, space right now) -- my main motivation is to increase usage of <a href="http://sagemath.org/">Sage</a></li><li> <a href="http://www.sagemath.org/doc/tutorial/sagetex.html" target="_blank">Sagetex</a> is fully supported </li><li> Forward and inverse search: jump from point in .tex file to corresponding point in pdf and conversely (it seems the competition doesn't have this, but I bet they will implement it soon after they read this post)</li><li> High quality png preview with customizable resolution</li><li> Perfect quality embedded pdf view as well, if your browser supports embedded pdf's</li><li> If you need a full xterm for some reason you have it: you can run arbitrary purpose programs on that command line. This means, you can download some data (file, website, database, experimental result files, use git), process them in the most general sense of computing, and generate those files or parts of it for your LaTeX document. </li><li> It scales up to large documents more efficiently (in my limited tests), since I was pretty careful about using hashing tricks, parallel compute to generate png's, etc.</li><li> A different synchronization implementation for multiple people editing the same file at once; the others lock the editor when the network drops, or reset the docuement when the connection comes back; in real life, network connections are often dropping...</li><li> I put some effort into trying to make this latex editor work on iPad/Android, though you'll want to use a bluetooth keyboard since there are major issues with CodeMirror and touch still.</li></ul>And some disadvantages: <br /><ul><li> I've put little work into properly dealing with multi-file LaTeX documents</li><li> The error messages are not displayed embedded in the tex document (not sure I want this though).</li><li> You must have a cloud.sagemath account (free) -- you can't just start editing without signing up.</li><li> Single file download is limited to 12MB right now, so if your PDF is huge, you won't be able to just download it -- you can scp it anywhere though using the terminal.</li></ul><h3>Behind the Scenes</h3>As a professional mathematician, I've spent 20 years using LaTeX, often enhanced with little Python scripts I write to automate the build process somewhat. Also, I've spent way too much time over the years just configuring and re-configuring forward and inverse search under Linux, OS X, and Windows with various editors and previewers. <br /><br />All the new code I wrote to implement the LaTeX editor is client-side CoffeeScript, HTML, and CSS, which builds on the infrastructure I've developed over the last year (so, e.g., it can run bash scripts on remote linux machines, etc.). Here are some specific problems I confronted; <em>none</em> of the solutions are what I expected two weeks ago or first tried! <br /><h4>Problem: How should we display a PDF in the browser</h4>I investigated three approaches to displaying PDF files in the web browser: (1) show a bunch of images (png or jpg), (2) use a native pdf viewer plugin, and (3) use a javascript pdf renderer (namely pdf.js). Regarding (2), Chrome and Safari have a native plugin that efficiently shows a high-quality display of a complete PDF embedded in a web page, but Chromium has nothing by default. Regarding (3), the Firefox devs wrote pdf.js, which they include with Firefox by default; it looks good on Firefox, but looks like total crap in Chrome. In any case, after playing around with (2)-(3) for too long (and even adding a salvus.pdf command to Sage worksheets in cloud.sagemath), I realized something: the only possible solution is (1), for the following reasons: <br /><ul><li> Inverse and forward search: It is impossible to read mouse clicks, page location, or control the location of the pdf viewer plugin in some browsers, e.g., in Chrome. Thus only using a PDF plugin would make inverse and forward search completely impossible. Game over. </li><li> It might be possible to modify pdf.js to support what is needed for inverse and forward search, but this might be really, really hard (for me). Plus the rendering quality of pdf.js on Chrome is terrible. Game over. </li><li> My test document is <a href="http://wstein.org/rh">this book's PDF</a>, which is about 8MB in size. With PDF viewer plugins, every time the PDF file changes, the entire 8MB pdf file has to be transferred to the browser, which just doesn't scale -- especially if you replace 8MB by 60MB (say). I want people to be able to write their books and Ph.D. theses using this editor. When editing a LaTeX document, the PDF file often changes only a little -- usually only a few pages changes and everything else remains identical; only the changes should get sent to the browser, so that even a 1000-page document could be efficiently edited. This sort of thing doesn't matter when working locally, but when working over the web it is critical. </li></ul>So we are stuck with (1) for the main PDF preview for a file we are actively editing using LaTeX. There are a long list of apparent drawbacks: <br /><ul><li> One substantial drawback to (1) for general PDF display is that there is no way to do full text search or copy text out of the PDF document. Neither of these drawback matters for the LaTeX editor application though, since you have the source file right there. Also, there's nothing stopping me from also providing the embedded PDF viewer, which has search and copy, and that's what I've done for cloud.sagemath. </li><li> Another potential drawback of (1) is that it takes a long time to generate jpg or png images for a large pdf file -- 5 pages is fine, but what about 150 pages? 1000 pages? I tried using ImageMagick and Ghostscript. ImageMagick is way too slow to be useful for this. Ghostscript is incredibly powerful for this, and has a wide range of parameters, with numerous different rendering devices. The solution I choose here is to: (1) generate a high quality PNG image just for the currently visible pages (and +/-1), then (2) generate medium quality pages in some neighborhood of the visible pages, then (3) generate low quality PNG's for all the other pages. All this is done in parallel, since the host VM's have many cores. Also, we compute the sha1 hashes of the previews, and if the browser already has them, don't bother to update those images. Finally, it turns out to be important to replace high quality images by lower quality ones as the user scrolls through the document, since otherwise the browser can end up using too much memory. A useful trick for the high quality pages is using ghostscript's downsampling feature, so the PDF is rendered at 600dpi (say) in memory, but output at 200dpi to the PNG. </li></ul>So the Preview tab in the LaTeX editor shows a png-based preview whose quality automatically enhances as you scroll through the document. This png preview will work on any browser (for which cloud.sagemath works), irregardless of PDF plugins. <br /><b>Summary:</b> It is critical to realize exactly what problem we're trying to solve, which is viewing a PDF that is often changing locally. This is completely different than the general problem of viewing a static PDF, or editing a PDF itself, or even annotating one. <br /><h4>Problem: how to implement forward and inverse search in the browser</h4>Forward and inverse search let you easily jump back and forth between a point in the source tex file and the corresponding point in the rendered PDF preview. You <i>need</i> this because editing LaTeX documents is not WYSIWYG (unless you are using something like Lyx or Texmacs), and without this feature you might find yourself constantly being lost, doing fulltext search through the source of pdf file, etc., and generally wasting a lot of effort on something that should be automatic. The first time I used inverse and forward search was around 2004 with the <a href="http://www.winedt.com/">Winedt</a> and <a href="http://pages.uoregon.edu/koch/texshop/">TexShop</a>editors, which I think (at the time) used various heuristics to implement them, since they often didn't quite work right. I 100% assumed that I would have to do use heuristics for cloud.sagemath, and started working on a heuristic approach based on pdftotext, page percentages, etc. <br /><br />Then one morning I searched and learned about <a href="http://mactex-wiki.tug.org/wiki/index.php/SyncTeX">synctex</a>, which was "recently" added to the core of pdflatex. The first thing I did was run it, look at the output file and try to parse with my eyes -- that didn't work. I then searched everywhere and could not find any documentation about the format of the synctex files; however, I found a paper by the author of synctex and read straight through it. In that paper, they mention that they provide a C library and C program to parse the synctex files, and explicitly don't document the format since they don't want anybody to write programs to parse it, since they reserve the right to significantly change it. No problem -- so I just call out to the shell and run the synctex program itself with appropriate options. With a little research into scaling factors, etc., I'm able to map mouse clicks on the png to the data synctex needs to get the corresponding location in the source file. This is all actually pretty easy and provides forward and inverse search with absolutely no hacks or heuristics. Also, forward search works well since using PNG's to display the preview means one can precisely set the preview location. <br /><h4>Problem: making sense of the LaTeX log file</h4>When you build a LaTeX document, tex spits out a log file full of filenames, parentheses, warnings, errors, etc., sometimes stopping to ask you questions, sometimes refusing to exit, etc. This file is NOT (by default, at least) easy for a human to read, at least not me! You can see an error message that refers to a specific location in a file, but which file that is is often listed hundreds of lines before, and you must manually balance paranthesis to figure this out. I read some documents about its format, and fortunately found <a href="https://github.com/jpallen/latex-log-parser">this Javascript library</a>, which parses LaTeX logs. Cloud runs pdflatex using the option <tt>-interact=nonstopmode</tt>, so that the whole file gets processed, then parses the log file, and displays first errors, then typesetting issues (overfull hboxes, etc.), and finally warnings. Each message has two buttons -- one to jump to the corresponding location in tex file, and one to jump to the location in the pdf preview. This is all easy to use, and I've found myself for the first time ever actually going through tex files and cleaning up the overfull hboxes. <br />The log file also says when to run sagetex and bibtex, and whether or not to run pdflatex again to update cross references, and cloud parses that and automatically runs those tools. For some reason, sagetex doesn't say "run me again" even though it should when you update existing blocks, and then you have to do it manually by clicking a button. <br /><br /><b>Summary:</b> I hope this LaTeX editor in <a href="https://cloud.sagemath.com/" target="_blank">the Sagemath Cloud</a> is useful to people who just want to edit tex documents, play around with sagetex, and not have to worry about configuring anything. Implementing it was fun and interesting. If you have any questions about the technical details, please ask! Enjoy. William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-69472920520113867412012-12-17T09:00:00.005-08:002012-12-17T09:00:59.218-08:00BDFL?<br /><div style="font-family: arial; font-size: small;">I just read <a href="http://technicaldiscovery.blogspot.com/2012/12/passing-torch-of-numpy-and-moving-on-to.html">this blog post </a>about the direction of numpy development, which might be of interest to Sage Developers. TL;DR -- Travis Oliphant explains that he is stepping down as head steward of Numpy, and then explains the awesome new things he's working on instead.</div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;">This made me think about my current relationship with the Sage project, where I'm similarly considered "head steward". I have not been super-active in the last few months in day-to-day Sage development, and haven't posted a lot on the lists. However, in my case, this is because I'm doing some hard work to build a company (Salvus) that may be able to provide more sustainable funding for core Sage development later. For example, yesterday, Drew Sutherland and I worked at an approach to implementing computation of <i>q</i>-expansions of higher weight modular forms (using modular symbols) which would be much, much faster than what's in Sage (or Magma) now. I'm not implementing it today, because I'm working on Salvus instead, so that hopefully in a year I will have all the time I need to implement exactly that algorithm and more <i>in Sage</i>, as a result of what I'm doing now. (Drew will implement a special case he needs for his research.)</div>William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-22578287326234206062012-03-10T07:57:00.000-08:002012-03-10T07:57:51.712-08:00Sage and Python 3?Have you ever wondered how difficult it might be to switch Sage from Python 2.7 to Python 3.x? Some students in my Sage course <a href="http://wstein.org/edu/2012/1062/projects/final/miloshevich-nason/">made a webpage</a> that summarizes the Python 3 support status of the Python packages on which Sage depends.<br /><br />Interesting examples:<br /><ul><li>Matplotlib doesn't support Python 3.x at all yet.</li><li>Mercurial doesn't support Python 3.x, and they don't have any plans to do so. (Which is another point in favor switching Sage to GIT.)</li></ul>William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-4621517359427724642012-03-08T14:57:00.000-08:002012-03-08T14:57:08.246-08:00How to get access to MagmaDespite <a href="http://www.sagemath.org/">Sage</a> doing a lot, people still write to me asking about how to get access to <a href="http://magma.maths.usyd.edu.au/magma/">Magma</a>. Here is my advice:<br /><ol><li>You can <a href="http://magma.maths.usyd.edu.au/magma/ordering/#sec_5">buy a copy of Magma here</a> for only $1180. <br /></li><li>If your calculation will take less than 20 seconds (?), you can do it <a href="http://magma.maths.usyd.edu.au/calc/">over the web for free</a>. (Historical note: I wrote the first version of that website.)</li><li> In Sage, use the command <tt>magma_free('some string')</tt>, though I just checked and the Magma calculator has changed in a way that breaks this command <a href="http://trac.sagemath.org/sage_trac/ticket/10499">again</a>. I'll be <a href="http://trac.sagemath.org/sage_trac/ticket/12642">posting a patch to fix this</a> in Sage.<br /></li><li> Most people I know who have Magma do not buy it, but get it in exchange for contributing to Magma (I used to get it free for that reason). Study the list of<br /><a href="http://magma.maths.usyd.edu.au/magma/acknowledge/#sec_3">past</a> and <a href="http://magma.maths.usyd.edu.au/group/members/">current</a> contributors to Magma and ask the one you know the best. This is a list of people who definitely have a copy of Magma and know how to use it.<br /></li></ol>William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-8688992652823391232011-12-15T14:42:00.001-08:002011-12-15T14:42:33.854-08:00How big is the core Sage library?I just did the following with Sage-4.8.alpha5:<br /><ol><li> "sudo apt-get install sloccount".</li><li> "cp -rv SAGE_ROOT/devel/sage-main /tmp/x"</li><li> Use a script [1] to rename all .pyx and .pxi files to .py.</li><li> Ran "sloccount *" in the /tmp/x directory, which ignores autogenerated .c/.cpp files coming from Cython.</li></ol><br />Here's the result for the full Sage library, which does not distinguish between Python and Cython. Note that sloccount really only counts lines of code -- comments are blank lines are ignored.<br /><br /><pre>Totals grouped by language (dominant language first):<br />python: 530370 (96.41%)<br />ansic: 14538 (2.64%)<br />cpp: 5188 (0.94%)<br /></pre><br />This suggests that the core Sage library is just over a "half million lines of Python and Cython source code, not counting comments and whitespace".<br /><br />Here's the breakdown by module:<br /><pre>SLOC Directory SLOC-by-Language (Sorted)<br />88903 rings python=87720,cpp=1183<br />72913 combinat python=71629,cpp=1284<br />47747 schemes python=46255,cpp=1492<br />39815 graphs python=28377,ansic=11438<br />31540 matrix python=31540<br />31019 modular python=31012,ansic=7<br />24475 libs python=21171,ansic=2845,cpp=459<br />20517 misc python=20383,ansic=134<br />18006 interfaces python=18006<br />17577 geometry python=16936,cpp=641<br />12775 categories python=12775<br />12093 server python=12093<br />11971 groups python=11971<br />11961 plot python=11961<br />10686 crypto python=10686<br />9920 modules python=9920<br />8389 symbolic python=8260,cpp=129<br />8150 algebras python=8150<br />7260 ext python=7198,ansic=62<br />7093 structure python=7093<br />6364 coding python=6364<br />5670 functions python=5670<br />5249 homology python=5249<br />4798 numerical python=4798<br />4323 quadratic_forms python=4323<br />3919 gsl python=3919<br />3911 calculus python=3911<br />3879 sandpiles python=3879<br />3003 sets python=3003<br />2647 databases python=2647<br />2074 logic python=2074<br />1736 finance python=1736<br />1608 games python=1608<br />1465 monoids python=1465<br />1435 tests python=1383,ansic=52<br />1370 stats python=1370<br />971 interacts python=971<br />959 tensor python=959<br />906 lfunctions python=906<br />308 parallel python=308<br />275 probability python=275<br />219 media python=219<br />197 top_dir python=197<br /></pre><br />Here is the script [1]:<br /><pre>#!/usr/bin/env python<br /><br />import os, shutil<br /><br />for dirpath, dirnames, filenames in os.walk('.'):<br /> for f in filenames:<br /> if f.endswith('.pyx') or f.endswith('.pxi'):<br /> print f<br /> shutil.move(os.path.join(dirpath, f),<br /> os.path.join(dirpath, os.path.splitext(f)[0] + '.py'))<br /><br /></pre>William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-22757855212524272002011-12-13T14:06:00.000-08:002011-12-13T14:07:21.837-08:00Using Sage to Support Research MathematicsWhen using Sage to support research mathematics, the most important point to make is to strongly encourage people to do the extra work to turn their "scruffy research code" into a patch that can be peer reviewed and included in Sage. They will have to 100% doctest it, and the quality of their code may improve dramatically as a result. Including code in Sage means that the code will continue to work as Sage is updated. Also, the code is peer reviewed and has to have examples and documentation for every function. That's a much higher bar than just "reproducible research". <br /><br />Moreover, getting code up to snuff to include in Sage will often also reveal mistakes that will avoid embarrassment later. I'm fixing some issues related to a <a href="http://wstein.org/papers/chow_heegner/chowheeg1.pdf">soon-to-be-done paper</a> right now that I found when doing just this for <a href="http://trac.sagemath.org/sage_trac/ticket/11975">trac 11975</a>.<br /><br />This final step of turning snippets of research code into a peer-reviewed contribution to Sage is: (1) a surprisingly huge amount of very important useful work, (2) something that is emphasized as an option for Sage more than with Magma or Mathematica or Pari (say), (3) something whose value people have to be sold on, since they get no real extra academic credit for it, at present, usually, and journal referees often don't care either way (I do, but I'm probably in the minority there), and (4) something that a *lot* of research mathematicians do not do. As an example of (4), in the last two months I've seen a ton of (separate!) bodies of code which is all sort of secret research code in various Dropbox repos, and which isn't currently squarely aimed at going into Sage. I've also seen a bunch of code related to Edixhoven et al.'s algorithm for computing Galois representation with a similar property (there is now <a href="http://trac.sagemath.org/sage_trac/ticket/12132">trac 12132</a>, due to my<br />urging). <br /><br />I did *not* do this step yet with this <a href="http://wstein.org/papers/shark/ ">recently accepted paper</a>. Instead I used "scrappy research code" in <a href="http://code.google.com/p/purplesage/">psage</a> to do the fast L-series computations. The referee for Math Comp didn't care either way, actually... I hope this doesn't come back to haunt me, though there are many double checks here (e.g., BSD) so I'm not too worried. I will do this get-it-in-Sage step at some point though.<br /><br />This will be better for the community in the long run, and better for individual researcher's credibility too. And there is a lot of value in having a stable refereed snapshot of code on which a published (=very stable) paper is based.William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-21657005959635687232011-11-12T11:47:00.000-08:002011-11-15T08:58:17.214-08:00Is the time ripe for http://sagenb.com?On a Sage mailing list, Karl Crisman wrote: "Regarding the downtime issue [of http://sagenb.org], there have occasionally been rumors of someone or some organization starting a service which would guarantee uptime and provide support."<br /><br />I might do this. It might be at <a href="http://sagenb.com/">http://sagenb.com</a> (right now sagenb.com points to sagenb.org). Here's the "business plan". Probably, sagenb.com would appear almost the same as <a href="http://sagenb.org">sagenb.org</a>, except there would be Google (?) ads on the side of your worksheets, and the revenue would go toward paying for:<br /><ol><li><b> Commercial server hosting:</b> Some Amazon EC2 instances</li><li><b> An employee (or later, employees) to maintain the servers:</b> at the beginning, this would be me in my free time, since I have a lot of experience with this.</li><li><b> Advertising that sagenb.com exists and we want users!:</b> Unlike sagenb.org, we would very strongly encourage as many people as possible to use sagenb.com. The advertising and landing page would explain that though sagenb.com generates money, all that money is all given back to the Sage development community (see below).</li><li><b> Hire employees to improve the notebook:</b> Fix bugs, implement features, etc. There would be a public commitment that <i>all</i> such work would be open sourced and get included with Sage. This would include adding support for a for-pay "pro" subscription version, adding nicer "offline support" (via a Sage install on the user's computer), integrated spreadsheets and better data visualization and manipulation tools for Science and business applications, and enabling development of Sage's core library and patches submission entirely through the notebook, etc.</li></ol><br /><br />At some point, there could be a $X/year subscription version that would remove all ads, and increase disk space available to that user. There would also be a $Y/year university site license version with customized authentication (e.g., LDAP?) for each university. The university site license version might also include Maple/Mathematica/Matlab/Magma pre-installed in their notebook server, assuming appropriate site licenses are honored, so sagenb.com could be something that goes beyond just a platform for "sage as math software". We can of course also tap into the R market, given that Sage includes R.<br /><br />I imagine the above being done as a not-for-profit effort, so if it brought in a lot of revenue (e.g., more than needed for hosting and employees), excess money would go to the Sage Foundation to support other Sage development activities. Regarding numbers, according to Google Analytics, right now on average well over 1000 people use sagenb.org every day. Standard commercial hosting costs for EC2 to support this load would be roughly $100/month. If each visitor generates on average of 1 penny of ad revenue per day (is this a reasonable estimate -- I have no clue?), then we would expect to make $3,650 in one year, which would be enough to fund the EC2 service (at about $2000/year), with a profit of $1,650. <br /><br />Now let's dream big! There might be 1,000,000 potential daily users out there for a service like this, if it worked sufficiently well, since there are many college students (and people that use math and stats programs like R in the sciences, and R is part of Sage) in the world. Scaling up by a factor of 1,000 would yield over $1 million/year, after paying for hosting fees. This would be enough to fund substantial work to improve Sage, the notebook, and have a paid Patch Editor position (imagine buying out a top Sage developer professor, e.g., John Palmierri, from 50% of his teaching obligations in exchange for him spending the time he would spend teaching instead organizing that the patches to Sage get properly refereed). Maybe we could even hire back some of the (many!) people who were Sage developers when they were mathematics grad student or postdocs, but who then went to work at a top web technology company and acquired useful experience there (and are now way too busy to work on Sage).<br /><br />This has the potential to make Sage a more longterm self-sustaining project. It's probably not possible to get traditional venture capital for a not-for-profit plan like the one above, but fortunately that is not needed due to (1) the generous support the National Science Foundation is currently providing toward development on the Sage notebook, and (2) private donations to the Sage Foundation. In particular, (2) provides enough money to bootstrap a sagenb.com for a while.<br /><br />I think the main potential downside is competition. If somebody else does the same thing right now for profit without giving back their changes, and captures the market it's hard to imagine how the above would work. Since we don't use the <a href="http://www.affero.org/oagpl.html">Affero GPL</a> for the Sage notebook, it is legal for somebody to do a lot of customization work to Sage and notebook, create a web-app using this customized version, and give back nothing to the community, so long as they don't redistribute their modified versions publicly. This isn't crazy -- not so long ago, I had a major company (I won't say who out of respect) tell me they planed to do something like that. And "random people" suggest it somewhat regularly when I give talks about Sage. I'm surprised it hasn't happened yet, but it definitely hasn't.<br /><br />So the time to do this is today. The notebook software we have is now finally reasonably scalable, primarily due to work of Mike Hansen and Rado Kirov. Our funding situation is good <b><i>this year</i></b>. We have strong good will and interest right now from both the Mathematical Association of America and WebWork developers. If we wait longer, the one chance to truly make Sage financially self-sustaining in a way that fits with my dreams and values will pass. <br /><br />I hope that by making all plans open, and by having a commitment to put the profits back into Sage development, an enterprise like I describe here will be in a better position to attract users than a purely for profit venture by somebody else.William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-69246470885005614272011-08-31T10:06:00.000-07:002011-08-31T10:06:30.361-07:00Sage: Creating a Viable Free Open Source Alternative to Magma, Maple, Mathematica, and MATLAB<br />The goal of the <a href="http://www.sagemath.org/">Sage project</a> is to create a viable fre open source alternative to Magma, Maple(TM), Mathematica(R), and MATLAB(R), which are the most popular non-free closed source mathematical software systems. (Maple is a trademark of Waterloo Maple Inc. Mathematica is a registered trademark of Wolfram Research Incorporated. MATLAB is a registered trademark of MathWorks. I will refer to the four systems together as ``Ma'' in the rest of this article.) Magma is (by far) the most advanced non-free system for structured abstract algebraic computation, Mathematica and Maple are popular and highly developed systems that shine at symbolic manipulation, and MATLAB is the most popular system for applied numerical mathematics. Together there are over 3,000 employees working at the companies that produce the four Ma's listed above, which take in over a hundred million dollars of revenue annually.<br /><br />By a viable free alternative to the Ma's, we mean a system that will have the important mathematical features of each Ma, with comparable speed. It will have 2d and 3d graphics, an interactive notebook-based graphical user interface, and documentation, including books, papers, school and college curriculum materials, etc. A single alternative to all of the Ma's is not necessarily a drop-in replacement for any of the Ma's; in particular, it need not run programs written in the custom languages of those systems. Thus it need not be like Octave or R, which (nearly) clone the languages of MATLAB and S, respectively. Development would instead focus on implementing functions that users demand, rather than systematically trying to implement every single function of the Ma's. The culture, architecture, and general look and feel of such a system would be very different than that of the Ma's.<br /><br /><br /><br /><h1>Motivation for Starting Sage</h1><br />Each of the Ma's cost substantial money, and is hence expensive for me, my collaborators, and students. The Ma's are not <i> owned by the community</i> like Sage is, or Wikipedia is, for that matter.<br /><br /><br />The Ma's are closed, which means that the implementation of some algorithms are secret, in which case you are not allowed to modify or extend them. <br /><br />The Mathematica Documentation: <i>"You should realize at the outset that while knowing about the internals of Mathematica may be of intellectual interest, it is usually much less important in practice than you might at first suppose. Indeed, in almost all practical uses of Mathematica, issues about how Mathematica works inside turn out to be largely irrelevant. Particularly in more advanced applications of Mathematica, it may sometimes seem worthwhile to try to analyze internal algorithms in order to predict which way of doing a given computation will be the most efficient. [...] But most often the analyses will not be worthwhile. For the internals of Mathematica are quite complicated.."</i><br /><br />The philosophy espoused in Sage, and indeed by the vast open source software community, is exactly the opposite. We want you to know about the internals, and when they are quite complicated, we want you to help make them more understandable. Indeed, Sage's growth depends on <i>you</i> analyzing how Sage works, improving it, and contributing your improvements back.<br /><pre>sage: crt(2, 1, 3, 5) # Chinese Remainder Theorem
<br /> 11
<br /> sage: crt? # ? = documentation and examples
<br /> Returns a solution to a Chinese Remainder Theorem...
<br /> ...
<br /> sage: crt?? # ?? = source code
<br /> def crt(...):
<br /> ...
<br /> g, alpha, beta = XGCD(m, n)
<br /> q, r = (b - a).quo_rem(g)
<br /> if r != 0:
<br /> raise ValueError("No solution ...")
<br /> return (a + q*alpha*m) % lcm(m, n)
<br /></pre>Moreover, by browsing <a href="http://hg.sagemath.org/sage-main/">the Mercurial repository</a>, you can see exactly who wrote or modified any particular line of code in the Sage library, when they did it, and why. Everything included in Sage is free and open source, and it will foreover remain that way.<br /><br /><a href="http://www.youtube.com/watch?v=bt_Y4pSdsHw">Linus Torvalds</a>: <i>"I see open source as Science. If you don't spread your ideas in the open, if you don't allow other people to look at how your ideas work and verify that they work, you are not doing Science, you are doing Witchcraft. Traditional software development models, where you keep things inside a company and hide what you are doing, are basically Witchcraft. Open source is all about the fact that it is open; people can actually look at what you are doing, and they can improve it, and they can build on top of it. [...] One of my favorite quotes from history is Newton: `If I had seen further, it has been by standing on the shoulders of giants.'"</i><br /><br />The design decisions of the Ma's are not made openly by the community. In contrast, important decisions about Sage development are made via open public discussions and voting that is archived on public mailing lists with thousands of subscribers.<br /><br />Every one of the Ma's uses a special mathematics-oriented interpreted programming language, which locks you into their product, makes writing some code outside mathematics unnecessarily difficult, and impacts the number of software engineers that are experts at programming in that language. In contrast, the user language of Sage is primarily the mainstream free open source language Python, which is one of the world's most popular interpreted programming languages. The Sage project neither invented nor maintains the underlying Python language, but gains immediate access to the IPython shell, Python scientific libraries (such as NumPy, SciPy, CVXopt and MatPlotLib), and a large Python community with major support from big companies such as Google. In comparison to Python, the Ma's are small players in terms of language development. Thus for Sage most of the problems of language development are handled by someone else.<br /><br />The bug tracking done for three of four of the Ma's is currently secret (MATLAB has an open bug tracker, though it requires free registration to view.), which means that there is no published accounting of all known bugs, the status of work on them, and how bugs are resolved. But the Ma's do have many bugs; see the release notes of each new version, which lists bugs that were fixed. Sage also has bugs, which are all publicly tracked at <a href="http://trac.sagemath.org/">the trac server</a>, and there are numerous ``Bug Days'' workshops devoted entirely to fixing bugs in Sage. Moreover, all discussion about resolving a given bug, including peer review of solutions, is publicly archived. We note that sadly even some prize winning free open source systems, such as GAP, do not have an open bug tracking system, resulting in people reporting the same bugs over and over again.<br /><br />Each of the Ma's is a combination of secret unchangeable compiled code and less secret interpreted code. Users with experience programming in compiled languages such as Fortran or C++ may find the loss of a compiler to be frustrating. None of the Ma's has an optimizing compiler that converts programs written in their custom interpreted language to a fast executable binary format that is not interpreted at runtime. (MATLAB has a compiler, but ``the source code is still interpreted at run-time, and performance of code should be the same whether run in standalone mode or in MATLAB.'' Mathematica also has a Compile function, but simply compiles expressions to a different internal format that is interpreted, much like Sage's <tt>fast_callable</tt> function.) In contrast, Sage is tightly integrated with Cython. The Cython project has received extensive contributions from Sage developers, and is very popular in the world of Python-based scientific computing. Cython is a Python-to-C/C++ compiler that speeds up code execution and has support for statically declaring data types (for potentially enormous speedups) and natively calling existing C/C++/Fortran code. For example, enter the following in a cell of <a href="http://sagenb.org/">the Sage notebook</a>:<br /><pre>def python_sum2(n):
<br /> s = int(0)
<br /> for i in xrange(1, n+1):
<br /> s += i*i
<br /> return s
<br /></pre>Then enter the following in another cell:<br /><pre>%cython
<br />def cython_sum2(long n):
<br /> cdef long i, s = 0
<br /> for i in range(1, n+1):
<br /> s += i*i
<br /> return s
<br /></pre>The second implementation, despite looking nearly identical, is nearly a hundred times faster than the first one (your timings may vary).<br /><pre>sage: timeit('python_sum2(2*10^6)')
<br />5 loops, best of 3: 154 ms per loop
<br />sage: timeit('cython_sum2(2*10^6)')
<br />125 loops, best of 3: 1.76 ms per loop
<br />sage: 154/1.76
<br />87.5
<br /></pre><br />Of course, it is better to choose a different algorithm. In case you don't remember a closed form expression for the sum of the first $n$ squares, Sage can deduce it:<br /><pre>sage: var('k, n')
<br />sage: factor(sum(k^2, k, 1, n))
<br />1/6*(n + 1)*(2*n + 1)*n
<br /></pre>And now our simpler fast implementation is:<br /><pre>def sum2(n):
<br /> return n*(2*n+1)*(n+1)/6
<br /></pre>Just as above, we can also use the Cython compiler:<br /><pre>%cython
<br />def c_sum2(long n):
<br /> return n*(2*n+1)*(n+1)/6
<br /></pre>Comparing times, we see that Cython is 10 times faster:<br /><pre>sage: n = 2*10^6
<br />sage: timeit('sum2(n)')
<br />625 loops, best of 3: 1.41 microseconds per loop
<br />sage: timeit('c_sum2(n)')
<br />625 loops, best of 3: 0.145 microseconds per loop
<br />sage: 1.41/.145
<br />9.72413793103448
<br /></pre>In this case, the enhanced speed comes at a cost, in that the answer is wrong when the input is large enough to cause an overflow:<br /><pre>sage: c_sum2(2*10^6) # WARNING: overflow
<br />-407788678951258603
<br /></pre>Cython is very powerful, but to fully benefit from it, one must understand machine level arithmetic data types, such as long, int, float, etc. With Sage you have that option.<br /><br /><h1>What is Sage?</h1><br />The goal of Sage is to compete with the Ma's, and the intellectual property at our disposal is the complete range of GPL-compatibly licensed open source software.<br /><br />Sage is a self-contained free open source distribution of about 100 open source software packages and libraries that aims to address all computational areas of pure and applied mathematics. See the <a href="http://sagemath.org/packages/standard/">list of packages in Sage</a>, which includes R, Pari, Singular, GAP, Maxima, GSL, Numpy, Scipy, ATLAS, Matplotlib, and many other popular programs. The download of Sage contains all dependencies required for the normal functioning of Sage, including Python itself. Sage includes a substantial amount of code that provides a unified Python-based <i>interface</i> to these other packages. Sage also includes a library of new code written in Python, Cython and C/C++, which implements a huge range of algorithms.<br /><br /><br /><br /><h1>History</h1><br />I made the first release of Sage in February 2005, and at the time called it "<b>S</b>oftware for <b>A</b>rithmetic <b>G</b>eometry <b>E</b>xperimentation." I was a serious user of, and contributor to, Magma at the time, and was motivated to start Sage for many of the reasons discussed above. In particular, I was personally frustrated with the top-down closed development model of Magma, the fact that <i>several million lines</i> of the source code of Magma are closed source, and the fees that my colleagues had to pay in order to use the substantial amount of code that I contributed to Magma. Despite my early naive hope that Magma would be open sourced, it never was. So I started Sage motivated by the dream that someday the single most important item of software I use on a daily basis would be free and open. David Joyner, David Kohel, Joe Wetherell, and Martin Albrecht were also involved in the development of Sage during the first year.<br /><br />In February 2006, the National Science Foundation funded a 2-day workshop called ``Sage Days 2006'' at UC San Diego, which had about 40 participants and speakers from several open and closed source mathematical software projects. After doing a year of fulltime mostly solitary work on Sage, I was surprised by the positive reception of Sage by members of the mathematical research community. What Sage promised was something many mathematicians wanted. Whether or not Sage would someday deliver on that promise was (and for many still is) an open question.<br /><br />I had decided when I started Sage that I would make it powerful enough for my research, with or without the help of anybody else, and was pleasantly surprised at this workshop to find that many other people were interested in helping, and understood the shortcomings of existing open source software, such as GAP and PARI, and the longterm need to move beyond Magma. Six months later, I ran another Sage Days workshop, which resulted in numerous talented young graduate students, including David Harvey, David Roe, Robert Bradshaw, and Robert Miller, getting involved in Sage development. I used startup money from University of Washington to hire Alex Clemesha as a fulltime employee to implement 2d graphics and help create a notebook interface to Sage. I also learned that there was much broader interest in such a system, and stopped referring to Sage as being exclusively for ``arithmetic geometry''; instead, Sage became ``<b>S</b>oftware for <b>A</b>lgebra and <b>G</b>eometry <b>E</b>xperimentation.'' Today the acronym is deprecated.<br /><br />The year 2007 was a major turning point for Sage. Far more people got involved with development, we had four Sage Days workshops, and prompted by Craig Citro, we instituted a requirement that all new code must have tests for 100% of the functions touched by that code, and every modification to Sage must be peer reviewed. Our peer review process is much more open than in mathematical research journals; everything that happens is publicly archived on <a href="http://trac.sagemath.org/">trac</a>. During 2007, I also secured some funding for Sage development from Microsoft Research, Google, and NSF. Also, a German graduate student studying cryptography, Martin Albrecht presented Sage at the Troph'ees du Libre competition in France, and Sage won first place in ``Scientific Software'', which led to a huge amount of good publicity, including articles in many languages around the world and appearances, for example, <a href="http://science.slashdot.org/story/07/12/08/1350258/Open-Source-Sage-Takes-Aim-at-High-End-Math-Software">this Slashdot article</a>.<br /><br />In 2008, I organized 7 Sage Days workshops at places such as IPAM (at UCLA) and the Clay Mathematics Institute, and for the first time, several people besides me made releases of Sage. In 2009, we had 8 more Sage Days workshops, and the underlying foundations of Sage improved, including development of a powerful coercion architecture. This <i>coercion model</i> systematically determines what happens when performing operations such as <tt>a + b</tt>, when <tt>a</tt> and <tt>b</tt> are elements of potentially different rings (or groups, or modules, etc.).<br /><pre>sage: R.<x> = PolynomialRing(ZZ)
<br /> sage: f = x + 1/2; f
<br /> x + 1/2
<br /> sage: parent(f)
<br /> Univariate Polynomial Ring in x over Rational Field
<br /></x></pre>We <a href="http://magma.maths.usyd.edu.au/calc/">compare this with Magma</a> (V2.17-4), which has a more ad hoc coercion system:<br /><pre>> R<x> := PolynomialRing(IntegerRing());
<br /> > x + 1/2
<br /> ^
<br /> Runtime error in '+': Bad argument types
<br /> Argument types given: RngUPolElt[RngInt], FldRatElt
<br /></x></pre><br />Robert Bradshaw and I also added support for beautiful browser-based 3D graphics to Sage, which involved writing a 3D graphics library, and adapting the free open source <a href="http://jmol.sourceforge.net/">JMOL Java library</a> for rendering molecules to instead plot mathematical objects.<br /><br /><pre>sage: f(x,y) = sin(x - y) * y * cos(x)
<br /> sage: plot3d(f, (x,-3,3), (y,-3,3), color='red')
<br /></pre><br />In 2009, following a huge amount of porting work by Mike Hansen, development of algebraic combinatorics in Sage picked up substantial momentum, with the switch of the entire MuPAD-combinat group to Sage (forming <a href="http://wiki.sagemath.org/combinat">sage-combinat</a>), only months before the formerly free system MuPAD{\textregistered}\footnote{MuPAD is a registered trademark of SciFace Software GmbH \& Co.} was bought out by Mathworks (makers of MATLAB). In addition to work on Lie theory by Dan Bump, this also led to a massive amount of work on a category theoretic framework for Sage by Nicolas Thiery.<br /><br />In 2010, there were 13 Sage Days workshops in many parts of the world, and grant funding for Sage significantly improved, including new NSF funding for undergraduate curriculum development. I also spent much of my programming time during 2010--2011 developing a number theory library called <a href="http://code.google.com/p/purplesage/">psage</a>, which is currently not included in Sage, but can be easily installed.<br /><br />Many aspects of Sage make it an ideal tool for teaching mathematics, so there's a steadily growing group of teachers using it: for example, there have been MAA PREP workshops on Sage for the last two years, and a third is likely to run next summer, there are regular posts on the Sage lists about setting up classroom servers, and there is an NSF-funded project called <a href="http://utmost.aimath.org/">UTMOST</a> devoted to creating undergraduate curriculum materials for Sage.<br /><br />The <a href="http://%20sagemath.org/library-publications.html">publications page</a> lists 101 accepted publications that use Sage, 47 preprints, 22 theses, and 16 books, and there are surely many more ``in the wild'' that we are not aware of. According to Google Analytics, the main Sage website gets about 2,500 absolute unique visitors per day, and the website <a href="http://sagenb.org/">http://sagenb.org</a>, which allows anybody to easily use Sage through their web browser, has around 700 absolute unique visitors per day.<br /><br />For many mathematicians and students, Sage is today the mature, open source, and free foundation on which they can build their research program.William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-71523086995186318502010-11-15T09:15:00.000-08:002010-11-19T08:57:32.648-08:00Brief History and Motivation Behind the Sage Coercion ModelIn Sage (like in Magma), most objects are either elements or parents. Think of a "parent" as a set. This Parent/Element idea is a powerful algebraic approach to implementing mathematical objects on a computer, which does not exist in Mathematica, Maple, PARI, Maxima, and many other math software platforms. <br /><br />I learned about this approach from using Magma:<br /><br /><pre>%magma<br />R<x> := PolynomialRing(Integers());<br />print Parent(x);<br />///<br />Univariate Polynomial Ring in x over Integer Ring<br /></pre><br />(In this blog post I'll put %magma above code that gets input to Magma; all other code gets input to Sage. The input and output is separated by ///.)<br /><br /><p>In Sage:</p><pre>R.<x> = ZZ[]<br />parent(x)<br />///<br />Univariate Polynomial Ring in x over Integer Ring<br /></pre><br /><pre>x.parent()<br />///<br />Univariate Polynomial Ring in x over Integer Ring<br /></pre><br /><br /><pre>isinstance(ZZ, Parent)<br />///<br />True<br /></pre><br /><pre>isinstance(2, Parent)<br />///<br />False<br /></pre><br /><br /><p>Automatic Coercions:</p><p><em>"The primary goal of coercion is to be able to transparently do arithmetic, comparisons, etc. between elements of distinct parents."</em></p><p>When I used to try to get people to use Magma, perhaps the number one complaint I heard about Magma was that doing arithmetic with objects having distinct parents was difficult and frustrating.</p><p>For the <strong>first year</strong>, in Sage, there was a very simple coercion system:</p><ul><li>If you try to compute a + b or a * b, first somehow put b into the parent of a, then do the arithmetic.</li></ul><p>That seriously sucked. E.g., </p><p> <strong>Mod(2,7) + 6</strong></p><p>was completely different than</p><p> <strong>6 + Mod(2,7)</strong>!</p><p>The first was Mod(1,7), and the second was the integer 8. This makes understanding code difficult and unpredictable.</p><p>So I rewrote coercion to be a bit better (this was a really painful rewrite that I mostly did myself over several hard months of work):</p><ul><li>If you try to compute a + b (or a*b), check for a "<strong>canonical coercions"</strong> from the parent of a into b, or failing that, from the parent of b into a. If there aren't any raise an error. If there is one, use it. There won't be both unless there is some canonical isomorphism. </li><li>There are some axioms about what a canonical coercion is. At least it is homomorphism. </li></ul><p>Then we decided that there is a canonical homomorphism Z --> Z/7Z, but there is not one Z/7Z --> Z since there is no ring homomorphism in this direction, hence the above makes sense in either order. </p><p>One implication of this new model was that parent objects have to be immutable, i.e., you can't fundamentally change them after you make them. This is why in Sage you must specify the name of the generator of a polynomial ring at creation time, and can't change it. In Magma, it is typical to specify the name only later if you want.</p><p>Objects must be immutable because the canonical maps between them depend on the objects themselves, and we don't want them to just change left and right at runtime. </p><br /><pre>%magma<br />R := PolynomialRing(RationalField(), 2);<br />f := R.1^3 + 3*R.2^3 - 4/5;<br />print f;<br />///<br />$.1^3 + 3*$.2^3 - 4/5<br />[ $.1, $.2 ]<br /></pre><br /><pre>%magma<br />AssignNames(~R, ["x", "y"]);<br />print f;<br />[R.1, R.2]<br />///<br />x^3 + 3*y^3 - 4/5<br />[x, y]<br /></pre><br /><pre>%magma<br />AssignNames(~R, ["z", "w"]);<br />print f;<br />///<br />z^3 + 3*w^3 - 4/5<br /></pre><br /><br />Now in Sage:<br /><pre>R = PolynomialRing(QQ)<br />///<br />TypeError: You must specify the names of the variables.<br /></pre><br /><pre>R.<x,y> = PolynomialRing(QQ)<br />f = x^3 + 3*y^3 - 4/5; f<br />///<br />x^3 + 3*y^3 - 4/5<br /></pre><br /><p>Note: In Sage, you can can use a <strong>with block</strong> to temporarily change the names if you <strong><em>really</em></strong> need to for some reason. This is allowed since at the end of the with block the names are guaranteed to be changed back.</p><br /><pre>with localvars(R, ['z','w']):<br /> print f<br />print "back?", f <br />///<br />z^3 + 3*w^3 - 4/5<br />back? x^3 + 3*y^3 - 4/5<br /></pre><br /><br /><p>But this new model had a major problem too, e.g., if x in Z[x] then "x + 1/2" would FAILS! This is because 1/2 does not coerce into Z[x] (the parent of x), and x does not coerce into Q (the parent of 1/2). </p><p> </p><p>Maybe the implementors of Magma have the answers? Evidently not. </p><br /><pre>%magma<br />R<x> := PolynomialRing(Integers());<br />x + 1/2;<br />///<br />Runtime error in '+': Bad argument types<br />Argument types given: RngUPolElt[RngInt], FldRatElt<br /></pre><br /><p>Robert Bradshaw did though, and now it is in Sage:</p><br /><pre>R.<x> = ZZ[]<br />x + 1/2<br />///<br />x + 1/2<br /></pre><br /><p>His new design is (for the most part) what Sage actually uses now.</p><p>He launched an effort in 2008 (see the <a href="http://wiki.sagemath.org/dev1">Dev Days 1 Wiki</a>) to implement a rewrite of the coercion model to his new design. This ended up swallowing up half the development effort at the workshop, and was a massive amount of work, since <em><strong>every</strong></em> parent structure and element had to have some modifications made to it. </p><p>This meant people changing a lot of code all over Sage that they didn't necessarily understand, and crossing their fingers that the doctest test suite would catch their mistakes. This was SCARY. After much work, none of this went into Sage. It was just way too risky. This failure temporarily (!) burned out some developers. </p><p>Robert Bradshaw, on the other hand, persisted and came up with a new approach that involved migrating Sage code gradually. I.e., he made it so that the old coercion model was still fully supported simultaneously with the new one, then he migrated a couple of parent structures, and got the code into Sage. I'm sure not everything is migrated, even today. There are two points to what he did:</p><ol><li>He extended the rules so x + 1/2 works, i.e., the result of a+b need not live in the parent of a or the parent of b.</li><li>He made implementing coercion much more top down: simply implement various methods in a class that derives from Parent. This meant that instead of coercion being rules and conventions that people have to understand and implement in their own code all over Sage, they just implement a small amount of code and the rules (and benefits) are all enforced automatically. </li></ol><br /><br /><h2>The Coercion Model</h2><p>The coercion model is explained here: <a href="http://sagemath.org/doc/reference/coercion.html">http://sagemath.org/doc/reference/coercion.html</a></p><p> </p>William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-77004672194090950782010-11-08T10:05:00.000-08:002010-11-08T14:44:34.101-08:00Getting Started With Cython<h1>Getting Started With Cython</h1><br /><br /><h2>Quote about Cython:</h2><a href="http://news.ycombinator.com/item?id=1212790">Andrew Tipton</a> says <i>"I'm honestly never going back to writing C again. Cython gives me all the expressiveness of Python combined with all the performance and close-to-the-metal-godlike-powers of C. I've been using it to implement high-performance graph traversal and routing algorithms and to interface with C/C++ libraries, and it's been an absolute amazing productivity boost.</i>" Yep.<br /><br /><br /><h2>Cython has two major use cases</h2><ol><li> Extending the CPython interpreter with fast compiled modules,</li><li> Interfacing Python code with external C/C++ libraries. </li></ol><br /><h2>Cython supports type declarations</h2><ol><li> For changing code from having dynamic Python semantics into having static-and-fast (but less generic) C semantics.</li><li> Directly manipulating C data types defined in external libraries.</li></ol><br /><h2>Tutorial: Building Your First Cython Code by Hand</h2>It happens in two stages:<br /><ol><li> A <tt>.pyx</tt> file is compiled by Cython to a <tt>.c</tt> or <tt>.cpp</tt> file.<br /></li><li> The <tt>.c</tt> or <tt>.cpp</tt> file is compild by a C compiler (such as GCC) to a <tt>.so</tt> file. <br /></li></ol><b>Let's try it now:</b><br />First, create a file <tt>sum.pyx</tt> that contains the following code (see t<a href="http://wstein.org/edu/2010/581d/notes/2010-11-08/">his directory for original code files</a>): <br /><br /><pre>def sum_cython(long n):<br /> cdef long i, s = 0<br /> for i in range(n):<br /> s += i<br /> return s<br /></pre><pre></pre><b>Then use Cython to compile it:</b><br /><b><br /></b><br />Since we're using Sage, you can do <br /><br /><pre>bash$ sage -cython sum.pyx<br />bash$ ls<br />sum.c sum.pyx<br /></pre><pre></pre>Notice the new file <tt>sum.c</tt>. Compile it with <tt>gcc</tt> as follows on <b>OS X</b>: <br /><br /><pre>bash$ sage -sh<br />bash$ gcc -I$SAGE_ROOT/local/include/python2.6 -bundle -undefined dynamic_lookup sum.c -o sum.so <br /></pre><br />On <b>Linux</b>, do: <br /><br /><pre>bash$ sage -sh<br />bash$ gcc -I$SAGE_ROOT/local/include/python2.6 -shared -fPIC sum.c -o sum.so<br /></pre><pre></pre><b>Finally, try it out.</b><br /><b><br /></b><br />You must run Sage from the same directory that contains the file <tt>sum.so</tt>. When you type <tt>import sum</tt> below, the Python interpreter sees the file <tt>sum.so</tt>, opens it, and it contains functions and data that define a compiled "Python C-extension module", so Python can load it (like it would like a module like <tt>sum.py</tt>). <br /><br /><pre>bash$ sage<br />-------------------------------------------------<br />| Sage Version 4.6, Release Date: 2010-10-30 <br />| Type notebook() for the GUI, and license() for<br />-------------------------------------------------<br />sage: import sum<br />sage: sum.sum_cython(101)<br />5050<br />sage: timeit('sum.sum_cython(101)')<br />625 loops, best of 3: 627 ns per loop<br />sage: timeit('sum.sum_cython(101)', number=10^6) # better quality timing<br />1000000 loops, best of 3: 539 ns per loop<br /></pre><br />Finally, take a look at the (more than 1000 line) autogenerated C file <tt>sum.c</tt>: <br /><pre></pre><pre>bash$ wc -l sum.c<br /> 1178 sum.c<br />bash$ less sum.c<br /></pre>...<br /><br />Notice code like this, which illustrates that Cython generates code that supports <em>both</em> Python2 and Python3: <br /><br /><br /><pre>#if PY_MAJOR_VERSION < 3<br /> #define __Pyx_BUILTIN_MODULE_NAME "__builtin__"<br />#else<br /> #define __Pyx_BUILTIN_MODULE_NAME "builtins"<br />#endif<br /><br />#if PY_MAJOR_VERSION >= 3<br /> #define Py_TPFLAGS_CHECKTYPES 0<br /> #define Py_TPFLAGS_HAVE_INDEX 0<br />#endif<br /></pre><pre></pre><pre></pre>The <a href="http://docs.python.org/release/2.6/howto/cporting.html">official Python docs</a> say: "If you are writing a new extension module, you might consider Cython. It translates a Python-like language to C. The extension modules it creates are compatible with Python 3.x and 2.x." <br /><br />If you scroll down further you'll get past the boilerplate and see the actual code:<br /><br /><pre>...<br /> /* "/Users/wstein/edu/2010-2011/581d/notes/2010-11-08/sum.pyx":2<br /> * def sum_cython(long n):<br /> * cdef long i, s = 0 # <<<<<<<<<<<<<<<br /> * for i in range(n):<br /> * s += i<br /> */<br /> __pyx_v_s = 0;<br /><br /> /* "/Users/wstein/edu/2010-2011/581d/notes/2010-11-08/sum.pyx":3<br /> * def sum_cython(long n):<br /> * cdef long i, s = 0<br /> * for i in range(n): # <<<<<<<<<<<<<<<br /> * s += i<br /> * return s<br /> */<br /> __pyx_t_1 = __pyx_v_n;<br /> for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2+=1) {<br /> __pyx_v_i = __pyx_t_2;<br /><br /> /* "/Users/wstein/edu/2010-2011/581d/notes/2010-11-08/sum.pyx":4<br /> * cdef long i, s = 0<br /> * for i in range(n):<br /> * s += i # <<<<<<<<<<<<<<<br /> * return s<br /> */<br /> __pyx_v_s += __pyx_v_i;<br /> }<br />...<br /></pre><pre></pre>There is a big comment that shows the original Cython code with context and a little arrow<span class="Apple-style-span" style="font-family: monospace;"> </span>pointing at the current line (these comment blocks with context were I think the first thing I personally added to Pyrex... before, it just gave that first line with the .pyx filename and line number, but nothing else). Below that big comment, there is the actual C code that Cython generates. For example, the Cython code <tt> s += i</tt> is turned into the C code <tt>__pyx_v_s += __pyx_v_i;</tt>. <br /><h2><br /></h2><h2>The Same Extension From Scratch, for Comparison</h2>If you read <a href="http://docs.python.org/extending/">Extending and Embedding Python</a> you'll see how you could write a C extension module from scratch that does the same thing as sum.so above. Let's see what this is like, for comparison. Given how simple sum.pyx is, this isn't so hard. When creating more complicated Cython code---e.g., new extension classes, more complicated type conversions, and memory management---writing C code directly quickly becomes unwieldy. <br />First, create a file <tt>sum2.c</tt> as follows: <br /><br /><br /><pre>#include <Python.h><br /><br />static PyObject * <br />sum2_sum_c(PyObject *self, PyObject *n_arg)<br />{<br /> long i, s=0, n = PyInt_AsLong(n_arg);<br /> <br /> for (i=0; i<n; i++) {<br /> s += i;<br /> }<br /> PyObject* t = PyInt_FromLong(s);<br /> return t;<br />}<br /><br />static PyMethodDef Sum2Methods[] = {<br /> {"sum_c", sum2_sum_c, METH_O, "Sum the numbers up to n."},<br /> {NULL, NULL, 0, NULL} /* Sentinel */<br />};<br /><br />PyMODINIT_FUNC<br />initsum2(void)<br />{<br /> PyObject *m;<br /> m = Py_InitModule("sum2", Sum2Methods);<br />}<br /></pre><pre></pre><pre></pre>Now compile and run it as before: <br /><pre></pre><pre>bash$ sage -sh</pre><pre>bash$ gcc -I$SAGE_ROOT/local/include/python2.6 -bundle -undefined dynamic_lookup sum2.c -o sum2.so <br />bash$ sage<br />...<br />sage: import sum2<br />sage: sum2.sum_c(101)<br />5050<br />sage: import sum<br />sage: sum.sum_cython(101)<br />5050<br />sage: timeit('sum.sum_cython(1000000r)')<br />125 loops, best of 3: 2.54 ms per loop<br />sage: timeit('sum2.sum_c(1000000r)')<br />125 loops, best of 3: 2.03 ms per loop<br /></pre><pre></pre><pre></pre>Note that this is a little faster than the corresponding Cython code. This is because the Cython code is more careful, checking various error conditions, etc. <br /><br />Note that the C code is 5 times as long as the Cython code. <br /><h2>Building Extensions using Setuptools Instead</h2>In nontrivial projects, the Cython step of transforming your code from .pyx to .c is typically done by explicitly calling cython somehow (this will change in the newest version of Cython), but the step of running the C compiler is usually done using either distutils or setuptools. To use the tools, one creates a file "setup.py" which defines the extensions in your project, and Python itself then runs a C compiler for you, with the proper options, includes paths, etc. <br /><br />Let's create a new setuptools project that includes the sum and sum2 extensions that we defined above. First, create the following file and call it <tt>setup.py</tt>. This should be in the same directory as sum.c and sum2.c. <br /><pre></pre><pre>from setuptools import setup, Extension<br /><br />ext_modules = [<br /> Extension("sum", ["sum.c"]),<br /> Extension("sum2", ["sum2.c"])<br /> ]<br /><br />setup(<br /> name = 'sum',<br /> version = '0.1',<br /> ext_modules = ext_modules)<br /></pre><pre></pre><pre></pre>Then type <br /><pre></pre><pre></pre><pre>bash$ rm *.so # make sure something happens<br />bash$ sage setup.py develop<br />...<br />bash$ ls *.so<br />sum.so sum2.so</pre><br /><br />Notice that running <br /><pre></pre><pre>setup.py develop</pre><br />resulted in Python generating the right gcc commmand lines for your platform. You don't have to do anything differently on Linux, OS X, etc. <br /><br />If you change <tt>sum2.c</tt>, and want to rebuild it, just type <tt>sage setup.py develop</tt> again to rebuild <tt>sum2.so</tt> If you change <tt>sum.pyx</tt>, you have to manually run Cython: <br /><pre></pre><pre>sage -cython sum.pyx<br /></pre><br />then again do <tt>sage setup.py develop</tt> to rebuild sum.so. Try this now. In sum.pyx, change <br /><pre></pre><pre>for i in range(n):<br /></pre><br />to <br /><pre></pre><pre>for i in range(1,n+1):<br /></pre><br />then rebuild: <br /><pre> </pre><pre>bash$ sage -cython sum.pyx<br />...<br />bash$ sage setup.py develop<br />...<br />bash$ sage<br />...<br />sage: import sum<br />sage: sum.sum_cython(100)<br />5050<br /></pre><br />There are ways to make setup.py automatically notice when sum.pyx changes, and run Cython. A nice implementation of this will be in the next Cython release. See the setup.py and build_system.py files <a href="http://code.google.com/p/purplesage/source/browse/">of Purple sage</a> for an example of how to write a little build system write now (before the new version of Cython). <br /><h2>An Automated Way to Experiment</h2>Given any single Cython file such as <tt>sum.pyx</tt>, in Sage you can do <br /><pre></pre><pre>sage: load sum.pyx<br />Compiling sum.pyx...<br />sage: sum_cython(100)<br />5050<br /></pre><pre></pre>Behind the scenes, Sage created a setup.py file, ran Cython, made a new module, compiled it, and imported everything it defines into the global namespace. If you look in the spyx subdirectory of the directory listed below, <b><i>before</i></b> you exit Sage (!), then you'll see all this. <br /><pre></pre><pre>sage: SAGE_TMP<br />'/Users/wstein/.sage//temp/deep.local/14837/'<br /></pre><br />You can also do <br /><pre></pre><pre>sage: attach sum.pyx<br /></pre><br />Then every time sum.pyx changes, Sage will notice this and reload it. This can be useful for development of small chunks of Cython code. <br /><br />You can also use the Sage notebook, and put <tt>%cython</tt> as the first line of a notebook cell. The rest of the cell will be compiled exactly as if it were written to a <tt>.pyx</tt> file and loaded as above. In fact, that is almost exactly what happens behind the scenes. <br /><h2>Next Time</h2>Now that we understand at a reasonably deep level what Cython really is and does, it is time to learn about the various constructs of the language: <br /><ol><li> How to create extension classes using Cython. </li><li> How to call external C/C++ library code.</li></ol><br />We will rewrite our sum.pyx file first to use a class. Then we'll rewrite it again to make use of the MPIR (or GMP) C library for arithmetic, and again to make use of the C++ NTL library.William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-20040350152650161552010-11-03T09:54:00.001-07:002010-11-03T09:54:55.857-07:00Cython, Sage, and the Need for SpeedCython seriously rocks, at least for much of what I need. It's still the killer feature of Python/Sage, IMHO. And meetings like EuroScipy last summer really confirmed that, where almost every other talk used Cython. <br /><br /><h2>History</h2><br />Greg Ewing wrote "Pyrex" in 2002--2004..., which I guess he named<br />after some cooking ware. It is amazing, but to understand this you<br />must take a quick tour of <a href="http://docs.python.org/extending/">Extending and embedding</a> and the <a href="http://docs.python.org/c-api/">Python/C API reference</a>. Pyrex let you write basically Python-ish code that gets magically turned into C extension code. <br /><br />In 2007 I forked it (discuss why briefly) and named Cython after <a href="http://cython.livejournal.com/">this punk rock guy</a>.<br /> <br />At that time, Robert Bradshaw and Stefen Behnel spent a lot of time improving Cython, implementing tons of _optimizations_ and new features.<br /><br />Cython is now very popular in the "Scientific computing using Python" world. It is also heavily used in Sage. <br /><br /><h2>Are You Serious?</h2>If you want to use a computer for math research, and you are serious (not some lazy person who fiddles then gives up), you will likely run into situations where you need code to run fast. Writing such code only in Python (or any other interpreter) is often impossible.<br /><br />If you want to write fast code on a computer, and don't want to mess with assembler, the only option right now is C, or something with equivalent speed... Cython! By "fast" I mean 100-1000 times what you'll get out of Python on certain tasks. I also mean code that is evil, scary, and dangerous... if you aren't careful with preconditions.<br /><br /><h2>Compiled versus Interpreted Code</h2>Here's how interpreter code usually runs.<br /><pre> 1. Check a bunch of conditions then do one single thing.<br /> 2. Check a bunch of conditions then do one single thing.<br /> ...<br /> 10^6. Check a bunch of conditions then do one single thing.<br /></pre>Here's how compiled (C, Cython, etc.) can can be written:<br /><br /><pre> 1. Check some conditions (optional, but a good idea);<br /> 2. Do very unsafe stuff with no checks at all (but they <br /> in theory should be safe given 1).<br /> ...<br /> 10^6. Do very unsafe stuff with no checks at all (but they <br /> in theory should be safe given 1).<br /></pre>The problem is that all the checks in step 1 (in either case) can easily take over 100 times as long as "do very unsafe stuff".<br /><br />TYPICAL EXAMPLE:<br /><pre>sage: def sum_sage(n):<br />... s = 1<br />... for i in range(n):<br />... s += i<br />... return s<br />sage: timeit('sum_sage(100000r)')<br />5 loops, best of 3: 84.6 ms per loop<br />sage: %python<br />sage: def sum_python(n):<br />... s = 1<br />... for i in range(n):<br />... s += i<br />... return s <br />sage: 84.6/5.88<br />14.3877551020408<br />sage: timeit('sum_python(100000r)')<br />125 loops, best of 3: 5.88 ms per loop<br />sage: %cython<br />sage: def sum_cython(int n):<br />... cdef int i, s = 1<br />... for i in range(n):<br />... s += i<br />... return s<br />sage: timeit('sum_cython(100000r)')<br />625 loops, best of 3: 61.6 µs per loop<br />sage: 5.88 / 0.061<br />96.3934426229508 <br /></pre><br />Let me explain what's going in the above. How, e.g., in the first one (sum_sage), the program is doing a sort of monologue: "I have to add a Python int to a Sage int. I don't have any code to do that directly (that would get too complicated, and they are so big and complicated and different objects, and they might change, oh my). So I'll convert the Python int to Sage int, because that's the only conversion I know. OK, I do that via (it used to be base 10 string parsing!) some code Gonzalo Tornaria wrote that is scary complicated... and once that is done, I got my new MPIR-based Sage integer, which I think add to s. The addition takes some memory that points to the two MPIR integers, and since Python numbers are supposed to be immutable, I make yet another MPIR number (wrapped in a Python object), which is the result of asking MPIR to add them. MPIR numbers are also very complicated objects, involving stuff like limbs, and C structs, which hardly anybody fully understands. Despite these integers happening to be small, there is still quite some overhead in the addition, but it happens (taking a small fraction of the total runtime). Then we move on to the next step in the loop!"<br /><br />With sum_python, the loop is similar, but MPIR isn't involved, and there are no conversions. This buys a 14-fold speedup. But it is still not super fast, since many new Python objects get created, the code is for "potentially huge integers", hence a potentially complicated data structure has to be checked for, etc. <br /><br />With sum_cython, the integers are only C ints, which are a 32 or 64-bit location in memory. Doing "s += i" just modifies in place that position in memory. There's no conversions or type checks done at all at run time. It's really fast... 1386 times faster than the first version!!!<br /><br />Key point: If you truly understand what is going on, you'll see that this isn't Sage being fundamentally broken. Instead, you'll hopefully be able to look at a block of Sage code and have a clue about how to figure out what it is really doing in order to see whether writing a new implementation of the same algorithm using Cython (which will likely mean directly working with C level data structures) is likely to give you a big speedup. If you look at the innermost statement in a loop, and there's a big monologue about what is really going on, then you might get a 1000-fold speedup by using Cython. <br /><br />In mathematics, general theorems -- once we have them -- are almost always much better than proofs of special cases. In math, proving a special case can often seem more awkward and unnatural than proving the general case (e.g., how would you proof that ever integer of the form a^2 + 7*a + 5 factors uniquely as a product of primes!?). With general theorems in math, the statements are often simple and clear so applying them is easier than applying theorems that are only about some very special case, which has often more elaborate hypothesis. In mathematics, usually a general theorem is simply all around much better than a theorem about some very special cases (especially if both are available). <br /><br />In contrast, when writing computer programs, algorithms to solve very general cases of problems often have significant drawbacks in terms of speed (and sometimes complexity) over algorithms for special cases. Since you are mathematicians, you should constantly guard against your instincts from math research which can point you in exactly the wrong direction for writing very fast code. Often implementations of very general algorithms _are_ easier to understand, and are much less buggy than a bunch of implementations of special cases. However, there are also usually very severe performance penalties in implementing only the general case. Watch out. <br /><br />A huge part of understanding the point of Cython for writing fast math code is that you must accept that you're going to write a lot of "ugly" (from a mathematicians perspective) code that only deals with special cases. But it's beautiful from the perspective of somebody who absolutely needs fast code for a specific research application; your fast code can lead to whole new frontiers of research.<br /><br />Continuing the example from above:<br /><pre>sage: sum_python(10^8)<br />4999999950000001<br />sage: sum_cython(10^8)<br />887459713<br /></pre><br />Yep, we just implemented only a special case in Cython!<br /><br />---------------<br /><br />Useful things to do if you want Cython enlightenment (all at once, no order):<br /><ul><li> Definitely read/skim <a href="http://docs.python.org/extending/">Extending</a> and try all examples. This is critical if you want to understand Cython with any depth at all. Don't think: I don't need to know any of this, because I have Cython. Yes, after you play with this a bit you may never explicitly use it. But you do need to know it. </li> <li> Oh, definitely learn the C language if you haven't already. This is <br /><a href="http://www.amazon.com/Programming-Language-2nd-Brian-Kernighan/dp/0131103628">the book</a>. There are courses. It's crazy not to learn C, anyways, since it (along with C++) is hugely popular, and a massive amount of code is written in C/C++. (See, e.g., http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html where C and C++ together are the most popular, by a long margin.)<br /><br /><li> Obviously, read <a href="http://docs.cython.org/">http://docs.cython.org/</a> until it is burned into your brain.</li><br /><br /><li> Look at the code that Cython generates (use "cython -a" for a nice HTML view). <br /></ul>William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-1546958210949047062010-10-31T22:23:00.000-07:002010-11-01T16:00:05.888-07:00How to Referee Sage Trac TicketsThe current workflow and tools for peer review in Sage are not perfect, and it's clear they can be improved in many ways. However, the current system is also stable and works well, and though it will surely evolve over time, it is well worth learning. This post is about the current Sage peer review workflow. I will restrict my discussion to tickets that involve the Sage library, not standard or optional packages or other scripts not in the Sage library.<br /><br /><h2>Why Referee a Ticket?</h2>There are a couple of reasons that you might referee a ticket:<br /><br /><ol><li>You want to contribute to Sage development, but don't know where to start. There's a list <a target="_new" href="http://trac.sagemath.org/sage_trac/report/30">right here of about 200 tickets</a> (sorted by component) that need review, and probably some require only background you know something about.<br /><br /><li> You want a specific chunk of code to get into Sage that somebody else wrote. If you care about this code, you are probably qualified to review it.<br /><br /><li> Somebody asked you to review a certain ticket, perhaps in exchange for them reviewing one of your tickets. <br /></ol>Refereeing trac tickets is different than refereeing papers for inclusion in a journal in several ways. There is no central editor that assigns tickets to reviewers -- anybody (you, right now!) can just jump in and start refereeing tickets. Also, everything about refereeing is completely open and archived (hopefully) forever. <h2>What To Expect</h2>The code attached to a trac ticket can do all kinds of things, including: <ol><li> Fix a little (or huge) bug.<br /><br /><li>Add documentation to a function.<br /><br /><li> Introduce a new feature to Sage (and possibly new bugs!).<br /><br /><li>Change the name of an existing function. <br /><br /><li> Raise (or lower) the "doctest coverage" of Sage. <br /></ol><h2>How to Referee a Ticket</h2>Refereeing a ticket involves several steps. The steps below are mainly aimed at tickets that add new capabilities to Sage, but I've included some remarks about other types of tickets in another section below. <ol><li> Use Mercurial queues to apply all the patches on the trac ticket to your own (very recent, usually alpha) version of Sage, then type "sage -br" to rebuild the Sage library. You download<br />each foo.patch, then do "hg qimport foo.patch" followed by "hg qpush" (I personally use this <a target="_new" href="http://sage.math.washington.edu/home/wstein/bin/hgqimport">little script</a>). Note: this can be automated; type "sage -merge" to learn how. If you can't apply the patch, then the ticket may "need work", i.e., the author has to rebase the patches; note this and change the Action at the bottom of the ticket page to "needs work". <br /><br /><li>Test out the code to make sure the doctests work, using "sage -t affected_files" and "make ptest" in the SAGE_ROOT directory. Doing "make ptest" is a good idea, since often a change in one part of Sage breaks something far, far away that was written by somebody else long, long ago. (Again, see "sage -merge" for automating this.) If any tests fail, the ticket probably "needs work"; note this and change the action to "needs work". I often do this testing on sage.math.washington.edu, since it has 24 processors.<br /><br /><li>Make sure that every function introduced in the code, even ones that start with underscores (!), have examples that fully illustrate how they work. Make sure that all input parameters to these functions are tested in the examples, and ensure that the function is actually being used by the examples. If the patch is large you can do "sage --coverageall" both before and after applying the patches, and make sure the coverage score doesn't go down. Also, make sure the docstrings are laid out according to our standard template, e.g.,<br /><pre>"""<br />Short description... (one sentence usually)<br /><br />Long description... (optional but sometimes good)<br /><br />INPUT:<br /> - param1 -- description of<br /> the first param<br /> - param2 -- description of<br /> the second param<br />OUTPUT:<br /> - description of output<br /><br />AUTHORS: <br /> - Sage Developer1<br /> - Sage Developer2<br /><br />EXAMPLES::<br /><br /> sage: 2+2<br /> 4<br /><br />TESTS::<br /><br /> sage: further examples nobody should look at.<br />"""<br /></pre>It is essential that there be two colons after "EXAMPLES" and a blank line!! Also, do not omit the INPUT/OUTPUT blocks, since I often get complaints from end users that these are missing.<br /><br /><li> If you've got this far without marking the ticket "needs work", then every function in the new code should work and have solid docstrings, including INPUT/OUTPUT blocks and examples that illustrate how the function works. Much of what you did in 1-3 is pretty routine (and yes, there are people working on trying to automate some of this). In this step, you finally get to have more fun and be creative: think seriously about whether the code is "probably correct" and sufficiently well documented to read. If not, do not give the code a positive review. The Sage library is already huge, and we don't want new code that is just going to cause a lot of pain to maintain later (there are other places for such code, such as <a target="_new" href="http://purple.sagemath.org">Purple Sage</a>). Basically, at this point, pretend you are a playtester or hacker and try to break the functionality introduced by the trac ticket. Some techniques include:<br /><br /><ul><li> Throw random input at the functions and see what happens.<br /><li> If you know any theorems or conjectures that leads to consistency checks on the functions, try them out. For example, if a function factors something as a product, try factoring products, then multiplying the factors back together. If the function computes the Ramanujan tau function, try it on large inputs and test the standard congruences.<br /><li>If the functions are implemented in other software (e.g., Magma or Mathematica) write a little program to systematically compare the output of Sage and the other system for some range of values. Also, compare timings; if code is 100 times slower in Sage than Magma, there better be a good reason for this. <br /><li> Very often when reading code, your "devious mind" will think of ways to potentially break it. Make sure that you have a sage prompt up, so you can try out your ideas for breaking the code. If anything succeeds at breaking the code, put that in your report and set the trac Action to "Needs work". Also, if you think of good doctests to add, note these.<br /></ul></ol><h2>Special Cases</h2>As mentioned above, most of the steps above assume that the code attached to the trac ticket is implementing new features. <ul><li> If the code fixes a bug, then if at all possible, there should be a new doctest that illustrates that the bug is fixed. This doctest should mention the relevant trac ticket.<br /><br /><li> If a trac ticket mainly adds new documentation, pay special attention to the English grammar, writing, etc. In my experience, almost nobody (definitely not me!) ever writes a few paragraphs of English without making several mistakes, typos, etc.<br /><br /><li> If the code changes the name of an existing function, then this must happen in accordance with our deprecation policy. You have to leave the old function name in place and add a deprecation warning. This can be removed after ONE YEAR. The proper way to do this is to put the following at the top of the function:<br /><pre>from sage.misc.misc import deprecation<br />deprecation("function_name is deprecated...")<br /></pre>Also, you may want to put in the Sage version number to give people a clue about when to remove the function (see the second argument to the deprecation function).<br /></ul>The core Sage library is meant to be mature, stable and highly polished, and function names, class names, etc., are not allowed to just be changed left and right. There are other places to post code that <em>uses</em> Sage, but which is not yet stable. <h2>Posting Patches as the Referee</h2>Often when going through the above steps, it is easiest to just make changes directly to the source code. E.g., if you see documentation misspelled, or think of examples to add. Just make a new patch that has all these changes and post a "referee patch". If you do this, then you should notify the original patch author(s), so they can referee your referee patch. This can get a little confusing, but usually sorts itself out. <h2>Summary</h2><br />I hope you can see from the above that refereeing tickets can be really fun. Think of it as a game to break whatever is posted on a trac ticket. It's much better if you find issues now when the author is still interested in the code, rather than some poor user finding bugs a few years later when the author of the code has moved on to other things and forgotten everything about the code. The core Sage library can only someday become high quality with your help as referees. And getting substantial code into Sage itself can only have any hope of attaining a similar status to publishing in a peer reviewed journal if this process of code refereeing itself gets refined, documented, and results in good quality mathematical software.<br /><br />Remember, the goal is that only high quality code goes into Sage that satisfies the requirements outlined above; this is much more important than worrying about following "bureaucratic protocols".William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-54312498888524548762010-10-26T18:44:00.001-07:002010-10-26T18:44:08.247-07:00Sage as a Python Library?When I started Sage I viewed it as a distribution of a bunch of math software, and Python as just the interpreter language I happen to use at the time. I didn't even know if using Python as the language would last. However, it's also possible to think of Sage as a Python library. <br /><br />Anyway, it has occurred to me (a few times, and again recently) that it would be possible to make much of the Sage distribution, without Python of course, into a Python library. What I mean is the following. You would have a big Python library called "sagemath", say, and inside of that would be a huge HG repository. In that repository, one would check in the source code for many of the standard Sage spkg's... e.g., GAP, Pari, etc. When you type <br /><br />python setup.py install<br /><br />then GAP, Pari, etc., would all get built, controlled by some Python scripts, then installed as package_data in the sagemath directory of <your python="">/site-packages/. <br /><br />From a technical perspective, I don't see any reason why this couldn't be made to work. HG can handle this much data, and "python setup.py install" can do anything. It does lead to a very different way of looking at Sage though, and it could help untangle things in interesting ways.<br /><br />(1) Have a Python library called "sagecore", which is just the most important standard spkg's (e.g., Singular, PARI, etc.), perhaps eventually built *only* as shared object libraries (no standalone interpreters). <br /><br />(2) Have a Python library which is the current Sage library (we already have this), and which can be installed assuming sagecore is installed.<br /><br />(3) Have other Python libraries (like psage: http://code.google.com/p/purplesage/source/browse/), which depend on (2). Maybe a lot of the "sage-combinat" code could also be moved to such a library, so they can escape the "combinat patch queue" madness. Maybe many other research groups in algebraic topology, differential geometry, special functions, etc., will start developing such libraries... on their own, and share them with the community (but without having to deal directly with the sage project until they want to). <br /><br />To emphasize (3), when people want to write a lot of mathematics code in some area, e.g., differential geometry, they would just make a new library that depends on Sage (the library in (2)). We do the work needed to make it easy for people to write code outside of the Sage library, which depends on Sage. Especially writing Cython code like this can be difficult and confusing, and we don't explain it all in any Sage documentation. It actually took me quite a while to figure out how to do it today (with psage). <br /><br />The core Sage library (2) above would continue to have a higher and higher level of code review, tough referee process etc. However, the development models for (3) would be up to the authors of those libraries.<br /><br />The above is already how the ecosystem with Python (http://pypi.python.org/pypi), Perl (http://www.cpan.org/), R, etc., work. Fortunately, Python has reasonably good support already for this. <br /><br />I think without a shift in this direction, Sage is going to be very frustrating for people writing research oriented code. <br /><br />Fortunately, it's possible to do everything I'm describing above without disturbing the mainline Sage project itself, at least for now.</your>William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-88669025831541052812010-10-25T12:25:00.000-07:002010-10-27T12:09:09.473-07:00Revision Control and Sage The main target audience of this blog post is mathematics researchers who are relatively new to revision control in the context of Sage.<br /><br /><h1>Background</h1><br />In the 1990s when I first started using revision control, it was a major, major pain in the arse. Making a tarball snapshot of the project directory with a date-stamp was better! In 2005 when I started Sage, I didn't know any better, and didn't both with revision control for the first year. In 2006, Gonzalo Tornaria told me that things had improved, and got me to use Darcs. That lasted just over a year, when I switched Sage to use Mercurial. The switch was due to major headaches installing Darcs (a Haskell program), and Darcs hanging when our repo got too big and complicated.<br /><br />We are fortunate that today many difficult problems and design decisions for revision control have been addressed through lots of work. Revision control is amazingly good now compared to what it was 15 years ago!<br /><br />Some milestones: RCS, CVS, Subversion, Bitkeeper & Monotone, Darcs, Git/Mercurial/Bazaar<br /><br />A reference: see <a href="http://hgbook.red-bean.com/read/how-did-we-get-here.html">the Mercurial guide</a>. <br /><br /><h1>Getting started with Mercurial</h1>In sage we use "hg" = (Mercurial). Why?<br /><br />- Because I chose it in early 2007, when Darcs started sucking too much<br />- <i>Distributed</i> is the way to go these days. (Modern)<br />- Bazaar far too unstable in 2007, to put it mildly. Bazaar is by the people who made Ubuntu, and is much more stable today.<br />- GIT badly documented, too hard to setup, etc... back in 2007; again, git is much better today.<br /><br />---> so Mercurial was the only reasonable choice then. It was just "good enough" at the time that we could survive. Hence Mercurial.<br /><br />Mercurial comes with Sage, and you can use it like so:<br /> - "sage -hg" <br /> - sage: install_scripts command<br /> - sage: hg_sage.[tab]<br /><br />- Or Install: easy to install standalone on OS X, Linux, <i>and</i> Windows. <br /><br />- You can do "$ sudo easy_install mercurial" to install systemwide the latest version into your default system wide Python. It will probably appear in /usr/local/bin/hg<br /><br />Documentation for hg:<br />- <a href="http://mercurial.selenic.com/">http://mercurial.selenic.com/</a><br />- <a href="http://hgbook.red-bean.com/">http://hgbook.red-bean.com/</a><br />- <a href="http://mercurial.selenic.com/learn/">http://mercurial.selenic.com/learn/</a><br />- <a href="http://hginit.com/">http://hginit.com/</a><br /><br /><h1>Project Hosting</h1><br />If you create your own HG projects, you can easily host them on <a href="http://code.google.com/">Google Code</a> (and also <a href="http://bitbucket.org/">http://bitbucket.org/</a>).<br /><br />I'm doing this a lot lately with smaller projects, including:<br />- <a href="http://code.google.com/p/purplesage/">Purple Sage</a><br />- <a href="http://code.google.com/p/sagenb/">Sage Notebook</a><br /><br />- Other systems. Google code only does HG and Subversion. <br />But there are also similar hosting services for git and bazaar:<br />- <a href="http://github.com/">github</a> -- free hosting for git projects<br />- <a href="https://launchpad.net/">Bazaar's Launchpad</a><br /><br />- Sourceforge: The first major free hosting site I heard of is <a hjref="http://sourceforge.net" href="http://www.blogger.com/post-edit.g?blogID=6365588202025292315&postID=8866902583154105281">sourceforge.net</a>, which offers hosting for all sorts of repos, including mercurial. It is more general, but not as "clean and focused" in abilities as the above, e.g., their mailing lists are often plagued by spam, and I seem to recall a lot of banner ads.<br /><br />These hosting services are a very popular and fairly <i>new thing</i> (but not too new); all but sourceforge became popular well after I started Sage. They provide much, much more than just what the revision control systems do; they are more than just a place to host your code repository. They all provide:<br /><br />- code hosting (of course)<br />- bug tracking<br />- code review<br />- mailing lists, wiki's, etc. <br /><br />Sage isn't hosted on one of these probably because they are so new, and Sage is older. This may change. Also, Sage is really, really big (Google code has a 2GB limit).<br /><br />Main point: if you want to start a small project with a small number of other people -- example, writing a research paper and some corresponding code -- you can in seconds <i>easily</i> get the full suite of hosting, bug tracking, code review, etc., totally for free, backed up, high quality, etc. Obviously, you are trusting your data to either Google, Github, or Canonical, but you can easily backup most of it (e.g., the repos, mailing lists).<br /><br />They want you to use them: "431,000 people hosting over 1,330,000 git repositories" (from github)<br /><br />Story: In 2006, when I came to UW, Randy Leveque (a professor in applied math) found out about me running the Sage project, and wanted to know how to get Mercurial, Wiki's, trac, etc., setup, so that he could organize his software development project(s). I explained some basics, and I think he and his students setup various project infrastructure, and ran into assorted troubles. Today, I would just say "http://code.google.com" (or github, etc.) and be done with it! And Randy would be up and running with more tools than before in a matter of minutes.<br /><br /><h1>Command line mercurial</h1><br />The main point of the rest of this lecture will be about how to use Mercurial. It's very hard to beat the official Mercurial Quickstart, which we'll just go over together in class:<br /><br />See the <a href="http://mercurial.selenic.com/quickstart/">Mercurial Quickstart</a>.<br /><br />- Make a directory in your computer be under revision control<br />- Setup your .hgrc file, if you haven't already.<br />- Try cloning a remote repository, e.g., <a href="http://code.google.com/p/purplesage/source/checkout">purple sage</a><br />- Try committing a change that spans multiple files, so you can see that all the changes are together. Then view the commit via "hg serve -p 8200". <br /><br />WARNING: on OS X with the sage-included hg, this "hg serve" crashed for me with "Trace/BPT trap" when I visited http://localhost:8200. I opened <a href="http://trac.sagemath.org/sage_trac/ticket/10171">trac ticket 10171</a>. <br /><br /><br />So I did "sudo easy_install mercurial" to get a new version systemwide, and now this works for me: "/usr/local/bin/hg serve -p 8200" <br /><br />A future blog post may cover more advanced HG topics, including branching, cloning, queues, and bundles.William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-47044352377408250052010-10-04T16:47:00.000-07:002010-10-07T11:41:33.278-07:00Standalone C/C++ libraries that are included in SageEvery copy of <a href="http://sagemath.org/">Sage</a> includes many free open source C/C++ libraries, which can also be used standalone without Sage. They got into Sage because enough people really cared about using them, and felt it was worth the effort to convince other people that these should be included in Sage. Thus every single library also provides some key functionality to Sage. Thus understanding the main points about these libraries should be of interest to anybody who knows C/C++ who wants to do mathematical computation. Moreover, most of these libraries were written in C/C++ because the authors had already mastered the underlying algorithms, and really wanted an implementation that was extremely fast, often potentially orders of magnitude faster than what they might implement using only an interpreter, such as Python, Maple or Magma.<br /><br />It's also possible using Cython (or even ctypes) to directly call any of the functionality of the libraries below from Python (hence Sage). Thus it is vital that you know about these libraries if you want to write fast code using the Sage framework. <br /><br />When you want to look at the source code for any of these libraries, you can download the Sage source code from <a href="http://boxen.math.washington.edu/sage/src/">here</a> or <a href="http://sagemath.org/download-source.html">here</a>, unpack it, look in the directory sage-x.y.z/spkg/standard/, and for any "spkg" (=tar, bzip2) file there, unpack it, e.g.,<br /><br /><pre>wstein@sage:$ tar jxvf mpir-1.2.2.p1.spkg<br />wstein@sage:$ cd mpir-1.2.2.p1/src/<br />wstein@sage:$ ls<br />acinclude.m4 doc missing <br />...<br /></pre><br />Note that you can't unpack the spkg files included in a binary of Sage, since they are actually empty placeholders. <br /><br />You can also find direct links to all of the standard spkg by going to <a href="http://sagemath.org/packages/standard/">this page</a>, along with short descriptions of each. <br /><br /><br /><h1>Arithmetic</h1><ul><li> mpir (written in C and assembler; has a C++ interface): this is a library for doing fast arithmetic with arbitrary precision integers and rationals. It's very fast and cross-platform, and it's used by many other libraries. (NOTE: MPIR was started as an angry fork" of GMP=Gnu Multiprecision Library. GMP is used by Magma, Maple, and Mathematica; I've been told Mathematica doesn't use MPIR because of the overlap of Sage and MPIR developers, and their general suspicion toward the Sage project.)<br /></li><li> mpfr (C library with C++ interface): this is a library for doing arithmetic with fixed precision floating point numbers, i.e., decimals with a given fixed choice of precision. "Floating-point numbers are a lot like sandpiles: Every time you move one you lose a little sand and pick up a little dirt." [Kernighan and Plauger]. MPFR is used by Magma (for details, google for "mpfr magma"), and several other libraries and systems out there. The number theory system and library PARI does *NOT* use MPFR; since PARI also does arbitrary precision floating point arithmetic, you see that PARI includes alternative implementations of some of the same algorithms (note: PARI also has its own alternative to MPIR). MPFR is unusual in that the main author -- Paul Zimmerman -- is unusually careful for somebody who works with floating point numbers, and my understanding is that all the algorithms in MPFR are proven to give the claimed precision. Basically, MPFR generalizes IEEE standards (that define what floating point arithmetic means) to arbitrary precision. MPFR also supports numerous "rounding modes". The mpmath Python library is in some cases faster than MPFR, but it has very different standards of rigor.</li><li> mpfi (C library): mpfi is built on top of MPFR, and implements arbitrary precision floating point "interval arithmetic". Interval arithmetic is an idea where you do arithmetic on intervals [a,b],<br />and apply functions to them. Given two intervals, their sum, product, quotient, etc., is another interval. The actual "feel" of using this is that you're working with floating point numbers, and as you lose precision due to rounding errors, the computer keeps track of the loss every step of the way... and in some cases you can quickly discover you have no bits left at all! MPFI is nicely integrated into Sage (by Carl Witty), and is a pretty sophisticated and robust implementation of interval arithmetic, on which some other things in Sage are built, such as the field QQbar of all algebraic numbers.</li><li> flint (C library): "Flint" is an abbreviation for "Fast Library for Number Theory". However, I've listed it here under basic arithmetic, since so far in Sage at least we only use FLINT for arithmetic with polynomials over Z[x] and (Z/nZ)[x]. (Sage should use it for Q[x], but still doesn't; see the horendously painful trac ticket, number 4000!) The main initial challenge and reason for existence of flint was to multiply polynomials in Z[x] more quickly than Magma (hence any other software on the planet), and FLINT succeeded. Now FLINT also does many other relevant polynomial algorithms, e.g., GCD, addition, etc. Most new development is going into "FLINT2", which is a total rewrite of FLINT, with an incompatible C interface. Bill Hart's vision is that eventually FLINT2 have capabilities sort of like PARI. Sage users have found at least one serious bug involving f,g in Z[x] where f*g was computed incorrectly.</li><li> znpoly (C libary): (by David Harvey) one version of znpoly is included in FLINT, and another is a standalone package in Sage. The point of this library is to do fast arithmetic in (Z/nZ)[x]. ("Monsky-Washnitzer!" we once found a very subtle bug in znpoly-in-FLINT that only appeared when multiplying polynomials of huge degree.)</li><li> givaro (C++ library): in Sage we use it entirely for fast arithmetic over small-cardinality finite fields, up to cardinality "2 to the power 16". Givaro makes a log table and does all multiplications as additions of small ints. Givaro actually does a lot more, e.g., arithmetic with algebraic numbers, and a bunch of matrix algorithms, including Block Wiedemann, which we don't even try to make available in Sage.<br /></li></ul><br />NOTE: The C library API conventions of mpir (gmp), mpfr, mpfi, flint and zn_poly are all pretty similar. Once you learn one, the others are comfortable to use. Givaro is totally different than the others because it is a C++ library, and hence has operator overloading available. The upshot: in Givaro, you type "c=a+b" to add two numbers, but in the other libaries (when used from C) you type, maybe "mpz_add(c,a,b)".<br /><br /><br /><h1>Number Theory</h1><br /><ul><li> pari (C library; interpreter): a hugely powerful C library (and interactive interpreter) for number theory. This library is very, very good and fast for doing computations of many functions relevant to number theory, of "class groups of number fields", and for certain computations with elliptic curves. It also has functions for univariate polynomial factorization and arithmetic over number fields and finite fields. PARI has a barbaric stack-based memory scheme, which can be quite annoying. Key Point: PARI is easy to build on my iPhone, which says something. That said, we have Neil Sloane, 2007: ``I would like to thank everyone who responded to my question about installing PARI on an iMAC. The consensus was that it would be simplest to install Sage, which includes PARI and many other things. I tried this and it worked! Thanks! Neil (It is such a shock when things actually work!!)'' PARI is very tightly integrated into Sage.</li><li> ntl (C++ library): a C++ library aimed at providing foundations for implementing number theory algorithms. It provides C++ classes for polynomials, vectors, and matrices over ZZ, Z/pZ, GF(q), (Z/pZ)[x], and RR, and algorithms for each. Some algorithms, e.g., for polynomial factorization in ZZ[x], are very sophisticated. Others, e.g., Hermite form of matrices over ZZ, are naive. In fact, most of the matrix algebra in NTL is naive, at least compared to what is in IML, Linbox, Sage and Magma. NTL is very stable and considered "done" by its author.</li><li> eclib (C++, standalone program): This is John Cremona's prized collection of C++ code that he wrote for computing elliptic curves, both finding them (using "modular symbols") and computing tricky things about them (via the "mwrank" program). It's a large amount of C++ code that Cremona developed over nearly 2 decades. There is still much functionality in there that could be made more readily available in Sage via Cython, but for which nobody has put in the effort to do so (I'm sure Cremona would be very helpful in providing guidance).</li><li> lcalc (C++ library, standalone program): This is Mike Rubinstein's C++ library for computing with "motivic L-functions", which are generalizations of the Riemann zeta function, of great importance in number theory. It's particularly good at computing tons of zeros of such functions in their critical strip, hence getting information about generalizations of the Riemann Hypothesis. It's the best general purpose L-functions program for computing zeros.</li><li> sympow (standalone C program): Though technically not a library it could be adapted into one. Sympow is a C program written by Mark Watkins for quickly computing special values of symmetric power L-functions (so related to what lcalc does). It has no dependencies (instead of PARI), because Mark didn't want to have to license sympow under the GPL.</li><li> ratpoints (C library): Michael Stoll's highly optimized C program for searching for certain rational points on hyperelliptic curves (i.e. of the form y^2 = f(x)). This library goes back to the early 1990s and work of Elkies. I believe this is in Sage for one reason only: Robert Miller wanted to implement 2-descent, and he needed this. (Unfortunately, Miller's implementation is only done in the case when there is a rational 2-torsion point.) This library could be made more useful in Sage, via making it so functionality in sage.libs.ratpoints is easily available for hyperelliptic curves...</li><li> genus2reduction (standalone C program that uses PARI): Not technically a library. This a C program that depends on PARI, and quickly computes "stuff" about genus 2 curves over the rational numbers... which as far as I know is fairly unique in what it does. A. Brumer used it recently for a large scale computation, and pointed out that it has stability issues.</li><li> flintqs (C++ program): This is a standalone C++ program that implements the quadratic sieve algorithm for integer factorization (which is good at factoring p*q with p and q of similar size and p*q up to maybe 100 digits). I think that this algorithm is also implemented in PARI, though in some cases flintqs is faster. A silly fact is that a better version of this program is evidently in FLINT itself, but for some reason Sage doesn't use it. </li><li> ecm (C library and C program; uses MPFR, MPIR): This implements the elliptic curve integer factorization algorithm, which Hendrik Lenstra invented, and is good at factoring p*n with p a prime with up to 30 (?) or so digits. There is another highly optimized implementation of the same algorithm as part of PARI, but ECM is in some (all?) cases better. It also has a huge number of tunable parameters. The ECM program is also often run standalone by people, sometimes on clusters, to attempt to factor integers. NOTE: Sage has a "factor" command, which just calls PARI. One could imagine it calling ECM and flintqs, etc., but nobody has made it do so... though many have tried.</li></ul><br /><h1>Linear Algebra</h1><br /><ul><li> IML (C library): "Integer matrix library". This library solves A*X = B, with A and B matrices over the integers. It does this by solving "A*X = B (mod p^n)" for several prime powers p^n, which involves computing the reduced row echelon form of A mod p (for various p) quickly... which is in turn done via matrix multiplication over matrices with double precision entries. It is the fastest software in the world for solving AX=B in many cases, especially as the entries of A and B get huge. </li><li> Linbox (C++ library): asymptotically fast black box linear algebra algorithms. It was just a "research sandbox", but "we" (mainly Clement Pernet) got Linbox whipped into shape enough to be of use in Sage for certain things. It is used at least for: matrix multiplication and computation of characteristic polynomials of matrices over the integers and finite fields, and asymptotically fast echelon form of matrices over finite field. I don't think it is used for much else yet. Implementing fast versions of the above in general is a gargantuan task, and we had done much of it (but no good charpoly!) natively in Sage before using Linbox, so now multiple implementations are available -- Linbox's are usually faster. Until Linbox/Sage, there were no fast open source programs for asymptoically fast exact linear algebra -- Magma was the only game in town (neither Maple or Mathematica are good at this either).</li><li> m4ri (C library): "Mary" -- scary fast arithmetic with matrices over the finite field GF(2) with 2 elements. This library uses various tricks (involving so-called "Gray codes") and asymptotically fast divide-and-conquer algorithms to compute reduced row echelon form and do matrix multiplication with matrices over GF(2). It is memory efficient and is the world's fastest at many benchmarks. There is related work (on "m4ri(e)") to do arithmetic over GF(2^n) and also some work by Boothby and Bradshaw on GF(p^n) for small p and n. </li><li> libfplll (C++ library): floating point LLL. The whole point of fpLLL is to implement very fast algorithms for computing LLL reduced basis for lattices, possibly with rounding error if you are not careful. LLL gets used as a key algorithm in many other algorithms (e.g., polynomial factorization). Magma also uses some variant of libfplll, and an older rewritten version of fpLLL is in PARI. If you look, e.g., at Mathematica, it does have some sort of limited LLL implementation, but it is nowhere near as sophisticated as what is available overall in Sage... NTL and PARI also have their own interfaces to LLL, which can have various pros and cons over using fpLLL directly. (In Sage, you can make a matrix over the integers and call the LLL or LLL_gram methods on it.) Magma also uses fpLLL. </li><li> polybori (C++ library): commutative algebra with boolean rings, which are characteristic 2 rings such that x^2 = x for every element of the ring. Evidently, polybori is important in cryptography and circuit design (?). I have personally never succeeded in really understanding how or seeing for myself polybori do well on benchmarks. Polybori is why the boost-cropped package is in Sage, which provides some C++ datastructure and algorithms. </li></ul><br /><h1>Combinatorics</h1><br /><ul><li> symmetrica (C++ library): a C++ library for doing very fast arithmetic with symmetric functions. The combinatorics code in Sage builds on this.</li><li> cliquer (C library): for finding cliques (=maximal complete subgraphs) in a graph.</li></ul><br /><br /><h1>Geometry</h1><br /><ul><li> gfan (C++ program): "groebner fans" -- for computing a geometric object that parametrizes all Groebner basis for an ideal.</li><li> cddlib (C library): for computing with convex polytopes that gfan requires, using the "double description" method (whatever that is). There is code in sage.geometry.polyhedra that also uses cddlib.</li><li> glpk (C library): the GNU linear programming toolkit. This is a good free open source program for "linear programming". There are also commercial programs (e.g., CPLEX) for linear programming, and whenever people talk about linear programming software they always point out that some commercial programs are way, way faster than glpk. But glpk is in Sage. </li><li> palp (C program): computes stuff about lattice polytopes, like all the integers points in one. There is a big Sage package that builds on top of this. Evidently, this is of substantial interest to some physicists.</li></ul><br /><h1>Numerical applied math</h1><ul><li> gsl (C library): implements core algorithms of scientific computing, is very stable and considered "done" by its authors. New relevant code goes into other libraries that use GSL.</li></ul><br /><h1>Non mathematics</h1><br /><ul><li> readline: provides command line editing, which is used by IPython, PARI, Singular, etc. Often misbuilt or not available on various platforms, so included in Sage. Readline is perhaps famous for being a key reason that both PARI and clisp are licenced under the GPL.</li><li> bzip2: bzip2 compression; easily usable from Sage via "import bz2". NOTE: this is used by the Sage build system so it is in SAGE_ROOT/spkg/base instead of SAGE_ROOT/spkg/standard/.</li><li> zlib: gzip-compatible compression; makes it so you can easily "gzip" files, and even work with gzip'd files in memory from Python.</li></ul>William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.comtag:blogger.com,1999:blog-6365588202025292315.post-79492103759039768272010-09-26T20:23:00.000-07:002010-09-26T20:23:44.388-07:00The Sagemath Cluster's FileserverIn addition to being involved in programming in Sage, I am also the systems administrator for the "Sagemath cluster" of computers, which are used for Sage development, and also support mathematics research. This is a cluster of 5 machines with 128GB RAM and 24 cores in each machine, along with a 24 terabyte Sun X4540 fileserver. <br /><br />When we setup this cluster nearly 2 years ago, we installed OpenSolaris and ZFS on the fileserver. It's run very solidly, in that the uptime was over 600 days a few days ago. However, no software would really work well on OpenSolaris for me -- top crashed usually, emacs crashed, stuff just didn't work. I never got Sage to build. Also, USB disks were a major pain on this machine, which complicated backups. Also, I frankly found the performance of ZFS and NFS-from-ZFS to be disappointing. In addition, nobody had a clue how to do maintenance and security updates on this server, meaning it was probable a danger. <br /><br />Then Sun got bought by Oracle, and Oracle killed OpenSolaris. Also, I started getting really into MongoDB for database work related to the modular forms database project, and it would be nice to be able to run a MongoDB server and other related software and servers directly on my fileserver (yes, MongoDB supports Solaris, but...). Getting anything to work on Solaris costs too much time and confusion. So I decided to delete OpenSolaris and install Linux. Since there's many terabytes of data on the fileserver, and dozens of people using it, this involved many rsync's back and forth. <br /><br /><br />I eventually succeed in installing Ubuntu 10.04.1 LTS Server on disk.math. I setup a big 20TB software RAID-5 array on disk.math (with 2 spares), and added it to an LVM2 (=logical volume management) volume group.<br /><br />I created 3 partitions:<br /><br /> home -- 3.5 terabytes: for people's home directories<br /> scratch -- 3.5 terabytes; for scratch that is available on all machines on the cluster<br /> lmfdb -- 3.5 terabytes; for the L-functions and modular forms database project.<br /><br />Thus over 7.5 terabytes is not allocated at all right now. This could be added to the lmfdb partition as needed.<br /><br />The RAID-5 array has impressive (to me) performance:<br /><br /> root@disk:/dev# /sbin/hdparm -t /dev/md0<br /> /dev/md0:<br /> Timing buffered disk reads: 1638 MB in 3.00 seconds = 545.88 MB/sec<br /><br />All the above 3.5TB partitions are NFS exported from disk.math, and (will all be) mounted on the other machines in the cluster.<br /><br />By using LVM, we will still get snapshotting (like ZFS has), which is important for robust backups over rsync, and is great for users (like me!) who accidentally delete important files. <br /><br />I chose 3.5TB for the partitions above, since that size is easy to backup using 4TB external USB disks. Now that I'm running linux on disk.math, it will be easy to just plug in a physical disk to the fileserver and make a complete backup, then unplug it and swap in another backup disk, etc. <br /><br />----<br /><br />Some general sysadmin comments about, in case people are interested. <br /><br /> (1) Oracle shutdown access to firmware upgrades, which meant I couldn't upgrade the firmware on disk.math. Maybe it would have been possible via phone calls, etc., but I was so pissed off to see Oracle do something that lame. It's just evil to not give away firmware upgrades for hardware. Give me a break. Oracle sucks. I'm really glad I just happened to upgrade the firmware on all my other Sun boxes recently.<br /><br /> (2) The remote console -- which is needed to install from a CD image on disk.math -- does not work correctly on 64-bit OS X or 64-bit Linux. It just fails with "not supported on this platform" errors, but with no hint as to why. By installing a 32-bit Linux into a virtual machine, I was able to get this to work.<br /><br /> (3) Linux NFS + software RAID + Linux LVM2 (=logical volume management) does roughly "the same thing" as ZFS did on the old disk.math machine. I spent about equal time with the documentation for both systems, RAID+LVM2+NFS and ZFS, and the documentation for RAID+LVM2+NFS is definitely less professional, but at the same time vastly more helpful. There's lots of good step-by-step guides, forums, and just generally a *ton* of useful help out there. With the ZFS documentation, there is basically one canonical document, which though professional, is just really frustrating as soon as you have a question not answered in it. I personally greatly prefer the Linux ecosystem. <br /><br /> (4) RAID+LVM2+NFS is much more modular than ZFS. Each part of the system can be used by itself without the other parts, each has its own documentation, and each can make use of the other in multiple flexible and surprising ways. It's very impressive. One of the reasons I chose ZFS+Solaris instead of Linux+whatever 2 years ago is that I didn't realize that Linux offers similar capabilities.... because it didn't back in 2001 when I was learning tons about Linux. But it does now. Linux on the server has really, really improved over the years. (I've been using Linux since 1993.)William Steinhttps://plus.google.com/115360165819500279592noreply@blogger.com