Thursday, June 12, 2014

SageMathCloud Task Lists

I've added task list functionality to SageMathCloud (SMC), so you can keep track of a list of things to do related to a project, paper, etc. Task lists are saved as files on the filesystem, so they can be backed up in the usual way, automatically generated, etc. I doubt anybody is going to use SMC just for the tasks lists, but for people already using SMC in order to use Sage, write LaTeX documents, use IPython notebooks, etc., having a convenient integrated task list should come in handy.
To create a task list, in a project, click "+New", name the task list, then click the "Task List" button.  Here's what a task list looks like:

The Design

I used the task list quite a bit when implementing the task list, and significantly modified the interface many, many times. I also tried out numerous other todo list programs for inspiration. I eventually settled on the following key design choices, some of which are different than anything I've seen. In particular, the design targets highly technical users, which is not something I saw with any other todo list programs I tried.
  • Markdown editor: The task description is formatted using client-side rendered Github flavored markdown (using marked), including [ ] for checkboxes. I also include full MathJax support, and spent quite a while working around various subtleties of mixing mathjax and markdown. I'll be rewriting Sage worksheets to use this code. The editor itself is Codemirror 4 in markdown mode, so it respects whatever theme you choose, has nice features like multiple cursors, paren matching, vim/emacs modes, etc. Multiple people can edit the same task at once and everybody will see the changes as they are made (note: I haven't implemented showing other cursors.)
  • Relative dates and times: All dates and times are shown relative to right now. If something is due in 20 hours, it says "due about 20 hours from now". I also included a sortable column with the last time when a task was edited, also shown relative to the current time. This display uses the timeago jquery plugin. You can of course click on the due date to see the actual date.
  • Hashtags: As I was implementing (and removing) features such as priorities, a way to indicate which task you are currently working on, folders, etc, I decided that hashtags can provide every feature. Moreover, they are very text editor/programmer friendly. When editing a task, if you put #foo, then it is interpreted as a hashtag, which you can click on to show only tasks containing that tag. To disambiguate with markdown headings, to make a heading you have to include a whitespace, so # foo. I haven't added autocomplete for hashtags yet, but will. You can easily click on hashtags anywhere to select them, or use the bar at the top.
  • User-specified task order: The default sort order for tasks is custom. There is a drag handle so you can explicitly move tasks up and down in order to indicate how important they are to you (or whatever else). You can also click an up hand or down hand to knock the currently selected task to the bottom of the list of displayed tasks.
Of course, I still have an enormous list of functionality I would like to add, but that will have to wait. For example, I need to enable a chat associated to each task list, like the chats associated to Sage worksheets and other files. I also want to make it so one can select a range of tasks and copy them, move them, paste them into another list, etc. It would also be nice to be able to export task lists to HTML, PDF, etc., which should be fairly easy using pandoc.  I'm also considering making a note list, which is like a task list but without the due date or "done" box.  Because of all the infrastructure already there, it would be easy to add code evaluation functionality, thus creating something like worksheets, but from a different point of view (with maybe hashtags determining the Python process).

Databases and Differential Synchronization

One interesting thing I noticed when implementing task lists is that there are many similarities with the original sagenb.org design (and also IPython notebook), in which a worksheet is a list of cells that get evaluated, can be manually reordered, etc. Similarly, a task list is a list of "cells" that you edit, manually reorder, etc. With sagenb we had longstanding issues involving the order of each cell and assigning an integer numerical id (0, 1, 2, ...) to the cells, which resulted in things like cells seemingly getting randomly reordered, etc. Also, having multiple simultaneous viewers with automatic synchronization is very difficult with that model. For task lists, I've introduced some new models to address these issues.

A natural way to store a task list is in a database, and I initially spent some time coming up with a good database schema and implementing basic lists using Cassandra for the backend. However, I couldn't pursue this approach further, since I was concerned about how to implement realtime synchronization, and also about the lack of easily backing up complete task lists via snapshots, in git, etc. So instead I created an "object database" API built on top of a file that is synchronized across clients (and the server) using differential synchronization. The underlying file format for the database is straightforward -- there is one line in JSON format for each record in the database. When objects are changed, the file is suitably modified, synchronized to other clients, and events are triggered.

Since differential synchronization has no trouble with files that have "a few thousand lines", this approach works well for our purposes (since personal or shared todo lists are typically fairly short). Also, having one line per record is at least more git friendly than something like a sqlite database. I'm considering rewriting my implementation of IPython notebook sync on top of this abstraction.
Since I view the task list as a database, each task has a globally unique uuid. Also, instead of viewing the task order as being defined by an integer 0,1,2,3, which leads to all manner of bugs and programming misery, race conditions, etc., instead we view the order as being determined by floating point positions. So to insert a task between tasks with positions 1 and 2, we just give the task position 1.5.

Thursday, June 5, 2014

The Official Go Tutorial as a (41-page) SageMathCloud worksheet


Do you like using interactive SageMathCloud worksheets and want to learn the basics of the Go Programming language? I've added a %go magic to SMC worksheets, and translated the official Go tutorial into a single long worksheet.
  1. Open a SageMathCloud project and restart the project server if necessary (project settings --Restart).
  2. Click +New and paste in this URL (then press enter): https://github.com/sagemath/cloud-examples.git
  3. You'll get a large collection of example worksheets in a directory cloud-examples. Navigate to the "Go" subdirectory and open go.sagews.
You can also directly browse a PDF of the worksheet here: https://github.com/sagemath/cloud-examples/blob/master/go/go.pdf?raw=true


Monday, June 2, 2014

Update to SageMathCloud - Codemirror 4.2, etc.

I've made the following updates to https://cloud.sagemath.com:

User interface changes (refresh your browser to get)

Sage Worksheets

  • Don't show spinner unless computation is taking more than a second.
  • Streamlined evaluation a little bit (never get stuck with the grey spinner when you're online)
  • 2d graphics now display using svg by default, since browser bugs have been fixed

Upgrade from Codemirror 3.20 to CodeMirror version 4.2, which is much better, faster, and has new features:

  • Multiple cursors (use control or command+click to create multiple cursors)
  • Sublime keyboard bindings
  • New editor features (you can turn these off in account settings):
    • Auto brackets: automatically close brackets, quotes, etc.
    • Show trailing whitespace: show spaces at ends of lines
Here's a 2-minute screencast that illustrates some of the above UI features: http://youtu.be/ykb12MGHOuk

Task Lists: There is now a preliminary task list implementation. To use it, make a file that ends in .tasks.

  • Task editing now uses full screen width
  • Fixed task deletion bug
  • Markdown list autocomplete

Backend changes (restart project server to get):

  • Automatically delete project log entries when the log grows beyond 7500 lines. In some cases of very active projects, the log would get enormous (say 30MB+) and just loading it would slow down everything for a while.
  • Clear a certain logfile that was getting huge whenever the project server is restarted.

Tuesday, May 6, 2014

Update to Differential Synchronization in SageMathCloud

I've just pushed out a major update to how synchronization works in https://cloud.sagemath.com.

This change is pretty significant behind the scenes, but the only difference you should notice is that everything should be better. In particular:

  • evaluation of code in Sage worksheet should feel a little snappier and more robust,
  • various random and hard to reproduce issues with synchronized editing should be fixed, e.g. chat messages out of order, etc.
  • everything should generally be a bit faster and more scalable overall.

Here's a short technical description of what changed. The basic architecture of SageMathCloud is that there are many web browsers connected to many hubs, which are in turn connected to your project (and to many other projects too):

  [web browser] <- websocket ----\/
  [web browser] <------------> [hub]<------ tcp -------\/
                                                     [project]
  [web browser] <------------> [hub]<------------------/\

Until today, the differential synchronization implementation involved having a copy of the document you're editing on:

  1. each hub pictured above,
  2. in each browser, and
  3. in the project itself.

In particular, there were three slightly different implementations of differential synchronization running all over the place. The underlying core code is the same for all three, but the way it is used in each case is different, due to different constraints. The implementations:

  • browser: running in a web browser, which mainly has to worry about dealing with the CodeMirror editor and a flakie Internet connection.
  • hub: running in a node.js server that's also handling a lot of other stuff, including worrying about auth, permissions, proxying, logging, account creation, etc.
  • project: running in the project, which doesn't have to worry about auth or proxying or much else, but does have to worry about the filesystem.

Because we're using Node.js, all three implementations are written in the same language (CoffeeScript), and run the same underlying core code (which I BSD licensed at https://github.com/sagemath/cloud/blob/master/diffsync.coffee). The project implementation was easiest to write, since it's very simple and straightforward, and has minimal constraints. The browser implementation is mainly difficult, since the Internet comes and goes (as laptops suspend/resume), and it this involves patching and diff'ing a CodeMirror editor instance; CodeMirror is difficult, because it views the document as a line instead of a single long string, and we want things to work even for documents with hundreds of thousands of lines, so converting back and forth to a string is not an option! Implementing the hub part of synchronization is the hardest, for various reasons -- and debugging it is particularly hard. Moreover, computing diffs can be computationally expensive if the document is large, so doing anything involving differential sync on the hub can result in nontrivial locking cpu usage, hence slower handling of other user messages (node.js is single threaded). The hub part of the above was so hard to get right that it had some nasty locking code, which shouldn't be needed, and just looked like a mess.

A lot of issues people ran into with sync involved two browsers connected to different hubs, who then connected to the same document in a project. The two hubs' crappy synchronization would appear to work right in this part of the picture "[web browser] <------------> [hub]", but have problems with this part "[hub]<-------------->[project]", which would lead to pain later on. In many cases, the only fix was to restart the hub (to kill its sync state) or for the user to switch hubs (by clearing their browser cookies).

Change: I completely eliminated the hub from the synchronization picture. Now the only thing the hub does related to sync is forward messages back and forth between the web browser and local hub. Implementing this was harder than one might think, because the the project considered each client to be a single tcp connection, but now many clients can connect a project via the same tcp connection, etc.

With this fix, if there are any bugs left with synchronization, they should be much easier to debug. The backend scalability and robustness of sync have been my top priorities for quite a while now, so I'm happy to get this stuff cleaned up, and move onto the next part of the SMC project, which is better collaboration and course support.

Thursday, May 1, 2014

What can SageMathCloud (SMC) do?

The core functionality of SageMathCloud:

  • Color command-line Terminal with many color schemes, which several people can interact with at once, with state that survives browser refresh.
  • Editing of documents: with syntax highlighting, auto-indent, etc., for files with the following extensions:
    c, c++, cql, cpp, cc, conf, csharp, c#, coffee, css, diff, dtd, e, ecl, f, f90, f95, h, hs, lhs, html, java, jl, js, lua, m, md, mysql, patch, gp, go, pari, php, py, pyx, pl, r, rst, rb, ru, sage, sagews, scala, sh, spyx, sql, txt, tex, toml, bib, bbl, xml, yaml.
(It's easy for me to add more, as CodeMirror supports them.) There are many color schemes and Emacs and Vim bindings.
  • Sage Worksheets: a single document interactive way to evaluate Sage code. This is highly extensible, in that you can define % modes by simply making a function that takes a string as input, and use %default_mode to make that mode the default. Also, graphics actually work in the %r automatically, exactly as in normal R (no mucking with devices or png's).
  • IPython notebooks: via an IPython session that is embedded in an iframe. This is synchronized, so that multiple users can interact with a notebook simultaneously, which was a nontrivial addition on top of IPython.
  • LaTeX documents: This fully supports sagetex, bibtex, etc. and the LaTeX compile command is customizable. This also has forward and inverse search, i.e., double click on preview to get to point in tex file and alt-enter in tex file to get to point in LaTeX document. In addition, this editor will work fine with 150+ page documents by design. (Editing multiple document files are not properly supported yet.)
  • Snapshots: the complete state of all files in your project are snapshotted (using bup, which is built on git) every 2 minutes, when you're actively editing a file. All of these snapshots are also regularly backed up to encrypted disks offsite, just in case. I plan for these highly efficient deduplicated compressed snapshots to be saved indefinitely. Browse the snapshots by clicking "Snapshots" to the right when viewing your files or type cd ~/.snapshots/master in the terminal.
  • Replication: every project is currently stored in three physically separate data centers; if a machine or data center goes down, your project pops up on another machine within about one minute. A computer at every data center would have to fail for your project to be inaccessible. I've had zero reports of projects being unavailable since I rolled out this new system 3 weeks ago (note: there was a project that didn't work, but that was because I had set the quota to 0% cpu by accident).

Sage

The Sage install contains the following extra packages (beyond what is standard in Sage itself). When you use Sage or IPython, this will all be available.
basemap, biopython, biopython, bitarray, brian, cbc, chomp, clawpack, cluster_seed, coxeter3, cryptominisat, cunningham_tables, database_cremona_ellcurve, database_gap, database_jones_numfield, database_kohel, database_odlyzko_zeta, database_pari, database_symbolic_data, dot2tex, fabric, gap_packages, gnuplotpy, greenlet, guppy, h5py, httplib2, kash3, lie, lrs, lxml, mahotas, mercurial, mpld3, munkres, mysql-python, nauty, netcdf4, neuron, normaliz, nose, nose, numexpr, nzmath, oct2py, p_group_cohomology, pandas, paramiko, patsy, patsy, phc, plotly, psutil, psycopg2, pybtex, pycryptoplus, pyface, pymongo, pyproj, pyx, pyzmq, qhull, quantlib, redis, requests, rpy2, scikit_learn, scikits-image, scimath, shapely, simpy, snappy, statsmodels, stein-watkins-ecdb, tables, theano, topcom, tornado, traits, xlrd, xlwt, zeromq

R

Also, I install the following extra packages into the R that is in Sage:
KernSmooth, Matrix, Rcpp, cairo, car, circular, cluster, codetools, e1071, fields, ggplot2, glmnet, lattice, mgcv, mvtnorm, plyr, reshape2, rpart, stringr, survival, zoo

It's Linux

SMC can do pretty much anything that doesn't require X11 that can be done with an Ubuntu-14.04 Linux can be done. I've pre-installed the following packages, and if people want others, just let me know (and they will be available to all projects henceforth):
vim git wget iperf dpkg-dev make m4 g++ gfortran liblzo2-dev libssl-dev libreadline-dev  libsqlite3-dev libncurses5-dev git zlib1g-dev openjdk-7-jdk libbz2-dev libfuse-dev pkg-config libattr1-dev libacl1-dev par2 ntp pandoc ssh python-lxml  calibre  ipython python-pyxattr python-pylibacl software-properties-common  libevent-dev xfsprogs lsof  tk-dev  dstat emacs vim texlive texlive-* gv imagemagick octave mercurial flex bison unzip libzmq-dev uuid-dev scilab axiom yacas octave-symbolic quota quotatool dot2tex python-numpy python-scipy python-pandas python-tables libglpk-dev python-h5py zsh python3 python3-zmq python3-setuptools cython htop ccache python-virtualenv clang libgeos-dev libgeos++-dev sloccount racket libxml2-dev libxslt-dev irssi libevent-dev tmux sysstat sbcl gawk noweb libgmp3-dev ghc  ghc-doc ghc-haddock ghc-mod ghc-prof haskell-mode haskell-doc subversion cvs bzr rcs subversion-tools git-svn markdown lua5.2 lua5.2-*  encfs auctex vim-latexsuite yatex spell cmake libpango1.0-dev xorg-dev gdb valgrind doxygen haskell-platform haskell-platform-doc haskell-platform-prof  mono-devel mono-tools-devel ocaml ocaml-doc tuareg-mode ocaml-mode libgdbm-dev mlton sshfs sparkleshare fig2ps epstool libav-tools python-software-properties software-properties-common h5utils libnetcdf-dev netcdf-doc netcdf-bin tig libtool iotop asciidoc autoconf bsdtar attr  libicu-dev iceweasel xvfb tree bindfs liblz4-tool tinc  python-scikits-learn python-scikits.statsmodels python-skimage python-skimage-doc  python-skimage-lib python-sklearn  python-sklearn-doc  python-sklearn-lib python-fuse cgroup-lite cgmanager-utils cgroup-bin libpam-cgroup cgmanager cgmanager-utils cgroup-lite  cgroup-bin r-recommended libquantlib0 libquantlib0-dev quantlib-examples quantlib-python quantlib-refman-html quantlib-ruby r-cran-rquantlib  libf2c2-dev libpng++-dev libcairomm-1.0-dev r-cran-cairodevice x11-apps mesa-utils libpangox-1.0-dev
I've also put extra effort (beyond just apt-get) to install the following:
polymake, dropbox, aldor/"AXIOM", Macaulay2, Julia, 4ti2

Functionality that is currently under development

We're working hard on improving SageMathCloud right now.

  • Streamlining document sync: will make code evaluation much faster, eliminate some serious bugs when the network is bad, etc.
  • Geographic load balancing and adding data centers, so that, e.g., if you're in Europe or Asia you can use SMC with everything happening there. This will involve DNS load balancing via Amazon Route 53, and additionally moving projects to run on the DC that is nearest you on startup, rather than random. Right now all computers are in North America.
  • Mounting a folder of one project in another project, in a way that automatically fixes itself in case a machine goes down, etc. Imagine mounting the projects of all 50 students in your class, so you can easily assign and collect homework, etc.
  • Homework assignment and grading functionality with crowdsourcing of problem creation, and support for peer and manual grading.
  • BSD-licensed open source single-project version of SMC.
  • Commercial software support and instructions for how to install your own into SMC (e.g., Mathematica, Matlab, Magma, etc.)
  • ssh access into projects easily

Friday, April 25, 2014

The SageMathCloud Roadmap

Everything below is subject to change.

Implementation Goals

  • (by April 27) Major upgrades -- update everything to Ubuntu 14.04 and Sage-6.2. Also upgrade all packages in SMC, including Haproxy, nginx, stunnel, etc.

  • (by May 4) Streamline doc sync: top priority right now is to clean up sync, and eliminate bugs that show up when network is bad, many users, etc.

  • (by May 30) Snapshots:
    • more efficient way to browse snapshot history (timeline view)
    • browse snapshots of a single file (or directory) only
  • (by May 30) User-owned backups
    • way to download complete history of a project, i.e,. the underlying bup/git repository with snapshot history.
    • way to update offline backup, getting only the changes since last download.
    • easy way to download all current files as a zip or tarball (without the snapshots).
  • (by June 30) Public content
    • Ability to make a read-only view of the project visible publicly on the internet. Only works after the account is at least n days old.
    • By default, users will have a "report spammer" button on each page. Proven 'good users' will have button removed. Any valid reported users will be permanently banned.
    • Only users with validated .edu accounts will be allowed to publish for starters. Maybe allow gmail.com at some point.
  • (by June 30) Fix all major issues (none major) that are listed on the github page: https://github.com/sagemath/cloud/issues

  • (by July 31) Group/social features:
    • Support for mounting directories between projects
    • Group management: combine a group of users and projects into a bigger entity:
      • a University course -- see feature list below
      • a research group: a collection of projects with a theme that connects them, where everybody has access to everybody else's projects
    • A feed that shows activity on all projects that users care about, with some ranking. Better notifications about chat messages and activity

Commercial Products

We plan four distinct products of the SMC project: increased quotas, enhanced university course support, license and support to run a private SMC cloud, supported open source BSD-licensed single-user version of SMC (and offline mode).

  • PRODUCT: Increase the quota for a specific project (launch by Aug 2014)
    • cpu cores
    • RAM
    • timeout
    • disk space
    • number of share mounts
  • Remarks:
    • There will be an option in the UI to change each of the above parameters that some project collabs see (maybe only owners initially).
    • Within moments of making a change it goes live and billing starts.
    • When the change is turned off, billing stops. When a project is not running it is not billed. (Obviously, we need to add a stop button for projects.)
    • There is a maximum amount that the user can pre-spend (e.g., $500 initially).
    • At the end of the month, the user is given a link to a Univ of Washington website and asked to pay a certain amount, and register there under the same email as they use with SMC.
    • When they pay, SMC receives an email and credits their account for the amount they pay.
    • There will also be a limit soon on the number of projects that can be associated with an account (e.g., 10); pay a small monthly to raise this.
  • PRODUCT: University course support (launch by Aug 2014 in time for Fall 2014 semester)

    • Free for the instructor and TA's
    • Each student pays $20 in exchange for:
      • one standard project (they can upgrade quotas as above), which TA and instructor are automatically collabs on
      • student is added as collaborator to a big shared project
      • in student's private project they get homework assignments (assigned, collected)
    • Instructor's project has all student projects as mounted shares
    • Instructor has a student data spreadsheet with student grades, project ids (links), etc.
    • Powerful modern tool for designing homework problems that can be automatically graded, with problems shared in a common pool, with ratings, and data about their usage.
    • A peer grading system for more advanced courses.
    • Tools to make manual grading more fun.
  • PRODUCT: License to run a private SMC cloud in a research lab, company, supercomputer, etc. (launch a BETA version by July 2014, with caveats about bugs).

    • base fee (based on organization size)
    • technical support fee
    • site visit support: install, run workshop/class
  • PRODUCT: Free BSD-licensed single-user account version of SMC (launch by December 2014)

    • a different way to do LaTeX editing, manage a group of IPython notebooks, use Sage worksheets, etc.
    • be included with Sage.
    • be included in many Linux distros
    • the doc synchronization code, local_hub, CoffeeScript client, terminal.
    • mostly Node.js application (with a little Python for Sage/IPython).
    • ability to sync with a cloud-hosted SMC project.
    • sell pre-configured (or just support) for a user to install standalone-SMC on some cloud host such as EC2 or Digital Ocean

Tuesday, April 15, 2014

SageMathCloud's new storage architecture

Keywords: ZFS, bup, rsync, Sage

SageMathCloud (SMC) is a browser-based hosted cloud computing environment for easily collaborating on Python programs, IPython notebooks, Sage worksheets and LaTeX documents. I spent the last four months wishing very much that less people would use SMC. Today that has changed, and this post explains some of the reasons why.

Consistency Versus Availability

Consistency and availability are competing requirements. It is trivial to keep the files in a SageMathCloud project consistent if we store it in exactly one place; however, when the machine that project is on goes down for any reason, the project stops working, and the users of the project are very unhappy. By making many copies of the files in a project, it's fairly easy to ensure that the project is always available, even if network switches in multiple data centers completely fail, etc. Unfortunately, if there are too many users and the synchronization itself puts too heavy of a load on the overall system, then machines will fail more frequently, and though projects are available, files do not stay consistent and data is lost to the user (though still "out there" somewhere for me to find).

Horizontal scalability of file storage and availability of files are also competing requirements. If there are a few compute machines in one place, then they can all mount user files from one central file server. Unfortunately, this approach leads to horrible performance if instead the network is slow and has high latency; it also doesn't scale up to potentially millions of users. A benchmark I care about is downloading a Sage binary (630MB) and extracting it (creating over 70,000 files); I want this to take at most 3 minutes total, which is hard using a networked filesystem served over the general Internet between data centers. Instead, in SMC, we store the files for user projects on the compute machines themselves, which provides optimal speed. Moreover, we use a compressed filesystem, so in many cases read and write speeds are nearly twice as fast as they might be otherwise.

New Architecture of SageMathCloud

An SMC project with id project_id consists of two directories of files, replicated across several machines using rsync:
  1. The HOME directory: /projects/project_id
  2. A bup repository: /bup/bups/project_id
Users can also create files they don't care too much about in /scratch, which is a compressed and deduplicated ZFS filesystem. It is not backed up in any way, and is local to that compute.

The /projects directory is one single big ZFS filesystem, which is both lz4 compressed and deduplicated. ZFS compression is just plain awesome. ZFS deduplication is much more subtle, as deduplication is tricky to do right. Since data can be deleted at any time, one can't just use a bloom filter to very efficiently tell whether data is already known to the filesystem, and instead ZFS uses a much less memory efficient data structure. Nonetheless, deduplication works well in our situation, since the compute machines all have sufficient RAM (around 30-60GB), and the total data stored in /projects is well under 1TB. In fact, right now most compute machines have about 100GB stored in /projects.
The /bup/bups directory is also one single big ZFS filesystem; however, it is neither compressed nor deduplicated. It contains bup repositories, where bup is an awesome git-based backup tool written in Python that is designed for storing snapshots of potentially large collections of arbitrary files in a compressed and highly deduplicated way. Since the git pack format is already compressed and deduplicated, and bup itself is highly efficient at deduplication, we would gain almost nothing by using compression or deduplication directly on this ZFS filesystem. When bup deduplicates data, it does so using a sliding window through the file, unlike ZFS which simply breaks the file up into blocks, so bup does a much better job at deduplication. Right now, most compute machines have about 50GB stored in /bup/bups.

When somebody actively uses a project, the "important" working files are snapshotted about once every two minutes. These snapshots are done using bup and stored in /bup/bups/project_id, as mentioned above. After a snapshot is successfully created, the files in the working directory and in the bup repository are copied via rsync to each replica node. The users of the project do not have direct access to /bup/bups/project_id, since it is of vital importance that these snapshots cannot be corrupted or deleted, e.g., if you are sharing a project with a fat fingered colleague, you want peace of mind that even if they mess up all your files, you can easily get them back. However, all snapshots are mounted at /projects/project_id/.snapshots and browseable by the user; this uses bup's FUSE filesystem support, enhanced with some patches I wrote to support file permissions, sizes, change times, etc. Incidentally, the bup snapshots have no impact on the user's disk quota.

We also backup all of the bup archives (and the database nodes) to a single large bup archive, which we regularly backup offsite on encrypted USB drives. Right now, with nearly 50,000 projects, the total size of this large bup archive is under 250GB (!), and we can use it efficiently recover any particular version of any file in any project. The size is relatively small due to the excellent deduplication and compression that bup provides.

In addition to the bup snapshots, we also create periodic snapshots of the two ZFS filesystems mentioned above... just in case. Old snapshots are regularly deleted. These are accessible to users if they search around enough with the command line, but are not consistent between different hosts of the project, hence using them is not encouraged. This ensures that even if the whole replication/bup system were to somehow mess up a project, I can still recover everything exactly as it was before the problem happened; so far there haven't been any reports of problems.

Capacity

Right now there are about 6000 unique weekly users of SageMathCloud and often about 300-400 simultaneous users, and there are nearly 50,000 distinct projects. Our machines are at about 20% disk space capacity, and most of them can easily be expanded by a factor of 10 (from 1TB to 12TB). Similarly, disk space for our Google compute engine nodes is $0.04 GB / month. So space-wise we could scale up by a factor of 100 without too much trouble. The CPU load is at about 10% as I write this, during a busy afternoon with 363 clients connected very actively modifying 89 projects. The architecture that we have built could scale up to a million users, if only they would come our way...