BupI spent a lot of time building a snapshot system for user projects on top of bup. Bup is a highly efficient de-duplicating compressed backup system built on top of git; unlike other approaches, you can store arbitrary data, huge files, etc.
I looked at many open source options for making efficient de-duplicated distributed snapshots, and I think bup is overall the best, especially because the source code is readable. Right now https://cloud.sagemath.com makes several thousand bup snapshots every day, and it has practically saved people many, many hours in potentially lost work (due to them accidentally deleting or corrupting files).
You can access these snapshots by clicking on the camera icon on the right side of the file listing page.
Some lessons learned when implementing the snapshot system
- Avoid creating a large number of branches/commits -- creating an almost-empty repo, but with say 500 branches,
even with very little in them, makes things painfully slow, e.g., due to an enormous number of separate
calls to git. When users interactively get directory listings, it should take at most about 1 second to
get a listing, or they will be annoyed. I made some possibly-hackish optimization -- mainly caching --
to offset this issue, which are here in case anyone is interested: https://github.com/williamstein/bup
(I think they are too hackish to be included in bup, but anybody is welcome to them.)
- Run a regular test about how long it takes to access the file listing in the latest commit, and if it gets above a threshhold, create a new bup repo. So in fact the bup backup deamons really manage a sequence of bup repos. There are a bunch of these daemons running on different computers, and it was critical to implement locking, since in my experience bad things happen if you try to backup an account using two different bups at the same time. Right now, typically a bup repo will have about 2000 commits before I switch to another one.
- When starting a commit, I wrote code to save information about the current state, so that everything could be rolled back in case an error occurs, due to files moving, network issues, the snapshot being massive due to a nefarious user, power loss, etc. This was critical to avoid the bup repo getting corrupted, and hence broken.
- In the end, I stopped using branches, due to complexity and inefficiency, and just make all the commits in the same branch. I keep track of what is what in a separate database. Also, when making a snapshot, I record the changed files (as output by the command mentioned above) in the database with the commit, since this information can be really useful, and is impossible to get out of my backups, due to using a single branch, the bup archives being on multiple computers, and also there being multiple bup archives on each computer. NOTE: I've been recording this information for cloud.sagemath for months, but it is not yet exposed in the user interface, but will be soon.