This semester as a part of TAing 167/9 we are making use of Mercurial for SCM. Itay liked Hg so much that he actually started using it for nearly everything on his computer and came up with a good reason for doing this. I am not organized enough to put all of my files into source control (though heavens knows I probably should), but I actually like the idea of using mercurial to keep my two copies of research code synchronized.
Since I can't compile and run said code on my home machine for lack of expensive libraries, and I'd rather not code in the department for a general lack of productivity, using Hg seemed like a better idea than just scping code around. Now this in general presents a problem on Mac OS X. For reasons I haven't yet discovered (I have not yet taken a look at the hg code), mercurial while remotely cloning from Linux seems to ignore the PATH variable and will explicitly look for hg in /usr/bin, which is not where Mercurial resides for a whole lot of OS X users. Bad in terms of design, and probably a fairly simple fix, but hardly the end of the world seeing as a simple ln -s solves the problem for now. Either ways once I was done with this, I was ready to commit all my past work to a repository, which is when I discovered a fairly large roadblock which has little to do with mercurial, and this got me thinking about a few other backup strategies I had been playing with in my mind...
At some point in time, I had gone through the effort of creating a few hundred thousand test cases for this program I am writing, they are all pretty small (around 1K each), but there are something like 180000 of these. For reasons which involve everything from a slow hard disk, to perhaps the general lay out of the HFS+ filesystem, doing much of anything with these files, even lsing them, takes a while. Creating hashes and metadata for these files probably did not make Hg very happy, and at some point in the middle of trying to commit all these changes I was struck with the realization that just listing every one of these 180000 files in a log file will take Hg a while.
Now of course this realization hit me after it had already been running for a while, and had probably been churning away at creating and writing all those hashes and metadata. Either ways I ended up trying to cancel the commit by hitting ctrl-c. For future reference, don't hit ctrl-c too many times when canceling a commit, yes it takes a while to respond to interrupts when processing this big a commit, but it really is trying to rollback the log and leave your system in a known state. Lots of ctrl-cs might save you time with canceling, but you will spend as much, if not more time, biting your nails as you run through hg recover. Anyhow this worked out, but it got me thinking...
Now obviously the sane way to deal with so many files is tar them so you usually have to deal with no more than one file, in my case I need to use them individually, but even that is a bad response. Most people don't really tend to use a 180000 files in a single directory, bad structuring and problems with finding information usually prevent this.
However, and this is the fun part, there is at least one situation where the specs of a specific system make it hard to ensure this limit. Amazon's S3 is described as a scalable, reliable, low latency storage solution designed using the same storage infrastructure used by Amazon itself. More importantly, S3 is fairly well priced, and I am running out of space (though there's all sorts of debates there, is 15 cents a month per GB really less expensive than spending even 300 dollars for a terabyte of storage), and while working on something for my father I did actually consider the possibility of using S3 as a place to back stuff up, or even as a backing storage for everything I have on my disk. In essence you could get your disk to act as a really big cache for S3, since networks aren't that fast I think it is safe to assume that real disks are much lower latency than S3, besides I don't like diskless nodes.
Now security is an obvious concern for this, but there is already a single company controlling vast amounts of my information, and information for a vast majority of the people I know, and swapping one company for another is not the worse thing. More importantly people are already working on the security problem. However, and this is where the entire mercurial discussion ties in with this, S3 is a flat file format. You could perhaps try and get more than one buckets, but buckets are fairly hard to get, and buckets merely contain keys linking to single objects which can be upto 5 GBs of information. So let's see, one could create a real file system on top of these buckets and use these 0kb-5gig blocks as actual file system blocks, however for any reasonably recent file system which does versioning, you'd rather rapidly reach a point where you're using enough blocks that attempting to list everything in your bucket will take a long time just in terms of transferring data over the network. Fortunately if you intelligently designed this, listing blocks should be a fairly rare process, perhaps something coming into play only every time you loose your cache (disks are unreliable, things happen). However I don't actually know how S3 stores this index, and how access to object maps out, and I can't really find out how they do this, I am just not sure how Amazon handles the case of an overly full bucket, it doesn't seem like a hard place to get to, hourly backups of entire 5 gb files might do this rather fast, and I am sort of curious about how this works out. Of course seeing as MySql now has a S3 backend, and other people have already made various backup things for S3, and of course seeing as networks are still rare to come across when traveling, I am not really expecting to see too much of this entire S3 backing disk idea come about anytime soon. If someone's working on something similar, it'd be nice to know.
Panda
Add New Comment
Viewing 57 Comments
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Add New Comment
Trackbacks