Archive for category Notes

Berserker is now available on NPM Registry

I had posted a while back about a web-based front-end to aria2c that I started building, primarily as an exercise in learning Node.js. While it has been functional and available on GitHub for some time, I now have the pleasure to announce that it is now available directly through the NPM Registry.

Happy downloading!

, ,

Leave a comment

Getting Serious with JavaScript

My recent decision to teach myself Node.js has turned out to be a good one. It runs on Google Chrome’s V8 JS engine and leverages JavaScript’s event model to provide non-blocking IO, allowing for fast, responsive applications that scale extremely well, even on low-end hardware.

Being a web developer for a few years now, I was already familiar with basic JavaScript, so the learning curve wasn’t greatly steepened on account of having to learn a whole new language. I did have to make a quantum leap in my outlook towards JS as a serious programming language though – what once used to be the silent underdog, used for writing a few onClick event handlers, and for sprinkling AJAX calls all over web pages, has now come into its own. Server-side JS programming really requires a much deeper and thorough understanding of the language and its associated design and programming paradigms than ever a alert(‘Your password must contain at least 6 numbers and 4 Egyptian hieroglyphs’) has called for. With widespread adoption in both server-side and client-side programming, JavaScript is now a true isomorphic language.

Needless to say, my JS knowledge base required a few upgrades before I could put together anything smarter than a ‘Hello World’ responder. Said upgrades, among other excellent sources, I have found here, here, here along with the JS object graph learning trail (part 1, part 2 and part 3). In fact, most of How to Node is a must-read if you’re planning on serious Node programming.

But this post is not just about my experience learning server-side JS. In the process of upgrading my overall web development repertoire, I’ve had to undergo quite a steep ramp-up on client-side JS technologies as well…

Not least of which was AngularJS. Ran into this fellow while looking for a good client side toolkit for building RIAs, especially one that was best suited for single page applications. Back then, my feet were firmly planted on Java turf (in spite of it showing obvious signs of age), and I was looking for a client library that would work well with Java on the server side. Did a few rounds of GWT, ZK, Vaadin and kin until I realized that all of these frameworks incurred a significant learning curve, tedious integration points to cooperate with custom Spring stacks, limited and buggy IDE support (among open-source IDEs, including STS and IDEA community edition; don’t know about IDEA Pro), and no small amount of code bloat on the server side in spite of auto-magic scaffolding and suchlike voodoo. Servers would have to carry out a lot of deeply nested processing to build the views for presenting to the client (esp. in frameworks that took the ‘pain’ of writing client-side javascript away from developers, by auto-generating it from java code).

The biggest problem still, was the lack of flexibility, in spite of all the automation and scaffolding (or perhaps because of it), in building a custom UI to one’s exact liking (esp. with dynamic scaffolding, where you lose significant control over the finally generated DOM). Including client-side styling libraries like Bootstrap would require jumping through increasingly tight hoops in order to make the auto-generated templates adhere to the specific DOM and CSS requirements dictated by such libraries.

I don’t remember if it was by chance, or deliberate intent (esp. since I was previously aware of jQuery UI) that my search led me to the discovery of pure client-side presentation technologies. The first one I tried (and was amply impressed by), was Knockout. This was really an ‘Aha!’ moment for me. One go at their live tutorial was enough to convince me of the potential of JavaScript to completely take over the client-side presenting/rendering business, greatly simplifying things on the server side, which now need bother about little more (w.r.t. presentation) than furnishing the HTML template and the JSON data upon which the JS framework could go to work.

Having discovered this entirely new (to me) method of structuring webapps, which basically addresses all the pain points mentioned above, I went about hunting for other similar offerings in the browser space; but by now I was convinced of JavaScript’s ability to be much more than just an errand boy.

As of today, I’m still delving deeper into the fascinating world of server-side JS runtimes, the accompanying middleware (Connect and Express being among the biggest names here), while also being repeatedly amazed by the power of modern client-side frameworks. The JS landscape is already incredibly expansive, and continues to grow at a frightful rate, as each day heralds the launch of several new libraries, frameworks and tools that make JS programming all the more exciting and enlightening.

My journey of exploration has already started bearing fruit, and has empowered me to give back to the wonderful open source community (JS or otherwise) that I have received so much from. My current efforts are focused on building a JS-based e-commerce platform, and I am confident I have selected the right platform for the necessary pace of development and superior performance that I demand of the product-to-be.

, ,

1 Comment

Which Open Source License should You Choose?

When I was looking for a suitable OSS license under which to release my download manager, I came across this very useful post by Ed Burnette. I converted his post, describing the decision process, into a (somewhat simplified) flowchart.

Open Source License Flowchart

As already mentioned in the original post, this information is not a substitute for professional legal advice.

Leave a comment

An advanced web-based frontend for Aria2-JSONRPC

An advanced web-based frontend for Aria2-JSONRPC

I’m building a full-featured web-based frontend to the powerful CLI download manager Aria2. It is still work in progress but already supports HTTP(s)/FTP/Magnet downloads.

Screenshot

The ‘Downloads’ Page

It would be great to find some early adopters who would give it a try and share their feedback.

UPDATE: Berserker is now available for download through the NPM Registry!

, , , ,

2 Comments

What’s Going Wrong with Your Software Project?

I will take a detour from my usual trend of writing purely about technology, and instead reflect on how and why so many software projects get into a terribly messy state of affairs, often bad enough to be terminated.

I’m trying to understand why, in spite of there being such a wealth of information, training, tools and techniques available on how to run a smooth ship, so little is actually implemented or consistently exercised in most places. I will limit my focus to a subset of problems for which I may be able to hint at a solution through a process of selective elimination of complexities and implementation overheads. Technology can obviously never replace a competent human for addressing personnel-related matters like conflicts, assessment, morale, etc. (although it can still be used to gather crucial data points that help measure individual and team performance), and I am not trying to shoehorn it into such a role.

Except, perhaps in such situations…

Every project, small or big, starts out with a lot of energy, zeal and a headstrong optimism among the stakeholders regarding the successful and timely delivery of end products. Assuming there is no deliberate malice or vested interest in failure, no one in a decision-making role would set the ball rolling if they predicted near-certain collapse. How then, do so many projects spiral into an uncontrollable vortex of botched deadlines, leading to increased pressure and reduced morale, leading to more botched deadlines and ultimately, attrition, blame games, broken deliverables and even total shutdown? A bit of googling would provide a flood of information pointing to issues with personnel, techniques, tools and circumstance. I will focus on just a few, outlining how non-invasive, data-driven solutions can help in addressing some of them.

Poor Visibility

An often observed chain of events in many projects I have worked on, goes something like this:

  1. Project Manager asks for overall status from team.
  2. Team reports completion status and ETA from seat-of-the-pants guesstimates, often intentionally hiding flare spots and backlogs.
  3. Project Manager gets an overall picture that’s rosier than what the hard ground realities would depict, and based on this information, assigns further tasks and deadlines to team, often too aggressive to be realistically achievable.
  4. Team accepts tasks and deadlines without question, mostly for the same reason they concealed the bad news in the first place.
  5. Deadlines fly by, most tasks are in a pitiful state of disarray, and no one can satisfactorily answer why, or what caused this mess in the first place.

Sound familiar? A point to note here is that this can happen even to the most competent and dedicated teams, since effort estimation and progress tracking over long periods of time are continuous, and non-trivial tasks, and not always diligently performed. The delay/failure in pushing out today’s deliverables may have been caused by an insidious scope creep in the recent past from another project, eating into the time and resources of the project at hand. However, people can’t always connect the dots at a moment’s notice. The data is often just too thinly spread out or too deeply nested in several layers of loosely related or unrelated information, or just too old and foggy, and hence not easily visible/traceable.

Another well-known tendency in any chain of command, is for news to get more and more inaccurate and imprecise, the further it percolates up the org pyramid. This is especially true when the most granular data is hard to get at or comprehend, and people at higher levels rely instead on summary information prepared by those immediately below them. Typically, since no one wants to share bad news too willingly, it is only the good bits that are polished, often exaggerated and passed on for consumption up the corporate ladder, while the bad ones are quietly swept under the rug in the hope that no one will go sniffing about.

I have personally observed this phenomenon even in very small teams with greatly flattened hierarchies. Human psychology is obviously playing a very prominent role here, and one needs to allow for the fact that the mere presence of distorted data does not necessarily imply deliberate falsification.

So what can one do to improve the overall visibility and predictability in attaining project milestones? At the very least, we need to make atomic data accessible to all layers in the management stack. Raw data, when coupled with advanced analytical tools, can reduce or even totally eliminate the chances of error buildup. Reporting and analysis tools abound, but it all has to begin with collecting the raw data from deep down in the trenches.

However, when it comes to collecting data at the grassroots level, one cannot reasonably expect all developers, designers and testers to continually push out micro-updates on the progress of their work items. It is disruptive, inefficient and basically extremely annoying to have to do so on a regular basis. Processes and tools that work in the background to automate the collection of atomic progress data, therefore, are key to enabling optimum visibility across the organization for any software project.

Real and Perceived Overheads in Supervision / Management

I am strongly opposed to micro-management – exercising too much control over how a developer works has been proven to be counter-productive in many studies, and I can corroborate the same from my own experience. However, I don’t lobby for a total absence of management either. Rather, during the life of a running project, I think management needs to perform, with minimal intrusion and high precision, two jobs in particular:

  1. Smartly allocate tasks to the resource pool based on availability and effective pairing with core competencies.
  2. Identify tasks that are lagging behind schedule, and selectively focus on them to facilitate speedy completion.

Or they could just carry on doing this…

Both these duties can be most effectively carried out when there is adequate, up-to-date data available in order to compute resource availability and progress of individual items. A manager’s task can be further assisted by automated planning software, using constraint programming algorithms to suggest time and resource allocation strategies. Driving this automation, again, is the raw data that needs to captured at every stage of the project life-cycle.

Real and Perceived Overheads in Using Project Tracking Tools

I’ve often noticed, especially in small companies with small teams and few, if any rigidly enforced processes, a distinct tendency for issue trackers and sophisticated project tracking tools to be under-utilized. Very often, at the time of project kickoff, a brand new ‘tracker’ is created on a spreadsheet, and a quick RTFM is disbursed to the team before it is let out into the wild.

This ‘tracker’ is diligently used and updated for perhaps the first 2-3 weeks. Following this initial phase of rigorous adherence is a marked decline in its usage. Limited usability, constant maintenance of fickle formulas, lack of scalability and lack of integration with external tools like issue trackers, version control systems and continuous integration systems are just a few reasons why most developers quickly learn to loathe a spreadsheet-based project tracker. For them it is no more than a gigantic, metric butt-load of (barely justifiable) double data entry. Data is already being constantly captured in various centralized as well as local development tools (VCS, CI, Bug Trackers, IDE, etc.). This data can and should be automatically fed into the project tracker so that it remains up-to-date without much manual input. Developers would much prefer to work on a project that tracks itself.

Another common ailment that many non-trivial projects suffer from is the reluctance of non-technical stakeholders, often business/end users, to formally enter their requirements, issues and feedback into an issue tracker, and the confusion that arises from the disconnect. Developers are bad at remembering the minute details of feedback and change requests. Email can work only for the simplest projects, with no more than 5-6 people involved in all. Input often comes in through phone calls, instant messages, SMS and direct conversation. As the project grows, it becomes cumbersome and error-prone to organize and track this barrage of information in an overly simplistic tracking tool.

One can always argue that the end users should not have to learn a new and complicated system. However, the core issue here is that most project and issue tracking systems lack a simple, intuitive user interface, geared towards simplifying the chore of submitting feedback. A system with such an interface in place would be easier to pitch to the layman, and thus enjoy much better adoption. Whenever possible, this input mechanism should be very tightly integrated with the feedback channels offered by the end product. However, an easy interface can only partially lower the barrier to adoption of tracker usage. The rest of the data must be captured, as far as possible, directly from their sources. Integrating a tracker with these channels can be a challenging task though.

Lack of Continous Integration and  Automated Build Systems

Continuous integration systems and automated building and health monitoring systems are usually setup only for large projects with a sizable team, in medium to large enterprise scale operations. These tools ensure that mission-critical software goes through a good deal of testing and quality checks before it is shipped out. Smaller projects do not have such elaborate setups in place, mostly due to time and budget constraints, as well a (misguided) belief that they can manage without them.

Truth is, no project lasting longer than a few months and involving more than 2-3 people can be effectively monitored in the long run for new bugs or build breakage in borderline test cases without some process facilitated by a continuous build system. Small projects, in spite of their relative simplicity, are still just as mission-critical to a small company as a big project is to a big company. Hence, it is a mistake to overlook the importance of putting these systems in place.

Summary

Overall, I have not proposed any radically new idea or magic cure to heal an ailing project. I am merely driving at the point that although the answers are out there for solving most problems of project management, they often exist in mutual isolation. The need of the hour is to bring them all under one banner and provide an integrated solution that, while being comprehensive and powerful enough to suit the needs of large and complex projects, is also affordable and easy to setup and use, so that smaller projects can reap the benefits of adopting them as well.

, ,

1 Comment

The ideal full-stack webapp framework?

I have been exploring several web development platforms for quite a few months now. It is not that there is a shortage of great frameworks out there (I am not averse to learning a new language in order to use a good framework), and I did play around with a few really good ones, but I have a few stringent conditions that I want the framework to satisfy:

Multi-tenancy

Since my goal is to build an application that provides hosted services to multiple organizations, the framework must be one that treats multi-tenancy as a first class member in their feature list. There’s always scope for some heated debate on whether multi-tenancy is the best approach, especially when it comes to isolating data between multiple clients. One of these discussions, which I found quite informative, is this.

Multi-tenancy is definitely not the best solution for all usage scenarios; one could argue that multiple single tenant databases are easier to scale out horizontally. However, scaling out is not entirely impossible with the multi-tenant model, and it does save me certain overheads like multiple-maintenance of common configuration data.

My reasons for opting for multi-tenancy are the lowered upfront and recurring infrastructure costs compared to running single-tenant-per-db solutions, and an easier maintenance/upgrade path. However, since the job of isolating data between clients is managed exclusively at the application level, that implementation has to be absolutely water-tight.

Native Application State

‘Shared-nothing’ platforms, like PHP, do not have a built-in application state. Once again, it is ultimately a matter of opinion as to whether or not this is a good thing, but I personally prefer systems where the bootstrap process ideally takes place just once in the lifetime of the application.

Devoid of a native provision for long-lived objects, a stateless platform has to bootstrap the entire framework for every single request that it processes. This is because all objects, class definitions, and all compiled bytecode in general, are restrained to scope of the request itself. While this does make thread-safe programming a no-brainer, it incurs a severe overhead for having to rebuild the object graph and other data structures in memory for each request (even those whose data are request-independent). No wonder it performs very poorly when compared to a platform like Java, in which an object once loaded into memory (even when triggered by a request) can legitimately outlive the request itself, thus saving time while processing future requests, since they can reuse this loaded object.

The lack of an application state can be offset by using op-code caches like APC (for PHP) which can cache compiled bytecode, and even objects across multiple requests (Note that doing this essentially violates the shared-nothing principle, one of the fundamental tenets of PHP). Memcache-based solutions can also be used as an alternative to, or in conjunction with APC. However, these solutions are not built into PHP, and thus require additional modules and/or coding in order to use (this also means there is additional code to execute). Expiring externally cached objects is also a non-trivial issue, since a separate garbage collector must be designed for that. At the end of the day, nothing can beat the speed of directly accessible, in-process-memory caching (with no protocol overheads) that native application state offers. Here’s an interesting Q&A with David Strauss, creator of Pressflow (essentially Drupal on steroids). Just the following excerpt from one his answers should drive home the point:

Because the overhead is largely on the PHP side, Pressflow is exploring ways to accelerate common functions by offloading select parts of core to Java (which blows away PHP + APC on a modern Java VM) and performing expensive page assembly and caching operations with systems like Varnish’s ESI and nginx’s SSI.

No prizes for guessing what gives Java this performance edge ;-). Even a simple PHP application needs help from external caching, and other auxiliary mechanisms in order to satisfactorily serve anything more than a handful of requests per second.

Independently accessible service layer

Say we have an application up and running, and it needs to be accessible though multiple devices, and over multiple channels like HTML, REST/SOAP, RSS and what not. Most platforms come pre-packaged with a scaffolding system for building the presentation layer on HTML by default. This is not a bad thing in general, except when I’m accessing the app from something other than a web browser, such as a mobile app with (screens built in), or from another webapp. In cases like this, I would like not even the slightest overhead to be incurred in loading and building any part of any presentation layer that is not required for serving these requests.

This is possible only when the framework has been designed from ground up with ‘headless’ operation in mind. Basically this translates to a totally detachable service layer that can be invoked independently of the default presentation system.

Lightweight domain objects

I’ve come across a few excellent frameworks that do something I find really strange, and the reason for which escapes me. Their domain modelling paradigm dictates that all business logic pertaining to a specific domain model be contained in the model itself. Really??! In cases where one is building large lists of objects of a particular ‘heavyweight’ domain class, this embedded business logic is simply bloating the memory usage. Now I get the part about static methods and variables (before the slingshots come out 😉 ) which are instantiated just once per class and not per object, but static members were designed with different architectural goals in mind, and not specifically as a memory saving construct. Hence they do help reduce the overhead somewhat, but not by much (every object still needs to maintain pointers to reference the static members).

Another problem is: where do you put logic whose concern spans multiple domain classes? Or has nothing to do with domain classes? Neverending source of confusion, that.

I would rather go for a design which treats domain objects as simple data beans, with no more than the simplest validation rules built in. The heavy lifting of business logic should be borne by a dedicated service layer. This approach also simplifies the implementation of independently accessible service methods that I outlined in the previous section.

Pluggable storage backend

This is a short one. Most frameworks support interchanging one RDBMS with another fairly smoothly. I want to throw NoSQL stores into the mix. I want to be able to plug in MongoDB or Couchbase, for instance, to supplement the RDBMS with certain functions that NoSQL DBs excel at, but I don’t want to change the way I use the persistence layer. Whatever the technology I use for abstracting the storage functionality in the application, its API must let me work seamlessly with non-relational data stores as well.

Summary

That more or less covers my wishlist of things I’m looking for, in a web framework. In order to keep the title short, I didn’t mention it has to be open source as well (yes it does 🙂 ). I think I might have found one that manages to check all boxes on the list, but I’m open to suggestions.

9 Comments

Syncing with bitpocket – a flexible, open source alternative to Dropbox

This continues from my previous post on the various online storage/sync solutions available today.

I’ve been a Dropbox (and Box, and Google Drive) user for a while now, and like it for its convenience. It is easy to use and setup, and lets you keep multiple devices in sync with next to no effort. However, I’ve always had some concerns over privacy and  security issues. In light of the recent attack on the service provider, I started wondering how safe my files and accounts really are (not just with Dropbox, but actually with any online storage solution, including a home-brewed one).

I also have some concerns regarding the privacy of my documents. Say, I’ve got some sensitive data uploaded to an online storage service. Who’s to say these documents are safe from data mining, or (god forbid) human eyes? (I’m not pointing fingers at any individual storage provider here. Some may respect your privacy, others may not.) Many people would be extremely wary of the possibility of information harvesting (even if it is completely anonymized and automated) and/or leakage.

Then of course, there are some less critical, but nevertheless important limitations:

  1. Only x GB of (free) storage space. One can always upgrade to a paid package, but I don’t want to pay for 50 GB of storage when I’m only going to use 10 GB in the foreseeable future. There are services who provide a large amount of storage space for free, but most of them still charge you for bandwidth usage above a fraction of the amount.
  2. No support for multiple profiles. You have to put EVERYTHING you want to sync under one single top-level folder. This may not be a suitable or acceptable restriction in all situations.
  3. Lack of flexibility – you don’t get to move your repository around if you need to. Once you subscribe to a service, you’re locked into using their storage infrastructure exclusively.

It is not necessary that the limitations I’ve described so far are all present in any single service, or even that they are a matter of concern for everybody. These are just a few issues that got me going on a personal quest to find a better alternative.

There are actually quite a few ways of setting up your own personal online storage and sync solution, whose security is limited only by your ability to configure it. But the most visible benefit over any existing service is the flexibility –

  1. to use a storage infrastructure of your choice, and
  2. to manage multiple profiles.

The rest of this post documents my experiments with one such solution, named bitpocket. It performs 2-way sync by using a wrapper script to run rsync twice (once on the master, once on the slave). It can also detect, and correctly propagate file deletions. It does have one limitation in that it doesn’t handle conflict resolution. You have been warned. (Unison is supposedly capable of this, but that is another post ;-).)

The basic setup instructions are right on the project landing page. Follow them and you’re all set. I’ll elaborate on two things here –

  1. how to do a multi-profile setup, and
  2. how to alleviate the problem of repeated remote lockouts when multiple slaves always try to sync at the same time.

Multiple profiles

I’ve got two folders on my laptop that I want to sync:

  1. /home/aditya/scripts
  2. /home/aditya/Documents

I want these two folder profiles to be self-contained, without requiring the tracking to be done at the common parent. Following the instructions on the project page, I did a bitpocket init inside each of the above folders. On the master side (I’m running an EC2 micro-instance on a 64-bit Amazon Linux AMI), I’ve got one folder: /home/ec2-user/syncroot where I want to track all synced profiles. So in the config file of the individual profile folders on the slave machine I set the REMOTE_PATH variable as follows:

  1. For /home/aditya/scripts
    REMOTE_PATH="/home/ec2-user/syncroot/scripts"
  2. For /home/aditya/Documents
    REMOTE_PATH="/home/ec2-user/syncroot/Documents"

That’s it! You can manage as many profiles as you want, with each slave deciding where to keep its local copy of each profile.

Preventing remote lockouts

Say, all your slaves are configured to sync their system clock over a network source. They are in sync with each other, often to the second (or finer). Now if all crons are configured to run at 5 minute intervals, then all the slaves attempt to connect to the master at exactly the same time. The first one to establish a connection starts syncing, and all the others get locked out. This happens on every cron run. The problem is further exacerbated by the fact that even blank syncing takes a few seconds at the very least, and the lockout is in force for that duration. We’re thus left with a very inefficient system which can sync ONLY one slave with every cron run. If one slave is on a network that enjoys consistently lower lag with the master than all the others, then the others basically never get a chance to connect! Even if that is not the case, the system overall always has a success rate of 1/N for N slaves, in each cron run. Not good.

One way to alleviate this (though not entirely) is to introduce a random delay (less than the cron interval) between when cron initiates and when the connection is actually attempted. Over several cron runs, this scheme spreads out the odds evenly (duh!), for each slave, of running into a remote lockout. Local lockouts are not a problem. Bitpocket uses a locking mechanism to prevent two local processes from syncing the same tracked directory at the same time. If a new process encounters a lock on a tracked directory, meaning the previously spawned process hasn’t finished syncing yet, it simply exits. The random delay is introduced as shown below (assuming a cron frequency of 5 min):

#! /usr/bin/env bash

cd $1
PIDFILE="$1/.bitpocket/run.pid"

sleep $[ ( $RANDOM % 300 ) ]s

if [ -e "${PIDFILE}" ] && (ps -u $USER -f | grep "[ ]$(cat ${PIDFILE})[ ]"); then
  echo "Already running."
  exit 99
fi

rm -rf .bitpocket/tmp/lock #Previously spawned proc is now dead. There should be no lock at this point. This step corrects for an unclean shutdown.
/usr/bin/bitpocket cron &

echo $! > "${PIDFILE}"
chmod 644 "${PIDFILE}"

That’s it! Assuming you’ve saved this file in /usr/bin/bpsync, edit your crontab entries like so, and you’re done:

*/5 * * * *     bpsync ~/Documents
*/5 * * * *     bpsync ~/scripts

Happy syncing!

EDIT: I ran into trouble with stale server-side locks preventing further syncs with any slave. This happens when a slave disconnects mid-sync for whatever reason. Lock cleanup is currently the responsibility of the slave process that created it. There is no mechanism on the server to detect and expire stale locks (See https://github.com/sickill/bitpocket/issues/16). This issue needs to be fixed before this syncing tool can be left to run indefinitely, without supervision.

EDIT #2: One quick way to dispose of stale master locks is by periodically running a little script on the server that checks each sync directory for any open files (i.e. some machine is currently running a sync). If none are found, it simply deletes the leftover lock files. The script and the corresponding crontab entries are as below:

#!/bin/bash

cd ~/syncroot
for DIR in *;
do
 OUT=`/usr/sbin/lsof +D $DIR`
 if [ "$OUT" = "" ];
 then
  rm -rf $DIR/.bitpocket/tmp/lock
 fi
done
*/5 * * * * /usr/bin/cleanup.sh

, ,

3 Comments

The ideal sync tool?

Of late I’ve been thinking what kind of backup/sync tool would serve the multi-pronged requirements of a power user, having a wishlist looking something like:

Support for multiple directory profiles

Most online sync tools let you sync just one top level folder and all its contents. What if I want to sync multiple folders at distinct locations, with low configuration overhead?

Configurable web storage

There are a handful of distinct kinds of backup/sync paradigms most prevalent these days:

  1. Dropbox-like services which lock you down to using their own online storage infrastructure, with all its incumbent security and privacy concerns.
  2. DejaDup-like software, which let you configure an online storage infra of your choice (S3, ftp, etc.) but are not very flexible in terms of one or more of:
    1. managing multiple profiles,
    2. instant updates,
    3. versioning
    4. multi-platform support

    Rsync and it GUI derivatives may be counted in this category as well.

  3. Full-fledged VCS like git, svn, cvs, etc. Though powerful and designed from ground-up to support versioning, incremental updates, branching, tagging and much more, they have the disadvantage of introducing metadata into the tracked folders, and are in general an overkill for the simple task of just syncing a bunch of folders. There’s also the non-trivial overhead of creating and managing the online repository.

Versioning

Some services like Dropbox (and VCSs, obviously) provide versioning support, others like DejaDup and rsync don’t. I think version control is  an invaluable feature to have, especially when tracking important documents. A very basic but life-saving advantage is the ability to restore a previously deleted file.

Portability

Most of us are owners of multiple devices, running on multiple platforms. Many of us are saddled with one or more personal/office laptops and desktops, as well as tablets and smartphones. They all run on vastly different platforms. It is incredibly convenient that a document you create on one device becomes transparently available on another, so you can seamlessly switch from your office desktop, where you created that PPT, to your tablet or smartphone at home to add those last-minute finishing touches. Services like Dropbox win hands down on this front, with support for all major desktop OSes as well as mobile platforms.

Tranparent sync

This is possibly the coolest feature in Dropbox, and was its chief attraction and selling point in amassing the large userbase that it now serves. I know I was impressed by it more than any other feature the service had to offer (other than being free, of course ;)). It’s a breeze working with your tracked folder exactly as you would work normally on other folders and let the software take care of syncing transparently in the background. This unburdens us from having to configure scheduled tasks or (god forbid) manually having to push/pull changesets all the time.

Multi-user rw support

Not sure if Dropbox supports this, but VCSs most certainly do, and DejaDup and rsync can be configured to support it with a little effort. For an individual user wanting to manage his own personal files, this is not a critical requirement. The ability to share read-only links to specific files and folders from time to time usually suffices.

Most of the parameters I’ve listed so far have one or the other clear winner, but it’s quite obvious that no single service described here is capable of fulfilling all criteria to adequate satisfaction. To me it indicates there could be an opening for a product/service which meets all the above requirements well enough to satisfy the exacting standards of a power user, while at the same time giving the normal user a handy, one-size-fits-all alternative.

,

3 Comments

%d bloggers like this: