Sunday, May 03, 2009

S+S Synchronization

Software and Services (S+S) solutions have a number of attractions, but one common attraction is the ability to work offline.  Some S+S solutions have clients that require persistent internet connections, or are read-only in offline mode.  However, most developers that choose S+S as their architectural blueprint do so, at least in part, with a desire to provide users strong offline capabilities.

The difficulty with offline editing is the possibility of conflicts and the need to provide conflict resolution.  Some applications ignore this difficulty because they are single user, and are willing to leave that user responsible for any data loss that results from accessing services directly, or from multiple clients without synchronizing work done offline on others.  Other applications are entirely read-only, eliminating the difficulty in a different way.

Many applications don’t fit inside those constraints, and when they do inevitably synchronization is brought out as a topic.  Discussions of synchronization often start out with a bit of wishful thinking.  That is, someone, or many participants believe there is a synchronization black box they can throw data at and all will be automatically resolved.  To my knowledge, that doesn’t exist.  In fact, I’ll go as far as to say, that I believe it cannot exist.  Unless you design your data to fit a strict structure that communicates your business rules, and those rules never require escalation to human judgment, such a system cannot correctly resolve all conflicts.

I’ve come to the conclusion that synchronization cannot be a black box.  Synchronization requires more than read/write to be exposed to the developer.  For S+S solutions, the synchronization architectures I prefer are those that expose more data to the client.  There are a number of ways to implement that, but I’ll explain one of the simplest.

First, begin with a centralized master, your service, that is the authoritative source for data.  This service needs to support two things for every synchronizable item.  First, it needs an identifier that is guaranteed to be unique and unchanging over the lifetime of the item.  Second it needs a versioning identifier.  It could be a sequentially incremented number or timestamp, but I prefer another unique id.

Next, in your service, implement (or re-use) a read-only caching system.  If this sounds pretty vanilla so far, that’s because it is.  These first two steps can be achieved through the use of HTTP, URI’s and ETags.

Where many implementations go wrong is the next step, where they attempt to convert their cache into a read-write store.  The most obvious problem with that choice is it breaks everything you’ve built so far, since a cache is one way.  The other problem is you’ve implemented a store that has two sources of data, and have no mechanism to rationalize which one wins.  You could create a mechanism, but there isn’t any perfect mechanism.

image

Instead of forcing the cache to transform into something more complex, leave it alone and create a separate store for modifications.  The client is then responsible for attempting to keep that modified store as empty as possible by submitting the modifications to your services.  Every submission is tagged with the versioning identifier of the item that was present in the cache when modification began.  If this sounds like the HTTP “If-Match”, then you have the idea.  Services should not accept modifications that are unaware of the content of the latest version they have accepted.

Now, if a submission is rejected, you’ve detected a conflict.  Your options are many at this point.  No option is perfect, and the choice is going to depend on many things, not the least of which are your user’s requirements.  But no option has been eliminated yet.  Without any additional implementation you have a first-in-wins strategy, which happens to be the safest bet without more complex insight into the data’s structure, or user intervention.

If you want a last-in-wins strategy, re-cache, update the versioning identifier and resubmit.  Since this would potentially destroy a previous set of updates, it would be bad practice to do so without some kind of user prompt or notification.. but you and you’re users are in control, not the API.

If merging is necessary, the essential complexity of the merge itself remains, but not much else.  You have a copy of the original version, you can retrieve a copy of the current version, and you have a copy of the modifications.  Two-way merge, Three-way merge, automated merge, manual merge.. whatever is necessary is possible and not any more difficult than absolutely necessary.

No comments: