Fossilize is an FFI-powered C-extension for Ruby that interfaces with the delta encoding algorithm created by D. Richard Hipp for the FOSSIL SCM project. It enables a Ruby program to quickly (and I mean quickly) generate a delta between files and strings, as well as apply those deltas.
Deltas can be created between a Ruby File object and a String and vice-versa, so you can read in some JSON from a remote server as a String, create a delta from your local File copy and then apply that delta to your local copy to merge the differences.
The project is currently considered a work-in-progress.
The algorithm itself is based on rsync and is a form of Delta Encoding (sometimes called Delta Compression). A delta encoding algorithm is designed to analyse two pieces of data and produce a delta (the differences between them) as an encoded string.
Here, I'll give an example. If I give the following two strings to the algorithm:
xiy needs to get a job!
maybe xiy needs to get a real job!
It spits out the following delta string (I sense sarcasm in its tone):
_
6:maybe J@0,B:*real* job!1rx1Az;
Although explaining the format of the delta string is out of the scope of this internets page, you can see why this algorithm is so damn cool.
For more info on the algorithm, see the excellent documentation over here.
Git uses a similar algorithm to only store the changes to tracked files between revisions. However, the deltas created by Git can sometimes be huge.
As a real world example, here are the differences between ruby/ruby/array.c@e3efce
and its previous commit (you can see the diff here).
WBD
N86@0,g:rb_random_ulong_limited((randgen), (max)-1)U@OvG,8:shuffle!H@OGW,_B@N9S,4:long49@Nii,F:i = RAND_UPTO(lb@NnL,4:<= iK@Nn~,N@Ntk,H: }
return ptr[i]b@3Rx,33@Npg,H:RAND_UPTO(len - iG@L00,8:k = len;~@NtF,C:len < k) {
R@O1i,K@DZ~,_:n; ++i) {
if (rnds[i] >= len) {
I@F9W,C:new2(0);
}H@QGF,U@NrC,16@Nu_,7:rnds[0]r@Nv~,L:rnds[0];
j = rnds[1]1F@Nxf,Z:rnds[0];
j = rnds[1];
k = rnds[2]40@N~E,I:rnds[0];
for (i=1J@OBW,B:k = rnds[i]86X@O4S,32IfMm;
In terms of diffing, it wouldn't be hard to parse this delta and determine where in a file modifications took place at a per-character level as opposed to the traditional per-line approach.
- Patching - the only data to transmit is the delta string.
- Diffing - although the format isn't human readable, it wouldn't be hard to make it so. Unlike diff, the algorithm compares by character, not by line.
- Syncing - my project
wormhole
uses this algorithm to ensure it only syncs the updated portions of my files, not the entire thing.
- Comparing differences between two almost completely different pieces of data (see Wikipedia).
- Binary diffing - while it can successfully create and apply binary patches, there are algorithms and tools better designed for these types of files.
bsdiff
will create a 2kb diff whereas Fossilize will create a 4kb diff. This is probably due to the fact that the algorithm uses base64 encoding of plain-text which ends up with binary artefacts popping up in the diff. Although they don't make it into the output, it's obviously better to use something like bsdiff. It would, however, be possible to modify the algorithm to use a different "mode" for binary files that uses binary encoding instead.
Add this line to your application's Gemfile:
gem 'fossilize'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install fossilize
Fossilized is distributed under the MIT License.
Fossil (and the Fossil delta encoding algorithm included within) are distributed under the Simplified BSD License/FreeBSD License:
Copyright (c) 2006 D. Richard Hipp
This program is free software; you can redistribute it and/or
modify it under the terms of the Simplified BSD License (also
known as the "2-Clause License" or "FreeBSD License".)
This program is distributed in the hope that it will be useful,
but without any warranty; without even the implied warranty of
merchantability or fitness for a particular purpose.
Author contact information:
[email protected]
http://www.hwaci.com/drh/
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Added some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request