[rb-general] Comparing binary files

Eric Myhre hash at exultant.us
Sun Nov 11 00:13:56 CET 2018


Another form of algorithm that might be interesting for this purpose of 
conjuring chunks from arbitrary data is known as "rabin fingerprinting"; 
"rolling checksum" should also turn up quite a few search hits.  It's an 
approach that's "less clever", because it does not greatly consider the 
content up-front; but computes quickly, and still yields 
probabilistically similar lengths.

Many backup and synchronization tools use some variation of this 
approach (rsync, bup, IPFS, casync, and many others included).

Some common algorithms for doing rolling checksums include Adler32 and 
Buzhash (though I'm sure there are more).  Generally there are ways to 
tune the average chunk sizes to some ~2^N.

Cheers,


On 10.11.2018 14:16, Richard Parkins via rb-general wrote:
> I was asked to post this by Chris Lamb as a result of a discussion 
> about what diffoscope does or should do when it finds a binary file 
> whose format it doesn't understand. Most binary file comparators just 
> compare bytes, which doesn't detect deletions or insertions.
>
> Ideally what you would like is a tool which splits the binary files 
> into convenient chunks and then applies your favourite file comparison 
> algorithm, displaying the results in hex with addresses rather than 
> lines with line numbers.
>
> The problem with this is how to define convenient chunks. Individual 
> bytes don't work well because there are too many occurrences of each 
> byte and the diff algorithm gets confused. Fixed size chunks don't 
> work because you can't detect insertions or deletions whose lengths 
> aren't a multiple of the fixed size.
>
> Some years ago I built a modified version of the Microsoft windiff 
> example application to make it work on binary files. I copped out of 
> the convenient chunk problem by leaving in the code which uses a 
> linefeed as the chunk separator (see below for more thoughts on this) 
> and I never bothered to make it display results in hex with addresses.
>
> However it does work and it does detect insertions and deletions, 
> although the user interface is a bit old and clunky. I no longer use 
> windows and I don't currently have access to a Windows machine, so I 
> can't at present build it. The built version that I have does run 
> under "wine".
>
> Chris suggested that I post the code to this list. In order to avoid a 
> large attachment, I've put the files on 
> github.com/rparkins999/bindiff. Since I can't at present build it, I 
> can't check which files are actually needed, so I put everything 
> except the .obj files which I know aren't needed: the built 
> bindiff.exe is still there.
>
> Since I talked to Chris I have had some further thoughts about 
> convenient chunks. I think that you can guess a good separator 
> character by scanning through the files creating a set of 256 element 
> arrays which record for each byte value B
>
> 1) how many B's you have seen
> 2) how many non-B's you have seen since you last saw a B (needed for 3 
> and 4)
> 3) the shortest interval between two B's
> 4) the longest interval between two B's
>
> You can calculate the average length of the interval between two B's 
> from the total count of bytes and the number of B's seen.
>
> You can then apply some figure of merit calculation to find a 
> separator character which breaks the files into a set of chunks which 
> are not too big or too small and have reasonably similar lengths. A 
> good algorithm here should always choose linefeed as the separator 
> character if applied to text files. I may get round to doing some work 
> on this, but I'm busy with other things at the moment.
>
> Incidentally I have just tried kdiff3 and it does about as good a job 
> as bindiff, except that it knows about Unicode so it tries 
> (unsuccessfully) to interpret the files as UTF8.
>
> I don't know ho well diffoscope actually does because I haven't been 
> able to install it. sudo apt-get install diffoscope (running on 
> kubuntu 14.0.4) says unable to locate package.
>
> Share and enjoy
> Richard P. Parkins, M. A.
> Cambridge, England
>
>
> _______________________________________________
> rb-general at lists.reproducible-builds.org mailing list
>
> To change your subscription options, visit https://lists.reproducible-builds.org/listinfo/rb-general.
>
> To unsubscribe, send an email to rb-general-unsubscribe at lists.reproducible-builds.org.



More information about the rb-general mailing list