[rb-general] Comparing binary files

Sat Nov 10 21:16:35 CET 2018

I was asked to post this by Chris Lamb as a result of a discussion about what  diffoscope does or should do when it finds a binary file whose format it doesn't understand. Most binary file comparators just compare bytes, which doesn't detect deletions or insertions.

Ideally what you would like is a tool which splits the binary files into convenient chunks and then applies your favourite file comparison algorithm, displaying the results in hex with addresses rather than lines with line numbers.

The problem with this is how to define convenient chunks. Individual bytes don't work well because there are too many occurrences of each byte and the diff algorithm gets confused. Fixed size chunks don't work because you can't detect insertions or deletions whose lengths aren't a multiple of the fixed size.

Some years ago I built a modified version of the Microsoft windiff example application to make it work on binary files. I copped out of the convenient chunk problem by leaving in the code which uses a linefeed as the chunk separator (see below for more thoughts on this) and I never bothered to make it display results in hex with addresses.

However it does work and it does detect insertions and deletions, although the user interface is a bit old and clunky. I no longer use windows and I don't currently have access to a Windows machine, so I can't at present build it. The built version that I have does run under "wine".

Chris suggested that I post the code to this list. In order to avoid a large attachment, I've put the files on github.com/rparkins999/bindiff. Since I can't at present build it, I can't check which files are actually needed, so I put everything except the .obj files which I know aren't needed: the built bindiff.exe is still there.

Since I talked to Chris I have had some further thoughts about convenient chunks. I think that you can guess a good separator character by scanning through the files creating a set of 256 element arrays which record for each byte value B

1) how many B's you have seen
2) how many non-B's you have seen since you last saw a B (needed for 3 and 4)
3) the shortest interval between two B's
4) the longest interval between two B's

You can calculate the average length of the interval between two B's from the total count of bytes and the number of B's seen.

You can then apply some figure of merit calculation to find a separator character which breaks the files into a set of chunks which are not too big or too small and have reasonably similar lengths. A good algorithm here should always choose linefeed as the separator character if applied to text files. I may get round to doing some work on this, but I'm busy with other things at the moment.

Incidentally I have just tried kdiff3 and it does about as good a job as bindiff, except that it knows about Unicode so it tries (unsuccessfully) to interpret the files as UTF8.
I don't know ho well diffoscope actually does because I haven't been able to install it. sudo apt-get install diffoscope (running on kubuntu 14.0.4) says unable to locate package.

Share and enjoy
Richard P. Parkins, M. A.
Cambridge, England

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.reproducible-builds.org/pipermail/rb-general/attachments/20181110/d4ee4ce0/attachment.html>