[rb-general] Comparing binary files

Sun Nov 18 18:29:39 CET 2018

Richard Parkins wrote:
> Most binary file comparators just compare bytes, which doesn't detect deletions or insertions.

I've sketched a simple algorithm to detect deletions and insertions. I
know next to nothing about diff algorithms (binary or otherwise), so
please forgive me if this is something already known.

The basic idea is: keep reading bytes and skipping them, a pair at a
time, as long as they match. When a pair of non-matching bytes is
found, read on until you find a new matching sequence that starts
after N1 bytes into file 1 and N2 bytes into file 2. Repeat until EOF.

Of course all pairs (N1, N2) must be checked until a match is found,
so the cost of this algorithm is O(N^2). I figured that it can be made
more efficient by using a pair of hash tables: if a key is already
present in the other file's table then I've found a match, otherwise
insert it in this file's table and go ahead. Searching and inserting a
key in a hash table are O(log N) operations, so the overall cost drops
to O(N log N).

Attached is a simple prototype written in Perl. Well, less simple than
I had thought, but still just a couple of hundred lines of code.

A few notes:

- The output mimics that of diff (a for insertions, d for deletions, c
for changes) except that byte counts are printed instead of line
counts, and the non-matching bytes aren't printed -- being binary,
they would likely appear as garbage anyway.

- A new sequence is defined as at least K matching bytes, because
there are only 256 byte values and thus you might find a lot of
non-significant matches if you looked for just one byte. So the keys
stored in the hash tables are K-bytes words. I chose K==4 but of
course this can be changed. The optimal value might also depend on the
typical contents of a specific file type.

- Bytes are loaded into buffers instead of being processed directly
after reading from file, because when you find a new matching
sequence, you've probably already read past the matching point on the
other file, so you have to go back and process again bytes that you've
already processed. But if you want to be able to redirect input
streams from standard input, you can't seek back, so using buffers was
the only solution I could come up with.

- I've read in the documentation that Perl's getc() reads characters
rather than bytes. I guess that means it tries to interpret the file
as UTF-8 or some other encoding, so if you use it on actual binary
files, it may not work, or it may work but get wrong byte counts. In
fact, from my limited tests it seems to work provided that I use the
ord() function (converts character to its numerical value) to do the
comparisons. But I wouldn't give 100% guarantee.

Let me know if this can be useful for your purpose.

Thank you
Gerardo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bdiff.pl
Type: application/octet-stream
Size: 6701 bytes
Desc: not available
URL: <http://lists.reproducible-builds.org/pipermail/rb-general/attachments/20181118/ea7b8805/attachment.obj>