[diffoscope] Better html-dir output for very large diffs

Ximin Luo infinity0 at debian.org
Wed Jun 14 20:20:00 CEST 2017

At the moment the html-dir presenter will split individual unified_diffs, e.g. that of a single archive member, or the output of a single command.

However the structure of the overall Difference tree is all put into the top-level parent index.html, and there is no way to split this. (You can set a limit for this page (--max-report-size) and then it gets truncated, but I don't want to truncate anything.)

I plan to allow splitting of the top-level parent index.html as well. As I explain below, this will involve a heavy restructuring of how the html-dir presenter works, but not its output (except for when this splitting is requested). Please let me know if you have any feedback:


The overall motivation is to provide a good "overview" on the top-level page, i.e. to prefer to keep stuff near the root node on the same page, up to a certain limit. After that limit is reached, we'll start splitting off subtrees into child pages.

For example (sizes in brackets):

root (20)
 |- diff1 (50)
     |- diff 1.1 (5)
     |- diff 1.2 (3)
     |- diff 1.3 (25)
 |- diff2 (100)
     |- diff 2.1 (2000)
     |- diff 2.2 (4)
     |- diff 2.3 (2)
 |- diff3 (10)
     |- diff 3.1 (10)
     |- diff 3.2 (1)
     |- diff 3.3 (100000)

Suppose our limit is 100, then we'd prefer the top-level page to contain (root, diff1, diff3) at least, and then as many from layer 2 (diff x.y) as we can fit in. (It will still mention diff2 but instead of the content there will be a button to load it.)


However, the current algorithm does a depth-first traversal through the Difference tree, so it's unable to make decisions about whether to split subtrees into a different file, based on what comes later on in the tree. What this means is that, with above example the algorithm sees nodes in this order:

root (20)
diff1 (50)
diff 1.1 (5)
diff 1.2 (3)
diff 1.3 (25)

at which point it has already used up its limit of 100 and can't add diff3 to the output, even though that is preferable to adding diff 1.3.


So what I plan to do is to refactor the traversal algorithm to do a priority-queue-based breadth-first traversal, where layers near the root are traversed first, but smaller diffs *within each layer* are also traversed first. The traversal algorithm is already written and tested in WIP/humungous-diffs. For the above example, it would traverse like:

root, diff3, diff1, diff2, diff 3.2, diff 2.3, diff 1.2, diff 2.2, diff 1.1, diff 3.1, diff 1.3, diff 2.1, diff 3.3

Then I'll proceed by doing something like:

def generate_output(node):
  return [a string with holes to be filled in, where each hole is associated with a child node]

class StringWithHoles():
  def fill_in(self, node, string_with_holes):
     return # another string_with_holes

partial_outputs = {root: StringWithHoles(root)} # empty string with a single hole reserved for the given node
for node in root.traverse():
   k = which_partial_has_hole_for(node, partial_ouputs)
   node_output = generate_output(node)

   if partial_outputs[k].size + node_output.size < limit:
     # add it to an existing page
     partial_outputs[k] = partial_outputs[k].fill_in(node, node_output)

     # new subpage
     partial_outputs[node] = node_output
     partial_outputs[k] = partial_outputs[k].fill_in(node, "interactive placeholder to load the subpage")

   if partial_outputs[k] has no holes:
      delete partial_outputs[k]

assert partial_outputs is empty

Now the downside is that this algorithm has to buffer each page in the diffoscope python process, before writing it out to disk. By contrast, the depth-first algorithm writes each line to disk immediately as it is generated, so probably its memory usage would be lower than this approach. I'm assuming this won't be a big deal, but please let me know if it is.


GPG: ed25519/56034877E1F87C35
GPG: rsa4096/1318EFAC5FBBDBCE

More information about the diffoscope mailing list