[diffoscope] Better html-dir output for very large diffs

Wed Jun 14 20:20:00 CEST 2017

At the moment the html-dir presenter will split individual unified_diffs, e.g. that of a single archive member, or the output of a single command.

However the structure of the overall Difference tree is all put into the top-level parent index.html, and there is no way to split this. (You can set a limit for this page (--max-report-size) and then it gets truncated, but I don't want to truncate anything.)

I plan to allow splitting of the top-level parent index.html as well. As I explain below, this will involve a heavy restructuring of how the html-dir presenter works, but not its output (except for when this splitting is requested). Please let me know if you have any feedback:

Goal
====

The overall motivation is to provide a good "overview" on the top-level page, i.e. to prefer to keep stuff near the root node on the same page, up to a certain limit. After that limit is reached, we'll start splitting off subtrees into child pages.

For example (sizes in brackets):

root (20)
 |- diff1 (50)
     |- diff 1.1 (5)
     |- diff 1.2 (3)
     |- diff 1.3 (25)
 |- diff2 (100)
     |- diff 2.1 (2000)
     |- diff 2.2 (4)
     |- diff 2.3 (2)
 |- diff3 (10)
     |- diff 3.1 (10)
     |- diff 3.2 (1)
     |- diff 3.3 (100000)

Suppose our limit is 100, then we'd prefer the top-level page to contain (root, diff1, diff3) at least, and then as many from layer 2 (diff x.y) as we can fit in. (It will still mention diff2 but instead of the content there will be a button to load it.)

Problem
=======

However, the current algorithm does a depth-first traversal through the Difference tree, so it's unable to make decisions about whether to split subtrees into a different file, based on what comes later on in the tree. What this means is that, with above example the algorithm sees nodes in this order:

root (20)
diff1 (50)
diff 1.1 (5)
diff 1.2 (3)
diff 1.3 (25)
[..]

at which point it has already used up its limit of 100 and can't add diff3 to the output, even though that is preferable to adding diff 1.3.

Solution
========

So what I plan to do is to refactor the traversal algorithm to do a priority-queue-based breadth-first traversal, where layers near the root are traversed first, but smaller diffs *within each layer* are also traversed first. The traversal algorithm is already written and tested in WIP/humungous-diffs. For the above example, it would traverse like:

root, diff3, diff1, diff2, diff 3.2, diff 2.3, diff 1.2, diff 2.2, diff 1.1, diff 3.1, diff 1.3, diff 2.1, diff 3.3

Then I'll proceed by doing something like:

~~~~
def generate_output(node):
  return [a string with holes to be filled in, where each hole is associated with a child node]

class StringWithHoles():
  def fill_in(self, node, string_with_holes):
     return # another string_with_holes

partial_outputs = {root: StringWithHoles(root)} # empty string with a single hole reserved for the given node
for node in root.traverse():
   k = which_partial_has_hole_for(node, partial_ouputs)
   node_output = generate_output(node)

   if partial_outputs[k].size + node_output.size < limit:
     # add it to an existing page
     partial_outputs[k] = partial_outputs[k].fill_in(node, node_output)

   else:
     # new subpage
     partial_outputs[node] = node_output
     partial_outputs[k] = partial_outputs[k].fill_in(node, "interactive placeholder to load the subpage")

   if partial_outputs[k] has no holes:
      write_to_file(partial_outputs[k])
      delete partial_outputs[k]

assert partial_outputs is empty
~~~~

Now the downside is that this algorithm has to buffer each page in the diffoscope python process, before writing it out to disk. By contrast, the depth-first algorithm writes each line to disk immediately as it is generated, so probably its memory usage would be lower than this approach. I'm assuming this won't be a big deal, but please let me know if it is.

X

-- 
GPG: ed25519/56034877E1F87C35
GPG: rsa4096/1318EFAC5FBBDBCE
https://github.com/infinity0/pubkeys.git