[Git][reproducible-builds/diffoscope][master] 3 commits: Fix missing diff output on large diffs.

Chris Lamb (@lamby) gitlab at salsa.debian.org
Mon Nov 15 19:02:58 UTC 2021



Chris Lamb pushed to branch master at Reproducible Builds / diffoscope


Commits:
6790469f by Brandon Maier at 2021-11-15T11:01:53-08:00
Fix missing diff output on large diffs.

When there is a large diff chunk, match_lines() will skip running the
difflib.Differ.compare(). However this causes the following issues:

- It does not empty the `self.buf` buffer. This means that all future
  calls to match_lines() for that file will always be too large. So
  effectively no more diffs from the file get output.

- It outputs a debug message, but does not output anything to the
  side-by-side diff, so a user looking at the side-by-side diff may be
  misled into thinking the rest of the file has no differences.

We can fix these issue by falling back to a lazy line-by-line diff. This
produces suboptimal output, but it runs in linear O(n) time while
providing some form of output. We include a comment in the diff so the
user knows the following output is using a lazy diff algorithm.

- - - - -
11cdb97c by Chris Lamb at 2021-11-15T11:02:02-08:00
Apply Black to previous commit.

Gbp-dch: ignore

- - - - -
592c401b by Chris Lamb at 2021-11-15T11:02:38-08:00
Import itertools top-level directly.

Gbp-dch: ignore

- - - - -


1 changed file:

- diffoscope/diff.py


Changes:

=====================================
diffoscope/diff.py
=====================================
@@ -24,6 +24,7 @@ import errno
 import fcntl
 import hashlib
 import logging
+import itertools
 import threading
 import subprocess
 
@@ -551,11 +552,11 @@ class SideBySideDiff:
         if len(l0) + len(l1) > 750:
             # difflib.Differ.compare is at least O(n^2), so don't call it if
             # our inputs are too large.
-            logger.debug(
-                "Not calling difflib.Differ.compare(x, y) with len(x) == %d and len(y) == %d",
-                len(l0),
-                len(l1),
+            yield "C", "Diff chunk too large, falling back to line-by-line diff ({} lines added, {} lines removed)".format(
+                self.add_cpt, self.del_cpt
             )
+            for line0, line1 in itertools.zip_longest(l0, l1, fillvalue=""):
+                yield from self.yield_line(line0, line1)
             return
 
         saved_line = None



View it on GitLab: https://salsa.debian.org/reproducible-builds/diffoscope/-/compare/3ab6acb816fa5e38cc58e6ad69515eef1ae4fe61...592c401bcad2ebffff195e23640031539fdf3a94

-- 
View it on GitLab: https://salsa.debian.org/reproducible-builds/diffoscope/-/compare/3ab6acb816fa5e38cc58e6ad69515eef1ae4fe61...592c401bcad2ebffff195e23640031539fdf3a94
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.reproducible-builds.org/pipermail/rb-commits/attachments/20211115/7c2eb4f6/attachment.htm>


More information about the rb-commits mailing list