[diffoscope] Schema of diffoscope JSON output
Aman Sharma
amansha at kth.se
Wed Jan 15 19:04:31 UTC 2025
Hi Chris,
> The "source1" and "source2" fields are essentially free-form text
descriptions
Good to know! Thanks!
> Is there a particular problem you are trying to solve here? Your
question suggests there might be.
Nice of you to ask. I have 1000s of diffoscope files for Maven central artifacts that I am analysing. I wanted to understand the reasons for differences in each file. I can't do it manually given the manually so I thought I would cluster them in groups. For example, if source1 is "javap", I can be sure that the diff is in JVM bytecode and I would cluster all diffs under "Difference in JVM bytecode". Then I would manually analyse some of them and know the reason for difference and eventually root cause. However, source1/source2 being toolname is not true for all diffs so I could not categorize diffs that way. Eventually, I went for RegEx to cluster them. For this matter, the "comments" JSON attribute in diffoscope files also helped :)
For example, there cases where some files are missing or additional in the rebuild version<https://github.com/algomaster99/reproducible-central/issues/16>, I created a RegEx to capture that pattern and classify which Maven releases have this reason for non-reproducibility.
Regards,
Aman Sharma
PhD Student
KTH Royal Institute of Technology
School of Electrical Engineering and Computer Science (EECS)
Department of Theoretical Computer Science (TCS)
<http://www.kth.se><https://www.kth.se/profile/amansha><https://www.kth.se/profile/amansha>
<https://www.kth.se/profile/amansha>https://algomaster99.github.io/
________________________________
From: Chris Lamb <chris at reproducible-builds.org>
Sent: Wednesday, January 15, 2025 7:00:15 PM
To: diffoscope
Cc: Aman Sharma
Subject: Re: [diffoscope] Schema of diffoscope JSON output
Hello Aman,
> I want to know if there is a schema for JSON output from diffoscope. I
> have understood that it always contains 'source1' and 'source2'.
> However, they can either mean the actual source files that are being
> diff-ed or the name of the tool that is run on the files before being
> diff-ed.
No, there is not a defined JSON schema á la json-schema.org (or
similar) beyond what you have observed. :)
The "source1" and "source2" fields are essentially free-form text
descriptions — as you outline, sometimes they are filenames and
sometimes they are descriptions of the tool being used to format some
difference.
Do remember that because of the way that diffoscope recursively
unpacks archives, you cannot rely on any filenames listed in these
fields being resolvable on the filesystem anyway, so it is unclear
what would be gained if this was somehow more... 'strict'.
Is there a particular problem you are trying to solve here? Your
question suggests there might be.
Best wishes,
--
o
⬋ ⬊ Chris Lamb
o o reproducible-builds.org 💠
⬊ ⬋
o
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.reproducible-builds.org/pipermail/diffoscope/attachments/20250115/9c24cf5b/attachment.htm>
More information about the diffoscope
mailing list