[diffoscope] Diffoscope falls back to xxd for two (seemingly) text files with identical content
Aman Sharma
amansha at kth.se
Thu Apr 10 17:48:53 UTC 2025
Hi Chris,
They still have not looked at my message. But I have sent a follow up.
However, I found a similar case where diffoscope is falling back to xxd. `file` output on the files.
```
Reference.java: HTML document, ASCII text, with very long lines (6135)
Rebuild.java: Java source, ASCII text, with very long lines (6135)
```
First thing that is strange here is "HTML" document for Reference.java. That seems like a bug. But both files are ASCII this time so diffoscope should be able to use diff tool, right?
Due to some "security issue", I was not able to attach the files with this email. So here is a link to a GitHub comment where both files are uploaded.
Regards,
Aman Sharma
PhD Student
KTH Royal Institute of Technology
School of Electrical Engineering and Computer Science (EECS)
Department of Theoretical Computer Science (TCS)
<http://www.kth.se><https://www.kth.se/profile/amansha><https://www.kth.se/profile/amansha>
<https://www.kth.se/profile/amansha>https://algomaster99.github.io/
________________________________
From: Aman Sharma
Sent: Wednesday, March 19, 2025 3:46:31 PM
To: Chris Lamb; diffoscope
Subject: Re: [diffoscope] Diffoscope falls back to xxd for two (seemingly) text files with identical content
Hi Chris,
They take some time to approve my messages but should be there soon :)
Regards,
Aman Sharma
PhD Student
KTH Royal Institute of Technology
School of Electrical Engineering and Computer Science (EECS)
Department of Theoretical Computer Science (TCS)
<http://www.kth.se><https://www.kth.se/profile/amansha><https://www.kth.se/profile/amansha>
<https://www.kth.se/profile/amansha>https://algomaster99.github.io/
________________________________
From: Chris Lamb <chris at reproducible-builds.org>
Sent: Wednesday, March 19, 2025 2:27:30 PM
To: Aman Sharma; diffoscope
Subject: Re: [diffoscope] Diffoscope falls back to xxd for two (seemingly) text files with identical content
Hi Aman,
> > Quickly experimenting, it appears the cause of this is the strange
> > character (0x1e) on line 1185.
>
> I got a response from the file tool. It is on their mailing list -
> https://mailman.astron.com/pipermail/file/2025-March/001476.html. It
> contains UTF 8 characters \xc3\xa9\xc3\xa9 (éé) in ref and reb and that
> is why file returns data.
Yes - that was my suggestion in my previous message. :)
> I have asked them the possibility if it should return UTF 8 instead
> of data.
Hm, I don't see your request for that on the list. You might want to check
whether it was sent successfully?
Regards,
--
o
⬋ ⬊ Chris Lamb
o o reproducible-builds.org 💠
⬊ ⬋
o
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.reproducible-builds.org/pipermail/diffoscope/attachments/20250410/ec564ac4/attachment.htm>
More information about the diffoscope
mailing list