Introducing: Semantically reproducible builds

Mon May 29 16:41:07 UTC 2023

On 5/29/23 05:15, David A. Wheeler wrote:
> Here's an example that might clarify the threat model.
> It's possible that a
> program could look for ".gitignore" and run it if present.
> The source code repo might not have a .gitignore file,
> but the malicious package added .gitignore and filled it with
> a malicious application. That would cause malicious code to
> be executed, but it would also be *highly* suspicious to
> run a ".gitignore" file (that's *not* what they are for), so
> it's reasonable to assume that the source code didn't do that.
> If an attacker can insert a file that *would* cause malicious code
> to execute in a reasonably-coded app, then that *would* be a problem.
> "What's reasonable" is hard to truly write down, but a
> whitelisted list of specific filenames seems like a reasonable place
> to start.

I think the pypi example and missing .gitignore file is more about "git 
and pypi are both a VCS, did the author commit the same source code". 
It's about "what's the canonical source code release" instead of a real 
build.

It's the famous disconnect of "our engineers reviewed the source code 
they got from `git clone`, but our servers use source code from a 
package registry (or whatever source code a debian maintainer uploaded 
into the debian archive)".

For my "how to evade a semantic diff" exercise you would probably not 
bluntly add a new file, but instead find a complex file format (one that 
gets interpreted by some other, complex program maybe?) and then try to 
find blind spots in the diff tool that are useful for exploit development.

These aren't hard to find, for example diffoscope doesn't have a good 
understanding of extended attributes in tar files and will only flag 
them with a binary diff if it couldn't find any semantic differences.

If you intentionally introduce a benign difference for diffoscope to 
pick up on (like changing a timestamp by a few seconds), diffoscope is 
going to cite this as an explanation why the files aren't binary-equal 
and stops further investigation.

I've already explored semantic diff evasion for multiple months but 
unfortunately didn't have time to blog about it.

---

I don't think it's a worthwhile activity to try to build security 
controls on top of it, it sounds more like a code-review problem. Source 
code inputs are commonly pinned by their sha256sum, so it's very clear 
what should be reviewed, with no ambiguity of some .gitignore being 
present or absent.

cheers,
kpcyrd