Introducing: Semantically reproducible builds
kpcyrd
kpcyrd at archlinux.org
Mon May 29 16:41:07 UTC 2023
On 5/29/23 05:15, David A. Wheeler wrote:
> Here's an example that might clarify the threat model.
> It's possible that a
> program could look for ".gitignore" and run it if present.
> The source code repo might not have a .gitignore file,
> but the malicious package added .gitignore and filled it with
> a malicious application. That would cause malicious code to
> be executed, but it would also be *highly* suspicious to
> run a ".gitignore" file (that's *not* what they are for), so
> it's reasonable to assume that the source code didn't do that.
> If an attacker can insert a file that *would* cause malicious code
> to execute in a reasonably-coded app, then that *would* be a problem.
> "What's reasonable" is hard to truly write down, but a
> whitelisted list of specific filenames seems like a reasonable place
> to start.
I think the pypi example and missing .gitignore file is more about "git
and pypi are both a VCS, did the author commit the same source code".
It's about "what's the canonical source code release" instead of a real
build.
It's the famous disconnect of "our engineers reviewed the source code
they got from `git clone`, but our servers use source code from a
package registry (or whatever source code a debian maintainer uploaded
into the debian archive)".
For my "how to evade a semantic diff" exercise you would probably not
bluntly add a new file, but instead find a complex file format (one that
gets interpreted by some other, complex program maybe?) and then try to
find blind spots in the diff tool that are useful for exploit development.
These aren't hard to find, for example diffoscope doesn't have a good
understanding of extended attributes in tar files and will only flag
them with a binary diff if it couldn't find any semantic differences.
If you intentionally introduce a benign difference for diffoscope to
pick up on (like changing a timestamp by a few seconds), diffoscope is
going to cite this as an explanation why the files aren't binary-equal
and stops further investigation.
I've already explored semantic diff evasion for multiple months but
unfortunately didn't have time to blog about it.
---
I don't think it's a worthwhile activity to try to build security
controls on top of it, it sounds more like a code-review problem. Source
code inputs are commonly pinned by their sha256sum, so it's very clear
what should be reviewed, with no ambiguity of some .gitignore being
present or absent.
cheers,
kpcyrd
More information about the rb-general
mailing list