-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardising on a format for describing affected functions. #149
Comments
@di @sethmlarson thoughts? |
My concern is I'm not sure how we'd get that information automatically, it'd be a manual process for each advisory. Then we'd have to teach pip-audit how to parse ASTs to apply it? cc @woodruffw |
Yes, it'd be manual for now, but I can imagine automation helping out with various pieces here, based on identifying fix commits and extracting the list of changed functions as a rough starting point. We've been building support for Go (via govulncheck), and Rust in osv-scanner: google/osv-scanner#476, so there is some precendent there in other ecosystems. |
Yes! In particular: in addition to being able to relay vulnerable symbols/functions, one thing that would be valuable for Python specifically is being able to signal the kind of distribution that the vulnerability is relevant for. This has come up a few times in the past, particularly with PyCA Cryptography: they've released security advisories that pertain only to wheel distributions (since those distributions include a static build of OpenSSL), and it'd be great if non-wheel users weren't unnecessarily included in those advisories. CC @alex (I could have sworn he opened an issue for this somewhere before, but I can't immediately find it) |
I don't remember if I filed a bug, or just groused a bunch :-) Tracking affected functions makes sense to me, though making use of that information may be more challenging than in a statically typed language. |
To flag a few (not exhaustive) examples of this:
I wouldn't expect a standard format here to solve either of these perfectly (they're fundamentally too dynamic), but both are worth making considerations around. In particular, for re-exports, it's probably worth nailing down whether all re-exports (potentially many!) will be listed for a vulnerable package, or whether the single original site of export gets listed (which itself might be funky, since it could be a "private"-to-public re-export). |
Hi everyone, my name is Itamar Sher and this is my first time commenting here. So hello 👋 |
Python has a spec for a fully qualified name (from Python 3.3)
The one thing I would suggest be added to that form is that packages be rooted with their pep 503 normalized name |
Yeah, IMO the full
I might be misunderstanding, but I don't think this will be necessary: the OSV finding itself will contain the package name, so including it in the path will be redundant (as well as confusing, since the package name isn't guaranteed to be a valid Python identifier or the same as the top-level module name). This also leaves open the re-export and decorator/locals issues, but I think a 99% solution is probably better to get started with 🙂 |
I think we do need a root of some sort to capture the case where the package is a single object/function.
Can you expand on this? I think maybe I've confused package name with top level identifier. I was under the impression that the 503 package name was the top level identifier, but perhaps that's just a commonality and not a rule. If I'm in error there then yes maybe the top level identifier is the correct approach. Is there a pep for this? |
Correct: it's just a convention, not a rule. The two technically don't have to be similar at all, and the normalized package name still isn't guaranteed to be a valid Python identifier. A good canonical example of this is As far as I know, there's no actual PEP stating this anywhere, it's just a quirk of how Python packaging works 🙂 As a result of this, a |
Ah yes, pillow. Good example 👍 I guess the question then is; when reading code how does one determine what the top level module is?
Just to clarify; |
Yeah, the discovery process for matching a vulnerability to an environment (or source code) would roughly be:
Hmm, I thought it did, but from a quick look >>> from pip_audit import _cli
>>> _cli.AuditOptions.__qualname__
'AuditOptions' So we'd need to go a little beyond >>> _cli.AuditOptions.__module__
'pip_audit._audit' (this has the nice effect of also canonicalizing the fully qualified name, since |
So would this be a "simple" concatenation of |
Yep, it should be "just" that 99% of the time (famous last words). It gets a little bit more complicated when the target is a module or package itself (perhaps it should never be, as a matter of policy?) but even in those cases it should be just different concatenations of Sorry for the confusion with my example 😅 -- from pip_audit import _cli
_cli.AuditOptions(...) and we would want to catch that, despite the "canonical" path for |
So
We should iterate out those cases 😄 On your example; I guess So, then my question on |
Yep, 99% of the time. There are some items in Python that don't have a It'd be good to check what Python actually says about those attributes, but it's also worth noting that they're fundamentally dynamic anyways: a misbehaving class can do
Yep, exactly. In practice, Python's lack of module visibility or explicit exporting means that any internal use of a package by itself results in re-exports that users can accidentally depend on.
This is, unfortunately, a good question with no good answer 🙂 As far as I know, Python's module system has no actual definition of "canonical" for a particular resolution path. As such, a tool/ecosystem like OSV will need to make its own determinations about how best to canonicalize paths. A few options:
(There are also further problems, like packages that have canonical paths for public APIs but intentionally use |
Oh for sure. I'm not worried about modules that are intentionally obtuse for a v1 spec. 99% will get us a lot of value for most users and we can tackle the infinite regression of edge cases for v2 (or 3 maybe 😄) But yes, lets formalize
:(
I guess for the moment we could also say that it's a data providers obligation to do a best effort canonicalization. There will be aliases in practice of course, but maybe that's an acceptable choice. I tend to prefer to match whatever a user of a package would use so that scanning tools can be more easily maintained/made to be lower noise so if there's a re-export pattern in some library but also the developer documentation advises users to use that re-export then we prefer that export. That's human guidance I know. |
Sounds good! As a starting point: we can say that these names take the form:
...where the LHS of the The RHS is a gettable attribute within the importable module, meaning that The reason I suggest When an entire module is meant to be flagged in a finding, we can probably just drop the
...which follows the same
That seems very reasonable to me. There isn't a perfect answer here, so I think anything (like this) that gets us to the 99% case is perfectly suitable 🙂 |
Any chance you can give some examples of the |
Sure: using the example above:
would become:
On the other hand, a report for an entire module would remain the same:
That's really the only change; the main reasons I suggest it are because it's what other tools in the Python packaging ecosystem do, and because it disambiguates different parts of the path well. |
Assuming we're going off https://github.com/pypa/pip-audit/blob/main/pip_audit/_audit.py |
Not through the >>> import importlib
>>> importlib.import_module("pip_audit._audit.AuditOptions")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/william/.pyenv/versions/3.11.2/lib/python3.11/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
File "<frozen importlib._bootstrap>", line 1137, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'pip_audit._audit.AuditOptions'; 'pip_audit._audit' is not a package
>>> importlib.import_module("pip_audit._audit")
<module 'pip_audit._audit' from '/Users/william/devel/pip-audit/pip_audit/_audit.py'> Similarly with the
That's why the disambiguation with |
Ah ok, so the |
Right: it's an indicator that the thing on the RHS isn't a part of the path, but is rather a class, object, function, etc. (i.e. the things that are the actual vulnerable API surface, rather than the path to the vulnerable API surface).
On one level, validating this is the same as it would be with the If you're asking how you'd check that the LHS is a valid path and the RHS is a valid attribute: you can either do it dynamically with (This is true for the |
This. Intuitively it seems easier to discover valid |
Could you elaborate a bit more on what you mean by "discover valid"? In a strict sense, For example,
And the same is true for I think the TL;DR of my position is that Python's dynamic nature makes it almost impossible to actually "validate" anything here, whether you use |
Sure. Part of my job is in populating this data and discovery of paths like Now, if the |
Thanks for explaining! I understand your position, and I'm also very sympathetic to not adding more work to an already manual process 🙂. I think I might have given you the wrong impression about how difficult this is, by talking about pathological cases: in practice, 99.9% of paths that you discover as That being said, I recognize that even that may be a dealbreaker for you. Ultimately any amount of path/API information will be beneficial here, so I'll stop hammering on |
Alright, if you're comfortable with it I'd prefer to keep the spec location based (and slightly easier 😅) for v1 with just a dot separated path to describe the location of a vulnerable object. I still need to get some time to go find the source for |
Ok, so I may have 🎉 'd a bit early 😞 First off for That said, In doing some testing to validate what we've been talking about and to ensure we have some good examples I came across a django case of (I think) a re-export which will likely matter for our use case. Using django
Which seems to be down to this line in the Sorry for the delay on getting into the weeds here but any thoughts? |
Yep, this is what I mentioned in #149 (comment) -- Python will not make any guarantees about the values of these attributes, because (for CPython) they're just
Unfortunately, I don't know of a great generalization for handling these kinds of canonical paths 😞 -- there's no particular guarantee that a given attribute has only one "canonical" path, much less that multiple modules don't use the I know I sound like a broken record at this point, but my recommendation here would be to sidestep the entire "what is a valid For getting a potential (non-exhaustive) set of "canonical" paths, one possibility would be to use TL;DR: Whether you use |
Yep, I think you're right and I just got wrapped up in my head 🤦 Ok, so we ignore the idea of canonical paths for the time being and focus instead on import path syntax. I guess first question; is there a name for that? Going back to the django example; we would then have and the two would be independent and valid paths? |
I'm not aware of an official name for this; the There may be some additional guarantees documented under
Yep 🙂 |
Hmmmm. Feature request there I guess :)
Cool 👍 |
@woodruffw sorry for dropping off on my end. To keep this going, I think we have importlib paths rooted at the top level module of the form If all that makes sense and is agreeable I think the next question is; where does this live? Is this something to make a PR for in this repo or is there a better place? |
No problem!
To clarify:
Sorry again if this comes across as pedantic, but I want to make sure we're using the same terms here. The rest makes sense to me!
My first thought would be here, or in one of the OSV repositories (maybe they already document similar syntaxes for other ecosystems? CC @oliverchang). @sethmlarson may also have opinions 🙂 |
In terms of expressing this, I had another thought: rather than creating these pseudo-paths like
This is slightly more verbose upfront, but comes with a few benefits:
|
The information for affectedness @woodruffw works for me. Is there a reason you're using {
"ecosystem_specific": {
"affects": [
{
"attrs": ["JSONField"],
"module": "django.db.models"
}
]
}
} (example assumes that Also, are we expecting to fully enumerate all places a name is re-exported or is that the job of tooling to walk an import ast and determine affectedness (tools will need to do this anyways in order to determine affectedness from source files, so this doesn't feel like a huge additional burden for an easier experience encoding advisories?) In terms of where do these choices live, we can have that live in a |
Nope,
My understanding is the latter, with the idea that the format can be a "best effort" source of initial information (e.g. if we know that a project has a few really popular re-exports, it probably makes sense to list them rather than having end user tooling have to do that lifting.) |
I'm ++ on best effort too, doesn't /have/ to be the original definition location if the API is defined a certain way (like |
Gah, sorry you're correct. What I meant by the package part is that the osv path data will be subordinate to an affected product which itself will be a pypi package. That's me conflating things though.
I do like the idea of attributes, but again I really question my ability to provide those with high fidelity. 😞 re:
I do like that as an approach to method to de-dupe paths to an affected code point. |
I think it would make sense for this to belong in this DB, and the OSV schema can link to here as the authoritative source of this. e.g. the OSV schema links to https://go.dev/security/vuln/database for the Go-specific bits (including how they identify vulnerable symbols). |
Sorry if you've already explained this upthread, but I think this should really just be a mechanical transform on your end. In other words, if you have the ability to present identifiers of the form mod_path, attr_name = "foo.bar.Baz".rsplit(".", 1)
json.dumps({"module": [mod_path], "attribute": attr_name}) In other words: presenting it in this form doesn't imply anything different about the data quality; it's just a way to de-dupe.
Yep, that sounds good (and correct) to me! |
It's mostly my own apprehension and fear honestly :) That said I will pester you with a few questions too 😉
I think
Where I think the lack of attributes can be read as a lack of refinement on the look up. eg. |
Very understandable! I think the best way to think about this is as a purely mechanical operation, so there should be no additional context or complexity on the producer side.
In that case, I would express that as:
...since a module (like |
That does help quite a bit. I missed that modules are attributes. That clears up a lot on my end and I agree with the OSV style representation too. Do we want to say also that |
Yep! That was what I was sort of going for with the original
That makes sense to me -- maybe @sethmlarson has opinions here as well. |
Just to be 100% clear I'm suggesting both where the many modules, one attribute would essentially be a short form. I can just see the single line form being useful in a lot of more conversational contexts as well as possible in some tooling settings. |
Gotcha, makes sense to me! |
Following up on pypa#149 it seems like we have general agreement on what this format should be, so I've gone ahead and kicked off the PR 🎉 I took a liberty in how to deliniate two attributes (with a `;`). Happy to change that if there's disagreement on how to delimit multiple different attributes on the same osv payload. The osv payload is explicitly called out as equivalent to the dot-colon single line format as well. I also added a brief section linking to the osv schema.
@woodruffw & @oliverchang I went ahead and kicked off a PR over in #175 since this has been hanging for a bit. Shall we move the conversation over there? |
* Add affected attribute format Following up on #149 it seems like we have general agreement on what this format should be, so I've gone ahead and kicked off the PR 🎉 I took a liberty in how to deliniate two attributes (with a `;`). Happy to change that if there's disagreement on how to delimit multiple different attributes on the same osv payload. The osv payload is explicitly called out as equivalent to the dot-colon single line format as well. I also added a brief section linking to the osv schema. * Update README.md Add json syntax for the markdown codeblock Co-authored-by: William Woodruff <[email protected]> * Update README.md Add json syntax Co-authored-by: William Woodruff <[email protected]> * Add note about starting at top level module just to be explicit * change the ImageFont/ImageFont2 example based on feedback and make json examples a little more explicit * Update README.md Co-authored-by: Oliver Chang <[email protected]> * Update README.md Co-authored-by: Oliver Chang <[email protected]> --------- Co-authored-by: William Woodruff <[email protected]> Co-authored-by: Oliver Chang <[email protected]>
Context: ossf/osv-schema#127
Summary: Many ecosystems/databases are starting to encode more fine grained, "vulnerable symbol" information in advisories.
Go has defined their own
ecosystem_specific
format for describing affected functions: https://go.dev/security/vuln/database#schema. Most vulns in the Go vuln DB have this.RustSec has this in their security advisories: https://github.com/rustsec/advisory-db/blob/ee74d43ec2c6203750e38a254f4029f6181c4362/crates/RUSTSEC-2021-0027.json#L30
GHSA also has their own: https://github.com/github/advisory-database/blob/046aafb21a71eadbe8bdd5a72b1491a5f1209fb7/advisories/github-reviewed/2023/07/GHSA-57fc-8q82-gfp3/GHSA-57fc-8q82-gfp3.json#L25
Given that ecosystems have their own nuances for describing these accurately (e.g. OS, arch specifiers), it makes more sense for ecosystems to each define their own
ecosystem_specific
field. We should define anecosystem_specific
for Python as well as part of this DB.The text was updated successfully, but these errors were encountered: