Hacker News new | past | comments | ask | show | jobs | submit login
Python utility for tracking third party dependencies within a library (github.com/ibm)
189 points by prashantgupta24 on May 25, 2022 | hide | past | favorite | 21 comments



Tried this on one of my projects, it's neat.

    python3 -m import_tracker --name datasette --recursive | jq
    {
      "datasette": [
        "aiofiles",
        "click",
        "markupsafe",
        "mergedeep",
        "pluggy",
        "yaml"
      ],
      "datasette.version": [],
      "datasette.utils.shutil_backport": [
        "click",
        "markupsafe",
        "mergedeep",
        "yaml"
      ],
      "datasette.utils.sqlite": [
        "click",
        "markupsafe",
        "mergedeep",
        "yaml"
      ],
      "datasette.utils": [
        "click",
        "markupsafe",
        "mergedeep",
        "yaml"
      ],
      "datasette.utils.asgi": [
        "aiofiles",
        "click",
        "markupsafe",
        "mergedeep",
        "yaml"
      ],
      "datasette.hookspecs": [
        "aiofiles",
        "click",
        "markupsafe",
        "mergedeep",
        "pluggy",
        "yaml"
      ]
    }
Related tool: pipdeptree - here's the output from that against a project that installs a lot of extra stuff: https://github.com/simonw/latest-datasette-with-all-plugins/...


Is this the equivalent of Poetry's `poetry show --tree`?


This looks like a real useful tool for large projects, it can be quite possible to loose track of what a specific dependancy is used for. I also like the idea of making an import lazy so in monolithic app you could have a deployment that excludes functionality, and exclude its dependancies.

When I read the title I was hoping for something else though, what I would love is a tool that logs and potentially blocks unexpected IO operations on a library basis. With the increasing common supply chain attacks we are seeing (there was a PyPI one just the other day), having a way to at least report on unexpected activity if not help prevent it would be brilliant. Has anyone ever found a tool like that?

(Obviously the ultimate solution would be an outbound firewall, but it seems be that although you can easily do this in a VM or bare metal, I haven't seen any PAAS platforms have that sort of capability)


> When I read the title I was hoping for something else though, what I would love is a tool that logs and potentially blocks unexpected IO operations on a library basis. With the increasing common supply chain attacks we are seeing (there was a PyPI one just the other day), having a way to at least report on unexpected activity if not help prevent would be brilliant. Has anyone ever found a tool like. that?

You could do something close to that with Python's audit hooks, which were introduced with 3.8[1]. One massive caveat: audit hooks can be disabled by an attacker with the ability to control the interpreter, and are not perfect (there's plenty of things they don't cover.)

(More generally: this kind of auditing/restriction falls under the umbrella of "capability management." OpenBSD's pledge[2] is another example.)

[1]: https://peps.python.org/pep-0578/

[2]: https://man.openbsd.org/pledge.2


https://github.com/ossillate-inc/packj analyzes Python/NPM packages for risky code and metadata attributes. Uses static code analysis. We found a bunch of malicious packages on PyPI using the tool, which have now been taken down: examples https://packj.dev/malware [disclosure: I’m one of the developers]


You can use Syft [1] which generates the full software bill of materials, which includes package names, licenses for a broad set of tech stack ranging from OS level (Alpine, Debian), through Go, Ruby, Python, Java, JavaScript, etc.

[1] https://github.com/anchore/syft


Since this is about Python specifically, I'll go ahead and and highlight `pip-audit`[1] as a specialized tool for generating Python SBOMs and running audits against the official PyPI vulnerability feed.

FD: My company, my work.

[1]: https://github.com/trailofbits/pip-audit


This looks really neat. One thing I noticed on reading the source code, it appears to actually import the modules:

Quoting the docstring on the `track_module` function:

    """This function executes the tracking of a single module by launching a
    subprocess to execute this module against the target module. The
    implementation of thie tracking resides in the __main__ in order to
    carefully control the import ecosystem.
Source: https://github.com/IBM/import-tracker/blob/67a1e84e5a609e52e...

Here's the actual subprocess call: https://github.com/IBM/import-tracker/blob/67a1e84e5a609e52e...

    # Launch the process
    proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, env=env)
I think this is clever, and maybe even necessary, but feels risky to do on unaudited third-party Python libraries.

Maybe I'm misunderstanding something?


> I think this is clever, and maybe even necessary, but feels risky to do on unaudited third-party Python libraries.

This is why my coworker built the project he called "dowsing"; it tries to understand as much as possible from the setup.py's AST, without actually executing it.

https://github.com/python-packaging/dowsing


Neat, I'll take a look! I thought I was going to need to write something similar!


Hi, I'm the main author of import_tracker. Thanks for taking the time to dig into it! It's a really interesting point that the subproces.Popen could itself be a security concern. The command that's being executed is executing the __main__ of the import_tracker library itself (which is not something that a user can't configure), so is your concern that import_tracker itself is untrusted and might be a concern for users running this on their machines?

For context on why I'm using the suprocess here, this allows the tracking to correctly allocate dependencies that are imported more than once (think my_lib.submod1 and my_lib.submod2 both need tensorflow, but my_lib.submod3 doesn't).


Hi! I think that, in my cursory reading, I misunderstood what the code is doing. I thought it was importing the module you're trying to analyze... I'll have to read more closely when I have some spare time.


Makes sense! I think the commenter below correctly addressed the true security concern here which is importing arbitrary python libraries. As is, import_tracker doesn't attempt to solve this problem (though it's an interesting one to consider for this or a similar library). Please feel free to reach out with any other questions if you're curious.


No, you understand. Indeed, by importing Python code, you execute Python code, and so there could be ab execution path for malicious code to run.

FYI, pylint does something similar for native-code extension modules (unless this changed in the past few years): it imports them dynamically!

EDIT: reading the code more closely and reading the rest of the comments, more precisely, it's not the subprocess call itself, but rather importing an arbitrary Python module, which could be a path for code execution. But this is the case generally with Python: importing a module executes code, and so even just importing (not otherwise executing) an untrusted module could be problematic.


Yep, this is spot on. As written, import_tracker does indeed do a dynamic import of the library in question and you're right that this introduces the possibility of arbitrary code execution. Currently, import_tracker is designed for library authors where the library in question is a trusted library that has dependency sprawl.

It's a very interesting use case to consider how a similar solution could work as a sandbox for investigating supply chain concerns with third-party libraries that have transitive dependencies. I think some of the static analysis tools referenced in other comments would address this better since the real concern there is detecting the presence of transitive dependencies which may be malicious as opposed to identifying exactly where in the target library those dependencies are used.


Related: "Probably the most complete python dependency database" - https://github.com/DavHau/pypi-deps-db

> This data allows deterministic dependency resolution which is required by mach-nix[1] to generate reproducible python environments.

[1]: https://github.com/DavHau/mach-nix/


This tool has reminded me a few years ago I created a helper web utility that let me search Python libraries and get a tree view of their dependencies, and some license info. I had to do a lot of manual Python library compliance before we had tools like Blackduck.

It accepts dependencies in requirements.txt format (e.g. Django==3.1 or tensorflow) https://pydepchecker.z33.web.core.windows.net/

It's got a few shortcomings. Dependency resolution in Python is pretty difficult to work out when you've got a lot of libraries with common dependencies. And the license info on Pypi isn't always correct. But it's always been a quick useful tool for me.


The couple of times I have done something similar I have found an odd outcome - first was internal (to large company codebase) and second was npm imports years ago. Both times one ended up pulling in huge numbers of dependencies (900+ npm, 600 or so internal)

The point was that pretty no much what starting point one used, you pulled in much the same amount. There was a common core but even so it was like a starfish - if you start at tip of one limb, you pull in that limb and the core. start on another limb same thing.

but all the limbs are about the same size

it's just anecdata but it has been at the back of my mind as some kind of rule.


Could I use the lazy import to define a single set of dependencies of a monorepo and then load only the required subset for each project?



That’s how Bazel Python works




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: