The key thing is interpretation of the diff. Is there a difference since they ran some code generator ins the crate contains generated code, not present in the repo or did they add a backdoor?
Most of the diffs are probably innocuous. I suspect the most common diff would be the version line of Cargo.toml, both from CI that automatically updates that line, and people who forgot to update it before making a tag in git.
Heavily down voted, which is fair because I didn't really explain what I meant, which was: Would using LLM's to parse the generated diffs, as a first pass, be useful/efficient for spotting and interpreting discrepancies?
I don't think this is a relevant take. Your goal is to implement a system to automatically scan countless packages and run a heuristic to determine if a package is suspicious or not. You're complaining about false positives/false negatives while ignoring that packages that not checking packages at all is not an improvement.
Personally I think using LLMs to scan is a good idea, but an honesty negative is a potential false sense of security. I think using LLMs here are useful for finding unintentional security flaws. I don't think it's a great tool to find intentional security flaws a la the xz situation. People might be less inclined to dig into the code directly if it was stamped with a green check mark by a GPT.
Using machine learning, including LLMs, to detect and mitigate malicious code is of interest to a whole lot of smarter people than me, really suggests your flippant rejection of their potential is premature.
It could work for classifying honest/innocent differences.
However, LLMs are incredibly naive, so they could be easily fooled by a malicious actor (probably as easy as adding a comment that this is definitely NOT a backdoor).
Next step would be to do reproducible builds (if it's not already the case).