The general problem here is not solvable with technology if there is no universally agreed definition for “a work” - and there isn’t (this touches on some profound issues of ontology).
And so I suspect the way forward is to maintain human-curated mappings of file hashes to “works”, where “a work” is a matter of the curator’s opinion, and different curations will be valued differently by different consumers of that information. For example, maybe a literary expert would have a great and respected method of organizing the works and derived works of Shakespeare, but that same person might not be sought out for their views on pop songs.
You could probably start with an ML-based curation that gets it 80% right, and fill out the last 20% with gamified crowdsourcing (with multiple competing interpretations of the last 20%).
All analogies melt if they're pushed loudly enough. And all models are wrong, though some are useful.
The notion of a work has utility, it respects the notion of different forms, variations, and evolution with time. If you're looking at, say, multiple editions of a book, or even of something much more dynamic, say, source code or a Wiki entry, yes the specific content may change at any point, and stands through many versions, but those are connected through edit events. A good revision control system will capture much of that, if the history interests you.
Ultimately, I'd argue that "work" is defined in relationships and behaviours. A record intermediates between author(s) and reader(s) (or listeners, viewers, etc.), concerning some informational phenomenon, perhaps fictional, perhaps factual, perhaps itself an action (as in a marriage record, divorce decree, or court decision). The work in its total context matters. (At which point we discover most works have very little context...).
The file-hashes-to-work mapping is all but certain to play a large role, but even that is only a means of indicating a relationship that is established by some other means.
The notion of selecting an arbitrary set of ngram tuples to establish highly probable relationsip is likely to remain at least one of those means.
And yes, the incremental / tuned approach is also likely a useful notion.
Paul Otlet had a lot to say about "documents", though I think "records" is a better term for what he had in mind, as any persistent symbolic artefact: book, painting, music, photograph, film, etc.
And so I suspect the way forward is to maintain human-curated mappings of file hashes to “works”, where “a work” is a matter of the curator’s opinion, and different curations will be valued differently by different consumers of that information. For example, maybe a literary expert would have a great and respected method of organizing the works and derived works of Shakespeare, but that same person might not be sought out for their views on pop songs.
You could probably start with an ML-based curation that gets it 80% right, and fill out the last 20% with gamified crowdsourcing (with multiple competing interpretations of the last 20%).