It's a bit of an itch I've been scratching for a few years.
Most especially, given two or more instances of what you suspect to be the same or a substantively similar work, how can you assess this in a robust and format-independent manner, programmatically?
For works with well-formed metadata, this isn't an issue.
For identical duplicate copies of the same file, a hash is effective.
But for the circumstance most often encountered in reality --- different forms and formats derived from different sources but containing substantially the same work --- there is no simple solution of which I'm aware. As examples, say you have a reference document The Reference Document.
How do I determine that:
- An ACSCII-only textfile
- Markdown, HTML, DocBook, and LaTeX sources
- PDF, MS Word (which version), PS, DJVU, ePub, or .mobi files (sling any other formats you care to mention).
- Hardbound and paperback physical copies
- Scans made from the same or different physical books or instances, versions, and/or translations.
- Audiobooks based on a work. By the same or different readers.
- Dramatic performances, films, video series, comic-book adaptations, etc., of a work. (Say: Hamlet or Romeo and Juliet. What is the relationship to "West Side Story" (and which version), or Pyramus and Thisbe?)
- Re-typed or OCRed text
... all refer to the same work?
How do you define "work"?
How do you define "differences between works"?
How do you distinguish intentional, accidental, and incidental differences between instances? (Say: translations, errata, corrections, additions for the one, transcription errors for the second, and scanning or rendering artefacts for the third.)
If you're working in an environment in which instances of works come from different sources with different provenances, these questions arise. At least some of these questions are prominent in library science itself. It's the technical mapping of digitised formats I'm focusing on most closely, so the physical instantiations aren't as critical here, though the presumption is that these could be converted to some machine-readable form.
In bibliographic / library science, the term is "work, expression, manifestation"
The general problem here is not solvable with technology if there is no universally agreed definition for “a work” - and there isn’t (this touches on some profound issues of ontology).
And so I suspect the way forward is to maintain human-curated mappings of file hashes to “works”, where “a work” is a matter of the curator’s opinion, and different curations will be valued differently by different consumers of that information. For example, maybe a literary expert would have a great and respected method of organizing the works and derived works of Shakespeare, but that same person might not be sought out for their views on pop songs.
You could probably start with an ML-based curation that gets it 80% right, and fill out the last 20% with gamified crowdsourcing (with multiple competing interpretations of the last 20%).
All analogies melt if they're pushed loudly enough. And all models are wrong, though some are useful.
The notion of a work has utility, it respects the notion of different forms, variations, and evolution with time. If you're looking at, say, multiple editions of a book, or even of something much more dynamic, say, source code or a Wiki entry, yes the specific content may change at any point, and stands through many versions, but those are connected through edit events. A good revision control system will capture much of that, if the history interests you.
Ultimately, I'd argue that "work" is defined in relationships and behaviours. A record intermediates between author(s) and reader(s) (or listeners, viewers, etc.), concerning some informational phenomenon, perhaps fictional, perhaps factual, perhaps itself an action (as in a marriage record, divorce decree, or court decision). The work in its total context matters. (At which point we discover most works have very little context...).
The file-hashes-to-work mapping is all but certain to play a large role, but even that is only a means of indicating a relationship that is established by some other means.
The notion of selecting an arbitrary set of ngram tuples to establish highly probable relationsip is likely to remain at least one of those means.
And yes, the incremental / tuned approach is also likely a useful notion.
Paul Otlet had a lot to say about "documents", though I think "records" is a better term for what he had in mind, as any persistent symbolic artefact: book, painting, music, photograph, film, etc.
I have been dealing with the same problem for curating resources at https://learnawesome.org. Projects like Openlibrary do collect unique identifiers for _books_, but for everything else, it mostly takes manual effort. For example, I collect talks/podcasts by the author where they discuss ideas from their books. Then there are summaries written by others.
There's a lot of work toward this in library space, though it takes some adaptation to new media formats. Paul Otlet worked in a paper-only medium in the early 20th century but also has some excellent thinking. His books are now seeing translation from French. The Internet Archive and Library of Congress are also doing a lot of relevant work, see the WARC format as an example.
What's particularly relevant now are ephemeral and/or continuously updated online content --- and not just the WWW (http/https), but other protocols (ftp, gemini, ipfs, torrents, ...), as well as apps.
A working truism I developed was that "identity is search that produces a single result". So if you can come up with something that uniquely identifies a work, then that can be a working identifier. I typically focus on what can be reasonably assessed of author, title, publication date, publisher (traditional, website/domain), and failing that, descriptive text. Remember that originally titles were simply the introductory lines of works (a practice that remains used in some cases, e.g., the names of church masses or prayers, e.g., "Kyrie Eleison").
Most especially, given two or more instances of what you suspect to be the same or a substantively similar work, how can you assess this in a robust and format-independent manner, programmatically?
For works with well-formed metadata, this isn't an issue.
For identical duplicate copies of the same file, a hash is effective.
But for the circumstance most often encountered in reality --- different forms and formats derived from different sources but containing substantially the same work --- there is no simple solution of which I'm aware. As examples, say you have a reference document The Reference Document.
How do I determine that:
- An ACSCII-only textfile
- Markdown, HTML, DocBook, and LaTeX sources
- PDF, MS Word (which version), PS, DJVU, ePub, or .mobi files (sling any other formats you care to mention).
- Hardbound and paperback physical copies
- Scans made from the same or different physical books or instances, versions, and/or translations.
- Audiobooks based on a work. By the same or different readers.
- Dramatic performances, films, video series, comic-book adaptations, etc., of a work. (Say: Hamlet or Romeo and Juliet. What is the relationship to "West Side Story" (and which version), or Pyramus and Thisbe?)
- Re-typed or OCRed text
... all refer to the same work?
How do you define "work"?
How do you define "differences between works"?
How do you distinguish intentional, accidental, and incidental differences between instances? (Say: translations, errata, corrections, additions for the one, transcription errors for the second, and scanning or rendering artefacts for the third.)
If you're working in an environment in which instances of works come from different sources with different provenances, these questions arise. At least some of these questions are prominent in library science itself. It's the technical mapping of digitised formats I'm focusing on most closely, so the physical instantiations aren't as critical here, though the presumption is that these could be converted to some machine-readable form.
In bibliographic / library science, the term is "work, expression, manifestation"
https://www.loc.gov/marc/marbi/2011/2011-dp03.html