It's a bit of an itch I've been scratching for a few years. Most especially, *gi...

jl6 · on June 10, 2021

The general problem here is not solvable with technology if there is no universally agreed definition for “a work” - and there isn’t (this touches on some profound issues of ontology).

And so I suspect the way forward is to maintain human-curated mappings of file hashes to “works”, where “a work” is a matter of the curator’s opinion, and different curations will be valued differently by different consumers of that information. For example, maybe a literary expert would have a great and respected method of organizing the works and derived works of Shakespeare, but that same person might not be sought out for their views on pop songs.

You could probably start with an ML-based curation that gets it 80% right, and fill out the last 20% with gamified crowdsourcing (with multiple competing interpretations of the last 20%).

dredmorbius · on June 10, 2021

Yes, it's complicated.

All analogies melt if they're pushed loudly enough. And all models are wrong, though some are useful.

The notion of a work has utility, it respects the notion of different forms, variations, and evolution with time. If you're looking at, say, multiple editions of a book, or even of something much more dynamic, say, source code or a Wiki entry, yes the specific content may change at any point, and stands through many versions, but those are connected through edit events. A good revision control system will capture much of that, if the history interests you.

Ultimately, I'd argue that "work" is defined in relationships and behaviours. A record intermediates between author(s) and reader(s) (or listeners, viewers, etc.), concerning some informational phenomenon, perhaps fictional, perhaps factual, perhaps itself an action (as in a marriage record, divorce decree, or court decision). The work in its total context matters. (At which point we discover most works have very little context...).

The file-hashes-to-work mapping is all but certain to play a large role, but even that is only a means of indicating a relationship that is established by some other means.

The notion of selecting an arbitrary set of ngram tuples to establish highly probable relationsip is likely to remain at least one of those means.

And yes, the incremental / tuned approach is also likely a useful notion.

Paul Otlet had a lot to say about "documents", though I think "records" is a better term for what he had in mind, as any persistent symbolic artefact: book, painting, music, photograph, film, etc.

mathnmusic · on June 10, 2021

I have been dealing with the same problem for curating resources at https://learnawesome.org. Projects like Openlibrary do collect unique identifiers for _books_, but for everything else, it mostly takes manual effort. For example, I collect talks/podcasts by the author where they discuss ideas from their books. Then there are summaries written by others.

dredmorbius · on June 10, 2021

There's a lot of work toward this in library space, though it takes some adaptation to new media formats. Paul Otlet worked in a paper-only medium in the early 20th century but also has some excellent thinking. His books are now seeing translation from French. The Internet Archive and Library of Congress are also doing a lot of relevant work, see the WARC format as an example.

What's particularly relevant now are ephemeral and/or continuously updated online content --- and not just the WWW (http/https), but other protocols (ftp, gemini, ipfs, torrents, ...), as well as apps.

A working truism I developed was that "identity is search that produces a single result". So if you can come up with something that uniquely identifies a work, then that can be a working identifier. I typically focus on what can be reasonably assessed of author, title, publication date, publisher (traditional, website/domain), and failing that, descriptive text. Remember that originally titles were simply the introductory lines of works (a practice that remains used in some cases, e.g., the names of church masses or prayers, e.g., "Kyrie Eleison").

The Superintendent of Documents (SuDoc) Classification Scheme (used by the US goverment and GAO) and operates by agency, type of publication, and further divisions, as well as date/year. https://www.fdlp.gov/about-fdlp/22-services/929-sudoc-classi...