According to the paper, the dataset goes up to 1978 because that's when copyright law was updated to automatically apply to newswires. It's unfortunate that we got into the situation where academia has to play by the rules wrt copyright while big private labs flaunt it.
This is, unfortunately, how our (US) justice system works.
The entire concept of an "NDA" has been bastardized in this manner, if you think about it. Conceptually you may think of an NDA as protecting sensitive data from disclosure, a sort of intellectual property right. However, it's been co-opted by folks to do nothing more than cover up inconvenient truths because they realize most people cannot afford to either (a) give up money they are promised in the future or (b) bankrupt themselves in their own defense.
So it's basically a game where whomever has the most money can ensure their narrative wins out in the end, because competing narratives can simply be "bought out".
In the text is stated’ All code used to create Newswire is available at our public Github repository. ’ But a check learned “(This repository is empty.” So not yet open science..
Yes, the OCR problem is also very interesting to me. I have been looking into the OCR/labeling field recently, and it seems to still be an actively researched field. There is a surya-ocr that was recently posted to YC, transformer-based, but it is still expensive if running on a really large dataset like 100 years of newspaper. Tesseract doesn't seem to handle this kind of thing very well. In the paper, they just mention it was an active-learning type of method.
Exciting work. I hope they consider releasing the 138 million scans.
The FAQ at the end is a bit confusing, because it says:
Was the “raw” data saved in addition to the
preprocessed/cleaned/labeled data (e.g., to support
unanticipated future uses)? If so, please provide a link or
other access point to the “raw” data.
All data is in the dataset.
and for each record there's "newspaper_metadata", "year", "date", and "article" fields that link back to _a_ LoC newspaper scan.
I stress _a_ singular as much is made of their process to identify articles with multiple reprints and multiple scans across multiple newspapers as these repititions of content (to a degree mitigated by local sub editors) with varying layouts are used to robust the conversion of the scans.
I haven't investigated whether every duplicate article has a seperate record to a distinct scan source ..
I've wondered for a while if a new kind of news could be fashioned from the events of the world. Basically, if I were to try and define what items are most newsworthy, it'd be along the lines of what (according to my pop-culture understanding) Claude Shannon described, of the items that are most "surprising". But while we normally look at surprise as a subjective thing that impacts only one person, like an item that most conflicts with my current understanding and requires me to update my own model the most, we'd instead apply that to very large groups of people, or whatever group the "news" is being tuned to.
So, what items of news cause the most "change" (or surprise) to that group of people. All our understandings and notions of truth are based off of large chains of reasoning that are based off of premises and values. When a new event happens that changes a premise our understandings are based on, which events cause the largest changes in those dags?
We're regularly subjected to "news" that doesn't change anything very much, while subtle events of deep impact are regularly missed. Maybe it would be a way to surface those things. I wonder if someone smarter than me could analyze a data set such as this and come up with a revised set of headlines and articles over the years that do the best job of communicating the most important changes.
It's an interesting idea...most news outlets are pretty similar, but with variance in a few (of the same) dimensions. A radical change could be to vary along new and more dimensions, and in new ways, but also to introduce variance (or novelty) to the shared axioms it all sits on top of (~"the reality").
If it was all powered by AI I think you could get some really interesting results. I bet it would invoke extremely strong emotions in Humans, they tend to not like their "reality" being messed with (well...in ways other than they have become accustomed to - these ways they seem extremely fond of, and defend them passionately).
Another angle no one's run with in any serious way would be a sort of meta-journalism, again with the techniques described above. I think a well done implementation could steal/borrow 50% of Trump's base, and 30% of the Democrats. Actually, with different variations of parameters I'd think you could get a wide range of outcomes, it is often hard to know in advance what will strike a chord with people, some of the weirdest things work like a charm.
Really good stuff. On the ground, in the weeds data collection and collation like this is sadly often thankless (and poorly funded) work, but it's a huge force multiplier to downstream research and should always be commended.
But are you correcting the OICR or miscorrecting the originals?
I want original text, including misspellings, and original regional / historical spellings, including slang (which may look like another word, but is not, and isn't in a dictionary).
You cannot fix OCR text wirhout lioking at the original.
With the spelling having been fixed, even if imperfectly, you could much more easily search for content and find relevant results, and then go on to look at the originals.
What you want is still possible, unless you unreasonably make it a requirement that the transcriptions should be perfect.
This is not what this phrase is about. I came to it working on the structural data of just under 100k Chinese characters. I'd spend hours, days and weeks proofreading and correcting formulas, so your "advocating and being a champion for inaccuracy" doesn't stick. But absent an automated, complete coverage of all records against a known error-free data set, there will likely be a small percentage of errors and dubious cases.
And thanks by the way for the readiness to jump to conclusions and fire a salve of allegations, viz. "willingly", "knowingly", "introducing", "ridiculous"
You're making statements supporting the concept that errors are unavoidable, with an air of "oh well!", in a thread where someone is claiming AI is a solution... right after demonstrating a 10x error!
AI is a ridiculous answer, with its hallucinations and absurd error rates. If you didn't intend to support that level of absurd error rate, you shouldn't be replying in defence.
It sounds like you did not want to give that impression, if so, I suggest you look at the chain of replies, and the context.
It's so nice and refreshing to see something like this, instead of the common "we tweaked this and that thing and got better results on this and that benchmark."
They do, but they charge a license fee to access it (https://www.ap.org/content/archive/how-to-access/) much less download it. Newswire seems to include a large subset of the part of this archive that is out of copyright.