Hacker News new | past | comments | ask | show | jobs | submit login
Newswire: A large-scale structured database of a century of historical news (arxiv.org)
165 points by h2odragon 4 months ago | hide | past | favorite | 39 comments



According to the paper, the dataset goes up to 1978 because that's when copyright law was updated to automatically apply to newswires. It's unfortunate that we got into the situation where academia has to play by the rules wrt copyright while big private labs flaunt it.


They don't have to play by the rules either. They can also be sued by the NYTimes.


Justice at the service of whoever has the bigger wallet. :/


This is, unfortunately, how our (US) justice system works.

The entire concept of an "NDA" has been bastardized in this manner, if you think about it. Conceptually you may think of an NDA as protecting sensitive data from disclosure, a sort of intellectual property right. However, it's been co-opted by folks to do nothing more than cover up inconvenient truths because they realize most people cannot afford to either (a) give up money they are promised in the future or (b) bankrupt themselves in their own defense.

So it's basically a game where whomever has the most money can ensure their narrative wins out in the end, because competing narratives can simply be "bought out".


> They don't have to play by the rules either. They can also be sued by the NYTimes.

I understand a lot of other outlets have been folding and making deals with OpenAI, because they're too weak to sue and desperate for the revenue.

Which if true is really sad. It's like taking out a payday loan: solve your short-term problem by giving yourself a bigger one in the future.


*flaut


*flout


*flute

:-)


In the text is stated’ All code used to create Newswire is available at our public Github repository. ’ But a check learned “(This repository is empty.” So not yet open science..


Yet another result of incentives in academia being misaligned.


This seems like a serious scholarly work and a contribution no matter how you cut it. Thanks to the team that puts this out.



Just randomly searched for a term and the article data appears full of typos and OCR mistakes in the sample I used.

Makes me wonder if this is a bigger problem.


Yes, the OCR problem is also very interesting to me. I have been looking into the OCR/labeling field recently, and it seems to still be an actively researched field. There is a surya-ocr that was recently posted to YC, transformer-based, but it is still expensive if running on a really large dataset like 100 years of newspaper. Tesseract doesn't seem to handle this kind of thing very well. In the paper, they just mention it was an active-learning type of method.


Exciting work. I hope they consider releasing the 138 million scans.

The FAQ at the end is a bit confusing, because it says:

  Was the “raw” data saved in addition to the 
  preprocessed/cleaned/labeled data (e.g., to support
  unanticipated future uses)? If so, please provide a link or 
  other access point to the “raw” data.

  All data is in the dataset.


The raw scans are in the Linrary of Congress digital collection.

(^- oops, example transcription error)

The "end of described process" dataset descrption is at https://huggingface.co/datasets/dell-research-harvard/newswi...

and for each record there's "newspaper_metadata", "year", "date", and "article" fields that link back to _a_ LoC newspaper scan.

I stress _a_ singular as much is made of their process to identify articles with multiple reprints and multiple scans across multiple newspapers as these repititions of content (to a degree mitigated by local sub editors) with varying layouts are used to robust the conversion of the scans.

I haven't investigated whether every duplicate article has a seperate record to a distinct scan source ..


I've wondered for a while if a new kind of news could be fashioned from the events of the world. Basically, if I were to try and define what items are most newsworthy, it'd be along the lines of what (according to my pop-culture understanding) Claude Shannon described, of the items that are most "surprising". But while we normally look at surprise as a subjective thing that impacts only one person, like an item that most conflicts with my current understanding and requires me to update my own model the most, we'd instead apply that to very large groups of people, or whatever group the "news" is being tuned to.

So, what items of news cause the most "change" (or surprise) to that group of people. All our understandings and notions of truth are based off of large chains of reasoning that are based off of premises and values. When a new event happens that changes a premise our understandings are based on, which events cause the largest changes in those dags?

We're regularly subjected to "news" that doesn't change anything very much, while subtle events of deep impact are regularly missed. Maybe it would be a way to surface those things. I wonder if someone smarter than me could analyze a data set such as this and come up with a revised set of headlines and articles over the years that do the best job of communicating the most important changes.


It's an interesting idea...most news outlets are pretty similar, but with variance in a few (of the same) dimensions. A radical change could be to vary along new and more dimensions, and in new ways, but also to introduce variance (or novelty) to the shared axioms it all sits on top of (~"the reality").

If it was all powered by AI I think you could get some really interesting results. I bet it would invoke extremely strong emotions in Humans, they tend to not like their "reality" being messed with (well...in ways other than they have become accustomed to - these ways they seem extremely fond of, and defend them passionately).

Another angle no one's run with in any serious way would be a sort of meta-journalism, again with the techniques described above. I think a well done implementation could steal/borrow 50% of Trump's base, and 30% of the Democrats. Actually, with different variations of parameters I'd think you could get a wide range of outcomes, it is often hard to know in advance what will strike a chord with people, some of the weirdest things work like a charm.


Really good stuff. On the ground, in the weeds data collection and collation like this is sadly often thankless (and poorly funded) work, but it's a huge force multiplier to downstream research and should always be commended.


First look at the data: https://pastila.nl/?05ee30a0/be7f1715c7de106b95cccd9385a6c2e...

TLDR: it makes sense :)


Seems to correlate nicely [0]:

> Prime Minister of the United Kingdom, from 1940 to 1945 during the Second World War, and 1951 to 1955

> Died 24 January 1965

[0] https://en.wikipedia.org/wiki/Winston_Churchill


Also uploaded it to the public playground for queries:

https://play.clickhouse.com/play?user=play#U0VMRUNUIHllYXIsI...


But the scan quality is subpar. Example:

> For belter safekecping Russta’s $2¢4,000,000 collection of crown jewels, probably (he finesl array of gems ever assem- bled at one tle

https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...


https://chatgpt.com/share/13f553a8-5cff-42a1-be95-4a9d33cd10...

May also be easy to correct a lot of it:

“For better safekeeping, Russia’s $24,000,000 collection of crown jewels, probably the finest array of gems ever assembled at one time,”


But are you correcting the OICR or miscorrecting the originals?

I want original text, including misspellings, and original regional / historical spellings, including slang (which may look like another word, but is not, and isn't in a dictionary).

You cannot fix OCR text wirhout lioking at the original.


With the spelling having been fixed, even if imperfectly, you could much more easily search for content and find relevant results, and then go on to look at the originals. What you want is still possible, unless you unreasonably make it a requirement that the transcriptions should be perfect.


Proper transcription to digital is to do so with accuracy, not "close enough".


to quote myself, "every interesting data set will have inaccuracies in it"


There is a vast difference between a rare, honest mistake, and an attenpt to mitigate them...

vs willingly knowing you are introducing corrections that are ridiculously wrong.

Advocating and being a champion for inaccuracy, really isn't a positive. You should find a new thing to quote about yourself.


This is not what this phrase is about. I came to it working on the structural data of just under 100k Chinese characters. I'd spend hours, days and weeks proofreading and correcting formulas, so your "advocating and being a champion for inaccuracy" doesn't stick. But absent an automated, complete coverage of all records against a known error-free data set, there will likely be a small percentage of errors and dubious cases.

And thanks by the way for the readiness to jump to conclusions and fire a salve of allegations, viz. "willingly", "knowingly", "introducing", "ridiculous"


You're making statements supporting the concept that errors are unavoidable, with an air of "oh well!", in a thread where someone is claiming AI is a solution... right after demonstrating a 10x error!

AI is a ridiculous answer, with its hallucinations and absurd error rates. If you didn't intend to support that level of absurd error rate, you shouldn't be replying in defence.

It sounds like you did not want to give that impression, if so, I suggest you look at the chain of replies, and the context.

AI hype is literally a danger to us all.


oh well


"$2¢4,000,000" should be "$204,000,000" rather than ChatGPT's "$24,000,000".


Are you aware of any models that perform as well as an LLM on this task at lower cost?


Self hosted LLM?


In case it wasn't intentional, you may be doxxing yourself.


Looks like fantastic work.

It's so nice and refreshing to see something like this, instead of the common "we tweaked this and that thing and got better results on this and that benchmark."

Thank you and congratulations to the authors!


Wait, so AP didn't/doesn't have an archive?


They do, but they charge a license fee to access it (https://www.ap.org/content/archive/how-to-access/) much less download it. Newswire seems to include a large subset of the part of this archive that is out of copyright.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: