I used to work as a web editor at a daily newspaper. One of my daily responsibilities was to write and post short blurbs about lottery results and beach surf conditions. They both were extremely formulaic, so I wrote push-button scripts that would fetch the data, parse it, and generate stories. Saved me time, which I used to do other, more important things.
Some people have said, "Why not just display the raw data?"
Well, to save you the trouble of having to analyze the data.
For instance, suppose I've written a Powerball results script that lists the winning numbers, along with which states had winners.
If I'm just spitting out the raw data, then people might miss the fact that one of the winners was from our state -- whereas if I'm generating a story, I'm going to make that the lede.
Similarly, the magic of a company like Narrative Science and its Quill service is not in taking a ton of data and filling in a bunch of blanks with the values, Mad Libs style. It's in analyzing the data and figuring out what the most important parts are, and constructing a story around those findings.
In other words, it's not hard for a bot to write, "The Tigers played the Wolves yesterday. The Tigers won 1-0. The Tigers' John Johnson scored a home run."
It's more difficult to write, "The Tigers' Tom Thompson pitched the first no-hitter of his career yesterday in a 1-0 game against the Wolves. Remarkably, the Wolves' pitcher, Dobbie Dobson, was moments away from forcing the game into extra innings with his own no-hitter, when, in the ninth inning with two outs and two strikes on the board, the Tigers' John Johnson hit a home run."
(I'm not a baseball writer, so I'm probably bungling it, but you get the point.)
News in this sense is just catching up to the financial world. Big finance has had bots that parse news feeds and raw data for YEARS as part of their high frequency trading (HFT) platforms as a form of arbitrage.
So let's say a major earthquake happens in San Francisco and the Google HQ falls into a bottomless pit. The trading algorithms will know about the earthquake within seconds, and they will also know about likely know about Google's obliteration from speech-to-text of police scanners. The millisecond these reports hit the wire, the HFT algorithms would likely short the hell out of Google's stock immediately while also placing positive bets on Google's competitors as well as construction companies and building suppliers. And this was going on 10 years ago; they're probably light years ahead of this by now.
It's also not very hard to write the second piece given the right machine learning technology. Feed a decade's worth of human-generated articles into an algorithm (which could also be paired with marketing metrics like page views, avg time spent on page, etc.) and the algorithm picks out the types of statistics it should look up from the databases and how those statistics should be incorporated into the story. Heck, if you wanted to get really fancy, you could try to pick the major themes of the game out of a transcript of the play-by-play or specific emotional moments of the game by measuring the level of crowd noise or twitter traffic.
A computer will have better information recall, better research skills and better perspective to write stories that are more appealing to people. Heck, they could even write a story on the fly designed to appeal specifically to the individual's preferences. I understand that there's a place for reporters, but I don't think it's going to be written event-based reporting for much longer.
I'm new to programming and web development, but just out of curiosity, if I wanted to write a script that fetches data and writes/generates product descriptions, is there an online resource that explains how to do this in general terms?
Also, would writing it be something a beginner/novice like myself could do or does it require pretty advanced skills?
A Chicago startup called Narrative Science makes scripts that turn raw data feeds such as seismological data or website analytics into accessible natural language reports. Some news outlets are already using them to get out earthquake and Little Leagues baseball reports quickly and cheaply.
The founder, Kristian Hammond, has big ideas. "Pulitzer Prize by 2017." "90% of news written by bots in 2030."
I am incredibly unimpressed with Kris Hammond's hype machine (although i think that what Narrative Science does is cool).
But this isn't the first bot that the LA Times has operated either. Folks like Ben Welsh (who works at the LA Times, and did this awesome lightning talk: http://www.youtube.com/watch?v=iP-On8PzEy8 ) and Matt Waite (who won a Pulitzer Prize for Politifact & now teaches at University of Nebraska Lincoln) have also been writing bots to help with the reporting process (and yes in some case actually generate copy) for years.
Narrative Science certainly was not the first or only group to do this.
What I wonder is this: Everybody who enters the information age slowly (and mostly without noticing) adapts to it and develops information processing skills. So why not just present a table and/or a graph? Maybe we're already capable enough to handle it.
If the data is already machine-readable, in many instances, what I want to see are a few slices through the database and a few visualizations. What I don't want is a big text blob that must first be piped through my brain to re-extract the information.
"If a writer never had to compose a fifty word earthquake report again—few would complain. Better to leave the short, dry, purely informational articles to the bots."
The logical extension of this software is to replace the phrase 'earthquake report'.
"If a writer never had to compose a fifty word Presidential press brief again—few would complain."
"If a writer never had to compose a fifty word news leak again—few would complain."
"If a writer never had to compose a fifty word announcement of a declaration of war again—few would complain."
Let the machine do the writing to expedite publishing, because the accuracy of the source is assumed before the information is even shared publicly, provided the source submits releases in a formulaic manner.
This removes the agency of humans that might ask complicated questions (journalists) – so I don't believe it qualifies as news or journalism, at all. This just helps move the words of the source to the front page of a news agency.
Why dress up the release as unique content at all? Just print the damn press release word for word and cite the source.
A lot of modern journalism is already like this, it's a middle man between a number of sources (news agencies, PR offices, politicians) to a public. What's worse is that usually the source wants the message spread as far and wide as possible, that's why journalism is having so many economic problems recently.
Agreed. The internet burst the bubble that news agencies possessed editorial integrity, and that has cost them business. This just helps online media transition faster in the direction of link bait bullshit. The only people who benefit from bot composition are the wealthy and powerful. I am glad the LA Times has disclosed their use of bots, and I hope others will do the same, so I can stop reading their 'work'.
> The internet burst the bubble that news agencies possessed editorial integrity, and that has cost them business.
That's a nice explanation and particularly common since the 2000s, but the problem with it is that the decline in perception of most mainstream news sources (particularly newspapers) and the resulting decline in the industry was already widely discussed before the internet, as anything other than ARPANET, existed, and certainly before it was a major factor in public perception of the news.
> The logical extension of this software is to replace the phrase 'earthquake report'.
You're asking for a principled, rather than arbitrary, place to insert human writers into the mix? I'd say a human writer might be needed when the story involes human agency; or when there's continuing human interest in the story.
There's no value-add for involving a human in the initial notification of a small earthquake; no human angle, no human perspective. If geologists issue a press release saying this presages Los Angeles falling into the ocean, then you need to involve a human.
Robots can't yet parse human political language (neither can most humans), so there's a possible value-add for interpretation of a presidential press release by an expert human.
Leaks and declarations of war involve human agency, and are assured of continuing human interest. Get a human to interpret them.
> This removes the agency of humans that might ask complicated questions (journalists) – so I don't believe it qualifies as news or journalism, at all.
Journalism at most newspapers died long before the process of uncritically transcribing information from external sources into publication format started being automated. That's what makes the automation invisible -- and therefore, acceptable -- the transition to robotic regurgitation long preceded the actual use of robots to do it.
I would argue that there are several logical problems with a jump from 'earthquake report' to 'news leak' or 'declaration of war'. Using only fifty words for an earthquake report implies that there was no explanation or analysis to be made of its effects. Most leaks and war declarations tend to require much more information to be covered completely, and I would not expect to find any that are fifty words or less.
For a sense of scale, your comment was 188 words. My comment is 88 words.
My point is that such software is a nod to the primacy of expediency in the field of breaking news. I'm not certain that other kinds of breaking news would be treated differently, particularly those from official sources that utilize standard formats (like corporations and governments).
If I may dig at your comment a little to illustrate my point: Your comment is shorter – more expedient, you seem to think – but you have not actually laid out any of the 'several logical problems' with my 'jump', you've merely made a claim that leaks/war declarations 'tend to require much more information' for complete coverage.
I agree such issues should require that, however, I don't feel they get it. I humbly submit the coverage of the latest US invasion of Iraq, where the evidence overwhelmingly indicated a dearth of WMDs, and yet US government talking points were rarely disputed, and sometimes printed without critique.
So, my hypothesis is that similar issues will definitely not get said coverage if expediency-by-software is valued more highly than analysis.
There's no real solution to our difference of opinion, but you haven't actually refuted my logic, as you claim.
We both agree there are different types of breaking news, and that they need to be treated differently. You trust that they will be, while I do not.
I actually don't think our opinions differ that much. I just have trouble expressing my ideas. You seem to have hit it on the nose with your last statement. I am more optimistic that these scripts will be used to generate breaking news that doesn't (yet) require much human analysis, just information that must be quickly disseminated in a readable format (such as an earthquake's geophysical qualities). This would hopefully leave more resources that would allow more rigorous journalism (such as analysis of the effects of and response to an earthquake) to thrive in the future just as deficiently as now.
The word count was intended to display just how little information might be conveyed in fifty words. The inclusion of my comment's word count was simply to make the comment self-referential. Both can be safely ignored.
Aha! I understand now, and I also agree that our positions aren't that different. I happen to work for a media operation, and I tend to think they would outsource the whole operation (writing, editing, layout) to a computer, if given the chance. So, I'm a more of a Grinch about these innovations.
I hope you're right, because contemporary journalism clearly needs to be doing something better with its time.
I mean that's sort of what a lot of news is. Just a summarization of a press release, some additional details found on the internet, maybe a few quotes.
It's the summarization and finding additional details that is important, just printing the entire interview or entire press release is too much.
Most news stories are already being boiled down to < 140 characters by humans. The next frontier in journalism is an explosion in informed, opinionated analysis from multiple perspectives.
We're seeing this in sports (Grantland, Deadspin, 538, Baseball Prospectus), politics (Vox.com/Ezra Klein, 538 again, Politico), and tech (Thompson, Evans, Gruber, The Wirecutter, Anandtech, MG/Dixon/Wilson/Suster/Andreessen/Horowitz/etc/yesIleftamillionpeopleout).
Tech leads the way because those on the cutting edge are often interested in tech, and I believe it's a solid leading indicator of where all journalism is headed—a 'Cambrian explosion' of thoughtful, analytic, but not purely objective writing.
There are three areas of journalism I see as relatively "safe" from bots for the foreseeable future:
1) Big-picture Op/Ed (of the kind you're talking about)
2) Investigation (of the deep, involved, I-need-to-know-where-to-start, I-need-to-get-people-to-talk variety)
3) Narrative nonfiction (profiles, colorful takes on a subject; see: The New Yorker, Michael Lewis, etc.)
The interesting thing is that each of these domains could be enhanced by the use of better software. And that's the best way to start looking at bot-fueled journalism. I'd love a machine-based fact checker for my pieces. I'd love AI-driven help with compiling and analyzing data. I'd love to outsource the high-frequency grunt work to a machine, just as it's done in many other industries. It would give me more time to think. It would let me focus where my focus matters. Instead of spending 99% of my time chasing raw research, and 1% of my time scrambling to assemble the piece, I could get more data, better, faster, up front, and thus put deeper thought into the analysis.
On that note: the journalists who thrive in the machine age will be the ones who understand what the machine is doing. Data-literacy is already a big, differentiating factor in the market now. Statistical competency will be very important, and the bar for competency will be positioned higher every few years. Beautiful logic will trump beautiful fluff.
This will accelerate the bimodal shake-out of journalism that we're already starting to see. We'll have base, commodified, LCD-pandering content farms on one end, and deep thought on the other. Quantity shops and quality shops, both of them improved by the automation of certain functions. Anyone in the middle ground is in trouble. This will be a good thing for both poles -- provided we can find a way to make the economics of quality work as well as those of quantity.
"is there something small and replicable that a piece of software could help you do right away?"
Fact checking is a huge problem. I'm not sure if call it a hair-on-fire problem. But it's certainly pissed me off on any number of occasions. Typically, what happens is that I'll write a story mildly critical of BigCo. The PR department at BigCo will catch wind of it, then email me and my editor with as many small "corrections" as they can muster, in an attempt to take down the piece.
When I say corrections, I don't mean broad facts. I mean infinitesimally perceptible technicalities. For instance, I might report someone's title as the "VP of Strategy," because that's what's listed on this company's website and on the person's LinkedIn page. But the company's PR group will come at me, screaming bloody murder, because the guy's title is actually "VP of Strategic Initiatives." They'll do this because they're unhappy with the broader criticism I'm leveling against their business -- so they want to pick away at as many nits as possible. It's a guerilla harassment tactic.
Now, you might well ask: aren't I just being sloppy? How hard can it be to fact check everything? The answer is: it's damned near impossible to catch everything. Some of this information is conflicting across multiple sources. Some of it's out of date or inaccurate. And some of it's not even publicly available, and you wouldn't know you're slightly off until someone corrects you. (Case in point: the "Strategic Initiative" example, a nonsensically titled internal promotion that was never made public.) Add to this the immense pressure of weekly deadlines, and factor in how much time I need to spend ideating, researching, writing, and editing in the first place, and you see how little time I have to get every last thing 100% accurate. I'd love to be 100% accurate 100% of the time, but given my constraints, that's rarely feasible. It's like asking an engineer to write flawless, bug-free code at a hackathon.
I'd love a piece of software I could run my stories through, which would analyze my piece against a public data set (Google, for instance) to check for factual accuracy in all these little places. Or at least a piece of software that would throw up red flags on unverified statements, for which multiple, conflicting data exists on the web. Sort of like the little, red underlines on automatic spell check.
Perhaps that's not a big enough use case or TAM. But broaden it out. I'd love a better way to compile Google research about any given query. This is why I'm rooting for the stuff Wolfram is working on, though I'd love a more modifiable and specialized front end for professional users.
Yeah, by no means an easy problem to solve. But I think better research tools, in general, are an interesting area. Search engines are beautiful things, but their output is limited by your input. They can't necessarily help you with the things you didn't know to look for in the first place. In a way, a search process is a very deterministic, almost teleological one. I believe there is room in the market for a more focused, yet more open-ended research methodology.
Thousands of Wikipedia articles regarding US cities were originally written by a bot years ago. Most still retain this content or traces of it, Englished from US Census data.
I found it shocking when I first learned about it. Previously I had thought that articles featuring in the news were actual journalism, not just copy and pasted stuff from company press releases. It changed my view of what the news is.
"With the help of Chicago startup and robot writing firm, Narrative Science, algorithms have basically been passing the Turing test online for the last few years." If an article written by a bot is indistinguishable from that written by human author, it does not pass the Turing test, as they are typically based on conversations. One of the claims is that is that the bots produce typo free articles, funnily enough some programs were able to fool judges of the Turing test by imitating human misspellings: http://en.wikipedia.org/wiki/Turing_test#Loebner_Prize
In a recent example, an LA Times writer-bot wrote and posted a snippet about an earthquake three minutes after the event. The LA Times claims they were first to publish anything on the quake, and outside the USGS, they probably were.
I've always thought news could be boiled down to a few important facts, and I wonder if having all that extra copy is really important in this day an age.
Taking this one step further: Why aren't "factual" news sites a thing, where news are computer readable (think JSON), and you can filter for the information you want?
Not all news can be made computer readable easily, but there are several classes of news that can be that way, or at least have rich, computer readable meta data (sports results, releases of software, new music albums, movies, upcoming events, announcement of road maintenance, ...).
Update: Is anybody interested in working on such a system with me? If yes, drop me an email (moritz@faui2k3.org)!
Great idea! I'm not sure something good would come out of it but another type of news that follows a predictable pattern and could be made machine-readable is celebrity gossip.
The skill is figuring out what the important facts are, and where to get them from.
For an earthquake, at least for the initial announcement, it's pretty straightforward - where and how big. It's not a great deal different to "a robot" (or more precisely, my phone's weather app) telling me what temperature and likelihood of rain. Trying to do that for, say, the salient points of the banking collapse would be a lot more difficult.
BBC Monitoing offers an excellent World news digest service and used to release part of the output publicly. But governments and companies will pay very handsomely for such services, so now it's only available on subscription
I care less about length and more about accuracy. Every news story I have happened know about personally has been totally wrong - this does not give me much confidence in journalism as currently practiced.
I guess we can look at this from a positive angle - the robots can't do any worse than the human journalists.
There's a good Chrome extension called churnalism that alerts you when stories on news sites you read appear to be copy pasted from other news sources or wikipedia
Both this and the parent. The local rag Tue New Zealand Herald, which is arguably the national paper publishes stories that have a headline you click on. It expands. The story is the headline - the headline text
is repeated. Its awful. They are even too slack to remove American spellings (it's supposedly a New Zealand paper) or convert US dollar values into local currency, both things a script could do fairly easily. Here is an example, it's sad.
http://m.nzherald.co.nz/world/news/article.cfm?c_id=2&object...
This is terrible. What a waste of effort! We have a machine that translates a simple fact into few paragraphs of text, so that thousands of people can spend time trying to do exact reverse when they read it.
Why don't they just publish raw facts in the computer readable form?
That might work well for someone with domain expertise. But not everyone is prepared to look at raw data and instantly extrapolate meaning. Taken to an absurd extreme, do you also think that machine translation between languages is also a terrible waste of effort? Why not just learn to read kanji fluently and skip the extra step?
If the computer adds some additional analysis to the data, autonomously, then it would be fair that reader should understand the nature of the analysis. With humans, we kind of live with it because we know they have common sense, so we kind of know what we can expect (or maybe, humans understand what are our needs). With automated translation or adding titles, we understand the process. But in automated writing? I am worried a lot of effort will be wasted trying to figure this out.
"Indeed, Kristian Hammond, cofounder and CTO of Narrative Science, thinks some 90% of the news could be written by computers by 2030."
I think there's more to a reporters job than writing, in fact i'd say the writing part is maybe the smallest aspect when it comes to the most interesting stories.
Investigation, interviews, and general "experiencing the world" is required, and computers aren't ready to do that yet. I'd be surprised if they are by 2030 too.
Of course, it could be a sign of how far down the quality of news has fallen. That 90% of it is now formulaic. I guess I had hopes that these new sites that are popping up would becomes a trend in quality. Artisanal news as the hipsters might say.
True. And I wonder how computers/bots might assist journalists in the other aspects of journalism - investigation, interviews, and trend spotting, etc?
I'm not sure I hope for the day that computers replace the human analysis, emotion, or opinion in news, but I do believe they could help journalists better spot leads and dig into stories. This would free journalists up to do less leg work and more thinking about the content in front of them.
That said, the process of investigation, etc. does contribute to the experience a journalist brings to the table, so the question then is, does cutting out the legwork undermine a journalists ability to deeply understand and analyze a given story?
If the news would turn into reports that way, I think would be better than opinionated news. It's okay to get opinionated news, as long as you know the frame and background of the speaker/paper/channel/etc.
All news is opinionated. Even the driest report has an implicit opinion on what facts were considered relevant to the audience and which ones should be left out. If in this case the algorithm is just turning data into English, they're just leaving the opinionating to whoever wrote the data report.
Isn't that a bit too ambitious? I admittedly don't know a lot about the Pulitzer Prize, but I'd assume the main qualities they're looking for in a journalist there are not 'fast', and 'factually correct'. The first one doesn't really relate to the quality of journalism, and the second one's just too basic a requirement.
What I'd assume is worthy of a prize in journalism, is the data collection and investigation process before even starting to write the article. This would include very complex tasks, most covering social interaction, which I just can't see a computer outperforming a human in by 2017.
It doesnt really matter ,journalism is already dead. Free news means sponsored news,therefore it's not news,it's marketing.
The future of news is to raw/unredacted/untransformed datas with tracable origin that people can analyse themself and decide to trust / not to trust. Exactly like the Snowden leaks.We dont need "journalist" to source/filter the data.
People then will be able to mashup the data with apps designed for that.
Next Step: Individuals have personal news writing software that scans sources of interest and generates human digestible stories. Completely bypassing news agencies all together.
There was manual approval involved; I guess that's what took 3 minutes (or maybe 15 seconds for polling the data, 5 seconds for actually composing the article, and 2m 40s waiting for approval).
Some people have said, "Why not just display the raw data?"
Well, to save you the trouble of having to analyze the data.
For instance, suppose I've written a Powerball results script that lists the winning numbers, along with which states had winners.
If I'm just spitting out the raw data, then people might miss the fact that one of the winners was from our state -- whereas if I'm generating a story, I'm going to make that the lede.
Similarly, the magic of a company like Narrative Science and its Quill service is not in taking a ton of data and filling in a bunch of blanks with the values, Mad Libs style. It's in analyzing the data and figuring out what the most important parts are, and constructing a story around those findings.
In other words, it's not hard for a bot to write, "The Tigers played the Wolves yesterday. The Tigers won 1-0. The Tigers' John Johnson scored a home run."
It's more difficult to write, "The Tigers' Tom Thompson pitched the first no-hitter of his career yesterday in a 1-0 game against the Wolves. Remarkably, the Wolves' pitcher, Dobbie Dobson, was moments away from forcing the game into extra innings with his own no-hitter, when, in the ninth inning with two outs and two strikes on the board, the Tigers' John Johnson hit a home run."
(I'm not a baseball writer, so I'm probably bungling it, but you get the point.)