Company using Mechanical Turk botches U.S. Senate campaign finance records

apendleton · on Sept 7, 2018

For a bit more background: in general, House and Senate campaigns use the same software to manage their campaign contribution data, and this software is capable of doing electronic submission (because the House requires it), but the Senate instead prints it all out and hands it in paper, because they're the worst, and then it all gets typed back in again at a cost to the taxpayer of about $250k per filing deadline (how many of those per year there are varies -- they get more frequent closer to elections). The main motivation for the unwillingness to change things here isn't that they like the inaccuracies so much as that they like the delay -- it effectively means the last couple of pre-election filings aren't public until after the election, rather than, you know, instantly.

everybodyknows · on Sept 7, 2018

Article says McConnell is the blocker of electronic filing. Anyone know about his own campaign finances in the last few months before elections?

apendleton · on Sept 7, 2018

https://www.opensecrets.org/ is a good resource, but honestly (having worked on money on politics stuff for several years) none of it is that juicy in isolation. Everybody gets money from everybody. For the most part they're not scared of any particular story coming out, they'd just rather not have a "In the final days of the campaign, X for $Y from Z" story come out the day before the election at all, for any Y or Z (it looks bad pretty much regardless of who it is: banks, pharmaceuticals, defense, lobbyists, energy, whatever)

slics · on Sept 7, 2018

There are so many factors in play here. Government issues a contract for scanning and loading all the paperwork. Contractor wins contract (prime), he then hires sub-contractors to do the their work, which in return can hire some other sub-contractor. If you think about it from a requirement perspective, government asks for apples, by the time those requirements get triaged three levels deep, they get oranges. Government attitude for a shitty product delivered, you can still make juice out of the fruit we got. They still keep pumping yearly, millions of dollars for that same minimal viable product.

vitorbaptistaa · on Sept 7, 2018

A bit off-topic, but as an example of what could be done with this kind of information, the Operation Serenata de Amor from Brazil created a robot that analyses the reimbursement claims from Brazilian politicians looking for outliers, filing complaints and tweeting the politicians to ask for clarifications. There have been some pretty funny conversations between politicians and twitter bots because of this :)

https://serenata.ai/en/

jessaustin · on Sept 7, 2018

They argue about reimbursements over Twitter? That seems like a topic for private communication methods. Email bots are also a thing.

mmt · on Sept 7, 2018

> That seems like a topic for private communication methods.

Since the topic is public officials acting in their official capacity, I disagree.

ohashi · on Sept 7, 2018

Captricity is terrible on MTurk. They reject and don't pay. A lot of people who regularly work on MTurk avoid them because they are a scummy company. Read their TurkOpticon reviews or on Reddit or other MTurk communities. I hope they disappear.

seveibar · on Sept 7, 2018

Mechanical Turk doesn't have built-in quality assurance so unless you're very skilled in QA systems or have a lot of money to burn on validation it's going to give poor results. Btw it's relatively hard to build a QA system that validates work, establishes trusted workers and optimizes for cost without significant redundancy.

I do the technology at an MTurk competitor that automates the quality assurance/training process and pays fair wages to refugees to do the work (workaround.online) if anyone is interested in an alternative to MTurk that is still relatively cheap I would highly recommend it.

cdoxsey · on Sept 7, 2018

Shouldn't they be running each scan through the mechanical turk multiple times to remove errors? I guess that would double or triple their costs.

danpalmer · on Sept 7, 2018

This is the dirty secret of Mechanical Turk – you're supposed to run it through multiple times according to much of the documentation, but people don't because then it becomes expensive enough that it's no longer an attractive option.

I work in a company who have considered it many times for various things, but between the Mechanical Turk API/tech being pretty terrible, and it all being expensive and low quality, we always either end up getting a temp in for a day or two to sit in front of Excel and tidy up data, or if it's a bigger process we outsource it to a data processing company in Bangladesh where we can have dedicated people on our account who sit in a shared Slack channel and who we can train.

Kagerjay · on Sept 7, 2018

How expensive has mechanical turk gotten though over the years? I thought this was standard practice as well, running it multiple times to reduce human error and do random spot checking.

What is the current cost per HIT over years previously?

danpalmer · on Sept 7, 2018

The last time I looked at this, a few years ago, it was ~$0.10 per HIT, and 2-3 would be needed, and that was for very simple data processing. We have quite complex data processing requirements with multiple interdependent fields, and UI, which would have increased the processing time, so I'd have guessed $1 per item processed total, plus extensive integration time.

Our outsourcing gives is far better communication and the ability to train staff doing the processing over time, feedback on their performance, and help them get better. I don't know the figures, but I suspect it's a similar price but with far better accuracy, but we do have enough consistent work for this to make sense - if we were more spikey in our demand then it might not.

electroly · on Sept 7, 2018

My experience with MTurk is that 3 isn't enough runs if you need the data to be correct and can't afford to pay someone (who ISN'T from MTurk) to validate every entry.

We regularly ran into these two situations:

- All three workers got different answers

- Two of the three workers agreed on the wrong answer

I think five or more runs may be necessary for data transcription on MTurk.

ig1 · on Sept 7, 2018

You should consider using qualifications / simplifying the requests.

The error rate I get for data entry tasks is around 0.5%-1% discrepancy between double entry. If you use prior reliability of the worker to tie break between who's right it drops to <0.1% error rate.

imhoguy · on Sept 7, 2018

Does MTurk API allow to identify, rank and exclude workers? By identifying I mean get some common key for all given worker submissions etc.

RosanaAnaDana · on Sept 7, 2018

I mean, this is an issue in any annotation exercise. Most annotation work heads south due to a failure to create a entire, discrete and complete workflow/ classification.

misiti3780 · on Sept 7, 2018

yep, and then using this type of analysis to determine who is good and who is not: https://en.wikipedia.org/wiki/Inter-rater_reliability

logfromblammo · on Sept 7, 2018

I once did campaign finance data entry as a child laborer for a newspaper reporter. I entered all the contribution reports for all the the state house and senate races. It was crates upon boxes of public-record paper documents. According to the reporter, I probably only made one mistake in the dollar amounts, due to a single missing contribution report. Not bad for a kid.

But it took forever, and it was relatively expensive for the newspaper. This was way back in the early 1990s. Since then, paper filings tapered off, and electronic filings replaced them. It really is a far better way to do it. Clearly, these records need to be digitized to create public transparency, and that need is apparently being met by bottom-feeding tax-eaters doing a minimum-effort job at a top-shelf price. I am not surprised these records are being botched, but I am surprised this story is coming out only now, rather than back in 2001.

kokey · on Sept 7, 2018

Oh this reminded me immediately of this story about a data migration project https://thedailywtf.com/articles/Importing-Data-the-WTF-Way

ianhawes · on Sept 7, 2018

My understanding is that Senate campaigns can file electronically, but are not legally required and therefore do not.

eli · on Sept 7, 2018

Because they want their records to be hard to read and analyze. This is a policy/political problem not an OCR one.

cabaalis · on Sept 7, 2018

Also, it directly benefits the senators for the company hired to process the data to fail in doing so. Any issues raised from the dataset can immediately be refuted.

Dowwie · on Sept 7, 2018

This is a political issue more than a technological one.

komali2 · on Sept 7, 2018

Agreed, especially considering

>Reform advocates say this is in large part because of opposition by a small group of Senate Republicans, most notably Senate Majority Leader Mitch McConnell

which is exactly what I'd expect from the dude, but not something I feel like screaming about on HN.

danso · on Sept 7, 2018

I’d say it’s both. The availability of purportedly cheap labor and tech make it easier for the Senate to push off a real solution. It’s also an interesting bit of bureaucratic trivia, in how the House and Senate approach this issue (and other data problems) in completely different and independent ways.

esseti · on Sept 7, 2018

a did part of my PhD on crowdsourcing and the poor quality results still remain a problem ;)

mygo · on Sept 7, 2018

I remember back when Yahoo Answers was the go-to for crowd-sourced Q&A. Nobody was getting payed and most answers were absolutely terrible.

Then Quora came out. Nobody was getting paid but suddenly experts were answering questions in their field.

Wonder if there could be a Quora for MTurk

booleandilemma · on Sept 7, 2018

I think this is the first positive thing I’ve ever read about Quora anywhere.

AdvancedCarrot · on Sept 7, 2018

I mean, it's not perfect - I have seen my fair share of "experts" but I would say as a general rule the answers are fairly high quality.

stuaxo · on Sept 7, 2018

Mechanical Turk is basically laundering of grossly underpaid labour.

mirimir · on Sept 7, 2018

Checking data for amount=ID seems pretty trivial. Also looking for outliers. And of course, for records with nothing in name fields.

txcwpalpha · on Sept 7, 2018

> Captricity turns these images into machine-readable text through what its website calls a “groundbreaking collaboration between humans and computers.”

> Captricity administers this kind of work through Mechanical Turk, an Amazon-owned online labor marketplace.

Ah yes, "groundbreaking collaboration between humans and computers", which in this case actually means "pay a human below minimum wage to type numbers on a keyboard".

Gotta love corporate marketing speak. The only thing that would make this even more ridiculous is if Captricity claimed they were using the mystical "machine learning" too. Hmm, let's check their website. [1]

>Captricity then uses sophisticated machine learning to package up these fields (we call them “shreds”) into quickly identifiable packets.

...lol

1: https://support.captricity.com/blog/captricitys-secret-weapo...

jessaustin · on Sept 7, 2018

>Captricity then uses sophisticated machine learning to package up these fields (we call them “shreds”) into quickly identifiable packets.

Reminds one of the "Librareome Project" from Rainbows End. That book is dense with the future...

ISTM a ML firm that can't get the ML to work and falls back to Mechanical Turk (how did that wonderful name get past corporate PR?) is kind of like Theranos using normal blood tests to stand in for their imaginary ones: at least as much of a scam on investors as on customers.

egypturnash · on Sept 7, 2018

I’m not sure Amazon really had much of a PR department back in 2005 when they launched the service and named it after a chess-playing “automaton” that had a human hidden in it to do all the thinking. https://en.m.wikipedia.org/wiki/The_Turk

jessaustin · on Sept 7, 2018

If an 11yo firm with billions in revenue could avoid having "much of a PR department", I'm impressed. Even one PR person would seem to be enough to avoid naming a product after a nationality...

pchristensen · on Sept 7, 2018

Of course Amazon has a PR department. But it also has thousands of teams and products, most of which were started by a handful of people and grown (or killed off if unsuccessful). There's a big difference between handling the global scale news stories Amazon is involved in and the naming of new products.

toomuchtodo · on Sept 7, 2018

Shouldn't your product naming go through a review process?

yorwba · on Sept 7, 2018

That doesn't guarantee that a name like "Mechanical Turk" will be rejected. Maybe they asked a Turk who believed it to be unlikely to be an issue.

Personally, I think the historical reference is clever, but then I'd say the same if someone named an autonomous ship-control system "Flying Dutchman". Maybe I just have bad taste.

therein · on Sept 8, 2018

As a Turk I do not find the name offensive at all and I am certain most Turks would find it amusing had they heard of the product/service.

wavefunction · on Sept 7, 2018

Turk is an ethnicity.

jessaustin · on Sept 7, 2018

It's both (there is a real nation called "Turkey"), but why would one be more "acceptable" than the other?

pavel_lishin · on Sept 7, 2018

But are the results stored in the blockchain?

magicnubs · on Sept 7, 2018

Democracy is a 51% attack

CalChris · on Sept 7, 2018

With the Electoral College it is less than a 51% attack.

redleggedfrog · on Sept 7, 2018

In the cloud?

kbenson · on Sept 7, 2018

No, no, no! They print the blockchain to paper so it can be duplicated and stored in multiple secure physical locations. Blockchain in the cloud is so last year.

SaltyBackendGuy · on Sept 7, 2018

A plant based media storage ledger, with redundancy. I smell a crypto startup.

nannal · on Sept 7, 2018

I wonder which generation the software is ... from?

stephengillie · on Sept 7, 2018

Chemtrail Blockchains

glitcher · on Sept 7, 2018

> we call them "shreds"

...as in, the results should be inserted into a paper shredder.

cloakandswagger · on Sept 7, 2018

What a disingenuous comment. The link you provided explains pretty clearly how machines are used, and that is in splicing up the documents and doing first-pass OCR on them. I imagine they're only delegated to humans under a certain confidence level.

You have zero idea what their technology looks like or what it's level of sophistication is, but don't let that stop you from posting this cynical snark.

txcwpalpha · on Sept 7, 2018

>You have zero idea what their technology looks like or what it's level of sophistication is

I don't? I dunno if you missed it, but at the top of this page there is a linked article specifically talking about the sophistication of their technology, and particularly, how it failed to catch even the most egregious errors in their process. Maybe you should read it, it's really interesting.

>The link you provided explains pretty clearly how machines are used, and that is in splicing up the documents and doing first-pass OCR on them.

Yes, and you'll notice that my comment didn't say anything about the usage of the machines, and was focused on the marketing speak phrase of "groundbreaking collaboration between humans and computers" and "machine learning". Regardless of how the machines are being used, those phrases are meaningless bullshit.

>I imagine they're only delegated to humans under a certain confidence level.

Let me get this right: you're criticizing me for not knowing what their technology looks like, and then in the same comment, you yourself are speculating about what their technology looks like. Really?

cloakandswagger · on Sept 7, 2018

From your original comment:

> Ah yes, "groundbreaking collaboration between humans and computers", which in this case actually means "pay a human below minimum wage to type numbers on a keyboard".

I agree that this is marketing speech (which is a reality of any business selling a product), but you implied that their operation is human-powered with only the illusion of machine interaction, which you go onto confirm:

> Gotta love corporate marketing speak. The only thing that would make this even more ridiculous is if Captricity claimed they were using the mystical "machine learning" too. Hmm, let's check their website.

If you had bothered to read the site you linked (or if you knew what OCR stands for), you'd understand that there is machine learning at play here, and not just for chunking up the documents. You made empty speculations about a company's technology for the sole purpose of denigrating them.

You might think this is splitting hairs, but these sardonic swipes lower the bar of intellectual discourse on this forum. It's OK to speculate about the technology in play, but not when you're insinuating that it's essentially fraudulent with absolutely zero evidence.

txcwpalpha · on Sept 7, 2018

>If you had bothered to read the site you linked (or if you knew what OCR stands for), you'd understand that there is machine learning at play here, and not just for chunking up the documents. You made empty speculations about a company's technology for the sole purpose of denigrating them.

Yea, and again, none of that is relevant here. I didn't make any judgements about their actual process (other than the fact that it did, objectively, fail). My focus is on the marketing speak, and despite what you may think, "machine learning" is as useless and as meaningless of a word as "cloud" is.

>but you implied that their operation is human-powered with only the illusion of machine interaction, which you go onto confirm:

I didn't imply any such thing, but that's a nice straw man you're building over there. You know, these arguments like yours are the type of thing that really lower the bar of intellectual discourse on this forum. Wait a minute...

>You might think this is splitting hairs, but these sardonic swipes lower the bar of intellectual discourse on this forum. It's OK to speculate about the technology in play, but not when you're insinuating that it's essentially fraudulent with absolutely zero evidence.

Ohhh okay, so when I make a comment about the technology, it's "lowering the bar of intellectual discourse", but when you speculate about the technology, it's okay? Nice hypocrisy, bud. Talk about lowering the bar of intellectual discourse.

cloakandswagger · on Sept 7, 2018

I'm going to discontinue this as you are getting upset and increasingly personal, I'd ask that you try to be more thoughtful with your comments in the future.

txcwpalpha · on Sept 7, 2018

Lol. You accuse me of "lowering the bar of intellectual discourse", and when when I do the same to you, it's "increasingly personal"? More hypocrisy. That's two instances of hypocrisy in as many comments. Perhaps you should be a bit more thoughtful and try to avoid such mistakes.