Google and Amazon Vie for Big Inroad into Wall Street Data Trove

elecengin · on Sept 1, 2016

The CAT on HN! I went to one of the early meetings for this project (years ago now!) It was a question and answer session for potential bidders.

My favorite question: "How long is the contract for?" (the SEC reps look to each other, and then respond...)

"There is no term."

"And the bidder is committed to storing all generated data?"

"Yes."

highlynt · on Sept 1, 2016

This has come up on HN before... One of the bidders has apparently run a load test on Google Cloud with some impressive numbers: https://cloudplatform.googleblog.com/2016/03/financial-servi...

boulos · on Sept 1, 2016

Yep that's FIS, running atop our now Generally Available release of Cloud Bigtable (https://cloud.google.com/bigtable/). With the HBase compatibility, several folks have swapped out Cassandra for Bigtable (like Spotify, mentioned in our GA announcement https://cloudplatform.googleblog.com/2016/08/Google-Cloud-Bi...).

Disclosure: I work on Google Cloud, so I want you to use Bigtable ;).

TheEzEzz · on Sept 1, 2016

~20GB/s read/write over a thousand+ cores seems slow, especially for embarrassingly parallel data such as this (split on security). That works out to megabytes per second per core. Am I missing something?

mbrukman · on Sept 3, 2016

They're not doing sequential scans of files on disk, they're doing random reads and writes in a database, where each write is replicated and durable, in parallel, across the entire key space of market transactions. The task was to reconcile market transactions end-to-end by matching orders with their parent/child orders (e.g., as orders get merged/split or routed from broker/dealers to others or to exchanges to be executed), thus building millions (billions?) of graphs across the entire dataset. You can see more details in the video of the presentation at the bottom of this blog post: https://cloudplatform.googleblog.com/2016/03/financial-servi... but I presume you're much more familiar with the intricacies of the stock market than I am. :)

Here's the performance you can expect to see per Cloud Bigtable server node in your cluster, whether for random reads/writes or for sequential scans: https://cloud.google.com/bigtable/docs/performance

Here's a benchmark comparing Cloud Bigtable to HBase and Cassandra that may be of interest (on a different benchmark than presented in the FIS blog post, but shows the relative price/performance): https://cloudplatform.googleblog.com/2015/05/introducing-Goo...

Disclosure: I am the product manager for Google Cloud Bigtable. Let me know if you have any other questions, I'm happy to discuss further.

thr0waway1239 · on Sept 1, 2016

Here is a section from the article I found interesting:

"Some worry that any insight into what could be the world’s largest repository of securities transactions will provide ways for either company to profit beyond cloud services....It’s also specified in the CAT proposal that whoever wins the bid must ensure the security and confidentiality of the data, and agree to use it only for appropriate surveillance and regulatory activities."

How will they actually enforce such clauses? Who is going to monitor what goes on inside these big corporations?

And why not initiate a sort of private-public partnership to form an independent entity dedicated solely for this purpose if there is such a desperate need?

If Wall Street was too big to fail during the last recession, does it not mean that now Amazon and Google are also going to be conferred with the same blessing once they become the repository of such information, especially if there really isn't any simple way to enforce these clauses? So two of the biggest tech companies are now becoming candidates for bailouts - so the threat they pose with the data they already possess is not enough for people?

Would love to hear thoughts from the folks who are already working in fintech who might be more familiar with the enforcement of such clauses.

danblick · on Sept 1, 2016

I don't have direct experience to answer your question, but I think perhaps audits are part of the answer.

Really, you can go a long way by asking "who has access to the system storing the data", "what is the policy for granting and revoking that access", "what are the policies for handling the data (to avoid leaking it)"?

I've worked at big tech companies but never come across a customer credit card number because there are policies for handling that data and audits to make sure they are obeyed. I think basic checks will go a long way.

(Granted, you're talking about a situation in which a company would have an incentive to subvert the controls on data; that's not really the case for credit card data.)

thr0waway1239 · on Sept 1, 2016

I would expect as much. But in these cases, would the auditors be expected to make their findings public?

My understanding is, the typical audit is stakeholder driven. For an audit of Google and Amazon's data handling policies in these kind of scenarios, who is the stakeholder?

saretired · on Sept 1, 2016

This is an excellent idea, and clearly it would also help the SEC investigate illegal trading. Prediction, based on the exceptional efficiency of Wall St lobbying: the Congress will refuse to give the SEC the funding for this.

lordnacho · on Sept 1, 2016

There must be more to this than what it says in the article. Where I work, I can look at the exact state of the orderbook, for tens of thousands of securities, on dozens on exchanges, at any point in history, or live. I can run simulations over the data in a few minutes per day simulated.

It didn't cost nearly $100M to build.

rboyd · on Sept 1, 2016

Are there not feeds that already collect this data? I was just listening to a podcast that described something similar from Nanex called NxCore.

usefulcat · on Sept 1, 2016

Existing market data doesn't include info like the actual names of the firms/individuals behind each order. From the article it sounds like they might be including (or proposing to include) that level of detail.

For anyone doing trading, or considering it, it would be immensely useful if you knew which orders were from the same firm, even if you didn't know the real names.

vgt · on Sept 1, 2016

Here is the video from Sungard FIS and Google Cloud discussing their approach:

https://www.youtube.com/watch?v=fqOpaCS117Q

TL;DR: They achieved a peak of 56m qps and sustained 38m qps when processing market data.

(disc: I work at Google)