Hacker News new | past | comments | ask | show | jobs | submit | querious's comments login

One of our simplest “screening” questions for DS roles at my company is: “your model is 100% accurate. How do you feel?” If the answer is anything other than deep skepticism (Data leakage, trivial dataset etc), it’s a big red flag


These guys are saying something like 17M thefts in 2017, which to me is a bit concerning and almost incredible, even if they’re off by a factor of 2. I mean, it’s not the end of the world if someone assumes my identity but it certainly seems like a monumental hassle to deal with and almost certainly bound to cost several thousand dollars.

https://www.javelinstrategy.com/press-release/identity-fraud...


What I'm psyched about is OpenStreetMaps data queryable with Athena. It's traditionally kind of a pain to convert PBFs to a queryable format.


Have you looked at Overpass API?

(it provides direct access to OSM data using a DSL: http://wiki.openstreetmap.org/wiki/Overpass_API/Overpass_QL )

For tiny purposes the public servers are sufficient and there seem to be quite a few people running private servers.


In case you missed it, we just added support for 2D Geospatial Queries in Amazon Athena:

https://docs.aws.amazon.com/athena/latest/ug/geospatial-exam...

PS: I'm on the Athena team


Out of pure curiosity, how so? I deal with Protobuf regularly, and as long as a decent library exists to dump to JSON that is domain specific to your use case it is trivial. Is that the only thing missing here?


For starters, the OSM PBF file format is not a protobuf file! Instead it's a collection of protobuf files inside each other!

You can read more in the fileformat: https://wiki.openstreetmap.org/wiki/PBF_Format

There are other problems, specific to OSM and not PBF/protobuf, like needing to store the locations of nodes until the end of file because they could be referenced anywhere in the file.


Global OSM is 40Gb or so - there are various libraries to translate it but as you can imagine, the sheer size of the dataset causes challenges. You also have to make choices about how you translate the attributes - for example, if you want to pull certain tags from the key:value field into separate columns in a table. Yet another issue involves source and target geometries - there can be inconsistencies in how features of the same type are recorded in OSM in terms of geometry, and so getting disparate input types translated into a single output type involves choices. Yes you can easily (after a wait!) get global OSM translated into something else, but making that something else exactly what you need can take effort.


Apache Airflow defines and runs DAGs in a sane way, IMO. Takes some configuration, but worth it for more complicated projects.


The UK's NHS recently opened up a large amount of its data to Google [0]. In parallel efforts, a company called Nuna is gathering and unifying data from state level Medicaid programs so it can be analyzed similarly [1].

0) https://www.newscientist.com/article/2086454-revealed-google... 1) https://www.nytimes.com/2017/01/09/technology/medicaids-data...

(edit: formatting)


Actually a pretty good way to understand manual transmission. Very cool.


Does anyone know whether this is a degenerative disease or a condition one is born with? It's possible the whole brain is necessary for learning, (including learning how to learn more abstract things), but then can be pruned heavily to move information into denser networks.


Whenever automated transportation becomes feasible, which is on the order of 5-10 years, a good chunk of a multi-trillion dollar market will go to the few players ready to capture it. At this point, Uber is arguably the most promising bet: tons of data, name recognition, long-term vision, talent, and on-point execution. Not to say they can't miss the mark still, but the market size for transportation of people and goods is gigantic, and Uber is dead set on basically being the One to take it all. I can see why it'd be an appealing investment.


I've been a loyal Uber user from their second month of existence but if Google offered me the same service for 10% less I would switch. I think this turns into a profitable but lower margin business than people think.


Original paper: https://groups.csail.mit.edu/EVO-DesignOpt/groupWebSite/uplo...

Basically how the "Data Science Machine" works: 1) "Feature synthesis": features related to the target are discovered by following foreign keyed relationships in a relational database, automatically generating "deep" queries involving many joins. Aggregates like MIN, MAX, AVG, and STD are automatically calculated to be used as additional features. 2) "Dimensionality Reduction": Truncated SVD reduces feature length 3) "Modeling": Remaining features are clustered and then modeled by Random Forest decision trees with learned hyper parameters.

There's a ton more optimizations, but that's the gist.

The machine can't be evaluated as a stand-alone product, though, because the researchers also mention manually generating features on the problems they tested, without reporting the effect of their own tampering.

They do conclude the system is more of a time-saving device than the general pattern-recognition one that our journalists seem to think it is.

Edited: formatting



Thanks.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: