Hacker News new | past | comments | ask | show | jobs | submit login
Datasets You've Likely Never Seen (yhathq.com)
116 points by bane on July 2, 2015 | hide | past | favorite | 13 comments



This is the site I usually turn to, when I'm on the prowl for interesting datasets: http://rs.io/100-interesting-data-sets-for-statistics/ These datasets will make for some nice additions to this list.


I now have a spreadsheet of Spanish Silver Production 1720-1800 for no other reason than it was available to me.

Well, and a long lingering interest in Sid Meier's Pirates and Neal Stephenson's Baroque Cycle.


Well then, here's timing for you... Jimmy Maher, who has an excellent blog that chronicles the evolution of the game industry, posted an article about Pirates! today: http://www.filfre.net/2015/07/pirates/


That was fun. A couple of the sets were particularly interesting. No surprise that using LSD adversely affects cognitive performance, though the strength and linearity of the effect were perhaps more relentless than expected. Its a reasonable hypotheses that other drugs would produce similar outcomes, but can't say for sure until the studies are done.

Eyeballing the marijuana data reveals that except for Mississippi (and maybe Kentucky) the Pacific coast states have the lowest prices. Notably OR was even lower than WA or CA. The graphs also show west coast prices were consistently falling, whereas the other states with low cost, price was either rising or fluctuating over time. OR (where I live) has a "laid back" reputation, maybe there's a connection.

Don't know what makes cannabis less expensive out here, though I've read our state is a leading pot grower/producer. Since Oregon is about to launch legal recreational marijuana sales, prices may drop even more.

Data: entertaining, and educational too.


California and Washington were long major suppliers in the top 5 producing states, with California first and Washington fifth. The data for that was 2006-2008 era, when I studied this more closely, however, I find it unlikely either of those two has significantly slipped. In addition to the raw volume that they produce, they also serve as conduits for illegal drugs moved across the border (from Mexico and Canada). Oregon also produced, but the volumes were somewhat lower.

I have a conjecture that the weed prices in those states are artificially high, because the average person in those states is buying a premium product which there simply isn't enough supply/demand for in states with a) a less direct line to the source, and first selection of choice portions and b) a culture where the dominant cash crop in the state is marijuana -- usually by a non-trivial amount -- making the whole culture of marijuana endemic to the state. Of interesting note is that in 2006-2008, Washington (while fifth overall) was the state which produced the most hydroponically grown marijuana, which fetches a considerably higher price than much of the outdoor crop.

Oregon, then, is the first state that represents "the rest of the country" once you step away from weird effects right at the source, and seems to sit in a clear trough around California and Washington. (Such patterns existed back in 2006-2008, and also happened around Kentucky, which is another major producer state.)

tl;dr: Seattle people are pot snobs as well as coffee snobs, and the WA price of pot is high for the same reason the average cup of coffee in Seattle is high -- $4 lattes instead of $1 gas station.


One of the big walls I hit as a data analytics person is how to turn data into actionable insights. I sent over the pigeon data to a friend who does pigeon research. Hope to see if it impacts his view of the pigeon world!


This is one of my pet topics to bore people with: how we've now passed the point where data collection, or even accessibility, (for most subjects) is the hard part. 10 years ago, for many things, there simply was no data; or if there was, you didn't know it existed, or it was very expensive. Today, the problem is that we don't know what to do with all the data. Of course loading it into R and making scatter plots is fine and dandy, and one can easily spend days on writing elaborate dataset-specific analysis reports, trying out various techniques just because you've never used them.

But turning data into insights, or even further, actionable advice - that's a whole different story; and one that many people aren't really interested in (yet?), either, both researchers and practitioners...


I'm not sure I understand your last sentence; could you elaborate?


What I mean is, that many people are still stuck at the 'we need more data' stage, or at least at the 'better data collection/verification'. And that the focus of much analysis and modeling is less on actionable advice, but more... well how shall I put it, 'dissecting' data, without having a clear way in mind how that dissection will lead to insights that are relevant for the stakeholder.

I should probably mention that this in the context of academia, I guess business analytics has an existential intrinsic motivation to be actionable.


Couldn't it still be the case, though, that we have too much data, but it's also the wrong sort for actionable insights? As a scientist I find the most actionable data are often in the smallest, custom-made datasets, driven by some question, not trawling through masses of data collected without a goal, hoping that they'll have collected the right thing.


Sure, could very well be, and what is a perfect fit in one situation, might be unusable in another situation that at the surface looks like it's almost the same. It would be silly for me to claim that all data we need is already being collected or something like that. But that's not at odds with my abstract point that the realization that data is usually no longer the problem, but the lack of knowing what to do with it hasn't sunk in for most people. (This makes it sound like I think of myself as someone who has seen The Light and 'those others' are chumps, which would obviously be delusional of me, and I don't mean it that way)

I guess what I'm failing to articulate here is the shift that has snuck up on us over the last 10 or so years. My bitching about the quality of datasets today is about increasingly marginal issues (at the macro scale of course, there are still crap individual datasets, obviously); whereas 15 years ago, I didn't even have datasets to bitch about.


This has been my experience while taking masters courses, but I assumed it was because it's important to learn about all the available tools and techniques. Has there been no research on methodologies that aid in discovering insights?

It reminds me of the difficulties in teaching someone how to prove statements in math. The basic approach is to learn as many techniques as possible (tools in your tool bag), review existing proofs, and practice. You really can't teach someone how to find the connections and insights that lead to a proof. I was once told that was the art/creativity in proofs.


Those datasets aren't very robust, though. The sample visualizations provided with them are about the extent of analysis possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: