Hacker News new | past | comments | ask | show | jobs | submit login
FOILing NYC’s Taxi Trip Data (chriswhong.com)
92 points by danso on June 15, 2014 | hide | past | favorite | 49 comments



The post is from March 18th. Did Chris end up hosting the data somewhere?

Edit: Talked to Chris on Facebook. He is in the process of uploading it to someone who has volunteered to host and I have offered to host on Summon.com

If you can host/mirror this data please reply here.



Thank you!


Would BitTorrent not much more sense for this? At least then people can pick which files they want instead of downloading a 50GB file.


Yes it would. I'll post here once I get a link to the original upload (which as I understand wont be bittorrent)


Archive.org? Once you have an account: https://archive.org/create/


And if you don't have an account just go to #archiveteam in EFNet IRC and ask someone there to upload it for you.


I have an account. Get in touch (details in profile) if you want it uploaded.


Good thing mega.co.nz gives you 50GB for free. :-)

Also, BitTorrent was built for this, as mentioned elsewhere...


Happy to mirror this content as well, if uploading is an issue we can also accept a hard disk sent to the datacenter.


You’re not only required to provide a large enough hard drive, but it must be “brand new, still in the box and unopened”, presumably for security reasons. This requirement is a bit silly in my opinion, and probably prevents a lot of would-be FOILers from getting this data,

That's a completely sensible stipulation. It's imperfect like all security, but it's a legitimate reason rather than just a barrier.


If it's for security, it would be theater. How hard would it be for someone to open the package, put their spyware on it and repackage the drive? Even easier are OEM drives that come in unsealed boxes.

I'm guessing it's more to make sure the drive is in working order and clean of any data. This would limit liability for accidentally deleted data and broken drives. If the drive doesn't work, it's still under warranty. It also saves time on trying to make an old drive work.

A new drive equals zero hassle for them.


It also stops Joe Shmoe from coming in with the same drive he's been using for a few years already but has no clue is filled with all sorts of malware and viruses.


Seems like they could close that gap easily enough by buying the drive themselves, and charging you the cost. Perhaps there are requirements of the law that prohibit this but allow restrictions on user-provided hardware.


A "new" drive in a box certainly wouldn't prevent your scenario from happening (somebody faking a new drive), but is that really the person's concern?

I'm thinking the biggest concern would be plugging in someone's drive who didn't know they had malware. In that case, the requirement works pretty well.


Wow, I'm impressed with the responsiveness. It could be better, but honestly, if the other municipalities in New York state were up to at least this standard, I'd be very happy. In my experience, however, FOIL requests are often delayed, "forgotten", or ridiculously stored (on reams of paper, in ancient data formats, etc).


Processing issues are usually only a problem if the information implicates the government in criminal activity.


Call me gobsmacked.

Since I've lived in NY, I've seen plenty of cool visualizations and stories...about where pickups happen, time of day, volume, etc., and I've periodically asked around, where does the data come from? Obviously I didn't ask well enough...because if all it took was an old-fashioned public request (and a brand new hard drive)...wow.

The trip data is interesting enough...but the fare data is really mind blowing. Everytime I get out of a cab, I wonder, "should I have tipped that much?" The (crowd-based) answer is apparently not that hard to find...


There are 2 possibilities:

1. People aren't tipping 2. Cab drivers aren't reporting their tips.

Being a native new yorker, legality aside, I'd bet most cash tips go unreported. Reporting a tip makes that tip taxable, so there is a very strong incentive to bury it if they think they can avoid trouble or suspicion.


Then you can filter the data to show only CC transactions. There might be some additional variance of CC vs cash tipping, but I think the overall trend will still be there in just the CC transactions.


Am I the only one that sees some significant privacy issues with exact pickup/drop off times and location being released? It seems like singling out a single passenger's data (e.g. to/from home) would not be that difficult.


Not many people take taxis from home to work every day, or even weekly. If you are the type, then likely, you are someone who lives in Manhattan (getting to work via cab in a borough is a tenuous situation) and in a dense enough area where you are one of dozens/hundreds of people who could conceivably be dropped off at your home spot (think of the density of high-rises).


The question is more inspired by someone I know who lives in Manhattan who has a psycho ex. This data would answer the question (if he were tech savvy enough to mine it) "Where does her new boyfriend live?" which is rather frightening IMO.


OK...but how would this psycho-ex track the new boyfriend down?

Presumably, the ex knows where the girlfriend lives...and I guess, he also knows what the new boyfriend looks like? So he watches the apartment until the BF leaves by taxi. The ex then notes the taxi's time of pickup. And then...

The ex waits a full month before calling up the TLC, buying a new hard drive, transferring a couple of GB, and then doing the data analysis to find that particular taxi that made a pickup within the vicinity of the girlfriend's apartment, and finding where that taxi made a dropoff?

And then the ex goes to those coordinates and...then what? Barges into one of high-rises and knock on every door until he finds the new boyfriend?

I think that if the psycho-ex were to act like a psycho, he probably will not do it through this kind of data analysis.


Not just a full month. Up to six months it seems. The response from TLC made it sound like new data is only available twice a year.


That's the sort of the thing the city will deal with in retrospect after a leak leads to a robbery or murder.


Very cool article! I'm torn on the "bring your own hard drive" issue. In one way it is very anachronistic given today's cloud technology but the flip side is that the OP was dealt zero procedural roadblocks along the way. Nobody at the city said "No." and they seemed helpful at every step. I'd tally that as a Win in today's bureaucratic and overly secretive world.

I find myself wanting to make a FOI request to my city. I have seen tricked out Parking Enforcement cars trolling the streets this year. They have license plate reading cameras mounted along the car's perimeter. I want to know if that information is stored, for how long, and who has access to it. Have any law enforcement agencies queried the database?

I would appreciate all the pointers I can get for proceeding with a FOI request. So far, I have been using MuckRock as my primary source of tutorial.


> who has access to it You can get the ANPR database under FOI https://github.com/johnschrom/Minneapolis-ALPR-Data


Is he allowed to post the data online? If so maybe we just need to collectively do the FOIL requests and upload the data to a community managed site where it can be made available to anyone.


Isn't that muckrock?


Google doesn't know what mudrock is. Do you have a URL?



Did you edit a typo or was I just blind? I searched for mudrock, not muckrock. Thanks. :D


Yeah, I edited it. Sorry! :)


Medallion and hack license appear to be one way unsalted hash


That will make things interesting. I think people elsewhere in the thread really aren't grasping the privacy implications.


I found that the first medallion in the trip data was the MD5 hash for 9N35, not a good sign for privacy


Can we get this put on S3? Would love to play around with it.


I think a torrent would be better. There will probably be lots of interest in this, so S3 can get expensive.


Useful tip, if you add "?torrent" to the end of the url for something stored in S3, you get a torrent. I think this only works for files < 5GB though.


Any idea of the reason for the size limit? That's a pretty weird limit to put on a torrent feature.


Reducing the incentive to host uncompressed DVDs / BluRay?


Its an S3 torrent endpoint architectural limitation.


That is awesome. Thank for the tip.


S3 can serve its contents through bittorrent as well, so you only really need to serve up 1 copy of the file(realistically, it will end up being a bit more than that if we are dealing with good internet citizens)


Anybody got cool idea's how to visualize such a dataset? I have a similar set of data I collected, but haven't gotten beyond the "trips per day, length, etc" basics. I feel there is something beyond the most basic visualisation, on a more meta level, but am not sure what.


Just a thought. You could do some interesting reporting on this data. What about finding all trips to a particular address e.g. a politician's house ? Or finding all taxis exiting a known crime sense.


Abortion clinics, drug clinics, psychiatrists, cancer specialists...

Now Manhattan may be dense enough it might not leak too much personal information but the same granularity of location data in a suburban or rural area may be very intrusive.


I'm curious how large the data is after gzipping it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: