I'd like Caltrain to publish raw train data

tatsiana · on Sept 2, 2014

We've been working on the solution to this issue since our office is overlooking the tracks. You can read more here: http://svds.com/post/listening-caltrain and here: http://svds.com/post/railroad-modeling-hadoop-scale-hadoop-s...

bduerst · on Sept 2, 2014

That's pretty cool. I love the idea of scraping real-world information.

I did something similar using a GoPro and computer vision when I lived next to one of the 101 off-ramps in SF. I got it to work for most daylight hours (headlights screwed it up hardcore) before our landlord raised our rent by $1000/mo and we moved.

I figured it could have been a way to calculate ad impressions for billboards, but I also figured Clear Channel probably already knows those numbers.

winslow · on Sept 2, 2014

I wouldn't be so sure that Clear Channel knows those numbers. You'd be surprised how little data big companies know. Do you have a github or blog post on your experiment?

bduerst · on Sept 4, 2014

Sadly I don't, and it was a self-project I was doing to learn OCR (to scrape price tags in grocery stores) and thus I never backed it up on github.

Do you have industry experience with billboard advertising? I might rebuild it if there is an actual demand for it.

winslow · on Sept 4, 2014

Unfortunately I do not have experience in billboard advertising. However, I do have experience in massive companies (they suck) as a Software Engineer and I've realized how little they actually know about their own products and the data associated with it. You could probably contact Clear Channel trying to advertise with them and just ask for simple data like expected eye balls/views and population etc. If they have absolutely no idea then you have your answer that there might be a market for this. Your market might not even been the billboard company but rather someone trying to advertise. If I were to advertise I would want some data behind the advertising platform along with some way to track its effectiveness. I assume you are the same @bduerst?

ChuckMcM · on Sept 2, 2014

Fascinating, so I wonder how hard it would be just to put a GPS bug on the trains. These things are magnetic, you can get them to send SMS messages to a number, throw it up on top of the train car when it stops at the station :-). Of course Caltrain could do this official like but I think that would be to efficient for them.

rachelbythebay · on Sept 3, 2014

Hide one inside the power outlets upstairs and you need not worry about batteries... Just gps reception and cell signals.

ChuckMcM · on Sept 3, 2014

Even better :-) GPS is nominally a pain on metal roofed train cars but it probably is doable.

mcpherrinm · on Sept 2, 2014

Heh, the author of this post's twitter says he works at Matasano, which appears to be downstairs from you. I guess working next to the tracks has that affect on people!

ZanyProgrammer · on Sept 2, 2014

Heh, I saw this on Twitter and responded to the author earlier-I'm working on a data mining project now with public transit times, comparing arrivals vs scheduled times. Since I live in the Bay Area, it made sense to use local data. However, 511.org, the repository (it seems) for all Bay Area transit APIs, doesn't publish any specific vehicle/route number, or what the actual scheduled time is for an arrival at a stop (though MUNI used to have a nextbus API that was really nicely detailed-I can't find any public hosting of it anymore though).

My solution, since I didn't want to do any screen scraping or make trying to identify individual busses/trains a project in and of itself, was to use Portland's TriMet API. That API acutally return specific route numbers, and estimated and scheduled times for each stop (interpolated in the case of non time points). I'm originally from the Portland Area, so I'm pretty familiar with the geography and roads.

From what I remember in the 511.org Google developer group, people have raised this exact issue, i.e. Caltrain train numbers. The guy responding from the MTA said they'd try and integrate it in the future, but these posts were like back in 2012 (IIRC).

simoncion · on Sept 3, 2014

If you're still interested in doing Muni data mining, you'll probably be interested in this: http://www.nextbus.com/xmlFeedDocs/NextBusXMLFeed.pdf

NextBus is the source of bus position and predicted arrival times for MUNI, and appears to be the same for many other transit agencies. I can verify that (as of three minutes ago) it's still returning reasonable data.

However, if you're looking for the SFMTA schedule [0], I don't think you can get it through the NextBus API. I do know that you can get it through a GTFS "feed" found here: http://sfmta.com/about-sfmta/reports/gtfs-transit-data

Also, you might be interested in this, if you haven't seen it already: http://bdon.org/transit/ (SF MUNI transit delays. [This isn't my work.])

[0] Why would you want MUNI's schedule? It's not like any of the drivers care about it! ;)

rakoo · on Sept 2, 2014

Author, you should integrate your scraper into http://raildar.fr, they've already started to scratch that kind of itch for a similar problem.

deepsun · on Sept 3, 2014

Side note: instead of buying Burp Suite, check out just pure free Chrome or Firefox browsers to watch your HTTP traffic -- they both have pretty good Developer Tools, even IE does. They will show you the returned HTML formatted, and let you change it.

lstamour · on Sept 3, 2014

mitmproxy, Charles Web Proxy and Fiddler have also worked for me. I've never understood why someone would pay so much more for Burp if they're not going to use much of it. And for half of the rest, there's plenty of other tools or scripting languages you could use and save yourself a pile of money. I'd love to be convinced otherwise, after all I openly admit I haven't used Burp yet...

hansnielsen · on Sept 3, 2014

I use Burp (and much of its featureset) every day at work; that's why I used it here. The free edition does basically everything you could want except for saving / restoring states (request history, requests you modified, etc). I've also used Fiddler to great effect in the past, but the fact that Burp is written in Java makes it really convenient to use when you deal with multiple OSes.

guard-of-terra · on Sept 2, 2014

"But that’s just the planned schedule"

Why won't it match the real schedule?

britta · on Sept 2, 2014

Every day on Caltrain includes unexpected delays, and it's interesting to look at how those delays affect the system. Check out the linked visualization of Boston-area trains: http://mbtaviz.github.io/ - a single day had two major train problems.

Caltrain also runs extra trains for special events such as baseball games. Those are planned, but not on the regular schedule.

guard-of-terra · on Sept 2, 2014

Maybe they should be fixing that instead of providing raw data API? Or at least before.

Totally doable I should say.

mcpherrinm · on Sept 2, 2014

Sure, they could fix delays! A good start to doing so would be:

  - Eliminate at-grade crossings
  - Switch to level platform boarding
  - Double the number of trains
  - Replace the current trains with faster accellerating electric ones

That ought to be done within a decade or two. Seems like more work than releasing the data they already have.

toomuchtodo · on Sept 2, 2014

You should also add "station-specific spurs", to allow trains to express through while trains are dropping off/picking up passengers at that station.

guard-of-terra · on Sept 2, 2014

This is a good start. Many railway systems in the world have that since, I don't know, 70s. Even in not first world exactly (Moscow railway in mind).

Sounds like doable in a few years. You may even keep some at-grade crossings.

mcpherrinm · on Sept 2, 2014

Electrification will be started by 2019. Level crossings will be eliminated by the time California HSR opens in 2029.

Whether or not level boarding happens, and what sort of service increases come with electrification are unclear.

jonemo · on Sept 3, 2014

Level boarding is currently not possible due to the requirements imposed by the one (or two?) freight trains going through per day.

guard-of-terra · on Sept 3, 2014

Why is that? It doesn't cause general problems, and specific problems may be treated if there are any.

Mentioned Moscow rail has lots of freight traffic and high-level boarding.

mcpherrinm · on Sept 3, 2014

It's an issue of "loading gauge".

If caltrain adopted higher platforms, they would interfere with the size of freight trains passing through.

There are ways around this: If all the high platforms were on 4-tracked sections, the freight could use the express tracks only. Mixing high and low boarding might be more painful though.

Not to mention the current train stock doesn't support level boarding, so Caltrain would be restricted to use them as express trains and only have high platforms at local stations, for example.

Or the platforms could be far away from the trains and they use extending platforms: That is potentially unreliable and costly.

There is a somewhat reasonable blog about this type of issues at http://caltrain-hsr.blogspot.com/ which is mostly reasonable though it tends to be written as if it is the only reasonable choice.

guard-of-terra · on Sept 4, 2014

If you have unusually wide freight trains - make separate tracks for those, restrict them from urban core, or make passenger trains as wide.

jonemo · on Sept 6, 2014

You use Moscow as an example so I'll assume you are from there. I am from Germany. That means we both have experienced countries with a well designed public infrastructure including public transport. Americans do not have this background. In America, the general attitude towards providers of public infrastructure (be it government or private like Caltrain) is that they are assumed to be incompetent, their actions should be opposed, and their business disrupted.

I ride Caltrain every day and the experience annoys me a lot (still less than driving), but I feel sorry for the people running Caltrain because despite providing a valuable infrastructure service that is in huge demand, they are considered the lowest priority by everyone they interact with and have to deal with the most ridiculous restrictions and regulations. For example, Caltrain knows that they are always behind schedule during rush hour and want to adjust their schedule. But to do that they have to consult everyone and their uncle over a year-long process where every nutjob's concerns about a five minute scheduling change can stall the entire process.

Infrastructure projects deal with such problems everywhere, but I have not seen a place where this attitude is so deeply engrained and systematic like here.

toomuchtodo · on Sept 2, 2014

Awesome! Get a gig with Carltrian and try to eliminate those unexpected delays.

GFK_of_xmaspast · on Sept 2, 2014

Many of Caltrain's problems are systematic, such as lots of at-grade crossings and mega-rich fucks in Atherton.

guard-of-terra · on Sept 2, 2014

Just curious how "mega-rich fucks in Atherton" prevent railway from functioning?

ZanyProgrammer · on Sept 2, 2014

NIMBYs preventing HSR, which in turn jeopardizes the funding course for Caltrain electrification?

kordless · on Sept 2, 2014

My advice is to never ask someone who is making blaming statements a question regarding the blaming statement.

superuser2 · on Sept 3, 2014

Because reality never matches plans like that. For starters, there is track maintenance, variable loading/unloading time, momentary delays or speed reductions in order to maintain separation when there are a lot of trains running at the same time, and the simple fact that drivers are not necessarily hitting exactly the same acceleration and deceleration curves every single time.

By the end of the day, you're going to be more than a few minutes off from how you started.

Nonstop long-distance rail service can exactly match its schedule because there are relatively few places for entropy to creep in - you just hold a constant speed across miles and miles of track that you pretty much have to yourself. A commuter rail system is much more complex and there is much more room for entropy.

snogglethorpe · on Sept 3, 2014

It's certainly not trivial to closely stick to a schedule (e.g. 95% of trains within a minute or two of scheduled times), but it's obviously possible for an urban railway to do so, because many do, often with far more aggressive schedules and higher volumes than Caltrain has.

Of course, many American systems are operating at a pretty severe disadvantage, being hamstrung by poor equipment and infrastructure, a lack of funding, understaffing, a hostile political environment, and even pervasive cultural attitudes that dismiss railroads as being something worth investing in. I suppose given all that, it's a wonder they do as well as they do...

But still, I think it's important to never forget: it's absolutely possible to do much better.

superuser2 · on Sept 3, 2014

The point is that it doesn't matter. Why should a transportation authority move heaven and earth to satisfy some moralistic concern about keeping exactly to schedules?

You don't use CTA (Chicago transit) based on schedules; you go to your stop, read the board with accurate realtime predictions, and wait for a train to show up. It doesn't matter whether the schedule has anything to do with reality, just that you don't have to wait too long.

If you'd prefer not to spend too much time on the platform, then you can check your phone, which is accessing the realtime prediction feed anyway, to figure out when you should head to the station.

No part of normal usage of CTA depends in any way on the preset schedule, so spending money to keep to the schedule would be waste. Not only is it a Hard Problem, it's one that can be worked around very easily by providing realtime train location and prediction data.

deathanatos · on Sept 3, 2014

> you go to your stop, read the board with accurate realtime predictions, and wait for a train to show up. It doesn't matter whether the schedule has anything to do with reality, just that you don't have to wait too long.

I've never ridden CTA, but this is the attitude that I treat Boston's T subway system with. It's wonderful, and like you say, you just don't care. Go to platform, board a train. Time-to-trains are acceptably low that you can go an wait. (looking at a random T schedule, if you arrive on the platform randomly, it's ~2-5 minutes wait on average if things are on schedule.)

That's not CalTrain. Trains are not as frequent (they I transit daily from Mountain View to SF: during the morning rush from 7-9am, trains are anywhere from 7 minutes to 34 minutes apart.) Missing a train also doesn't just mean the time waiting for the next train, but also lost time due to the train itself. (For example, the 8:05 is 7 minutes behind the 7:57 in Mountain View, but 15 behind in SF.)

Exceptional delays are significant, and not that exceptional. Trains hit things, catch up to other trains and follow them at a snail's pace, break down, stop for blocked tracks, can't be boarded due to overcrowding, departed early due to being "full", or are just missing without cause.

That said, if there was good data that just told me when trains would leave and when they would arrive, that'd be nice. I don't know of such a thing, and given the article, what exists looks clunky, complicated, and incorrect.

> Why should a transportation authority move heaven and earth to satisfy some moralistic concern about keeping exactly to schedules?

Because you're wasting people's time?

snogglethorpe · on Sept 3, 2014

So far as I know (I haven't ridden it in a few years), CalTrain doesn't have frequent enough service to make schedules irrelevant. [I don't know the generally accepted figure, but I'd say anything with headways of greater than 10 minutes or so is not suitable for just-turn-up-and-go usage.]

Obviously frequent service is good, but there's a large variety of lines and sometimes very frequent service isn't warranted (or isn't possible because of funding/politics/etc).

For a line with non-frequent service, "just look at your smartphone" isn't a good solution, as (1) not everybody has a smartphone available, so it's a bad idea to make a transit service that requires one for decent service, and (2) more importantly, in many cases with less frequent lines not being able to plan can be a significant burden, especially when you need to make transfers along the way (in which case you always have to assume the worst case, and all the required uncertainty padding quickly adds up).

So for user convenience either you want frequent service, so planning isn't needed, or you want scheduling that's at least a little accurate, so planning is possible when necessary.

[There are also technical reasons for scheduling even very frequent trains in some cases, because when trains become frequent enough, the line itself can become a bottleneck, especially with complex services. For instance on many lines in Tokyo, both ordinary, express, and limited-express trains share the same tracks, with the expresses using in-station bypass tracks to overtake the non-expresses. The track network is also in many cases very non-linear, with many different lines sharing portions of tracks in some places, and different operators running onto each others' tracks. To do all this with high frequency services requires a delicate timing dance, and if you screw up the timing beyond some point, the whole thing very quickly falls apart.]

gkoz · on Sept 3, 2014

Isn't average waiting time lower with a regular schedule?

bowenli · on Sept 2, 2014

Caltrain is often behind schedule. Trains have break down or hit cars. It's a huge pain for daily Caltrain commuters. See: https://twitter.com/Caltrainstatus

digitalchaos · on Sept 2, 2014

This is just the exceptional delays. People have given up reporting the "normal" delays we are now seeing every day due to overcrowding of the trains. The overcrowding slows down the onboarding/offboarding of trains by a lot.

Another crowdsourced caltrain twitter account is https://twitter.com/caltrain You can see some of the more granular delays there. All these crowdsourced status accounts should be proof that caltrain SHOULD publish the raw data for us to use.

markcerqueira · on Sept 2, 2014

Even aside from major issues like hitting cars or people, delays are not "exceptional" when it comes to Caltrain -- it is the norm. An exception is Caltrain arriving on time.

ak217 · on Sept 2, 2014

Among the things Caltrain has to contend with (aside from old equipment prone to breaking) are several dozen at grade crossings and freight train traffic on the same line (!)

jarek · on Sept 3, 2014

While I feel for the grade crossings, freight traffic does not necessarily prevent good service. To give one example, North London Line of the London Overground has freight trains in between every-15-minutes service outside peak hours.

bdamm · on Sept 2, 2014

Fortunately the freight traffic is mostly after commute hours. Imagine what will happen with those so-called "high-speed" trains coming through!

guard-of-terra · on Sept 2, 2014

Aren't you supposed to have separate high-speed track for high-speed trains?

bfung · on Sept 3, 2014

I also had this idea, but I never executed it as I haven't thought of a way to solve the real vs. estimated times perfectly. Probably can get close w/some data mining, but not sure if it's worth the effort.

RE: scraping - instead of putting logic in your scraper, just download the entire section you need, store it in file format. Then parse and shove into database whenever you feel like it. You could rerun the parsing since you'll have all the historically scraped website data on disk.

ZanyProgrammer · on Sept 2, 2014

It'd be neat if they published positional data. I know the old Nextbus public API for MUNI did that, and it was cool making maps of real time positions of vehicles. I'm sure the excuse now is security BS.

tzm · on Sept 2, 2014

I'd like Caltrain to accept mobile payments.

enos_feedler · on Sept 2, 2014

Use a clipper card? What is the pain?

tzm · on Sept 2, 2014

Yes, I have a few clipper cards tied to travel bank accounts. Unfortunately, Clipper cannot be integrated into third-party vendors / apps and is prone to 24 hour account locks if transactions are declined. Adding money ad-hoc is troublesome as well.. use a POTS terminal, go to an approved retailer (Walgreens, etc), online ('available within 3-5 days').

Their commerce system is not mobile friendly and is a pain for mobile users. It could be much more efficient.