That's pretty cool. I love the idea of scraping real-world information.
I did something similar using a GoPro and computer vision when I lived next to one of the 101 off-ramps in SF. I got it to work for most daylight hours (headlights screwed it up hardcore) before our landlord raised our rent by $1000/mo and we moved.
I figured it could have been a way to calculate ad impressions for billboards, but I also figured Clear Channel probably already knows those numbers.
I wouldn't be so sure that Clear Channel knows those numbers. You'd be surprised how little data big companies know. Do you have a github or blog post on your experiment?
Unfortunately I do not have experience in billboard advertising. However, I do have experience in massive companies (they suck) as a Software Engineer and I've realized how little they actually know about their own products and the data associated with it. You could probably contact Clear Channel trying to advertise with them and just ask for simple data like expected eye balls/views and population etc. If they have absolutely no idea then you have your answer that there might be a market for this. Your market might not even been the billboard company but rather someone trying to advertise. If I were to advertise I would want some data behind the advertising platform along with some way to track its effectiveness. I assume you are the same @bduerst?
Fascinating, so I wonder how hard it would be just to put a GPS bug on the trains. These things are magnetic, you can get them to send SMS messages to a number, throw it up on top of the train car when it stops at the station :-). Of course Caltrain could do this official like but I think that would be to efficient for them.
Heh, the author of this post's twitter says he works at Matasano, which appears to be downstairs from you. I guess working next to the tracks has that affect on people!
Heh, I saw this on Twitter and responded to the author earlier-I'm working on a data mining project now with public transit times, comparing arrivals vs scheduled times. Since I live in the Bay Area, it made sense to use local data. However, 511.org, the repository (it seems) for all Bay Area transit APIs, doesn't publish any specific vehicle/route number, or what the actual scheduled time is for an arrival at a stop (though MUNI used to have a nextbus API that was really nicely detailed-I can't find any public hosting of it anymore though).
My solution, since I didn't want to do any screen scraping or make trying to identify individual busses/trains a project in and of itself, was to use Portland's TriMet API. That API acutally return specific route numbers, and estimated and scheduled times for each stop (interpolated in the case of non time points). I'm originally from the Portland Area, so I'm pretty familiar with the geography and roads.
From what I remember in the 511.org Google developer group, people have raised this exact issue, i.e. Caltrain train numbers. The guy responding from the MTA said they'd try and integrate it in the future, but these posts were like back in 2012 (IIRC).
NextBus is the source of bus position and predicted arrival times for MUNI, and appears to be the same for many other transit agencies. I can verify that (as of three minutes ago) it's still returning reasonable data.
However, if you're looking for the SFMTA schedule [0], I don't think you can get it through the NextBus API. I do know that you can get it through a GTFS "feed" found here: http://sfmta.com/about-sfmta/reports/gtfs-transit-data
Also, you might be interested in this, if you haven't seen it already: http://bdon.org/transit/ (SF MUNI transit delays. [This isn't my work.])
[0] Why would you want MUNI's schedule? It's not like any of the drivers care about it! ;)
Side note: instead of buying Burp Suite, check out just pure free Chrome or Firefox browsers to watch your HTTP traffic -- they both have pretty good Developer Tools, even IE does. They will show you the returned HTML formatted, and let you change it.
mitmproxy, Charles Web Proxy and Fiddler have also worked for me. I've never understood why someone would pay so much more for Burp if they're not going to use much of it. And for half of the rest, there's plenty of other tools or scripting languages you could use and save yourself a pile of money. I'd love to be convinced otherwise, after all I openly admit I haven't used Burp yet...
I use Burp (and much of its featureset) every day at work; that's why I used it here. The free edition does basically everything you could want except for saving / restoring states (request history, requests you modified, etc). I've also used Fiddler to great effect in the past, but the fact that Burp is written in Java makes it really convenient to use when you deal with multiple OSes.
Every day on Caltrain includes unexpected delays, and it's interesting to look at how those delays affect the system. Check out the linked visualization of Boston-area trains: http://mbtaviz.github.io/ - a single day had two major train problems.
Caltrain also runs extra trains for special events such as baseball games. Those are planned, but not on the regular schedule.
Sure, they could fix delays! A good start to doing so would be:
- Eliminate at-grade crossings
- Switch to level platform boarding
- Double the number of trains
- Replace the current trains with faster accellerating electric ones
That ought to be done within a decade or two. Seems like more work than releasing the data they already have.
If caltrain adopted higher platforms, they would interfere with the size of freight trains passing through.
There are ways around this: If all the high platforms were on 4-tracked sections, the freight could use the express tracks only. Mixing high and low boarding might be more painful though.
Not to mention the current train stock doesn't support level boarding, so Caltrain would be restricted to use them as express trains and only have high platforms at local stations, for example.
Or the platforms could be far away from the trains and they use extending platforms: That is potentially unreliable and costly.
There is a somewhat reasonable blog about this type of issues at http://caltrain-hsr.blogspot.com/ which is mostly reasonable though it tends to be written as if it is the only reasonable choice.
You use Moscow as an example so I'll assume you are from there. I am from Germany. That means we both have experienced countries with a well designed public infrastructure including public transport. Americans do not have this background. In America, the general attitude towards providers of public infrastructure (be it government or private like Caltrain) is that they are assumed to be incompetent, their actions should be opposed, and their business disrupted.
I ride Caltrain every day and the experience annoys me a lot (still less than driving), but I feel sorry for the people running Caltrain because despite providing a valuable infrastructure service that is in huge demand, they are considered the lowest priority by everyone they interact with and have to deal with the most ridiculous restrictions and regulations. For example, Caltrain knows that they are always behind schedule during rush hour and want to adjust their schedule. But to do that they have to consult everyone and their uncle over a year-long process where every nutjob's concerns about a five minute scheduling change can stall the entire process.
Infrastructure projects deal with such problems everywhere, but I have not seen a place where this attitude is so deeply engrained and systematic like here.
Because reality never matches plans like that. For starters, there is track maintenance, variable loading/unloading time, momentary delays or speed reductions in order to maintain separation when there are a lot of trains running at the same time, and the simple fact that drivers are not necessarily hitting exactly the same acceleration and deceleration curves every single time.
By the end of the day, you're going to be more than a few minutes off from how you started.
Nonstop long-distance rail service can exactly match its schedule because there are relatively few places for entropy to creep in - you just hold a constant speed across miles and miles of track that you pretty much have to yourself. A commuter rail system is much more complex and there is much more room for entropy.
It's certainly not trivial to closely stick to a schedule (e.g. 95% of trains within a minute or two of scheduled times), but it's obviously possible for an urban railway to do so, because many do, often with far more aggressive schedules and higher volumes than Caltrain has.
Of course, many American systems are operating at a pretty severe disadvantage, being hamstrung by poor equipment and infrastructure, a lack of funding, understaffing, a hostile political environment, and even pervasive cultural attitudes that dismiss railroads as being something worth investing in. I suppose given all that, it's a wonder they do as well as they do...
But still, I think it's important to never forget: it's absolutely possible to do much better.
The point is that it doesn't matter. Why should a transportation authority move heaven and earth to satisfy some moralistic concern about keeping exactly to schedules?
You don't use CTA (Chicago transit) based on schedules; you go to your stop, read the board with accurate realtime predictions, and wait for a train to show up. It doesn't matter whether the schedule has anything to do with reality, just that you don't have to wait too long.
If you'd prefer not to spend too much time on the platform, then you can check your phone, which is accessing the realtime prediction feed anyway, to figure out when you should head to the station.
No part of normal usage of CTA depends in any way on the preset schedule, so spending money to keep to the schedule would be waste. Not only is it a Hard Problem, it's one that can be worked around very easily by providing realtime train location and prediction data.
> you go to your stop, read the board with accurate realtime predictions, and wait for a train to show up. It doesn't matter whether the schedule has anything to do with reality, just that you don't have to wait too long.
I've never ridden CTA, but this is the attitude that I treat Boston's T subway system with. It's wonderful, and like you say, you just don't care. Go to platform, board a train. Time-to-trains are acceptably low that you can go an wait. (looking at a random T schedule, if you arrive on the platform randomly, it's ~2-5 minutes wait on average if things are on schedule.)
That's not CalTrain. Trains are not as frequent (they I transit daily from Mountain View to SF: during the morning rush from 7-9am, trains are anywhere from 7 minutes to 34 minutes apart.) Missing a train also doesn't just mean the time waiting for the next train, but also lost time due to the train itself. (For example, the 8:05 is 7 minutes behind the 7:57 in Mountain View, but 15 behind in SF.)
Exceptional delays are significant, and not that exceptional. Trains hit things, catch up to other trains and follow them at a snail's pace, break down, stop for blocked tracks, can't be boarded due to overcrowding, departed early due to being "full", or are just missing without cause.
That said, if there was good data that just told me when trains would leave and when they would arrive, that'd be nice. I don't know of such a thing, and given the article, what exists looks clunky, complicated, and incorrect.
> Why should a transportation authority move heaven and earth to satisfy some moralistic concern about keeping exactly to schedules?
So far as I know (I haven't ridden it in a few years), CalTrain doesn't have frequent enough service to make schedules irrelevant. [I don't know the generally accepted figure, but I'd say anything with headways of greater than 10 minutes or so is not suitable for just-turn-up-and-go usage.]
Obviously frequent service is good, but there's a large variety of lines and sometimes very frequent service isn't warranted (or isn't possible because of funding/politics/etc).
For a line with non-frequent service, "just look at your smartphone" isn't a good solution, as (1) not everybody has a smartphone available, so it's a bad idea to make a transit service that requires one for decent service, and (2) more importantly, in many cases with less frequent lines not being able to plan can be a significant burden, especially when you need to make transfers along the way (in which case you always have to assume the worst case, and all the required uncertainty padding quickly adds up).
So for user convenience either you want frequent service, so planning isn't needed, or you want scheduling that's at least a little accurate, so planning is possible when necessary.
[There are also technical reasons for scheduling even very frequent trains in some cases, because when trains become frequent enough, the line itself can become a bottleneck, especially with complex services. For instance on many lines in Tokyo, both ordinary, express, and limited-express trains share the same tracks, with the expresses using in-station bypass tracks to overtake the non-expresses. The track network is also in many cases very non-linear, with many different lines sharing portions of tracks in some places, and different operators running onto each others' tracks. To do all this with high frequency services requires a delicate timing dance, and if you screw up the timing beyond some point, the whole thing very quickly falls apart.]
Caltrain is often behind schedule. Trains have break down or hit cars. It's a huge pain for daily Caltrain commuters. See: https://twitter.com/Caltrainstatus
This is just the exceptional delays. People have given up reporting the "normal" delays we are now seeing every day due to overcrowding of the trains. The overcrowding slows down the onboarding/offboarding of trains by a lot.
Another crowdsourced caltrain twitter account is https://twitter.com/caltrain You can see some of the more granular delays there. All these crowdsourced status accounts should be proof that caltrain SHOULD publish the raw data for us to use.
Even aside from major issues like hitting cars or people, delays are not "exceptional" when it comes to Caltrain -- it is the norm. An exception is Caltrain arriving on time.
Among the things Caltrain has to contend with (aside from old equipment prone to breaking) are several dozen at grade crossings and freight train traffic on the same line (!)
While I feel for the grade crossings, freight traffic does not necessarily prevent good service. To give one example, North London Line of the London Overground has freight trains in between every-15-minutes service outside peak hours.
I also had this idea, but I never executed it as I haven't thought of a way to solve the real vs. estimated times perfectly. Probably can get close w/some data mining, but not sure if it's worth the effort.
RE: scraping - instead of putting logic in your scraper, just download the entire section you need, store it in file format. Then parse and shove into database whenever you feel like it. You could rerun the parsing since you'll have all the historically scraped website data on disk.
It'd be neat if they published positional data. I know the old Nextbus public API for MUNI did that, and it was cool making maps of real time positions of vehicles. I'm sure the excuse now is security BS.
Yes, I have a few clipper cards tied to travel bank accounts. Unfortunately, Clipper cannot be integrated into third-party vendors / apps and is prone to 24 hour account locks if transactions are declined. Adding money ad-hoc is troublesome as well.. use a POTS terminal, go to an approved retailer (Walgreens, etc), online ('available within 3-5 days').
Their commerce system is not mobile friendly and is a pain for mobile users. It could be much more efficient.