Question about the architecture though: what were the reasons for using HTTP as opposed to UDP (which is typically how these stats collection servers receive data)? It looks like it's possible to keep system load manageable since you aggregate the data and space out the HTTP requests, but why do this instead of just blasting the server with UDP requests?
Since we are cloud-based, our servers are remote to our customers'... and since there's no UDP availability over WAN... Then we had to find a solution to make it work nicely with HTTP. Does it make sense?
Definitely talked too fast! You're right, it definitely works. It's a little bit shadier on WAN but the main reason for the pre-aggregation is that the fire+forget behavior is much harder to scale as a service and would wastefully cost a lot of WAN bandwidth.
Once you have a pre-aggregation and only periodic updates... UDP does not make sense anymore.
HTTP also good for browser clients which could make sense.
Some time ago I wrote a system to collect performace statistics for my company's internal use (we have 100+ application servers - 4-6core 2xcpu hardware servers - and 1 server running the statistics). And then we released it as open-source.
Our system shows not only by-service- or by-source-script- statistics, but also which services are used by a particular script.
Drawback: all information is in russian.
It's an acronym for statsd indeed. Tough on the eyes yet easy to remember I think... Not the best name ever I admit :)
- Combining graph is one of the biggest missing feature at the moment. We'll have simultaneous hover very soon to kind to help with that, and we'll work on combining the graphs right after! (useful for scale of course!)
- On timers you have the main "filters" built-in: min, max, 0.1 / 0.9 percentile and average. The filters are more an UI challenge than anything else so we should be able to iterate on that. We wanted to get something out fast!
- Graph resizing: same as above. The sky's the limit, we're using d3.js and it's pretty awesome but we wanted to limit ourselves to the core features for this first release!
Thanks a bunch for your comments they're really useful!
Do you support aggregating the data and showing graphs for percentiles? So I'd like to send you latency stats and then get a graph for latency at the 99th percentile (or 90th percentile) or whatever.
Agreed! We are trying to communicate on a clear vision, but at the same time we already are a replacement for statsd + graphite in PaaS (and some IaaS) environment where there is no UDP availability... So we're not entirely lying :)
It's a rest based API with a very thin library. We're working on Ruby, Python and Java as we speak. Which language are you working with, I can keep you posted as soon as we have the driver ready!
Oh come on man, don't tell me it's free. Tell me what numbers I can go up to before you give me the "phone call" (if that, you could go Google and just shut stuff off without any notification or warning of any kind) telling me I've hit some sort of API limit.
The ambiguity is like that Oracle Linux debacle.
I don't even really care if you intend on living in fantasy-land and never intend on charging for the service (get real), at least give me some idea of what the limitations are.
I like the live demo though, good idea.
I want to use your service but you need to set up some clearly defined boundaries. I'm not some sort of enterprise-wonk trying to hold you to an uptime contract, I'm a startup guy trying to make certain he's not hinging stuff his company uses on something overly uncertain.
traffic is not a big issue since we have pre-aggregation client-side. So you can aggregate 5 event / hour or 10 millions it's all the same for DaTtSs... and it does not kill your network stack
As far as retention is concerned let say we aim for 1 month history. Might be more if people say they want it but we believe the value is in the analysis of what is happening now.
Question about the architecture though: what were the reasons for using HTTP as opposed to UDP (which is typically how these stats collection servers receive data)? It looks like it's possible to keep system load manageable since you aggregate the data and space out the HTTP requests, but why do this instead of just blasting the server with UDP requests?