Stack Exchange's monitoring system is now open source

stevepotter · on Oct 16, 2013

The anti .net kneejerk reactions on HN really disturbs me. You spend all day on stackoverflow then blindly bash their tech stack. Developing, deploying, and hosting .net apps is just fine. Many brilliant people choose .net and are plenty happy with it. Maybe rather than jumping to conclusions, you could give it a shot.

lmm · on Oct 16, 2013

Stack Overflow (which I avoid as much as I can fwiw) is not a strong argument for the .net stack given the people behind it have moved to a different stack for their next project (http://www.codinghorror.com/blog/2013/03/why-ruby.html). That's not an isolated incident.

I've several friends working in that stack. While many defend it as the correct decision at the time (and I don't disagree with that), all would rather be using something else now.

Nick-Craver · on Oct 16, 2013

Most of the people behind Stack Overflow are still here...and we haven't changed our stack and are quite happy with the performance we get...and there's always more to squeeze out.

We can and sometimes do run Stack Overflow (currently 3.3 billions hit a month) from 2 web servers and 1 SQL server...I think that's pretty good for any stack.

lmm · on Oct 16, 2013

Performance isn't the reason I see people wanting to move away from that stack; it's more about library availability, tool ecosystem (including things like monitoring, so this is an improvement - though I doubt it will stem the tide), and language productivity.

(Not that that your performance sounds like a compelling advantage for that stack. That's what, an average of 700 requests/second/web server? So the peak is probably around 1500? Pretty good indeed, but not outstandingly so - where I worked 3 years ago our system handled peaks of 600 requests/second/web server on a JVM-based stack. And I don't see .net topping the charts on http://www.techempower.com/benchmarks/).

Nick-Craver · on Oct 16, 2013

First that chart's for linux systems...so it's not totally surprising to me that .Net isn't on it.

You have to keep in mind we're ridiculously over-provisioned, we're handling that load while maintaining 10-15% utilization on 3 year old web servers. Also, we're rendering quite a bit of dynamic data when rendering each page in < 40ms (the average for a question page on Stack Overflow yesterday was 36ms).

We render every request we get in a very speedy manner, usually with 90% headroom and utilizing only 1 DB server (also at only 10-15% utilization)...we're pretty happy with that.

lmm · on Oct 16, 2013

> You have to keep in mind we're ridiculously over-provisioned, we're handling that load while maintaining 10-15% utilization on 3 year old web servers.

So you've premature-optimized for performance far past the point where you gained anything from it?

I'm not saying you shouldn't be happy with your performance, but if the best thing you can say about your platform is "it has adequate performance" then, well, it's not a very compelling platform.

Nick-Craver · on Oct 16, 2013

Not at all, we did the performance optimizations to make page loads faster for our users...the side effect is less load on our servers.

It's an awesome platform for us and we love it...you should choose whatever platform works for you.

profquail · on Oct 16, 2013

C# is on that chart -- just Ctrl+F and search for 'mono'.

If you select the 'Win' hardware tab near the top of the page, you'll be able to see figures for Windows as well: http://www.techempower.com/benchmarks/#section=data-r6&hw=wi...

JonoW · on Oct 16, 2013

It can't top a table it's not in...

rurounijones · on Oct 16, 2013

What anti .net kneejerk? At time of this comment there are 28 comments, only 2 of which are anti .net / Microsoft (with no replies) and they are already well on their way to being downvoted into oblivion.

Styck · on Oct 16, 2013

Many of us have .NET experience, really like C# but have sensible reasons for not recommending it. A Microsoft only stack is nowhere near as flexible as Linux and if you were to reach Stack Overflow's levels of traffic you'll not be able to avoid the need to run a part of your infrastructure on Linux. I seriously doubt Stack Overflow is running Redis, ElasticSearch & HAProxy* on Windows. Unless you already have a team proficient in .NET, there are few good reasons to use it instead of a JVM language.

*https://github.com/opserver/Opserver/tree/master/Opserver.Co...

Nick-Craver · on Oct 16, 2013

Indeed we run all of these things (and more) on linux. We use the most appropriate architecture we can come up with at the time. When a better overall option appears, we do that. New information and technology makes our decisions change, that's how brains are supposed to work, I think.

We use linux for: redis, elasticsearch, HAProxy, DNS (bind), nginx, mail (exim) apache, wordpress, mysql, nexpose, backups, puppet, asterix, android builds and our internal mercurial.

As Opserver grows we will be monitoring Windows and Linux with our solution, but simple to setup via polling, or more advanced monitoring via an agent (puppet and DSC configurable/installable). We plan to have agents for both Windows and Linux open sourced, both using a standard communication format so that anyone can write additional agents, or add to them, or...whatever really. We haven't started building this yet, a complete monitoring solution is what we'll be working on over the next 6-12 months. It will be in the open as we go, with lots of internal dogfooding to prove things out.

EnderMB · on Oct 16, 2013

I imagine that 99% of the users on here won't work on any projects that receive the same levels of traffic that Stack Overflow will get, and as such the worries of scaling to this level aren't really an issue.

For the vast majority of websites that developers will build, the .NET stack is absolutely fine. My only gripe with being a .NET developer is that if your professional experience is limited to a Windows based stack you may find it hard to move over to a Unix-based stack.

JonoW · on Oct 16, 2013

Just because you use c# doesn't mean your whole stack needs to be Microsoft. No reason why you could have Windows web-servers running asp.net/c#, but have postgres DB, nginx, memcache etc on linux. Azure seems to be actually encouraging this.

The cost of windows licenses for web-servers is not that much. Avoiding SQL Server is the big win in reducing cost.

Styck · on Oct 16, 2013

All true but there is still quite a bit of inefficiency involved if you need a team that is both proficient in deploying & running Linux infrastructure tools as well as Windows servers.

And given that this discussion is taking place in the comments section of a link to a custom-made monitoring dashboard that quite possibly wouldn't be needed* if SO was a Linux-only shop doesn't exactly devalue my previous statement.

*I have no idea whether this could be replaced with Nagios or Munin

garethadams · on Oct 16, 2013

I imagine the Stackoverflow guys are familiar with Linux architectures, they also run the open source forum Discourse[1] which is a Ruby on Rails app that comes with a recommendation to set up and develop in a *NIX environment[2].

Certainly the point is to use the right tool for the job, and I wouldn't dismiss your final comment: having a team proficient in .NET is a great reason to build your tools with/for that platform.

But no, I don't think it would be fair to reply to this article saying "Why didn't they build this for [my preferred language]?"

[1]: http://www.discourse.org/ [2]: http://blog.discourse.org/2013/04/discourse-as-your-first-ra...

smoyer · on Oct 16, 2013

Maybe you're being over-sensitive as I don't see anything hostile to .NET. The closest I saw was the comment that reading C# seemed odd compared to most of today's start-ups (which you could probably say about Java as well).

EDIT: Somehow I had showdead off, and now see the "punched in the gut comment" ... the fact that it's dead shows that the community doesn't really put up with non-productive comments.

spion · on Oct 16, 2013

A problem with the Windows tech stack is the culture of automation (or lack thereof). Sure there is PowerShell, but often times critical tools insist on giving you a GUI instead of providing something that can be incorporated into a script. And even when they do offer scriptability, its much less documented than the GUI.

Nick-Craver · on Oct 16, 2013

There is Desired State Configuration (DSC) in Windows 2012+ (native in 2012 R2+), we are using it here at Stack Exchange and just deployed an entire data center via DSC. I highly suggest checking it out...think puppet on windows without ruby and with powershell. It is v1 and providers for smaller stuff are still coming, but we're already open sourcing the modules we're creating as we go: https://github.com/PowerShellOrg/DSC

edraferi · on Oct 16, 2013

This was a revelation to me when I first started working with Linux. Because you work on the command line all the time, it's relatively straightforward to write shell scripts to do repeated tasks for you (good shell scripts of course, is another matter). Within the Windows environment it's much more difficult to switch between interactive and scripted modes.

rdtsc · on Oct 15, 2013

It is so strange to look at that C# code.

I know they use it, it is just looks kind of odd compared to the rest of young (I mean less than 5 year old) web companies. No Ubuntu, nginx, node, jvm, but instead C#. I don't know, it just stands out.

emillon · on Oct 15, 2013

I understand that it's not worth the cost to cleanup the history, but it's always frustrating to see such a project come with a 100kloc "initial" commit.

mjibson · on Oct 16, 2013

Please consider that this project started internal only, so there may be sensitive information in the full history.

crazygringo · on Oct 15, 2013

Interesting. Why? What would be valuable about the commit history for you?

kingkilr · on Oct 15, 2013

The same things that are valuable about any commit history?

meowface · on Oct 16, 2013

I would imagine that if you're a random person only just finding the project now and aren't already involved in it (by working for or with Stack Exchange), then you probably wouldn't get all that much value out of the commit history.

Feel free to dispute that, though. There may be some scenarios I'm not considering.

tedunangst · on Oct 16, 2013

    if (x < 42 && y > 1776)
        useConfabulator = false;

Why? Who knows? People don't comment their code for many reasons, not least of which is that it's not technically required and you can easily put it off and forget. Every checkin generally requires at least some comment. You can still enter meaningless messages like "fix", but you're not liable to do so by accident.

emillon · on Oct 16, 2013

I think that the commit history says a lot about the development process and the company culture. First, is that it's interesting to see how these companies work internally. Second, it's easier to dive into the code by reading the latest commits. You can see what's going on and what pieces are changed together. Finally, there is a substantial difference between:

- a repository that's handled as a first class "source code ledger" (pick one or several of: proper branching/merging, atomic commits, issues referenced in commit messages, etc).

- a repository full of "git commit -a -m 'ill fix this bug later, TGIF'" whose history makes no sense.

I don't want to open the debate on which one is preferable or not (most people agree that git bisect is handy, but it is open to debate whether it's worth the cost).

What I'm wondering is how it looked and why it was rewritten. The most probable reason is that they wanted to make sure that nothing sensitive was exposed (non-redistributable embedded dependencies, information on their infrastructure, ...). But it's also possible that this was started as a type-2 repository and they don't want to expose this image.

I tend to think that every code should be written (and developed) as it was open sourced one day, but as always there's a cost trade-off in here.

Anyway, thanks for open sourcing this piece of software!

Nick-Craver · on Oct 16, 2013

Guy who wrote it here...the only reason this isn't public is security, at a few points there were various passwords in the repository. Also, the internal repository is Hg (still is, though this may change). I still want to dogfood major changes before breaking others. We could convert this history, sure...but the security reason remains.

There were many commits that don't make a lot of sense to anyone but me probably, since they were so massive in scale...especially leading up to this release. For example, moving the configuration from web.config-ish xml style to JSON in preparation for a larger system I wanted to have done before open sourcing it to not hose adopters later.

That being said, most commits are decent size features or several every day or two. This was completely a side-project for me, happening while waiting on something or in the evening. Going forward, we're shifting focus to monitoring and will be giving some real dedicated time to it. We will be building our own monitoring system as a whole, polling, push, agents, etc...and Opserver is a large part of that big picture. Those commits will be much more interesting, and you're going to see them all.

If there are any questions though, I'm happy to answer them...we're pretty wide open, just short of sharing logins and passwords.

emillon · on Oct 16, 2013

Thanks!

jamesRaybould · on Oct 15, 2013

Having a quick poke around and I notice that they are storing all exceptions in an SQL database. I've been looking at storing all the errors we get in our various applications in a central repository and was wondering what the general consensus was?

Currently I'm going a centralised logstash server and using a logstash shipper on each of my servers to push the exceptions, from a standard logfile to it. I was toying with the idea of pushing all my errors at source to an SQL database but figured if I was having database problems I'd be missing all the exceptions that I could be using to trigger the alerts that I'm having database problems!

SEJeff · on Oct 15, 2013

Seriously take a look at Sentry[1]. It supports just about every major language out there, is open source and used on some very large web properties (disqus, which powers comments for cnn.com amongst others), and is just generally awesome software. If you don't want to set it up yourself, use their hosted version[2].

If you are a .net / C# guy check out their csharp raven client[3]. Raven is the client to sentry which automatically sends all exceptions.

[1] https://github.com/getsentry/sentry

[2] https://getsentry.com/welcome

[3] https://github.com/getsentry/raven-csharp

rhizome · on Oct 15, 2013

There is no general consensus, but there's no rule that say you can only have one central watchdog. Collect errors one place, watch over you db somewhere else, watch over your watchdogs somewhere else still. This is basic sysadmin CYA.

Nick-Craver · on Oct 16, 2013

We indeed use SQL for errors (though Exceptional has a couple of stores...and you can add a new one). The reason a SQL outage isn't an issue is a) we have other alarms for that, and b) Exceptional will fall back to an in-memory exception queue and flush to the DB when it's available again.

In the event of a store loss (file share, SQL server...whatever your store is) then it queues exceptions in memory with rollups to reduce memory usage, and will flush to the store when it's available again. It does a connectivity test every 2 seconds in the event of failure.

Exceptional is open source and is the basis for what's used in Opserver...the UI is even very, very similar it's just that Opserver has more features for a multi-application view. You can see the source here: https://github.com/NickCraver/StackExchange.Exceptional

sk5t · on Oct 16, 2013

We built our monitoring package at Zetetic to be extremely flexible about what events go where, with multiple levels of non-blocking filtering and routing, so that everything remotely interesting could go to SQL (with or without a durable local queue in front of it), and just very critical stuff could also go to a local embedded database / lpr / MQ / ZeroMQ publisher, etc. My preference is for decoupled ZMQ subscribers to handle anything very gnarly.

Internal errors in the monitoring software itself first surface in NLog, which could--but probably oughtn't--be configured to feed even more errors into the monitoring system; obviously this could create a terrible feedback loop if left unchecked.

js2 · on Oct 15, 2013

Take a look at http://squash.io before you build your own.

pyrox420 · on Oct 15, 2013

There are a few issues with the GIT repo right now that I'm helping Nick Craver work through. I see this project maturing over time to be quite awesome.

grundprinzip · on Oct 15, 2013

Easy to deploy as it's written in C# :)

zackbloom · on Oct 15, 2013

Just because you're already on a .NET stack, or is there another reason?

grundprinzip · on Oct 16, 2013

Actually, this was irony, as all the systems that I have currently deployed run a classical Linux stack and I would have to think a long time to start deploying a .NET / Mono app if it's worth the integration effort.

rsync · on Oct 15, 2013

Is it GUI based ?

If so, are there screenshots ?

GeorgeBeech · on Oct 15, 2013

Yep.

Here are the screenshots from out Velocity Presentation: http://imgur.com/a/dawwf

qixiang · on Oct 15, 2013

thanks

pygy_ · on Oct 15, 2013

Wow... Nice one.

Does anybody know if it works with Mono?

michaelfdeberry · on Oct 16, 2013

I haven't pulled the code yet, but it looks like it targeting MVC4. The documentation[1] states that mono currently partially supports it, everything but the async stuff, but that doesn't mean there isn't anything else in there that will cause it to not be compatible.

[1]http://www.mono-project.com/Compatibility

beck5 · on Oct 15, 2013

Could someone please list the equivalent systems this is could replace? (linux or windows)

viraptor · on Oct 16, 2013

I see parts of cacti, nagios/icinga + graphing, munin, zabbix, sentry (exceptions/errors), shinken, etc. There are many similar projects, but nothing so much integrated as far as I know.

toyg · on Oct 15, 2013

Looks nice, sadly it's just overkill for what I need.

Has anyone built a very simple solution to start/stop an arbitrary set of Windows services across several boxes, in a specific order? It'd be nice to have a simple GUI for this sort of thing. I've started working on it, but I suck at desktop programming (well, at programming in general, probably)...

gecko · on Oct 16, 2013

Unless you really need a GUI, it's insanely easy to do exactly this with PowerShell workflows (http://technet.microsoft.com/en-us/library/jj134242.aspx).

rajbala · on Oct 16, 2013

They lost me at IIS.

AsymetricCom · on Oct 15, 2013

Looks like a tangled mess of a decision engine. Why would anyone want this? Surely, there are other solutions out there that are more mature? (nagios comes to mind but is a poor example)

hamburglar · on Oct 16, 2013

The fact that you're labeling the only alternative you can think of "a poor example" seems significant. Edit: and I agree that nagios is a poor example, and welcome newcomers.

rch · on Oct 16, 2013

Any and all answers to this question would be appreciated, even those that just qualify as 'promising' and 'active' (as opposed to mature). All I know of is nagios really.

ojilles · on Oct 16, 2013

http://sensuapp.org/

and

http://riemann.io/

saneshark · on Oct 16, 2013

Stackoverflow runs Microsoft?!? I feel like I was just punched in the gut by a best friend. Throwing up...

elwell · on Oct 15, 2013

Wow, they weren't kidding when they said .NET