Hacker News new | past | comments | ask | show | jobs | submit login
Stack Exchange's monitoring system is now open source (github.com/opserver)
147 points by waffle_ss on Oct 15, 2013 | hide | past | favorite | 56 comments



The anti .net kneejerk reactions on HN really disturbs me. You spend all day on stackoverflow then blindly bash their tech stack. Developing, deploying, and hosting .net apps is just fine. Many brilliant people choose .net and are plenty happy with it. Maybe rather than jumping to conclusions, you could give it a shot.


Stack Overflow (which I avoid as much as I can fwiw) is not a strong argument for the .net stack given the people behind it have moved to a different stack for their next project (http://www.codinghorror.com/blog/2013/03/why-ruby.html). That's not an isolated incident.

I've several friends working in that stack. While many defend it as the correct decision at the time (and I don't disagree with that), all would rather be using something else now.


Most of the people behind Stack Overflow are still here...and we haven't changed our stack and are quite happy with the performance we get...and there's always more to squeeze out.

We can and sometimes do run Stack Overflow (currently 3.3 billions hit a month) from 2 web servers and 1 SQL server...I think that's pretty good for any stack.


Performance isn't the reason I see people wanting to move away from that stack; it's more about library availability, tool ecosystem (including things like monitoring, so this is an improvement - though I doubt it will stem the tide), and language productivity.

(Not that that your performance sounds like a compelling advantage for that stack. That's what, an average of 700 requests/second/web server? So the peak is probably around 1500? Pretty good indeed, but not outstandingly so - where I worked 3 years ago our system handled peaks of 600 requests/second/web server on a JVM-based stack. And I don't see .net topping the charts on http://www.techempower.com/benchmarks/).


First that chart's for linux systems...so it's not totally surprising to me that .Net isn't on it.

You have to keep in mind we're ridiculously over-provisioned, we're handling that load while maintaining 10-15% utilization on 3 year old web servers. Also, we're rendering quite a bit of dynamic data when rendering each page in < 40ms (the average for a question page on Stack Overflow yesterday was 36ms).

We render every request we get in a very speedy manner, usually with 90% headroom and utilizing only 1 DB server (also at only 10-15% utilization)...we're pretty happy with that.


> You have to keep in mind we're ridiculously over-provisioned, we're handling that load while maintaining 10-15% utilization on 3 year old web servers.

So you've premature-optimized for performance far past the point where you gained anything from it?

I'm not saying you shouldn't be happy with your performance, but if the best thing you can say about your platform is "it has adequate performance" then, well, it's not a very compelling platform.


Not at all, we did the performance optimizations to make page loads faster for our users...the side effect is less load on our servers.

It's an awesome platform for us and we love it...you should choose whatever platform works for you.


C# is on that chart -- just Ctrl+F and search for 'mono'.

If you select the 'Win' hardware tab near the top of the page, you'll be able to see figures for Windows as well: http://www.techempower.com/benchmarks/#section=data-r6&hw=wi...


It can't top a table it's not in...


What anti .net kneejerk? At time of this comment there are 28 comments, only 2 of which are anti .net / Microsoft (with no replies) and they are already well on their way to being downvoted into oblivion.


Many of us have .NET experience, really like C# but have sensible reasons for not recommending it. A Microsoft only stack is nowhere near as flexible as Linux and if you were to reach Stack Overflow's levels of traffic you'll not be able to avoid the need to run a part of your infrastructure on Linux. I seriously doubt Stack Overflow is running Redis, ElasticSearch & HAProxy* on Windows. Unless you already have a team proficient in .NET, there are few good reasons to use it instead of a JVM language.

*https://github.com/opserver/Opserver/tree/master/Opserver.Co...


Indeed we run all of these things (and more) on linux. We use the most appropriate architecture we can come up with at the time. When a better overall option appears, we do that. New information and technology makes our decisions change, that's how brains are supposed to work, I think.

We use linux for: redis, elasticsearch, HAProxy, DNS (bind), nginx, mail (exim) apache, wordpress, mysql, nexpose, backups, puppet, asterix, android builds and our internal mercurial.

As Opserver grows we will be monitoring Windows and Linux with our solution, but simple to setup via polling, or more advanced monitoring via an agent (puppet and DSC configurable/installable). We plan to have agents for both Windows and Linux open sourced, both using a standard communication format so that anyone can write additional agents, or add to them, or...whatever really. We haven't started building this yet, a complete monitoring solution is what we'll be working on over the next 6-12 months. It will be in the open as we go, with lots of internal dogfooding to prove things out.


I imagine that 99% of the users on here won't work on any projects that receive the same levels of traffic that Stack Overflow will get, and as such the worries of scaling to this level aren't really an issue.

For the vast majority of websites that developers will build, the .NET stack is absolutely fine. My only gripe with being a .NET developer is that if your professional experience is limited to a Windows based stack you may find it hard to move over to a Unix-based stack.


Just because you use c# doesn't mean your whole stack needs to be Microsoft. No reason why you could have Windows web-servers running asp.net/c#, but have postgres DB, nginx, memcache etc on linux. Azure seems to be actually encouraging this.

The cost of windows licenses for web-servers is not that much. Avoiding SQL Server is the big win in reducing cost.


All true but there is still quite a bit of inefficiency involved if you need a team that is both proficient in deploying & running Linux infrastructure tools as well as Windows servers.

And given that this discussion is taking place in the comments section of a link to a custom-made monitoring dashboard that quite possibly wouldn't be needed* if SO was a Linux-only shop doesn't exactly devalue my previous statement.

*I have no idea whether this could be replaced with Nagios or Munin


I imagine the Stackoverflow guys are familiar with Linux architectures, they also run the open source forum Discourse[1] which is a Ruby on Rails app that comes with a recommendation to set up and develop in a *NIX environment[2].

Certainly the point is to use the right tool for the job, and I wouldn't dismiss your final comment: having a team proficient in .NET is a great reason to build your tools with/for that platform.

But no, I don't think it would be fair to reply to this article saying "Why didn't they build this for [my preferred language]?"

[1]: http://www.discourse.org/ [2]: http://blog.discourse.org/2013/04/discourse-as-your-first-ra...


Maybe you're being over-sensitive as I don't see anything hostile to .NET. The closest I saw was the comment that reading C# seemed odd compared to most of today's start-ups (which you could probably say about Java as well).

EDIT: Somehow I had showdead off, and now see the "punched in the gut comment" ... the fact that it's dead shows that the community doesn't really put up with non-productive comments.


A problem with the Windows tech stack is the culture of automation (or lack thereof). Sure there is PowerShell, but often times critical tools insist on giving you a GUI instead of providing something that can be incorporated into a script. And even when they do offer scriptability, its much less documented than the GUI.


There is Desired State Configuration (DSC) in Windows 2012+ (native in 2012 R2+), we are using it here at Stack Exchange and just deployed an entire data center via DSC. I highly suggest checking it out...think puppet on windows without ruby and with powershell. It is v1 and providers for smaller stuff are still coming, but we're already open sourcing the modules we're creating as we go: https://github.com/PowerShellOrg/DSC


This was a revelation to me when I first started working with Linux. Because you work on the command line all the time, it's relatively straightforward to write shell scripts to do repeated tasks for you (good shell scripts of course, is another matter). Within the Windows environment it's much more difficult to switch between interactive and scripted modes.


It is so strange to look at that C# code.

I know they use it, it is just looks kind of odd compared to the rest of young (I mean less than 5 year old) web companies. No Ubuntu, nginx, node, jvm, but instead C#. I don't know, it just stands out.


I understand that it's not worth the cost to cleanup the history, but it's always frustrating to see such a project come with a 100kloc "initial" commit.


Please consider that this project started internal only, so there may be sensitive information in the full history.


Interesting. Why? What would be valuable about the commit history for you?


The same things that are valuable about any commit history?


I would imagine that if you're a random person only just finding the project now and aren't already involved in it (by working for or with Stack Exchange), then you probably wouldn't get all that much value out of the commit history.

Feel free to dispute that, though. There may be some scenarios I'm not considering.


    if (x < 42 && y > 1776)
        useConfabulator = false;
Why? Who knows? People don't comment their code for many reasons, not least of which is that it's not technically required and you can easily put it off and forget. Every checkin generally requires at least some comment. You can still enter meaningless messages like "fix", but you're not liable to do so by accident.


I think that the commit history says a lot about the development process and the company culture. First, is that it's interesting to see how these companies work internally. Second, it's easier to dive into the code by reading the latest commits. You can see what's going on and what pieces are changed together. Finally, there is a substantial difference between:

- a repository that's handled as a first class "source code ledger" (pick one or several of: proper branching/merging, atomic commits, issues referenced in commit messages, etc).

- a repository full of "git commit -a -m 'ill fix this bug later, TGIF'" whose history makes no sense.

I don't want to open the debate on which one is preferable or not (most people agree that git bisect is handy, but it is open to debate whether it's worth the cost).

What I'm wondering is how it looked and why it was rewritten. The most probable reason is that they wanted to make sure that nothing sensitive was exposed (non-redistributable embedded dependencies, information on their infrastructure, ...). But it's also possible that this was started as a type-2 repository and they don't want to expose this image.

I tend to think that every code should be written (and developed) as it was open sourced one day, but as always there's a cost trade-off in here.

Anyway, thanks for open sourcing this piece of software!


Guy who wrote it here...the only reason this isn't public is security, at a few points there were various passwords in the repository. Also, the internal repository is Hg (still is, though this may change). I still want to dogfood major changes before breaking others. We could convert this history, sure...but the security reason remains.

There were many commits that don't make a lot of sense to anyone but me probably, since they were so massive in scale...especially leading up to this release. For example, moving the configuration from web.config-ish xml style to JSON in preparation for a larger system I wanted to have done before open sourcing it to not hose adopters later.

That being said, most commits are decent size features or several every day or two. This was completely a side-project for me, happening while waiting on something or in the evening. Going forward, we're shifting focus to monitoring and will be giving some real dedicated time to it. We will be building our own monitoring system as a whole, polling, push, agents, etc...and Opserver is a large part of that big picture. Those commits will be much more interesting, and you're going to see them all.

If there are any questions though, I'm happy to answer them...we're pretty wide open, just short of sharing logins and passwords.


Thanks!


Having a quick poke around and I notice that they are storing all exceptions in an SQL database. I've been looking at storing all the errors we get in our various applications in a central repository and was wondering what the general consensus was?

Currently I'm going a centralised logstash server and using a logstash shipper on each of my servers to push the exceptions, from a standard logfile to it. I was toying with the idea of pushing all my errors at source to an SQL database but figured if I was having database problems I'd be missing all the exceptions that I could be using to trigger the alerts that I'm having database problems!


Seriously take a look at Sentry[1]. It supports just about every major language out there, is open source and used on some very large web properties (disqus, which powers comments for cnn.com amongst others), and is just generally awesome software. If you don't want to set it up yourself, use their hosted version[2].

If you are a .net / C# guy check out their csharp raven client[3]. Raven is the client to sentry which automatically sends all exceptions.

[1] https://github.com/getsentry/sentry

[2] https://getsentry.com/welcome

[3] https://github.com/getsentry/raven-csharp


There is no general consensus, but there's no rule that say you can only have one central watchdog. Collect errors one place, watch over you db somewhere else, watch over your watchdogs somewhere else still. This is basic sysadmin CYA.


We indeed use SQL for errors (though Exceptional has a couple of stores...and you can add a new one). The reason a SQL outage isn't an issue is a) we have other alarms for that, and b) Exceptional will fall back to an in-memory exception queue and flush to the DB when it's available again.

In the event of a store loss (file share, SQL server...whatever your store is) then it queues exceptions in memory with rollups to reduce memory usage, and will flush to the store when it's available again. It does a connectivity test every 2 seconds in the event of failure.

Exceptional is open source and is the basis for what's used in Opserver...the UI is even very, very similar it's just that Opserver has more features for a multi-application view. You can see the source here: https://github.com/NickCraver/StackExchange.Exceptional


We built our monitoring package at Zetetic to be extremely flexible about what events go where, with multiple levels of non-blocking filtering and routing, so that everything remotely interesting could go to SQL (with or without a durable local queue in front of it), and just very critical stuff could also go to a local embedded database / lpr / MQ / ZeroMQ publisher, etc. My preference is for decoupled ZMQ subscribers to handle anything very gnarly.

Internal errors in the monitoring software itself first surface in NLog, which could--but probably oughtn't--be configured to feed even more errors into the monitoring system; obviously this could create a terrible feedback loop if left unchecked.


Take a look at http://squash.io before you build your own.


There are a few issues with the GIT repo right now that I'm helping Nick Craver work through. I see this project maturing over time to be quite awesome.


Easy to deploy as it's written in C# :)


Just because you're already on a .NET stack, or is there another reason?


Actually, this was irony, as all the systems that I have currently deployed run a classical Linux stack and I would have to think a long time to start deploying a .NET / Mono app if it's worth the integration effort.


Is it GUI based ?

If so, are there screenshots ?


Yep.

Here are the screenshots from out Velocity Presentation: http://imgur.com/a/dawwf


thanks


Wow... Nice one.

Does anybody know if it works with Mono?


I haven't pulled the code yet, but it looks like it targeting MVC4. The documentation[1] states that mono currently partially supports it, everything but the async stuff, but that doesn't mean there isn't anything else in there that will cause it to not be compatible.

[1]http://www.mono-project.com/Compatibility


Could someone please list the equivalent systems this is could replace? (linux or windows)


I see parts of cacti, nagios/icinga + graphing, munin, zabbix, sentry (exceptions/errors), shinken, etc. There are many similar projects, but nothing so much integrated as far as I know.


Looks nice, sadly it's just overkill for what I need.

Has anyone built a very simple solution to start/stop an arbitrary set of Windows services across several boxes, in a specific order? It'd be nice to have a simple GUI for this sort of thing. I've started working on it, but I suck at desktop programming (well, at programming in general, probably)...


Unless you really need a GUI, it's insanely easy to do exactly this with PowerShell workflows (http://technet.microsoft.com/en-us/library/jj134242.aspx).


They lost me at IIS.


Looks like a tangled mess of a decision engine. Why would anyone want this? Surely, there are other solutions out there that are more mature? (nagios comes to mind but is a poor example)


The fact that you're labeling the only alternative you can think of "a poor example" seems significant. Edit: and I agree that nagios is a poor example, and welcome newcomers.


Any and all answers to this question would be appreciated, even those that just qualify as 'promising' and 'active' (as opposed to mature). All I know of is nagios really.



Stackoverflow runs Microsoft?!? I feel like I was just punched in the gut by a best friend. Throwing up...


Wow, they weren't kidding when they said .NET




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: