Hacker News new | past | comments | ask | show | jobs | submit login
Command-line tools can be faster than a Hadoop cluster (2014) (adamdrake.com)
463 points by matthberg on Jan 30, 2020 | hide | past | favorite | 253 comments



With all the Hipster tech being released recently, the headline statement holds true for a lot of things, unfortunately.

We recently discussed new logging tools at work. It was either a redundant Amazon EC2 cluster with ElasticSearch for $50K monthly, or two large bare metal servers with rsyslog and grep for $400 monthly. The log ingestion and search performance was roughly the same...

EDIT: To give everyone a sense of scale, those $200 each bare metal servers are 2x Intel Xeon 6-core + 256GB RAM + 15x 10TB 7200 rpm. We retain logs for 30 days and handle 4-5TB per day.


What cloud advocates always say is that the $50k monthly will save you money from not needing to hire a team to manage it for you, and that over the course of 10+ years you will be ahead.

Is that true in anyone's experience? Every once in a while somebody posts about their competing bare-metal system and it looks like a lot of people have managed to cut their server costs by 99% (based on the numbers they post) by avoiding the cloud as a service

Honestly curious


There's a lot -- and an increasing amount -- of knowledge that's specific to each cloud platform, and increasingly specialized (and complex) software. In the last couple of years or so I've definitely seen companies having to hire people to manage their cloud setups.

I suspect it depends on a lot of things -- the complexity of the project, its architecture, the development, maintenance and management practices. For a few years, me and a colleague used to manage a 30+ server setup without needing more than 8-10 hours a month. But we managed to pull it off not because of where the servers were, but because we chose a good and stable tech stack, we had the knowledge and experience required to manage it efficiently, and no one decreed that we shall henceforth move everything to the cloud because it cuts costs.

Given the same situation -- right stack, right experience, useful management practices -- I'm 100% sure you can pull off the same thing in the cloud, at least as far as efficiently managing the whole setup is concerned.

But IMHO the idea that cloud services give you all of that for free is snake oil. As soon as you need more than a virtual machine running a web server or whatever, what you end up with is exactly what you build. If what you build is crap, it's gonna run like crap, and you're going to need a crapload of money to keep it running, and two craploads of money to fix it.


I have an experience related to this. We used to run a [scalable SQL engine, PaaS] for ~$1000/mo, but it wasn't cool enough, so people started migrating stuff to Spark, Databricks, S3 etc. Now the infra costs are ~$8000/mo + several people to manage it build tooling around it (~$6000 per person).

People just want to run their selects, man.


>>I have an experience related to this. We used to run a [scalable SQL engine, PaaS] for ~$1000/mo, but it wasn't cool enough, so people started migrating stuff to Spark, Databricks, S3 etc. Now the infra costs are ~$8000/mo + several people to manage it build tooling around it (~$6000 per person).<<

IMHO cloud services are only really logical options for new startups or ventures. Existing IT shops that are already heavily invested in infrastructure ops will NOT readily move to the cloud unless there's an obvious attempt at power-grab or subvert the IT/Ops fiefdom.

>>People just want to run their selects, man.<<

I read that in the "Dude's" voice. :-)


This is so true. Also, I see tremendous value in not being reliant on any of the three major cloud providers. They all kinda seem untrustworthy in their own, unique way.

And also:

That JOIN really tied the tables together.


In the venn diagram of "new startups or ventures", "Existing IT shops that are already heavily invested in infrastructure", and "every business", the first two do not cover even close to the entire area of the third.


Hey man there's a query here!


5 years ago I built our own company CDN using bare metal clusters hosted in multiple POP's that was 1/10th the cost of using commercial CDN's. It brought us ahead by literally hundreds of thousands of dollars per year and it serves us extremely well (continues to this day). The problem is that I am the CEO of the company which has now expanded to 45 people and to date, I've been unsuccessful at assigning a team to manage this piece of tech due to economics. At the point where I have to get engineers to work on the CDN, it becomes cheaper to move to a commercial CDN where it becomes someone else's problem. But at the moment, it ain't broke, the team loves the stability of it and it continues to save us hundreds of thousands of dollars every year....


Something to turn into a product? Economies of scale could help staff it then?


It's something to think about! :)


What product? A CDN?


woah, are you the ceo of artstation? I like that site


Yessir, reporting in. Thanks for the thumbs up.


Great, could you tell your Android contractor:

- "Magazine" in the sidebar just shows "No internet connection" (which is wrong) when clicking on it.

- Why can't the app remember my Filter settings? I'm only interested in digital 2D when on the "Browse" screen, but I have to apply the filter every single time I open the app.

- On Browse selected, click on Search, type in anything (e.g. Blender), then "View All Projects." Scroll through the first "page" (so that results show up that weren't on the screen before). Now these results begin to show up duplicated later when continuing scrolling (I assume pagination and/or the RecyclerView is implemented incorrectly).

- The app doesn't seem to cache images. Click on any project. Let the image load. Hit back. Click the same project again. The image needs to load again.

Looking at it again, some UI elements feel kind of uncanny which makes me think that the app isn't full native. There are some other small things that make the app look a little bit unpolished which I think a site like ArtStation doesn't deserve.


I’ve sent this over for the team to take a look, thanks.


Is the stack too complicated? I mean from a layman perspective id doesn't look like fancy tech, am I wrong?


It depends on who is working on it. :) For devs who are comfortable with bare/close-to-metal clusters and networking, it's really not complex (and why over the years it's just not been an issue for me to just hack on it on the side). But for other devs, just the context switching from working on application domains becomes an overhead you can't ignore. Then when you factor in all the other overheads of assigning a "team" to it which includes scrum events, refinement, just talking about it ends up driving the cost up.

It is a little bit more complex now - e.g. I added on-the-fly image resizing routines. But the core concepts of caching on multiple pops, being able the purge, etc. no it is not fancy tech by any stretch of the imagination.


It can be true, but not from a difference between 50k and 400; chances of you getting ahead of that are not very large. The chance of your company existing in 10 years is not either.

As I have said here before; many people are abusing AWS (etc) touting this reason, but in many cases it is used with the 'we have infinite resources' (no constraints) so literally nothing is optimised and as well the case at hand is often something that would most likely not require much work on bare metal either. So for that price difference, you secure against that chance that you maybe have to fork out $2.5k ah let's splurge, let's say $25k, in those 10+ years for server management/problems etc. Of all the bare metal we have running 10+ years, only one thing breaks and that is not bare metal but an old fashioned VPS. The rest has never had any issues. That's luck, I know, but even if it would have broken, it cost to fix would've been vastly less than 25k/10years. Let alone 50k/mo.

Sure there are plenty of exceptions, but I dare say literally most companies, by far, don't need setups like that. But then again; let them set them up; usually they make things so needlessly complex, expensive and slow that I get hired to fix it somewhere in those 10 years (yes, fixing! in the cloud! as far as 'not needing to hire a team to manage it for you').


We maintain 30 bare metal servers at a colo center, and between me (primarily a developer) and the CTO we spend maybe 1 day per month "managing" them. The last time we had a hardware failure was months ago. The last hardware emergency was years ago.

Servers run on electricity, not sysadmin powered hamster wheels.


Yes, the maintenance is cheap. The changes are more costly.

We run a dozen bare metal servers and I see the difference what it takes to spin up a new VM vs. set up a new physical server. There's planning, OS installation (we use preseed images but we weren't able to automate everything), sometimes the redundant network setup doesn't play well with what the switches expect (so you need to call the datacenter).

Still, it works out in favor of the bare metal servers. But I'm looking forward to a bit bigger scale to justify a MaaS tool to avoid this gruntwork.


I completely agree that some types of changes are much more expensive with a bare metal architecture than with cloud.

6 years ago, I worked for a company in the mobile space. This was around the time of the Candy Crush boom, and our traffic and processing/storage needs doubled roughly every six months. Our primary data center was rented space co-located near our urban office. For a while, our sysadmins could simply drive over and rack more servers. We reached a point where our cages were full, and the data center was not willing rent us adjacent space. We were now looking at a very large project to stand up additional capacity elsewhere to augment what we had (with pretty serious implications on the architecture of the whole system beyond the hardware) or move the whole operation to a larger space.

This problem ended up hamstringing the business for many months, as many of our decisions were affected by concern about hitting the scale ceiling. We also devoted significant engineering/sysadmin resources to dealing with this problem instead of building new features to grow the business. If the company had chosen a cloud provider or even VPS, it would have been less critical to try to guess how much capacity we'd need a few years down the road to avoid the physical ceiling we dealt with.


Yes, the cloud premium is also a kind of insurance - you know you'll probably be able to double your capacity anytime you need it.


Openstack ironic and bifrost are pretty useful OSS tools for managing baremetal servers.


Yeah, "CAPEX always increases, OPEX always decreases".


[I'm speaking only from the US perspective]

The only difference is if you own the gear or not. If you do own it, then it is CAPEX, and the gear goes on the balance sheet and you can only depreciate it according to the schedules ( in some cases immediately 100% but most of companies blow through that number really quickly ).

In all other cases it is OPEX.

The rule of thumb is that all OPEX can be used to offset the revenue, which is god sent to most companies that aren't printing gobs of money.

So if you make some money and you are past 100% deduction thresholds, when owning gear beefs up the balance sheets and at best slightly decreases taxes while spending money on OPEX significantly decreases taxes.


I didn't parse that in terms of bookkeeping. I took that to mean that the ongoing operational expenses such as, salary for the ops team, dev team, expertise, etc. may be difficult to estimate, once you figure out how to do something, they can be automated.


Under the 2017 Tax Cuts and Jobs Act, you can take a 100% deduction on new and old equipment in the first year. It expires in 2022, but that's because of the shenanigans to avoid formally violating deficit rules, which use a 10-year window for assessing budget impact. I think there's a good chance that it'll be extended as you can game the budget analysis that way indefinitely.


Does that time include security updates for the OS and installed services? I would assume so, but 8 (16?) hours a month seems lower than I'd expect given the frequency of vulnerability discovery and security patches.


Yes.


Thanks!


Everything in one colo facility with no geographic redundancy?


If you look beyond marketing a single top tier DC has better uptime than AWS AZ. You are way more likely to be bitten by AWS control layer issues than by DC in a good location dropping.


Most use cases there is just no need.


Some colo companies, even the small ones, offer multiple datacenters. You then have to either use public IPs for the inter-service traffic, maintain a VPN or contract it from them.


Your suggested solutions aren't reliable at all. AWS/GCP/Azure backbones work absolutely differently.


What about software updates and reboots?


There is the same issue with a cloud provider unless you run immutable deployments. Then, you need to invest in the tooling to produce the immutable images.


We didn't find that managed elasticsearch saved us any money over running our own ES cluster, but the best part of the cloud for my company is the scalability -- our peak monthly load is 10X more than our light weekend load. So to accommodate that scalability, instead of buying and maintaining 1000 servers to handle our light base load, we'd need to own 10,000 servers that would only be less than 30% utilized on average. And for redundancy, we'd need to spread those servers across multiple data centers, as well as manage a 2 petabyte storage system (also replicated across data centers) to replace our S3 usage.

And we'd need to have staff to manage the physical servers, run our network infrastructure, etc.

We've been through the numbers many times and the cloud always wins, even when compared to running our own servers for base load and have some hybrid cloud that lets us expand into AWS for peak loads. (which saves some dollars in servers, but that doesn't make up for the added complexity).

And nearly for "free", we have a cold standby site that has a replica of our hot data (a small portion of our full dataset), so if there was a full region outage, we could be back up in running in 30 minutes with reduced functionality.


As someone who doesn't know a lot about this stuff, is there a way to have the base load internal and use a cloud provider for elastic load?


Yes, with a caveat which is latency. For example it's hard to split a typical webapp so the application servers are far from the database. The latency kills performance.

On the other hand, for stuff like CI jobs, batch processing or anything else which doesn't depend on tight synchronization with the other location, you can mix and match.


I found most large companies who use cloud at scale have a team/s managing cloud, providing tooling around management and deployment or even full on PaaS solutions on top of providers. So you might not have you traditional sysadmins but you have a team of what recruiters like to call devops engineers instead.

The main advantage of cloud is the flexibility, treat resources as ephemeral, being able to click a button and get more/less resources. You don’t need to wait for a server to ship and be installed in a dc, if you don’t know what specs you need just twiddle the dials until you get something that performs as you want. It allows you to deploy / release quicker and easier.

It’s possible to architect so the cloud is cheaper but almost never happens. Optimising is intensive, often it’s called over engineering by pm’s/product owners. Costs normally start out low, blow up as products ramp out feature after feature and teams start onboarding, if performance is an issue it’s easier to increase an instance size instead of profile and change architecture. Only when the person paying the bill says this is to much does architecture and other optimisations happen, sometimes at a point when it’s hard to retrofit.


DevOps tools are force multipliers though. Cloud has its own set of problems to deal with, but manualOps BaU work requires so much more labor to accomplish so much less.


not needing to hire a team to manage it

You absolutely do need an experienced team to operate any significant cloud deployment, or misunderstanding the cost model will kill you.

Good cloud people cost more than good conventional sysadmins.

Unless cloud is simply an excuse to kill off a really, really complacent in-house entrenched IT dept, it will not save anyone anything cash-wise.

These are the facts.


I work for a massive corporation who's been on the cloud journey for about 10 years now with no signs of slowing. I think what my company loves most about cloud is that you can throw money at the bad management and volatility problems.

Even if a department spins up 500 servers for some executive's sacred cow when the project goes bust it's trivial to tear down the whole thing without requiring semi trucks.

The other very important thing is inventory auditing. Managing inventory in 10 datacenters built and populated by 100 teams over 30 years is a nightmare. Cloud provides a mechanism to build the coveted "single pane of glass" which large companies need desperately.


The answer to bad management is the cloud. Well put.


The answer is: it depends

Clearly in the example above, you can afford to hire two sys admins to manage the bare metal servers (you can probably afford more, but 2 is the minimum that gives some peace of mind)

We had an application that was a glorified quiz engine and our customers would mass enrol and mass take the quizzes so the scaling that was offered by Azure made it a no brainer for us, specially as given the nature of our customers, quizzes would only take place during working hours, so we'd scale right down for half the day and scale up and out at peak times.

Total costs were about 1/2 of what we estimated for bare metal


2 sysadmins for 2 servers? What are they going to do all day? Managing those two servers should probably be 5-10% of their job -- 2-4h per week for each of them seems plenty --, while doing 90-95% other stuff. Like a sibling comment said, "Servers run on electricity, not sysadmin powered hamster wheels." I'm our de facto sysadmin here (university lab) and I spend maybe an hour a week taking care of our ~dozen bare-metal servers.


> What are they going to do all day?

Be on-call. You need the sysadmins to be available to fix issues when they happen, even if the fixing itself takes little time.

> ... university lab ...

I developed a few web sites for universities, and they were hosted by the university. You really have to hope nothing breaks on Friday afternoon, or you have to wait until Monday morning to get it fixed...


Facebook with 400m users had just one DBAs if I remember correctly. I wonder if people ever runs the numbers.


I was on the DBA team there around that MAU. I had lots of company and support from SRO (aka jr DBA's) as well as other teams (provisioning, etc...).


> Facebook with 400m users had just one DBAs if I remember correctly.

You don’t remember correctly. There are multiple and to my knowledge there hasn’t been just one in at least the past ten years.


Millions of Facebook accounts are a pretty harmless size metric. DBA workload increases with the number of employees, which is proportional to support requests, accidents, changes, strange needs that need to be addressed, and so on.

Buying and provisioning disks before space runs out is a small part of a DBA's job, and a modest constant-size task compared to predicting how fast space is running out.


Have you got a citation for that claim because that figure seems a little hard to believe.

In terms of your general point: it really depends on your business. In my last job there were 4 DBAs out of 12 total IT staff and they were constantly busy. In my current job there are no DBAs in a much larger team and yet no requirement to need one either. The two businesses produce vastly different products.


Yes, if you can benefit from frequent scaling up and down an order of magnitude, this is where the cloud really shines.

For more continuous workloads, you can overprovision on bare metal more cheaply to deal with the spikes.


Would that not just be part of your existing sysadmin teams job.

Back when I worked at BT the unix developers all went on the basic sysadmin course from sun as part of their induction.


That might have worked for the specific hardware BT needed those developers to manage but it's not good advice in the general sense. Systems Administration is as much a detailed speciality as being a software developer. There's so many edge cases to learn -- particularly when it comes to hardening public facing infrastructure -- that you really should be hiring an experienced sysadmin if you're company is handling any form of user data.

As an aside, this is one of the other reasons company directors like the cloud -- or serverless specifically: it absolves responsibility for hardening host infrastructure. Except it doesn't because you then need to manage AWS IAM policies, cloud watch logs and security groups instead of UNIX user groups, syslog and iptables (to name a few arbitrary examples). But that reality is often not given as part of the cloud migration sales pitch.


True but in BT that was SD's (Security Division) Job to set standards and our teams Sysadmin handled that

SD was the employer Of Bruce Schinner for a few years BTW


Kahoot?


I am very happy to pay for Heroku, exactly because I don't want to be the poor guy getting up in the middle of the night to restart Apache.

But for a system that is invisible from customers, like our logging system here, I don't care that much about timely maintenance.


> I am very happy to pay for Heroku, exactly because I don't want to be the poor guy getting up in the middle of the night to restart Apache.

As an employee I am also happy if my company pays more, so I don't have to be awaken in the middle of the night.

As a small business owner with limited resources and liquidity waking up once a month for 50K savings looks like a good deal.


I have a very nice quote from a discussion I remember:

"If you need to get up at 3AM to keep services running, you're doing something wrong."

You can make sure that most of the services in *NIX world to take care of itself while you're away without using any fancy SaaS or PaaS offering.

Heck, even you can do failovers with heartbeatd. Even with a serial cable if you feel fancy.

Bonus: This is the first thing I try to teach anyone in system administration and software engineering. "Do your work with good quality, so you don't have to wake up at 3AM".


I'd pick a call in the morning any time given that the cause of the call occurs rarely and the alternative is to spend a lot of time automating things with possibility of blowing things up in a much bigger way. If situation like [0] had happened to me at night, I'd happily take a time off my sleep and do manual standby server promotion (or no promotion at all) rather than spend days recovering from diverged servers that Raft kool-aid was supposed to save me from.

[0] https://github.blog/2018-10-30-oct21-post-incident-analysis/


I'm not against your point of view to be honest. It's a perfectly rational and pragmatic to act this way.

I'm not also advocating that "complete, complex automation" is the definitive answer to this problem. On the contrary, I advocate "incremental automation" which, solves a single problem in a single step. If well documented, it works much better and reliably in the long run & can be maintained with ease.

Quoting John Gall:

> A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work.


I'm a good engineer, but I'm not qualified as a system administrator. I know that I will do something wrong, and I don't have the time to learn everything.

So I'd rather pay amazon.


So instead of learning systems you have to learn the AWS spaghetti of microservices.


You can try and build and test redundancy and contingency management, and you can lower the frequency of surprises through good choices.

But you're still going to get woken up at 3am sometimes. Things break, in unexpected ways. Maybe the hot spare didn't actually work when a raid set started rebuilding onto it. Maybe third party software did something unexpected. Or maybe something broke and your failover didn't actually work because of subtle configuration drift since the last test.


We have a standard routine called The restart test. We reboot the machine in a normal way to see how it behaves, but in the middle of the workload. Also, if the system is critical, sometimes we just yank the power cables to see what happens.

Normally all plausible scenarios are tested and systems are well tortured before putting into production.

It also helps that our majority of servers are cattle, not pets. So a missing cattle can wait until morning. Also all "pet" servers have active and tested failover, so they can wait until morning too.

We once had a problem with a power outage when our generators failed to kick in, so we lost the whole datacenter. Even in this case we can return to all-operational in 2hrs or less.

I forgot to add: We use installations from well tested templates, so installations have no wiggle-room configuration wise. If something is working, we can replicate that pretty reliably.


Sure, this is typical of well-run environments.

But you probably don't yank power on critical things mid-load after making a trivial change. Excessive testing breeds its own risks.

But it's really, really easy to gank up a trivial change now and then.

In the past 10 years, I've been woken up three times. One was from third-party software having a certificate that we didn't know about expiring; one was from a very important RAID set degrading and failing to auto-rebuild to the hotspare (it was RAID-10, so didn't want to leave it with a single copy of one stripe any longer than necessary); and one was from a bad "trivial change" that actually wasn't. I don't see how you can get to a rate much lower than this if you are running critical, 24x7 infrastructure.


Don't looks like you have much of experience with what's you're talking about - there is no such thing as heartbeatd, it's called keepalived (or pacemaker if you prefer unnecessary complex solutions), any ops person can't even misspell that.


Sorry, you're right. I confused it with its cousin, which is indeed called heartbeatd [0].

I'm new at Linux and system administration. I'm using Linux just for 20 years and managing systems for 13 years.

[0]: https://www.manpagez.com/man/8/heartbeatd/


Having just setup a HA Postgres with Patroni - I disagree. Honestly I think we should've just stuck with a single Postgres server.

Sure you can have an orchestration tool "Make sure everything is running, and respond to failures", but that's yet another tool that can break, be misconfigured, etc.


> If you need to get up at 3AM to keep services running

I just turn off the phone until I get up. That way I don't have to get up at 3 AM, I don't even know they were down until five hours later. :)


Same here.

As a small business owner I run my own apps on Digital Ocean, which to me offers very nice balance between features, reliability and price.


If your application needs to be restarted in the middle of the night, how does Heroku help?


Looking at how the market rate for an AWS architect (I wear this hat as well) or DevOps engineer is, it doesn't work out cost-wise on that front either.

The fact is that cloud is expensive, and outside a few use cases (such as extreme horizontal scalability, aka elasticity, or machine learning) the costs just don't work out well.


I think cloud's way harder than managing actual hardware (and/or your own VMs and such on actual hardware, or even traditional VMs on someone else's hardware) since once you're beyond the trivial it quickly becomes a superset of traditional admin knowledge, not a replacement—you end up having to know how to do things the old way to understand WTF the cloud is doing and troubleshoot issues, or to integrate or transition some older system, or whatever.

And then of course every "cloud" is full of about a billion hidden gotchas ("well the marketing page says that'll work, but on Tuesdays in February it won't, so use this instead, but only if you're writing your logic in JavaScript because the tools for the other allegedly-supported SDKs are broken in weird ways half the time, so instead use this other thing, unless you're a Libra, then...") none of which knowledge transfers between "clouds", and each has a pile of dumb names to memorize and a ton of other per se useless, but necessary, knowledge you've gotta pick up, just to rub salt in the wound.


I have a friend with a rack of co-located servers I manage. I drive from San Diego to LA where they are located maybe once every 6 months or more. I don't believe I have been to the rack for about a year. The rack with 20mbs of bandwidth costs around $2,500 a month.


Did you really mean twenty megabits?


if it’s unmetered 20mbps I’d happily take that over ingress / egress cost at cloud provider such as AWS


Yeah AWS bandwidth egress is extortionate. Digital Ocean is something like 5-7x less expensive. It's ridiculous.


That's kind of shocking when compared to European prices. Over a full order of magnitude of difference, comparing to the list price of the first Google result.


Yes, not sure of the abbreviation apparently! But 20 megabits.

I like to watch Neil Patel's seo videos on Facebook occasionally and he mentioned for one service he runs he spends over 100k a month on hosting. It blows my mind because he could buy a top end server with dozens of processors and terabytes of ram, co-locate it and at least host some things on it or even turn it into your own cloud hosting server.

Even if a person bought 2 over-the-top servers for 30k each, paid around $2,500 a month for hosting it would save huge money.


20 motherboards?


Not in mine. All that stuff you build out of cloud provider tools still needs care and feeding.


Honestly, it depends on the situation. Skilled engineers who can maintain systems reliably are hard to find, in general. If its not a core system and you have the capital, it makes a lot more sense to pay a cloud hosting provider and focus on your product rather than attempt to build something simply to save on costs.


I too never got this. Have learned using bare metal before AWS was around. Took a look at AWS in 2008 and decided that it was interesting, but nothing I had use for at the time. Since then, I never felt any need for it, except for a few times when I wanted to emulate a distributed network for testing purposes.

If I look into the AWS dashboard today, I feel totally overwhelmed, while running my own servers with LXC containers on it feels effortless. I guess for many it's the other way round.

Don't really know what I'm missing, but I'm happy about the old school skillset I have, allows me to have a fraction of the infrastructure costs I would have otherwise for what I'm running.


My company builds out enterprise and b2b apps. For most of our clients, infrastructure just isn't a big enough expense to really think about or optimize for. If you're selling your web app at a $1,000-$10,000 a license per year it doesn't really matter what you spend on servers. But it still matters how much time/money you spend managing your servers. So here it makes a lot of sense to go cloud because it's a lot faster to go from having no infrastructure to running a web server and database with automatic backups.


$50k monthly would pay for a team of the worlds best system administators.


No.


You know sysadmins that make > $100K / year?

If we assume $20K fringe benefits, that $50K / month is $600K annually, which gets you 5 $100K/year sysadmins.


Yes, I know plenty of Devopsish people that make more than that. And I'm not even talking about SRE-minded folk there.


You live in a bubble. Outside of the rarefied air of SFBA, wages are much lower. I do know super sharp people who are paid far less.


And how much is your non-bubbled far less? I'm not even closely to the Bay Area btw.


Under 150k in a major urban area.


You also have to consider the features you either have to do without, or create yourself. Dashboards, alerts, saved searches, web GUI, etc... You might legitimately not need any of that, but simply implementing search and storage isn’t replacing the whole product on its own.


I work at a big enterprise saas company that's moving a couple billion dollars worth of hardware into cloud services, the price difference is wild. Original unoptimized costs were 8x, dropping to 6x in our first round of optimizations. Even the rosiest middle management protections put eventual costs at 2x owning the hardware.

Add to that we threw out plans to do things like virtualization on our owned hardware, and that engineering headcount has consistently grown faster than revenue, it's not clear there are any savings to be had there at our scale.


From my experience (consulting for a decent size company that has lots of their infrastructure on Azure) it looks to me that the amount of management involves is not less then if they had it on premises or co-located bare metal.


Seems like the cloud services are complex enough that you're paying someone to manage them anyway.


I work at a small company, we have 5 programers, and 3 IT/Server/network people.

I'm one of the programmers, and the past few months I filled the role of devops/deployment engineering for our new website.

One great example is Sentry, I love Sentry, and it's been invaluable for us. It saves the web developers a crapload of time. Now Sentry has a self hosted option, and that's what we're currently using.

Now I know little about Sentry's internals, and frankly I don't really care. But sometimes it breaks, or we want it to be updated etc. Sentry offers hosting at 25$/mo, and would mean we don't have to worry about it at all, stability, upgrades, scale, are all handled by them.

25$ a month is less than 1 hour of my time. All it has to do is save me 1 hour a month to easily pay for itself.

---

Another example is that I just spent a large amount of time trying to setup a HA Postgres cluster. This meant I had to dive into the internals of postgres, how our orchestrator (patroni) works, setting up consul to manage the state, etc. This has taken a significant amount of time (several weeks) - it's easy to get a POC working, but actually ironing the bugs out is a different story.

Also nobody else on the team fully understands it's setup, so if it breaks, welp....

All this to say, for us a hosted DB option would likely have been cheaper (compared to my time and pay) and would have better uptime and support then us rolling our own solution.

----

We don't have any central log store, and I really wish we did. Similar situation here, I could spend a month or two configuring and tuning Elastic Search, or we could just pay for a hosted option.

---

Tl;dr - I'm a developer at a small company and have spent little time doing application development the past few months because of all the time that has been required to setup the infrastructure for our application.


So why don't you move to cloud/managed services? I can tell you from own experience that having a HA solution you don't really understand is a really bad idea. Just go with master/replica for Pg and a well documented troubleshooting and manual failover procedure. It'll likely be more reliable than some HA black magic. And it will likely take less than several weeks to get up and running.


I think the real answer is that you don't have to trust an employee's expertise.

I can give you a real life example.

Years ago I did work for this company that built flight simulators for the government. Millions of dollars rolling through their company. One day I get called up by this company and the woman is panicking because their website went down.

Well, come to find out, the entire site was running on a server sitting in an office of their building and the electricity went out due to a winter storm. To say that I was floored is an understatement. I started asking questions.

Well, when I did, the "sys admin" (and I'm putting that in quotes...) started talking to the owner of this company and it was ultimately decided that I couldn't be trusted.

Fast forward a few years and around august of last year this company contacts me again for more work. They apparently scored another huge government contract, built a new facility to be able to house actual government planes so they can strip them down and turn them into flight simulators and so forth.

While I'm at the facility I learn that they're running all their software off of a server sitting in the building. Now, this isn't completely unreasonable, especially for government work. But I again started asking questions.

- Is it environmentally controlled.

- Do you have a generator?

- Do you have multiple lines in case your ISP goes down (which 100% will happen at some point)

- Do you have backups?

- ARE YOU EXERCISING THOSE BACKUPS AT LEAST ONCE/QUARTER?

When this got back to the "sys admin", he was livid. I also found out that they didn't have the source code for the latest changes I made, despite me pushing said source code to a private git server this "sys admin" had stood up. Said virtual server apparently got removed when they moved facilities, but despite this the guy accused me of pushing it to my private github repo based purely on the fact that I stated in no uncertain terms that I had pushed that to the git repo.

But this software was a part of the government contract.

I just sent out an email the next day thanking them for the opportunity, but that I would have to pass on it.

My point is this.

They're an engineering company, that's where their expertise lies. But due to the nature of what they're doing, they were forced into the software side of things. They hired an incompetent.

Companies like these are probably better off throwing money at the problem and putting it in the cloud. The skill level required to successfully run something in the cloud and not completely lose everything is much lower. That's not to say there isn't skill involved, but you don't have to hire someone who may or may not decide to run your software that's involved in a multi-million dollar contract in a closet in your building.


The cloud won't fix "hired an incompetent".


what I specifically said (with emphasis)

> Companies like these are probably better off throwing money at the problem and putting it in the cloud. __THE SKILL LEVEL REQUIRED__ to successfully run something in the cloud __AND NOT COMPLETELY LOSE EVERYTHING__ is much lower. That's not to say there isn't skill involved, but you don't have to hire someone who may or may not decide to run your software that's involved in a multi-million dollar contract in a closet in your building.


> What cloud advocates always say is that the $50k monthly will save you money from not needing to hire a team to manage it for you, and that over the course of 10+ years you will be ahead. Is that true in anyone's experience? Every once in a while somebody posts about their competing bare-metal system and it looks like a lot of people have managed to cut their server costs by 99% (based on the numbers they post) by avoiding the cloud as a service

Too many weasel words.


if you are using systemd, you don't even need to grep "manually" that much.

  journalctl -u nginx -u ssh --since today


I had never heard of that before. Thank you for pointing this out :)


And yet you're rsyslogging and grepping what is supposed to be structured and typed. Have you heard about backpressure, controlled retention, compliant log redundancy and all of this advanced stuff that you'll never get with aggregation via rsyslogd/syslog-ng?


rsyslog is far more powerful than you're making it out. You have to actually tell it what to do but it's more expressive than filebeat and logstash.

* rsyslog in the use-case he's describing is just a method of pushing some subset of the logs generated on a client system to a directory on the collector which has trade-offs but the benefit is having really simple failure modes.

* both rsyslog and journald store structured data: rsyslog with lumberjack, and journald just always. And rsyslog can parse and structure the logs in-flight so you save computing power on the collector.

* rsyslog behaves exactly like filebeat when it comes to reliable delivery and can persist unsent messages to memory then disk. rsyslog's rate limiting, backoff, and retry options are super powerful.


Yes, you are right, sorry. I was too fast in my assumptions, rsyslog (don't know much of syslog-ng) has feature parity with ELK in terms of log delivery and processing. But I think that grep and its permutations aren't right tools of choice for log analysis anyway.


>backpressure

Both syslog-ng and rsyslog apply backpressure just fine over a network or socket...

>controlled retention

logrotate? It's only been around for over a decade..

>compliant log redundancy

So like, a backup strategy?

All of this stuff has been around for approximately forever, ES just had their marketing team name it something else and people like you are falling for it...


Might be worth taking a look at ripgrep[1].

[1] https://github.com/BurntSushi/ripgrep


That tool looks great :) but since we're already seeing <1s search times and the tool is only used by internal support employees, I'm mostly going with "never touch a running system" these days.

While for a database like ES you'd put all of the data into one big pile and then filter by keywords, e.g. host=ftp service=ftp query=IP, for logfiles you usually search on a much smaller set. They are rotated by day and logs are broken down by host and service by rsyslog, so instead of filtering the full 150TB - which is what ES has to do - my grep only needs to look at the 1-2 GB of data inside the file where host, service, and date match.


Do you understand something about such thing as ES indices?


RG is awesome for the basic use case. Using it across platforms just makes you tear your hair out, though.


I'm curious, why? Looks pretty easy to get the binaries.


Why? Can you share your frustrations?


What kind of searches are you doing where the search performance of grep is the same as with Elasticsearch?


One common query was to check the FTP server on our side for access from a customer server, so that we can help with troubleshooting as to why they couldn't connect. Turns out, in many cases less-technical customers mistyped the hostname or left out the .com at the end.

tail -n 10000 /var/log/ftp.log | grep -i $CUSTOMER_IP | tail -n 5


If that's your typical use-case, I can see why that's as fast (or faster) as Elasticsearch, but it's not clear why you'd have gone with Elasticsearch in the first place.

When I last used Elasticsearch, we indexed ~10TB of log data a day, kept 14+ days, and a typical query was looking for log records that matched a unique session ID over the past 10 days or so, not an easy task for grep. But we didn't pay $50K/month for that cluster, it was closer to $12K/month.

Before we used ES, someone had written a parallel grep that would grep multiple files at once, and would run multiple greps at once through chunks of the file, but still it could take 30 minutes or longer to churn through the logs on a 32 core machine - ES took that query time down to 100 milliseconds. The ES cluster easily paid for itself in employee time savings.


We used to use the ELK (ElasticSearch, Logstash, Kibana) stack hosted locally. So when we outgrew that, we just tried to find a bigger ELK provider.

We did evaluate the elastic.co Cloud - which I concur would have been cheaper than Amazon - but since their demonstration cluster failed to boot correctly and as a result suffered a data loss during their sales presentation, my boss didn't feel comfortable going with them.

That's how I was left with the decision to either scale up our old ELK stack on AWS or go with something proprietary.


If you always want to search by one thing, you can manually index by that thing. In your case, arrange your log files by the first 6 hex characters of the user ID (/var/log/xxx/xxx/date.log), and grep will typically only then have a few megabytes to scan.

If you need real indexes, or just want something industry standard and maintainable rather than 'some guys grep script', then elastic search is probably the way to go.


That was just one thing we needed to search for (but by far the most common). The guy that wrote the parallel grep did try creating some indexes of common fields to speed searches, but quickly realized that he was re-implementing the wheel (poorly)

Plus we made good use of Kibana dashboards for the service


> it's not clear why you'd have gone with Elasticsearch in the first place.

Having been in a company with a similar issue (Hadoop/Spark, not ES), the issue is: You have a bunch of programmers hired to handle the 10TB data/day. Then you need some work done processing some data that is, e.g. 100GB once a week. Rather than evaluate the best way to process that data, the through process basically goes "it doesn't fit in my laptop's memory, so we'll use the cluster."


How did they connect at all with the wrong hostname?


They didn't, that's why they called us to complain that our FTP server was offline...

Some of them also had one of those DNS-grabbing ISPs, so by mistyping the hostname they would accidentally connect to the wrong IP.

EDIT: I think I now got what you meant. When a customer says "my password is not working" then the first thing we do is check in the logs that the customer actually did connect to the correct server. That's like the number one issue, correct username and password but wrong hostname.


> those $200 each bare metal servers are 2x Intel Xeon 6-core + 256GB RAM + 15x 10TB 7200 rpm.

That's really cheap for such a big server. I think the same specs on Hetzner and OVH cost at least twice of that. Where do you rent those servers?


I think this guy have no idea what he's talking about.


Sounds similar to a Hetzner SX292 - Xeon 6-core, 256 RAM, 15x10TB. Only one CPU though at $285.


I've been fighting elasticsearch to parse some netflow traffic. Found a great tool for logging the traffic and it even set up an elasticsearch instance and kibana for me, pretty much plug-and-play.

It logs lots of data, but I'm only interested in subnet level at this stage. I'm not interested in direction either.

A (multiple) text files with say a csv format (other fields are fine, cut -f will strip those out)

datetime,collector,srcip,dstip,bytes

I could easilly pipe into sed to strip the 4th octet, but in reality I'd probably parse with perl. Would take about 10 minutes and output a nice simple spreadsheet showing me which subnets are busiest, which times are busy etc, and apply accounting on a subnet basis (or with a bit more perl trivally assign different IPs to different accounting buckets)

I can only assume elk is a completely different way of thinking. Collegues think that grep is "hard", but click-click-click in kibana is easy.

However, I know I'm a grumpy old fart. I find lots of new ways to reinvent the wheel tiring and pointless, but it feels like shouting into the wind. Recently things that have made me simply sign include ubuntu switching from /etc/network/interfaces to netplan, or from debian-installer to subiquity. Moving from initd to systemd.

I'm sure there are good reasons for changing all of these, but for my use cases it just increases the workload.


Can you tell me which tool you consider for getting netflows? I am trying building something that is really close to what you describe but fail to find the time to do so.


Using https://gitlab.com/thart/flowanalyzer at the moment

However thinking it would be far easier to write something from scratch


ElasticSearch also does a lot more than your CLI does.

Splunk style dashboards, multi-user access, full text searching e.g. stemming, ability to support non-structured formats.


I fully agree, but our business need was to stream log files (e.g. tail -f) and to search for snippets in them and to produce statistic counters (e.g. |grep|wc)

While I can imagine situations where ElasticSearch is the best solution, I have seen it mostly used in situations that would be better served with simple command-line tools, similar to what the author demonstrated with awk vs. Hadoop.


> ElasticSearch also does a lot more than your CLI does.

This is not necessarily a good thing. In fact, it is probably a really bad thing if you do not need the additional stuff.


We reached the point where NOT being hosted on AWS is a competitive advantage :)


You should really be packaging up what you built and selling it as a competing service.


The irony is that there is no service. It's a duck-and-tape solution reminding me my first years in Linux ops industry.


As a possible middle-ground: I'm using Graylog with a lot of success, but my log volumes are rather modest (medium sized corporate network switches/routers/firewalls/servers log aggregation and processing)


We have 5TB per day, which may or may not break Graylog. Before evaluating ES, we used Logstash, but that couldn't keep up at around 2TB+ per day.


> Before evaluating ES, we used Logstash

Huh? I thought one would use Logstash to ingest & massage data going into ES. I view it as a sort of "quick&dirty stream processing engine". How would one go about using Logstash _instead of_ ES, as opposed to "in addition to ES" ?


Sorry, I worded that badly.

We used to use a locally-hosted ELK stack, so yes, including ES as the data backend for Kibana, but there Logstash or more precisely filebeat and logstash-forwarder became the bottleneck. That was a horrible experience because then filebeat would keep the old (unsent, but deleted) log files open, so production servers crashed running out of HDD space even though "du -h -s /" was a lot smaller than the HDD size. It took me way too long to figure out that deleted but still open files were the problem.

So our decision to move from using ES invisibly as part of Logstash towards a full dedicated ES cluster was forced by the Logstash pipeline getting too slow.

I then evaluated Elastic Cloud and using Amazon EC2 for deploying ES, both of which were ultimately rejected. The first due to fears of data loss, the second due to costs.


I'm confused again - wouldn't Elastic Cloud cost more than EC2? It makes no sense to be the other way around.

And secondly - why use logstash if you had filebeat - why not send data to ES directly? (I mean, specifically for your usecase, where you don't seem to need to do much processing prior to ingestion). Yeah, I saw that big parts of Logstash are in ruby so I expect underwhelming performance from it - but Filebeat should in theory be able to ingest large volumes of data into ES directly.


We used a central logstash server to receive the data from filebeat and logstash-forwarder running on multiple other servers. So in the initial deployment, logstash only set up the host and service tags needed for Kibana plus it was like a central gateway for external servers to write into the locally running ES storage.

As for why the price that elastic.co quoted us was lower than EC2, I have no idea. My guess would be that they were hoping to get a reference customer onboard for their (then new) offering.

Or maybe they are internally running bare-metal and thereby purchasing their CPU power cheaper than EC2 pricing. I mean you don't really need a redundant fault-tolerant virtual machine if the database on top is redundant and fault-tolerant, too.


I'm piping 5TB/day through 2 Graylog hosts with plain stupid DNS round-robin, all good.


We used Greylog in previous company and it worked perfectly.

Scaled to our needs but we might have used a cluster.


The cost savings with bare metal servers vs managed cloud can be ludicrous... like take away 2-3 zeroes. It's particularly true for bandwidth and raw CPU power where major cloud providers massively overcharge.

The real answer is like most other things: it depends.

If you have a highly variable workload or you really get a lot of value from specialized managed services then cloud will save you and generally work better. If you have a less variable workload then a more DIY approach can yield truly massive cost savings and even better uptime and performance in some cases.


Where do you get such low prices for dedicated rigs? Well anyway, looks like you haven't tried to roll your own ES cluster on bare metal hosts - I've got more throughput than yours with 2 Graylog hosts backed by 5x bare metal ES instances. With all benefits of web interface for less tech savvy people, structured logs, analytics and data exposure through unified API. All of these for less than 2k/mo. If I lived in wonderland with server prices like yours, our check would be less than 1k/mo.


I personally use OVH.com and Hetzner (for EU based installations). Don't expect support, but you get good machines, unlimited traffic (OVH). If you can set up your infrastructure so that it's JBOS (Just a Bunch of Servers) where you can hot swap them if there's any hardware failure, etc. it's great price/performance.


Not quite that price, but with 308EUR/month not too far off:

https://www.hetzner.com/dedicated-rootserver/sx292

I guess if you buy and host the machines yourself, you should be able to beat hetzner's price.


How does the functionality compare, though? With ELK, you have a variety of options for monitoring (I have personal experience with Kibana, Grafana, and the API) and visualizing. You explicitly say “rsyslog and grep”; do you build dashboard systems on top of grep? What does the performance look like? How many dashboards does that scale to?

What’s your strategy for availability?


Do you mean “bare metal” as just basic aws images using only rsyslog and grep? You call out physical disks so it makes me think like physical machines.

How are two physical machines $400/month? Is that assuming you already have the hardware laying around?

How much would it cost for basic aws ec2 instances running rsyslog and grep?


Why not run ES on the bare metal servers?


Tried and it was horribly slow. We add about 5TB of logs per day, and the ES index wasn't able to keep up, despite having 256 GB RAM and 2x Intel Xeon 6-core


That seems like a really good price for bare metal of that magnitude, where are you getting those from?


https://www.hetzner.com/dedicated-rootserver/matrix-sx

On their German website, you can bid for used-but-still-good servers and there you'll find great offers. There, you can also find hardware that they customized for others and then didn't need anymore. For example, I rented some GPU rendering servers for $110 monthly each from there.

https://www.hetzner.de/sb

Plus I can't praise their service and ops team enough. After we rented the new storage servers, I sent them an email to ask for it, and they then put the new storage servers on the same router inside the same datacenter as our other servers with them. Now we effectively have 2x redundant 1GBit LAN between our servers with them, plus an 10GBit internet uplink shared between our servers.

For that level of performance, they are ridiculously cheap.

BUT - and that is a big gotcha - you will be responsible for diagnosing hardware issues and asking for a replacement yourself. That's why we have RAID and two storage systems with them, because we tend to see one HDD fail every 3-6 months.


Bespoke solutions are always going to outperform generic ones. You need to choose the right tools for each job. ElasticSearch definitely has it advantages if you have to write complex queries, however, might be less suitable for grep like queries.


> We recently discussed new logging tools at work. It was either a redundant Amazon EC2 cluster with ElasticSearch for $50K monthly, or two large bare metal servers with rsyslog and grep for $400 monthly. The log ingestion and search performance was roughly the same...

If grep has the same search performance as elasticsearch, you should not be using elasticsearch and any comparison is bullshit.

> It was either a redundant Amazon EC2 cluster with ElasticSearch for $50K monthly, or two large bare metal servers with rsyslog and grep for $400 monthly.

$50K monthly is $69.44 an hour, or 8 instances of the most expensive thing from https://aws.amazon.com/elasticsearch-service/pricing/ (with 488GB RAM)

Please cite references, your numbers seem made up. Where can you get 256GB of RAM for $200/month?


https://www.hetzner.com/dedicated-rootserver/matrix-sx

They get slightly cheaper for large customers like us, plus you can get hardware upgrades like the 2x CPUs by paying for the difference upfront. So yes, it's $400 monthly + $2000 once, but those $2k don't really make much of a difference over the years that we've been running the system ^^


Are European hosting providers cheaper because they’re simpler, or because they’re European? I don’t see anything analogous for cheap dedicated servers in the US market. Wages, consumer prices, real estate, etc. are all lower so it makes sense that French and German companies charge less than Amazon.

Collocation can be a great deal at scale, but I don’t see a way to cheaply rent a handful of servers for substantially less than Amazon would charge. VPS providers maybe, but even the quintessential cheap American VPS (DigitalOcean) has a bunch of cloud features these days.

The last big dedicated provide I know of was Softlayer, which is now $2100/mo for a 242GB and 56 core dedicated box under IBM Cloud. Amazon’s m5 is around the same capacity at $3700/mo.


I believe it's cheaper connectivity, due to the network monopoly in the US. For European hosting providers, traffic tends to be cheap or free. For Amazon, you'll pay through the nose for egress traffic.

And when we tried to rent a Verizon 1gbit dedicated line to our Boston office, they quoted us $15k monthly.

In Germany, we paid first €199 then €499 for a dedicated 1gbit fiber line from 1&1.


Even before network charges, though.

An Amazon m5.12xlarge with a 1-year reservation is $1,075/mo, about the same as OVHCloud's top-end HG3 in the US at $975. Both are 28 cores / 56 vCPUs and 192GB RAM.


I have some servers at Wholesale Internet - they're located in Kansas City. Good counterpart to Hetzner or SoYouStart; with these low end dedicated hosts you end up being multi-provider to have geographically redundant datacenters anyway, but if you can build within those restrictions (often you can't have a private network so you're doing host to host VPN, for instance) it's a cheap way to get beefy servers and unmetered bandwidth.

Wholesale lists with a transfer limit on their price page but on the server config page you can upgrade to an unmetered Gbit port. And the low end boxes have an unmetered 100mbit port already. My experience has been that they really don't mind you heavily using that bandwidth either.. did have a point a few years ago where they were bandwidth limited until more fiber could be turned up in their datacenters, but you're going to occasionally have that sort of thing with any single datacenter provider.

All in all, I'm very satisfied with them for their specific niche of CONUS cheap dedicated servers.

SoYouStart wins for use cases where being located in France or Canada is acceptable, largely because they have basic DDoS protection on their bandwidth and I host some game related servers - anything gaming related and you will get DDoS attacks semi-regularly. And game servers can't run behind CloudFlare which is my other go to for basic DDoS protection.


Really? Try Googling '256gb ram dedicated'


You're right, I replied too quickly.

> We recently discussed new logging tools at work. It was either a redundant Amazon EC2 cluster with ElasticSearch for $50K monthly, or two large bare metal servers with rsyslog and grep for $400 monthly. The log ingestion and search performance was roughly the same...

If grep has the same search performance as elasticsearch, you should not be using elasticsearch and any comparison is bullshit.


Right, though some people really do reach for heavy-weight tools on datasets that are way too small to justify their use.

But yeah, even ripgrep should get out-performed by any db/search daemon with a decent index.


Your setup is redundant and expensive. logrotate directly to s3 and use Athena.


Out of curiosity, is the 200$/month price tag for colocation?


Where are you getting such a bare metal machine for 200$ a month?


ELK could handle that workload pretty well and isn't too hard to set up. Plus you get a nice web interface. But I still find myself logging in and grepping for things.


Horizontal scaling is also where ES shines. If you don't need that among other things, then there are alternatives yea


But those "hipster" tools can scale. Perhaps you simply haven't hit the right scale yet?


You don't need to scale. You're not Google, Amazon or FB to have a billion daily active users.

And if you manage to make it to that scale, you can certainly hire engineers to refactor those systems. Fix the problem only when you encounter it.

Parsing 1Tb of logs daily with a 14 day retention? A desktop with a NAS attached can do that.

The corps that developed these fancy tools and then open sourced them and labeled them hipster ish names? Of course they want engineers out there to use them, it's called vendor lockin.


Yes, you are right. But there are a lot more usecases, e.g. imagine you run a website for a (small or large) government, or a website for a popular television show. Or perhaps you are a developer in a telecom company in some country. Scale is everywhere.

> you can certainly hire engineers to refactor those systems.

Wouldn't it be better if those engineers used standard tools like Hadoop?


I worked for the largest online news site of Switzerland. We were nowhere near the scale where such tools would provide significant benefit over the "boring", "outdated" technology we we using. We also had a fraction of the operational cost than our much much smaller competitor within the same publisher that went full hipster on their tech.

Thing is, it for a lot of use cases it is a whole lot harder to get to large scales from a business and product perspective than it is for good engineers to adapt to it.


how much dev time would both approaches take?


You are underestimating the value of indexed logs and a Kibana view into them. For example, if you wanted a histogram of error logs for the last 12 days, and then zoom in and see the actual logs.


You are underestimating how much costs the issues caused by Kibana and a slow ES can produce.

I used to have these nice Kibana diagrams set up with auth and a https proxy so that our team could easily check if the server error rates looked normal. I quickly found out two things:

1. Most of our employees never ever looked at it, dismissing it as "too technical" just from seeing a screenshot.

2. The ELK stack uses backpressure, meaning the servers running filebeat run out of HDD space.

BTW, with a bit of grep in front, the actual per-service working data sets are small enough so that I can load them into R on a 256GB RAM server and then produce nice diagrams. You can also script R from the command line and then send the diagrams around as a cronjob.


Have you used the file input? With such throughput you don't need it at all, it hangs both of its host because of iowait and remote endpoint too if beats proto is used. All of my log traffic is sent as GELF UDP or syslog UDP.


Sure it’s nice and maybe it’s worth 50k/month if you’re running Uber... but grep and sed together can get you quite far.

Admittedly for the histogram you’d have to cut and paste into a spreadsheet but yeah, hardly something you need in real time (probably more for a postmortem presentation to management.)


You can do this with graylog


I did some testing on the same (kind of) dataset and task:

First test: A single 2.9GB file

time rg Result all.pgn | sort --radixsort | uniq -c 13 [Result ""] 1106547 [Result "0-1"] 1377248 [Result "1-0"] 1077663 [Result "1/2-1/2"] rg Result all.pgn 1.12s user 0.55s system 99% cpu 1.680 total sort --radixsort 3.87s user 0.37s system 71% cpu 5.911 total uniq -c 2.69s user 0.02s system 45% cpu 5.909 total

Using Apache Flink and a naive implementation It took 13.969 seconds.

Second test: same dataset, split between 4 files

time rg Result chessdata/ | awk -F ':' '{print $2}' - | sort --radixsort | uniq -c 13 [Result ""] 1106547 [Result "0-1"] 1377248 [Result "1-0"] 1077663 [Result "1/2-1/2"] rg Result chessdata/ 1.70s user 0.97s system 42% cpu 6.292 total awk -F ':' '{print $2}' - 5.47s user 0.07s system 88% cpu 6.289 total sort --radixsort 4.13s user 0.42s system 43% cpu 10.559 total uniq -c 2.73s user 0.03s system 26% cpu 10.559 total

Flink: 12.724s

Conclusion: For this kind of workload, both approaches have comparable runtimes, even tough taco bell programming has the upper hand (as is should for simply filtering a text file). It took me about equally long to implement both. I think both approaches have their use case.

I ran this locally on my Laptop with 4 logical cores.


Hadoop is very slow, because it persist the data to disk before every stage. You really wouldn't want to use Hadoop if you don't have a good reason too. More modern tools like Spark and Flink fare better there.



It's just the annual expedition for HN where everyone gets their turn to be smarter than everyone else.

Even more irrelevant now because Hadoop is largely a dead-end technology.


A classic from 2015 along the same lines: Scalability, but at what COST?

http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...


The author is experimenting with 1.75Gigs of data. At that scale sure, a local machine will be faster. Hadoop's real use-case though is when your data doesn't fit in memory and even this is kind of debatable. It makes sense to measure the performance with some prototypes and then make a final design rather than just use whatever AWS offers. Besides packaged services in AWS are also a bit more costly than basic services like EC2 instances and network goodies.


These days you can get servers with Terabytes of RAM, so a lot of people (most?) could fit their data in memory.

I just took a gander to HPE's website and you can get proliant servers with up to 12 TB of RAM (you might be able to get them with more RAM, did not check in detail)


HPE ?


Hewlett-Packard Enterprise


How much does it matter that the data doesn't fit in memory when consumer grade (Samsung 970 Pro) SSDs can deliver 3GB/s sequential read? On inexpensive hardware you can process a ~TB every 5 minutes per SSD.


Well, I know someone in the business where the average database is multiple TB, and they still have to somehow run queries against it, 1TB/minute is just far too slow.


Hadoop is useful at the scale where your data is already distributed, maybe even in different datacenters. At that point, it's faster to push the computation out to where the data is already stored. It doesn't make sense to me when the data starts out all in one place.


This reminds me my experience from a company internal hackathon. My colleague started writing a Spark program that would process the data we needed (a few hundreds GB uncompressed). Before he finished writing it, I was able to process all the data on a single machine with a unix pipeline. The computationally intensive steps were basically just grep, sort and uniq. When he finished the program, it couldn't run because of some operational issues on the cluster at the moment, so we didn't even find out the speed to compare.

For me, the morale is that the cheap hardware saves money/time twice:

1. It's faster if a program can run on a single machine.

2. It's easier to write a program that runs on a single machine.

With this in mind, cloud works great for analytical data processing. Just start a big enough machine, download data, do the computation, upload the result and turn the machine off. If you develop the program on a sample of the data so you can do it locally, it will be even cheap because you use only short time of the powerful server.


Given you can soon get beefy 64 core Threadripper based workstation under 10K just running the analysis locally looks like a very decent option.


The two approaches aren't necessarily mutually exclusive. Spark can easily shell out using pipe(). Plus you can use that to compose and schedule arbitrarily large data sets through your bash pipeline through a multi-node cluster.

Beyond that, while the Unix tools are amazing for per-line FIFO-based processing, they really don't do a great job at anything requiring any sort of relational algebra.


join and comm would like a word with you.

You can't match SQL expressiveness, but you can definitely handle set-based stuff.


Wait... are you telling me people over-engineer solutions to ultimately simple problems? You're kidding.


very simple processing, not memory bound, tiny data-set - of course its going to be faster locally when the command itself takes less time than the networking, distribution, coordination and collation overhead of using any distributed tool...


You know that, I know that, and we can be happy that we have the experience to know what the right tool for this job would be by sizing up and describing the characteristics of the problem like you just did. But those with less experience may not be able to do that unless shown stuff like this in practice.

Some may think this problem requires MapReduce. The quote from the original implementation blog post certainly seems to indicate so.


MapReduce as a paradigm and technology was popular about a decade ago and then died shortly after in favour of Hive, Spark etc.

Pretty confident that not a single developer anywhere in this world would be first thinking of MapReduce. Just like they wouldn't jump straight to Cobol.


> in favour of Hive, Spark

Well, Hive and Spark create MapReduce jobs to satisfy your queries: it's still in the background, but you don't have to think about it (as much).


Some places still seem to go for large Kafka clusters just to calc some stats and forward some messages. I am sure some of their solutions are MapReduce below.


Very curious to understand more about these mythical developers who are recommending a technology/paradigm that stopped being used a decade ago.

You've seen people writing pages and pages of Java code to do ETL ?


I worked for a large multinational last year that had multiple teams rolling out terrible Hadoop solutions.

The rest of the world does not particularly care about SV fashion trends.


I'm very curious to know where you got this strange notion that MapReduce is "a technology/paradigm that stopped being used a decade ago". I barely knew what MapReduce was in 2010, and didn't touch my first Hadoop cluster until late 2012. Every place I have worked since has had a Hadoop cluster or two.


I'm on a team of a Fortune500 who uses a combination of Java and C# to do ETL. AMA.

Edit: We're also a fairly greenfield team, with a product in beta that's less than 4 years old. So like, the leaders of the team knowingly crafted us in this particular direction.


Back in high school I did a map reduce job using split, sort, and a handwritten transformation and accumulation tool. It worked!


Yes, the article even states that in the first paragraph. Processing a few GB (the example in the post is not even in the two digit range) is easily done with basically any tool. Heck, someone could probably pull off something like that in Excel in a couple of minutes.

I think the point if the article is exactly that: Unless you have a couple of TB to process don't even bother. I remember a similar post in the same spirit that had a couple of GB in a Postgres database and ended like "Just do a simple query!" instead of a Hadoop cluster for simple map-reduce.


You can easily run a workload with 20tb/day on a single thread ripper.

My last contract we wrote a c tool that did dsp on 10tb/day data with a latency of 50 microseconds. We are using less than 10% of the CPU on that.


Yeah, the ThreadRippers are amazing. I was truly astonished when I noticed that one hour of V-Ray on 3970X outperformed $1000 in ChaosCloud rendering.

At a price of $2500 once for the TR versus $24,000 per day for cloud, it's a no-brainer to buy them. Sadly, availability in my area is pretty spotty and even Amazon currently predicts 2+ weeks of delivery time.


Not sure where you are getting your costs from.

Threadripper 3970x = 32 cores @ $1999

AWS c5.9xlarge = 36 vCPUs @ $8.34/day (spot)

You could buy a year of AWS for the same price as a Threadripper based PC.


I was comparing with ChaosCloud: https://www.chaosgroup.com/cloud

For EC2, one would need to rent licenses for the V-Ray renderer plus V-Ray tends to be heavily CPU-cache- and memory-bound, so exactly the parts that get slow if you enable the Intel security fixes. Due to c5.9xlarge being Xeons, they are like 80% slower than a comparable AMD CPU.

BTW, calculator.s3.amazonaws.com quotes me $1119 per month for a single c5.9xlarge, so I'm not sure how you would reach $8/day when I see $37/day.


Then switch to the AMD based c5a.9xlarge which is even cheaper.

And you can reach $8/day using Spot or Reserved Instances. Not sure where you get $1119.


what kind of DSP were you doing? can I come work for you haha


Hence the use of the word "can".

The fact that "deal with it locally" was obvious to the author, but not to the people who inspired the author to deal with it that way (he mentions an EMR job), means there are those it may be worth pointing it out to.

Certainly, being reminded that you probably should solve your problem without distribution, unless you actually demonstrate that you CAN'T solve your problem without distribution, is nothing but a positive.


once you get to the stage where your laptop is just not enough anymore (or your laptop has some cores you want to add to the processing as well), gnu parallel might be of use.

https://www.gnu.org/software/parallel/


Is there any benefit of

  cat files* | grep pattern
over this

  grep -h pattern files*
aside from result color highlighting?


Even though it's a faux pas, I find it logically makes more sense to me.

Consider a pipeline like

    cat *.log | grep "apache" | grep -v "1.1.1.1" | wc -l 
It makes sense in that order. Read the file, grep this, exclude that, could the total. Meanwhile

    grep "apache" *.log | grep -v "1.1.1.1" | wc -l
may be a few characters shorter but the order isn't really logical anymore.


With the first method, you can use xargs to run multiple copies of grep running in parallel like he did later in the article.


The author switched to using find before doing the parallelization.


I've read this, but still can't see a reason to use 'cat | grep':

https://stackoverflow.com/questions/13507889/difference-betw...


I don't think there is.


When all you have is a hammer, every problem starts looking like a nail.

The basic premise is fine: If you have a simple problem, using simple tools will give you a good result. Here you have text files, you just want to iterate through them and find a result from ONE line that's the same in every file, collate the results. No further analysis required.

Every problem in the world can be solved by a bash one-liner, right!?

There's an interesting dichotomy with bash scripts: One school says any bash script over 100 lines should be rewritten in Python, because it's overcomplex already. Another school says any Python script used daily over 100 lines should be rewritten in bash so there are no delusions about it being easy to maintain.

The original article is from 2013, and doesn't try to do any optimization (I guess, the original article is unavailable at the time of writing of this comment), so it would be an interesting question to see what you could do at the Hadoop end to make the query faster. I would imagine quite a lot.


If you can fit your data on a single disk drive, you don't need Hadoop.


The bottom 90% of data users are in the gigabytes range. Anything works.


I've been in a meeting where the amazing scalable cloud solution for the huge data warehouse was laid out. Turned out to be 500GB. Judging by the death stares I got I don't think I was supposed to say "Wow, the whole thing would fit on the 512GB SD card I bought last week".


We had a poorly performing service which reads from a number of rest endpoints and writes to s3 in date prefixed format. Offshore wrote 3,600 lines of codes targeting kinesis firehose. By just piping the url endpoints to a named pipe and cycling the s3 file in python, my code was 55 lines of code and did the same thing without kinesis. Wrapping things in GNU parallel and using bash flags, it handles any failure cases super gracefully, which is something the offshore code did not do. The India offshore code had a global exception catch-all, and would print the error and return exit success return code... but I guess someone got to put Kinesis on their resume.


I maintain here a very small command line cheatsheet that I get back for reference for mostly data analysis tasks https://tinyurl.com/tomercli


It's not accessible. Can you publish it?



Been saying that for years. Also, get this, 99.999% of companies do not need "big data" or distributed systems of any kind. I feel like the old "cheap commodity hardware" pendulum swung way too far. More expensive, less "commodity" hardware can often be cheaper, if correctly deployed. I.e. you don't need a distributed database if your database is below 1TB and QPS is reasonable (and what's "reasonable" can surprise you today with large NVME SSDs, hundreds of gigabytes of RAM, and 64-core machines being affordable).


This was a straw man article in 2014, it was a straw man article the other times it’s been posted to HN in the intervening years, and it’s still a straw man article in 2020. As noted in another comment here, the contemporary technology of Apache Flink really isn’t far off command-line tools running on a single machine. Meanwhile, HDFS has made a lot of progress on its overhead, particularly unnecessary buffer copies. There are datasets where a Hadoop approach makes sense. But not for ones where the data fits in RAM on a single system. No one has ever argued that.


While I personally would use a similar pipeline as OP for such a small data set, saying Hadoop would take 50min for this is just flat wrong. It shows a clear lack of understanding of how to use Hadoop.


Amen. You can do a lot with pipes, various utils (sed, awk, grep, gnu parallel, etc.), sockets, so on and so forth. I see folks abuse Hadoop way too often for simple jobs.


I am always tempted to say too that "vim can be faster than IDE x"... But I guess that is a bit more subjective.


That's because Hadoop is a big favorite of wining and dining suits, who scram at the sight of the command line.


If you're disappointed with the speed and complexity of your Hadoop cluster, and especially if you're trying to crack a bit, you should give ClickHouse a spin.


>crack a bit

What does that mean? I don't understand if you're trying to endorse ClickHouse or make fun of it.


Phone typo. Should have been 'nut'.

And yes, I'm endorsing ClickHouse; it scales down much better than Hadoop.


if you're doing spark or hadoop today and are a python shop...you should definitely look at Dask https://dask.org/

works as good as spark. very lightweight. works through docker.

Ground up integrated with kubernetes (runs on EKS/GKE,etc).

and no serialization betweek java/python, fatjar stuff, etc


command line tools like grep,awk,sed,etc are great for structured and line-based files like logs. For json documents I can add a recommendation for jq:

https://stedolan.github.io/jq/


Cloud computing is kind of a joke. Yeah keep paying someone for shared "virtual computers", that sounds suspiciously similar to shared hosting from a decade or 2 ago... Oh but this is different, you get isolation from containers/VMs! Yeah ok, meanwhile new exploits emerge every couple weeks. It's like tech debt ideology on steroids... just keep pumping out instances until the company either goes hyperbolic, or goes bankrupt. Realistically, just buy a few physical servers and actually work to build efficiency into the system instead of just throwing compute at your public-facing web app.

I recently bought a DELL r710 just for fun and was pleasantly surprised how even days after spinning up a bunch of VMs, I don't have a 30gb logfile for all the failed attempts at getting into my instance (this was my experience recently with 2 cloud providers!)

It'll be interesting to see how the mkt reacts when you have a "first-of-its-kind" massive, massive security breach that affect popular "pure play" internet companies hosted on top of the mythical "cloud."

Seriously, $READER, look at your cloud computing-dependent startup, and calculate egress costs for your storage, as if you HAD to stop using cloud tomorrow. How much does it cost you? How could you adapt? It's designed to keep you dependent on 3rd parties... Idk, IMO it is really not great.


I see one really positive point in cloud computing and that is that I can soundly sleep at night :)

Of course, cloud is overpriced, slow, and suffers from noisy neighbors. And keeping things running in the cloud is about the same amount of work as keeping it running on bare-metal. But for customer-visible things, I want to use cloud so that someone else has to get up in the middle of the night when apache crashes.

Sleeping peacefully makes it worth for me to pay $5,000 monthly to Heroku when 2-3x $100 bare metal servers would do. Plus I can cheaply insure against supplier negligence, whereas insuring against employee negligence would be much more expensive.


Also important: turning over a finished project to a client who just wants to run the workload and doesn't want to hire full time sysadmins.


Noisy neighbors tracking can be automated and partially solved then. With cloud you gain access to the provisioning API and with Terraform/Ansible stack (as for example) you are able to build up and manage infra quickly, efficiently and declaratively. Bare metal provisioning can also be automated (as via private cloud solution, for example) but you need dedicated team for that (and nice OpenStackers aren't cheap). I was solo managing 500+ hosts once on public cloud, there is no chance you can do this without what you call "hipster tech" and modern devops toolchain.


I like to think that most innovation in IT is driven by people being woken up at 3am..


Innovations like... the cloud?


Yeah, and when the server dies at 2am on a Sunday, nobody going to get any calls from our partners, that their clients cannot pay and they are loosing money because of us so when our website is going to be up on full capacity. Because the autoscaler dealt with it and all I got is a text that a faulty instance is gone and a healthy has been brought up.

I wonder how you would solve that with a few physical server.


> I wonder how you would solve that with a few physical server.

They won't. I think a lot of these people are skeptical of cloud computing have never worked on a product that serves to millions of users every hour and get smashed during rush hours. To me, "why not just get bunch of bare metal computers" is laughable.


> I recently bought a DELL r710 just for fun and was pleasantly surprised how even days after spinning up a bunch of VMs, I don't have a 30gb logfile for all the failed attempts at getting into my instance (this was my experience recently with 2 cloud providers!)

That's weird. I run a couple of servers myself, recently installed Asterisk on one of them and in under a minute it was under a constant barrage of automated scanners.


I guess our firewall configuration must be different. What are you using to run VMs, if that's your use case? I'm using proxmox, but I'm curious to hear what your setup is like.


Bare metal at Strato. I got fail2ban in place, but still - for services that you need to have reachable on the Internet on standard ports (web, asterisk) they normally should be plagued by automated scanners the second you open the ports.

Too bad that it never got implemented in DNS to specify the ports there (e.g. example.com would have an A record for the IP address and another record for the port)...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: