Command-line tools can be faster than a Hadoop cluster (2014)

fxtentacle · on Jan 30, 2020

With all the Hipster tech being released recently, the headline statement holds true for a lot of things, unfortunately.

We recently discussed new logging tools at work. It was either a redundant Amazon EC2 cluster with ElasticSearch for $50K monthly, or two large bare metal servers with rsyslog and grep for $400 monthly. The log ingestion and search performance was roughly the same...

EDIT: To give everyone a sense of scale, those $200 each bare metal servers are 2x Intel Xeon 6-core + 256GB RAM + 15x 10TB 7200 rpm. We retain logs for 30 days and handle 4-5TB per day.

jammygit · on Jan 30, 2020

What cloud advocates always say is that the $50k monthly will save you money from not needing to hire a team to manage it for you, and that over the course of 10+ years you will be ahead.

Is that true in anyone's experience? Every once in a while somebody posts about their competing bare-metal system and it looks like a lot of people have managed to cut their server costs by 99% (based on the numbers they post) by avoiding the cloud as a service

Honestly curious

alxlaz · on Jan 30, 2020

There's a lot -- and an increasing amount -- of knowledge that's specific to each cloud platform, and increasingly specialized (and complex) software. In the last couple of years or so I've definitely seen companies having to hire people to manage their cloud setups.

I suspect it depends on a lot of things -- the complexity of the project, its architecture, the development, maintenance and management practices. For a few years, me and a colleague used to manage a 30+ server setup without needing more than 8-10 hours a month. But we managed to pull it off not because of where the servers were, but because we chose a good and stable tech stack, we had the knowledge and experience required to manage it efficiently, and no one decreed that we shall henceforth move everything to the cloud because it cuts costs.

Given the same situation -- right stack, right experience, useful management practices -- I'm 100% sure you can pull off the same thing in the cloud, at least as far as efficiently managing the whole setup is concerned.

But IMHO the idea that cloud services give you all of that for free is snake oil. As soon as you need more than a virtual machine running a web server or whatever, what you end up with is exactly what you build. If what you build is crap, it's gonna run like crap, and you're going to need a crapload of money to keep it running, and two craploads of money to fix it.

drej · on Jan 30, 2020

I have an experience related to this. We used to run a [scalable SQL engine, PaaS] for ~$1000/mo, but it wasn't cool enough, so people started migrating stuff to Spark, Databricks, S3 etc. Now the infra costs are ~$8000/mo + several people to manage it build tooling around it (~$6000 per person).

People just want to run their selects, man.

ak39 · on Jan 30, 2020

>>I have an experience related to this. We used to run a [scalable SQL engine, PaaS] for ~$1000/mo, but it wasn't cool enough, so people started migrating stuff to Spark, Databricks, S3 etc. Now the infra costs are ~$8000/mo + several people to manage it build tooling around it (~$6000 per person).<<

IMHO cloud services are only really logical options for new startups or ventures. Existing IT shops that are already heavily invested in infrastructure ops will NOT readily move to the cloud unless there's an obvious attempt at power-grab or subvert the IT/Ops fiefdom.

>>People just want to run their selects, man.<<

I read that in the "Dude's" voice. :-)

Kimitri · on Jan 30, 2020

This is so true. Also, I see tremendous value in not being reliant on any of the three major cloud providers. They all kinda seem untrustworthy in their own, unique way.

And also:

That JOIN really tied the tables together.

RHSeeger · on Jan 30, 2020

In the venn diagram of "new startups or ventures", "Existing IT shops that are already heavily invested in infrastructure", and "every business", the first two do not cover even close to the entire area of the third.

jvanderbot · on Jan 30, 2020

Hey man there's a query here!

leonardteo · on Jan 30, 2020

5 years ago I built our own company CDN using bare metal clusters hosted in multiple POP's that was 1/10th the cost of using commercial CDN's. It brought us ahead by literally hundreds of thousands of dollars per year and it serves us extremely well (continues to this day). The problem is that I am the CEO of the company which has now expanded to 45 people and to date, I've been unsuccessful at assigning a team to manage this piece of tech due to economics. At the point where I have to get engineers to work on the CDN, it becomes cheaper to move to a commercial CDN where it becomes someone else's problem. But at the moment, it ain't broke, the team loves the stability of it and it continues to save us hundreds of thousands of dollars every year....

cs02rm0 · on Jan 30, 2020

Something to turn into a product? Economies of scale could help staff it then?

leonardteo · on Jan 30, 2020

It's something to think about! :)

gowld · on Feb 1, 2020

What product? A CDN?

terminal625 · on Jan 30, 2020

woah, are you the ceo of artstation? I like that site

leonardteo · on Jan 30, 2020

Yessir, reporting in. Thanks for the thumbs up.

Atomskun · on Jan 31, 2020

Great, could you tell your Android contractor:

- "Magazine" in the sidebar just shows "No internet connection" (which is wrong) when clicking on it.

- Why can't the app remember my Filter settings? I'm only interested in digital 2D when on the "Browse" screen, but I have to apply the filter every single time I open the app.

- On Browse selected, click on Search, type in anything (e.g. Blender), then "View All Projects." Scroll through the first "page" (so that results show up that weren't on the screen before). Now these results begin to show up duplicated later when continuing scrolling (I assume pagination and/or the RecyclerView is implemented incorrectly).

- The app doesn't seem to cache images. Click on any project. Let the image load. Hit back. Click the same project again. The image needs to load again.

Looking at it again, some UI elements feel kind of uncanny which makes me think that the app isn't full native. There are some other small things that make the app look a little bit unpolished which I think a site like ArtStation doesn't deserve.

leonardteo · on Jan 31, 2020

I’ve sent this over for the team to take a look, thanks.

spaniard89277 · on Jan 30, 2020

Is the stack too complicated? I mean from a layman perspective id doesn't look like fancy tech, am I wrong?

leonardteo · on Jan 30, 2020

It depends on who is working on it. :) For devs who are comfortable with bare/close-to-metal clusters and networking, it's really not complex (and why over the years it's just not been an issue for me to just hack on it on the side). But for other devs, just the context switching from working on application domains becomes an overhead you can't ignore. Then when you factor in all the other overheads of assigning a "team" to it which includes scrum events, refinement, just talking about it ends up driving the cost up.

It is a little bit more complex now - e.g. I added on-the-fly image resizing routines. But the core concepts of caching on multiple pops, being able the purge, etc. no it is not fancy tech by any stretch of the imagination.

tluyben2 · on Jan 30, 2020

It can be true, but not from a difference between 50k and 400; chances of you getting ahead of that are not very large. The chance of your company existing in 10 years is not either.

As I have said here before; many people are abusing AWS (etc) touting this reason, but in many cases it is used with the 'we have infinite resources' (no constraints) so literally nothing is optimised and as well the case at hand is often something that would most likely not require much work on bare metal either. So for that price difference, you secure against that chance that you maybe have to fork out $2.5k ah let's splurge, let's say $25k, in those 10+ years for server management/problems etc. Of all the bare metal we have running 10+ years, only one thing breaks and that is not bare metal but an old fashioned VPS. The rest has never had any issues. That's luck, I know, but even if it would have broken, it cost to fix would've been vastly less than 25k/10years. Let alone 50k/mo.

Sure there are plenty of exceptions, but I dare say literally most companies, by far, don't need setups like that. But then again; let them set them up; usually they make things so needlessly complex, expensive and slow that I get hired to fix it somewhere in those 10 years (yes, fixing! in the cloud! as far as 'not needing to hire a team to manage it for you').

bcrosby95 · on Jan 30, 2020

We maintain 30 bare metal servers at a colo center, and between me (primarily a developer) and the CTO we spend maybe 1 day per month "managing" them. The last time we had a hardware failure was months ago. The last hardware emergency was years ago.

Servers run on electricity, not sysadmin powered hamster wheels.

krab · on Jan 30, 2020

Yes, the maintenance is cheap. The changes are more costly.

We run a dozen bare metal servers and I see the difference what it takes to spin up a new VM vs. set up a new physical server. There's planning, OS installation (we use preseed images but we weren't able to automate everything), sometimes the redundant network setup doesn't play well with what the switches expect (so you need to call the datacenter).

Still, it works out in favor of the bare metal servers. But I'm looking forward to a bit bigger scale to justify a MaaS tool to avoid this gruntwork.

andreimackenzie · on Jan 30, 2020

I completely agree that some types of changes are much more expensive with a bare metal architecture than with cloud.

6 years ago, I worked for a company in the mobile space. This was around the time of the Candy Crush boom, and our traffic and processing/storage needs doubled roughly every six months. Our primary data center was rented space co-located near our urban office. For a while, our sysadmins could simply drive over and rack more servers. We reached a point where our cages were full, and the data center was not willing rent us adjacent space. We were now looking at a very large project to stand up additional capacity elsewhere to augment what we had (with pretty serious implications on the architecture of the whole system beyond the hardware) or move the whole operation to a larger space.

This problem ended up hamstringing the business for many months, as many of our decisions were affected by concern about hitting the scale ceiling. We also devoted significant engineering/sysadmin resources to dealing with this problem instead of building new features to grow the business. If the company had chosen a cloud provider or even VPS, it would have been less critical to try to guess how much capacity we'd need a few years down the road to avoid the physical ceiling we dealt with.

krab · on Jan 30, 2020

Yes, the cloud premium is also a kind of insurance - you know you'll probably be able to double your capacity anytime you need it.

dahfizz · on Jan 30, 2020

Openstack ironic and bifrost are pretty useful OSS tools for managing baremetal servers.

hosh · on Jan 30, 2020

Yeah, "CAPEX always increases, OPEX always decreases".

notyourday · on Jan 30, 2020

[I'm speaking only from the US perspective]

The only difference is if you own the gear or not. If you do own it, then it is CAPEX, and the gear goes on the balance sheet and you can only depreciate it according to the schedules ( in some cases immediately 100% but most of companies blow through that number really quickly ).

In all other cases it is OPEX.

The rule of thumb is that all OPEX can be used to offset the revenue, which is god sent to most companies that aren't printing gobs of money.

So if you make some money and you are past 100% deduction thresholds, when owning gear beefs up the balance sheets and at best slightly decreases taxes while spending money on OPEX significantly decreases taxes.

hosh · on Feb 3, 2020

I didn't parse that in terms of bookkeeping. I took that to mean that the ongoing operational expenses such as, salary for the ops team, dev team, expertise, etc. may be difficult to estimate, once you figure out how to do something, they can be automated.

wahern · on Jan 31, 2020

Under the 2017 Tax Cuts and Jobs Act, you can take a 100% deduction on new and old equipment in the first year. It expires in 2022, but that's because of the shenanigans to avoid formally violating deficit rules, which use a 10-year window for assessing budget impact. I think there's a good chance that it'll be extended as you can game the budget analysis that way indefinitely.

GrinningFool · on Jan 30, 2020

Does that time include security updates for the OS and installed services? I would assume so, but 8 (16?) hours a month seems lower than I'd expect given the frequency of vulnerability discovery and security patches.

bcrosby95 · on Jan 30, 2020

GrinningFool · on Feb 4, 2020

Thanks!

djmobley · on Jan 30, 2020

Everything in one colo facility with no geographic redundancy?

qaq · on Jan 30, 2020

If you look beyond marketing a single top tier DC has better uptime than AWS AZ. You are way more likely to be bitten by AWS control layer issues than by DC in a good location dropping.

opless · on Jan 30, 2020

Most use cases there is just no need.

krab · on Jan 30, 2020

Some colo companies, even the small ones, offer multiple datacenters. You then have to either use public IPs for the inter-service traffic, maintain a VPN or contract it from them.

pepemon · on Jan 30, 2020

Your suggested solutions aren't reliable at all. AWS/GCP/Azure backbones work absolutely differently.

foobarian · on Jan 30, 2020

What about software updates and reboots?

krab · on Jan 30, 2020

There is the same issue with a cloud provider unless you run immutable deployments. Then, you need to invest in the tooling to produce the immutable images.

Johnny555 · on Jan 30, 2020

We didn't find that managed elasticsearch saved us any money over running our own ES cluster, but the best part of the cloud for my company is the scalability -- our peak monthly load is 10X more than our light weekend load. So to accommodate that scalability, instead of buying and maintaining 1000 servers to handle our light base load, we'd need to own 10,000 servers that would only be less than 30% utilized on average. And for redundancy, we'd need to spread those servers across multiple data centers, as well as manage a 2 petabyte storage system (also replicated across data centers) to replace our S3 usage.

And we'd need to have staff to manage the physical servers, run our network infrastructure, etc.

We've been through the numbers many times and the cloud always wins, even when compared to running our own servers for base load and have some hybrid cloud that lets us expand into AWS for peak loads. (which saves some dollars in servers, but that doesn't make up for the added complexity).

And nearly for "free", we have a cold standby site that has a replica of our hot data (a small portion of our full dataset), so if there was a full region outage, we could be back up in running in 30 minutes with reduced functionality.

_wmhc · on Jan 30, 2020

As someone who doesn't know a lot about this stuff, is there a way to have the base load internal and use a cloud provider for elastic load?

krab · on Jan 30, 2020

Yes, with a caveat which is latency. For example it's hard to split a typical webapp so the application servers are far from the database. The latency kills performance.

On the other hand, for stuff like CI jobs, batch processing or anything else which doesn't depend on tight synchronization with the other location, you can mix and match.

tiew9Vii · on Jan 30, 2020

I found most large companies who use cloud at scale have a team/s managing cloud, providing tooling around management and deployment or even full on PaaS solutions on top of providers. So you might not have you traditional sysadmins but you have a team of what recruiters like to call devops engineers instead.

The main advantage of cloud is the flexibility, treat resources as ephemeral, being able to click a button and get more/less resources. You don’t need to wait for a server to ship and be installed in a dc, if you don’t know what specs you need just twiddle the dials until you get something that performs as you want. It allows you to deploy / release quicker and easier.

It’s possible to architect so the cloud is cheaper but almost never happens. Optimising is intensive, often it’s called over engineering by pm’s/product owners. Costs normally start out low, blow up as products ramp out feature after feature and teams start onboarding, if performance is an issue it’s easier to increase an instance size instead of profile and change architecture. Only when the person paying the bill says this is to much does architecture and other optimisations happen, sometimes at a point when it’s hard to retrofit.

AmericanChopper · on Jan 30, 2020

DevOps tools are force multipliers though. Cloud has its own set of problems to deal with, but manualOps BaU work requires so much more labor to accomplish so much less.

goatinaboat · on Jan 30, 2020

not needing to hire a team to manage it

You absolutely do need an experienced team to operate any significant cloud deployment, or misunderstanding the cost model will kill you.

Good cloud people cost more than good conventional sysadmins.

Unless cloud is simply an excuse to kill off a really, really complacent in-house entrenched IT dept, it will not save anyone anything cash-wise.

These are the facts.

russellendicott · on Jan 30, 2020

I work for a massive corporation who's been on the cloud journey for about 10 years now with no signs of slowing. I think what my company loves most about cloud is that you can throw money at the bad management and volatility problems.

Even if a department spins up 500 servers for some executive's sacred cow when the project goes bust it's trivial to tear down the whole thing without requiring semi trucks.

The other very important thing is inventory auditing. Managing inventory in 10 datacenters built and populated by 100 teams over 30 years is a nightmare. Cloud provides a mechanism to build the coveted "single pane of glass" which large companies need desperately.

drywater · on Jan 30, 2020

The answer to bad management is the cloud. Well put.

reallydontask · on Jan 30, 2020

The answer is: it depends

Clearly in the example above, you can afford to hire two sys admins to manage the bare metal servers (you can probably afford more, but 2 is the minimum that gives some peace of mind)

We had an application that was a glorified quiz engine and our customers would mass enrol and mass take the quizzes so the scaling that was offered by Azure made it a no brainer for us, specially as given the nature of our customers, quizzes would only take place during working hours, so we'd scale right down for half the day and scale up and out at peak times.

Total costs were about 1/2 of what we estimated for bare metal

lorenzhs · on Jan 30, 2020

2 sysadmins for 2 servers? What are they going to do all day? Managing those two servers should probably be 5-10% of their job -- 2-4h per week for each of them seems plenty --, while doing 90-95% other stuff. Like a sibling comment said, "Servers run on electricity, not sysadmin powered hamster wheels." I'm our de facto sysadmin here (university lab) and I spend maybe an hour a week taking care of our ~dozen bare-metal servers.

jakobegger · on Jan 30, 2020

> What are they going to do all day?

Be on-call. You need the sysadmins to be available to fix issues when they happen, even if the fixing itself takes little time.

> ... university lab ...

I developed a few web sites for universities, and they were hosted by the university. You really have to hope nothing breaks on Friday afternoon, or you have to wait until Monday morning to get it fixed...

ChrisCinelli · on Jan 30, 2020

Facebook with 400m users had just one DBAs if I remember correctly. I wonder if people ever runs the numbers.

rwultsch · on Jan 30, 2020

I was on the DBA team there around that MAU. I had lots of company and support from SRO (aka jr DBA's) as well as other teams (provisioning, etc...).

jsjohnst · on Jan 30, 2020

> Facebook with 400m users had just one DBAs if I remember correctly.

You don’t remember correctly. There are multiple and to my knowledge there hasn’t been just one in at least the past ten years.

HelloNurse · on Jan 30, 2020

Millions of Facebook accounts are a pretty harmless size metric. DBA workload increases with the number of employees, which is proportional to support requests, accidents, changes, strange needs that need to be addressed, and so on.

Buying and provisioning disks before space runs out is a small part of a DBA's job, and a modest constant-size task compared to predicting how fast space is running out.

laumars · on Jan 30, 2020

Have you got a citation for that claim because that figure seems a little hard to believe.

In terms of your general point: it really depends on your business. In my last job there were 4 DBAs out of 12 total IT staff and they were constantly busy. In my current job there are no DBAs in a much larger team and yet no requirement to need one either. The two businesses produce vastly different products.

krab · on Jan 30, 2020

Yes, if you can benefit from frequent scaling up and down an order of magnitude, this is where the cloud really shines.

For more continuous workloads, you can overprovision on bare metal more cheaply to deal with the spikes.

C1sc0cat · on Jan 30, 2020

Would that not just be part of your existing sysadmin teams job.

Back when I worked at BT the unix developers all went on the basic sysadmin course from sun as part of their induction.

laumars · on Jan 30, 2020

That might have worked for the specific hardware BT needed those developers to manage but it's not good advice in the general sense. Systems Administration is as much a detailed speciality as being a software developer. There's so many edge cases to learn -- particularly when it comes to hardening public facing infrastructure -- that you really should be hiring an experienced sysadmin if you're company is handling any form of user data.

As an aside, this is one of the other reasons company directors like the cloud -- or serverless specifically: it absolves responsibility for hardening host infrastructure. Except it doesn't because you then need to manage AWS IAM policies, cloud watch logs and security groups instead of UNIX user groups, syslog and iptables (to name a few arbitrary examples). But that reality is often not given as part of the cloud migration sales pitch.

C1sc0cat · on Jan 31, 2020

True but in BT that was SD's (Security Division) Job to set standards and our teams Sysadmin handled that

SD was the employer Of Bruce Schinner for a few years BTW

nannal · on Jan 30, 2020

Kahoot?

fxtentacle · on Jan 30, 2020

I am very happy to pay for Heroku, exactly because I don't want to be the poor guy getting up in the middle of the night to restart Apache.

But for a system that is invisible from customers, like our logging system here, I don't care that much about timely maintenance.

levosmetalo · on Jan 30, 2020

> I am very happy to pay for Heroku, exactly because I don't want to be the poor guy getting up in the middle of the night to restart Apache.

As an employee I am also happy if my company pays more, so I don't have to be awaken in the middle of the night.

As a small business owner with limited resources and liquidity waking up once a month for 50K savings looks like a good deal.

bayindirh · on Jan 30, 2020

I have a very nice quote from a discussion I remember:

"If you need to get up at 3AM to keep services running, you're doing something wrong."

You can make sure that most of the services in *NIX world to take care of itself while you're away without using any fancy SaaS or PaaS offering.

Heck, even you can do failovers with heartbeatd. Even with a serial cable if you feel fancy.

Bonus: This is the first thing I try to teach anyone in system administration and software engineering. "Do your work with good quality, so you don't have to wake up at 3AM".

vthriller · on Jan 30, 2020

I'd pick a call in the morning any time given that the cause of the call occurs rarely and the alternative is to spend a lot of time automating things with possibility of blowing things up in a much bigger way. If situation like [0] had happened to me at night, I'd happily take a time off my sleep and do manual standby server promotion (or no promotion at all) rather than spend days recovering from diverged servers that Raft kool-aid was supposed to save me from.

[0] https://github.blog/2018-10-30-oct21-post-incident-analysis/

bayindirh · on Jan 30, 2020

I'm not against your point of view to be honest. It's a perfectly rational and pragmatic to act this way.

I'm not also advocating that "complete, complex automation" is the definitive answer to this problem. On the contrary, I advocate "incremental automation" which, solves a single problem in a single step. If well documented, it works much better and reliably in the long run & can be maintained with ease.

Quoting John Gall:

> A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work.

golergka · on Jan 30, 2020

I'm a good engineer, but I'm not qualified as a system administrator. I know that I will do something wrong, and I don't have the time to learn everything.

So I'd rather pay amazon.

parliament32 · on Jan 31, 2020

So instead of learning systems you have to learn the AWS spaghetti of microservices.

mlyle · on Jan 30, 2020

You can try and build and test redundancy and contingency management, and you can lower the frequency of surprises through good choices.

But you're still going to get woken up at 3am sometimes. Things break, in unexpected ways. Maybe the hot spare didn't actually work when a raid set started rebuilding onto it. Maybe third party software did something unexpected. Or maybe something broke and your failover didn't actually work because of subtle configuration drift since the last test.

bayindirh · on Jan 30, 2020

We have a standard routine called The restart test. We reboot the machine in a normal way to see how it behaves, but in the middle of the workload. Also, if the system is critical, sometimes we just yank the power cables to see what happens.

Normally all plausible scenarios are tested and systems are well tortured before putting into production.

It also helps that our majority of servers are cattle, not pets. So a missing cattle can wait until morning. Also all "pet" servers have active and tested failover, so they can wait until morning too.

We once had a problem with a power outage when our generators failed to kick in, so we lost the whole datacenter. Even in this case we can return to all-operational in 2hrs or less.

I forgot to add: We use installations from well tested templates, so installations have no wiggle-room configuration wise. If something is working, we can replicate that pretty reliably.

mlyle · on Jan 30, 2020

Sure, this is typical of well-run environments.

But you probably don't yank power on critical things mid-load after making a trivial change. Excessive testing breeds its own risks.

But it's really, really easy to gank up a trivial change now and then.

In the past 10 years, I've been woken up three times. One was from third-party software having a certificate that we didn't know about expiring; one was from a very important RAID set degrading and failing to auto-rebuild to the hotspare (it was RAID-10, so didn't want to leave it with a single copy of one stripe any longer than necessary); and one was from a bad "trivial change" that actually wasn't. I don't see how you can get to a rate much lower than this if you are running critical, 24x7 infrastructure.

pepemon · on Jan 30, 2020

Don't looks like you have much of experience with what's you're talking about - there is no such thing as heartbeatd, it's called keepalived (or pacemaker if you prefer unnecessary complex solutions), any ops person can't even misspell that.

bayindirh · on Jan 30, 2020

Sorry, you're right. I confused it with its cousin, which is indeed called heartbeatd [0].

I'm new at Linux and system administration. I'm using Linux just for 20 years and managing systems for 13 years.

[0]: https://www.manpagez.com/man/8/heartbeatd/

penagwin · on Jan 30, 2020

Having just setup a HA Postgres with Patroni - I disagree. Honestly I think we should've just stuck with a single Postgres server.

Sure you can have an orchestration tool "Make sure everything is running, and respond to failures", but that's yet another tool that can break, be misconfigured, etc.

commandlinefan · on Jan 30, 2020

> If you need to get up at 3AM to keep services running

I just turn off the phone until I get up. That way I don't have to get up at 3 AM, I don't even know they were down until five hours later. :)

davedx · on Jan 30, 2020

Same here.

As a small business owner I run my own apps on Digital Ocean, which to me offers very nice balance between features, reliability and price.

xorcist · on Jan 30, 2020

If your application needs to be restarted in the middle of the night, how does Heroku help?

starfallg · on Jan 30, 2020

Looking at how the market rate for an AWS architect (I wear this hat as well) or DevOps engineer is, it doesn't work out cost-wise on that front either.

The fact is that cloud is expensive, and outside a few use cases (such as extreme horizontal scalability, aka elasticity, or machine learning) the costs just don't work out well.

baroomba · on Jan 30, 2020

I think cloud's way harder than managing actual hardware (and/or your own VMs and such on actual hardware, or even traditional VMs on someone else's hardware) since once you're beyond the trivial it quickly becomes a superset of traditional admin knowledge, not a replacement—you end up having to know how to do things the old way to understand WTF the cloud is doing and troubleshoot issues, or to integrate or transition some older system, or whatever.

And then of course every "cloud" is full of about a billion hidden gotchas ("well the marketing page says that'll work, but on Tuesdays in February it won't, so use this instead, but only if you're writing your logic in JavaScript because the tools for the other allegedly-supported SDKs are broken in weird ways half the time, so instead use this other thing, unless you're a Libra, then...") none of which knowledge transfers between "clouds", and each has a pile of dumb names to memorize and a ton of other per se useless, but necessary, knowledge you've gotta pick up, just to rub salt in the wound.

gscott · on Jan 30, 2020

I have a friend with a rack of co-located servers I manage. I drive from San Diego to LA where they are located maybe once every 6 months or more. I don't believe I have been to the rack for about a year. The rack with 20mbs of bandwidth costs around $2,500 a month.

lrem · on Jan 30, 2020

Did you really mean twenty megabits?

xchaotic · on Jan 30, 2020

if it’s unmetered 20mbps I’d happily take that over ingress / egress cost at cloud provider such as AWS

davedx · on Jan 30, 2020

Yeah AWS bandwidth egress is extortionate. Digital Ocean is something like 5-7x less expensive. It's ridiculous.

lrem · on Jan 30, 2020

That's kind of shocking when compared to European prices. Over a full order of magnitude of difference, comparing to the list price of the first Google result.

gscott · on Jan 30, 2020

Yes, not sure of the abbreviation apparently! But 20 megabits.

I like to watch Neil Patel's seo videos on Facebook occasionally and he mentioned for one service he runs he spends over 100k a month on hosting. It blows my mind because he could buy a top end server with dozens of processors and terabytes of ram, co-locate it and at least host some things on it or even turn it into your own cloud hosting server.

Even if a person bought 2 over-the-top servers for 30k each, paid around $2,500 a month for hosting it would save huge money.

ChrisCinelli · on Jan 30, 2020

20 motherboards?

Aloha · on Jan 30, 2020

Not in mine. All that stuff you build out of cloud provider tools still needs care and feeding.

pm90 · on Jan 30, 2020

Honestly, it depends on the situation. Skilled engineers who can maintain systems reliably are hard to find, in general. If its not a core system and you have the capital, it makes a lot more sense to pay a cloud hosting provider and focus on your product rather than attempt to build something simply to save on costs.

d10r · on Jan 31, 2020

I too never got this. Have learned using bare metal before AWS was around. Took a look at AWS in 2008 and decided that it was interesting, but nothing I had use for at the time. Since then, I never felt any need for it, except for a few times when I wanted to emulate a distributed network for testing purposes.

If I look into the AWS dashboard today, I feel totally overwhelmed, while running my own servers with LXC containers on it feels effortless. I guess for many it's the other way round.

Don't really know what I'm missing, but I'm happy about the old school skillset I have, allows me to have a fraction of the infrastructure costs I would have otherwise for what I'm running.

JamesBarney · on Jan 30, 2020

My company builds out enterprise and b2b apps. For most of our clients, infrastructure just isn't a big enough expense to really think about or optimize for. If you're selling your web app at a $1,000-$10,000 a license per year it doesn't really matter what you spend on servers. But it still matters how much time/money you spend managing your servers. So here it makes a lot of sense to go cloud because it's a lot faster to go from having no infrastructure to running a web server and database with automatic backups.

admax88q · on Jan 30, 2020

$50k monthly would pay for a team of the worlds best system administators.

pepemon · on Jan 30, 2020

NateEag · on Jan 30, 2020

You know sysadmins that make > $100K / year?

If we assume $20K fringe benefits, that $50K / month is $600K annually, which gets you 5 $100K/year sysadmins.

pepemon · on Jan 30, 2020

Yes, I know plenty of Devopsish people that make more than that. And I'm not even talking about SRE-minded folk there.

shrubble · on Jan 31, 2020

You live in a bubble. Outside of the rarefied air of SFBA, wages are much lower. I do know super sharp people who are paid far less.

pepemon · on Jan 31, 2020

And how much is your non-bubbled far less? I'm not even closely to the Bay Area btw.

shrubble · on Jan 31, 2020

Under 150k in a major urban area.

AmericanChopper · on Jan 30, 2020

You also have to consider the features you either have to do without, or create yourself. Dashboards, alerts, saved searches, web GUI, etc... You might legitimately not need any of that, but simply implementing search and storage isn’t replacing the whole product on its own.

bb611 · on Jan 30, 2020

I work at a big enterprise saas company that's moving a couple billion dollars worth of hardware into cloud services, the price difference is wild. Original unoptimized costs were 8x, dropping to 6x in our first round of optimizations. Even the rosiest middle management protections put eventual costs at 2x owning the hardware.

Add to that we threw out plans to do things like virtualization on our owned hardware, and that engineering headcount has consistently grown faster than revenue, it's not clear there are any savings to be had there at our scale.

FpUser · on Jan 30, 2020

From my experience (consulting for a decent size company that has lots of their infrastructure on Azure) it looks to me that the amount of management involves is not less then if they had it on premises or co-located bare metal.

wtetzner · on Jan 30, 2020

Seems like the cloud services are complex enough that you're paying someone to manage them anyway.

penagwin · on Jan 30, 2020

I work at a small company, we have 5 programers, and 3 IT/Server/network people.

I'm one of the programmers, and the past few months I filled the role of devops/deployment engineering for our new website.

One great example is Sentry, I love Sentry, and it's been invaluable for us. It saves the web developers a crapload of time. Now Sentry has a self hosted option, and that's what we're currently using.

Now I know little about Sentry's internals, and frankly I don't really care. But sometimes it breaks, or we want it to be updated etc. Sentry offers hosting at 25$/mo, and would mean we don't have to worry about it at all, stability, upgrades, scale, are all handled by them.

25$ a month is less than 1 hour of my time. All it has to do is save me 1 hour a month to easily pay for itself.

---

Another example is that I just spent a large amount of time trying to setup a HA Postgres cluster. This meant I had to dive into the internals of postgres, how our orchestrator (patroni) works, setting up consul to manage the state, etc. This has taken a significant amount of time (several weeks) - it's easy to get a POC working, but actually ironing the bugs out is a different story.

Also nobody else on the team fully understands it's setup, so if it breaks, welp....

All this to say, for us a hosted DB option would likely have been cheaper (compared to my time and pay) and would have better uptime and support then us rolling our own solution.

----

We don't have any central log store, and I really wish we did. Similar situation here, I could spend a month or two configuring and tuning Elastic Search, or we could just pay for a hosted option.

---

Tl;dr - I'm a developer at a small company and have spent little time doing application development the past few months because of all the time that has been required to setup the infrastructure for our application.

jskrablin · on Jan 30, 2020

So why don't you move to cloud/managed services? I can tell you from own experience that having a HA solution you don't really understand is a really bad idea. Just go with master/replica for Pg and a well documented troubleshooting and manual failover procedure. It'll likely be more reliable than some HA black magic. And it will likely take less than several weeks to get up and running.

celticmusic · on Jan 30, 2020

I think the real answer is that you don't have to trust an employee's expertise.

I can give you a real life example.

Years ago I did work for this company that built flight simulators for the government. Millions of dollars rolling through their company. One day I get called up by this company and the woman is panicking because their website went down.

Well, come to find out, the entire site was running on a server sitting in an office of their building and the electricity went out due to a winter storm. To say that I was floored is an understatement. I started asking questions.

Well, when I did, the "sys admin" (and I'm putting that in quotes...) started talking to the owner of this company and it was ultimately decided that I couldn't be trusted.

Fast forward a few years and around august of last year this company contacts me again for more work. They apparently scored another huge government contract, built a new facility to be able to house actual government planes so they can strip them down and turn them into flight simulators and so forth.

While I'm at the facility I learn that they're running all their software off of a server sitting in the building. Now, this isn't completely unreasonable, especially for government work. But I again started asking questions.

- Is it environmentally controlled.

- Do you have a generator?

- Do you have multiple lines in case your ISP goes down (which 100% will happen at some point)

- Do you have backups?

- ARE YOU EXERCISING THOSE BACKUPS AT LEAST ONCE/QUARTER?

When this got back to the "sys admin", he was livid. I also found out that they didn't have the source code for the latest changes I made, despite me pushing said source code to a private git server this "sys admin" had stood up. Said virtual server apparently got removed when they moved facilities, but despite this the guy accused me of pushing it to my private github repo based purely on the fact that I stated in no uncertain terms that I had pushed that to the git repo.

But this software was a part of the government contract.

I just sent out an email the next day thanking them for the opportunity, but that I would have to pass on it.

My point is this.

They're an engineering company, that's where their expertise lies. But due to the nature of what they're doing, they were forced into the software side of things. They hired an incompetent.

Companies like these are probably better off throwing money at the problem and putting it in the cloud. The skill level required to successfully run something in the cloud and not completely lose everything is much lower. That's not to say there isn't skill involved, but you don't have to hire someone who may or may not decide to run your software that's involved in a multi-million dollar contract in a closet in your building.

pnutjam · on Jan 30, 2020

The cloud won't fix "hired an incompetent".

celticmusic · on Jan 30, 2020

what I specifically said (with emphasis)

> Companies like these are probably better off throwing money at the problem and putting it in the cloud. __THE SKILL LEVEL REQUIRED__ to successfully run something in the cloud __AND NOT COMPLETELY LOSE EVERYTHING__ is much lower. That's not to say there isn't skill involved, but you don't have to hire someone who may or may not decide to run your software that's involved in a multi-million dollar contract in a closet in your building.

bses · on Jan 30, 2020

> What cloud advocates always say is that the $50k monthly will save you money from not needing to hire a team to manage it for you, and that over the course of 10+ years you will be ahead. Is that true in anyone's experience? Every once in a while somebody posts about their competing bare-metal system and it looks like a lot of people have managed to cut their server costs by 99% (based on the numbers they post) by avoiding the cloud as a service

Too many weasel words.

wdroz · on Jan 30, 2020

if you are using systemd, you don't even need to grep "manually" that much.

  journalctl -u nginx -u ssh --since today

fxtentacle · on Jan 30, 2020

I had never heard of that before. Thank you for pointing this out :)

pepemon · on Jan 30, 2020

And yet you're rsyslogging and grepping what is supposed to be structured and typed. Have you heard about backpressure, controlled retention, compliant log redundancy and all of this advanced stuff that you'll never get with aggregation via rsyslogd/syslog-ng?

Spivak · on Jan 30, 2020

rsyslog is far more powerful than you're making it out. You have to actually tell it what to do but it's more expressive than filebeat and logstash.

* rsyslog in the use-case he's describing is just a method of pushing some subset of the logs generated on a client system to a directory on the collector which has trade-offs but the benefit is having really simple failure modes.

* both rsyslog and journald store structured data: rsyslog with lumberjack, and journald just always. And rsyslog can parse and structure the logs in-flight so you save computing power on the collector.

* rsyslog behaves exactly like filebeat when it comes to reliable delivery and can persist unsent messages to memory then disk. rsyslog's rate limiting, backoff, and retry options are super powerful.

pepemon · on Jan 30, 2020

Yes, you are right, sorry. I was too fast in my assumptions, rsyslog (don't know much of syslog-ng) has feature parity with ELK in terms of log delivery and processing. But I think that grep and its permutations aren't right tools of choice for log analysis anyway.

parliament32 · on Jan 31, 2020

>backpressure

Both syslog-ng and rsyslog apply backpressure just fine over a network or socket...

>controlled retention

logrotate? It's only been around for over a decade..

>compliant log redundancy

So like, a backup strategy?

All of this stuff has been around for approximately forever, ES just had their marketing team name it something else and people like you are falling for it...

bobnamob · on Jan 30, 2020

Might be worth taking a look at ripgrep[1].

[1] https://github.com/BurntSushi/ripgrep

fxtentacle · on Jan 30, 2020

That tool looks great :) but since we're already seeing <1s search times and the tool is only used by internal support employees, I'm mostly going with "never touch a running system" these days.

While for a database like ES you'd put all of the data into one big pile and then filter by keywords, e.g. host=ftp service=ftp query=IP, for logfiles you usually search on a much smaller set. They are rotated by day and logs are broken down by host and service by rsyslog, so instead of filtering the full 150TB - which is what ES has to do - my grep only needs to look at the 1-2 GB of data inside the file where host, service, and date match.

pepemon · on Jan 30, 2020

Do you understand something about such thing as ES indices?

reagent_finder · on Jan 30, 2020

RG is awesome for the basic use case. Using it across platforms just makes you tear your hair out, though.

foota · on Jan 30, 2020

I'm curious, why? Looks pretty easy to get the binaries.

burntsushi · on Jan 30, 2020

Why? Can you share your frustrations?

Johnny555 · on Jan 30, 2020

What kind of searches are you doing where the search performance of grep is the same as with Elasticsearch?

fxtentacle · on Jan 30, 2020

One common query was to check the FTP server on our side for access from a customer server, so that we can help with troubleshooting as to why they couldn't connect. Turns out, in many cases less-technical customers mistyped the hostname or left out the .com at the end.

tail -n 10000 /var/log/ftp.log | grep -i $CUSTOMER_IP | tail -n 5

Johnny555 · on Jan 30, 2020

If that's your typical use-case, I can see why that's as fast (or faster) as Elasticsearch, but it's not clear why you'd have gone with Elasticsearch in the first place.

When I last used Elasticsearch, we indexed ~10TB of log data a day, kept 14+ days, and a typical query was looking for log records that matched a unique session ID over the past 10 days or so, not an easy task for grep. But we didn't pay $50K/month for that cluster, it was closer to $12K/month.

Before we used ES, someone had written a parallel grep that would grep multiple files at once, and would run multiple greps at once through chunks of the file, but still it could take 30 minutes or longer to churn through the logs on a 32 core machine - ES took that query time down to 100 milliseconds. The ES cluster easily paid for itself in employee time savings.

fxtentacle · on Jan 30, 2020

We used to use the ELK (ElasticSearch, Logstash, Kibana) stack hosted locally. So when we outgrew that, we just tried to find a bigger ELK provider.

We did evaluate the elastic.co Cloud - which I concur would have been cheaper than Amazon - but since their demonstration cluster failed to boot correctly and as a result suffered a data loss during their sales presentation, my boss didn't feel comfortable going with them.

That's how I was left with the decision to either scale up our old ELK stack on AWS or go with something proprietary.

londons_explore · on Jan 30, 2020

If you always want to search by one thing, you can manually index by that thing. In your case, arrange your log files by the first 6 hex characters of the user ID (/var/log/xxx/xxx/date.log), and grep will typically only then have a few megabytes to scan.

If you need real indexes, or just want something industry standard and maintainable rather than 'some guys grep script', then elastic search is probably the way to go.

Johnny555 · on Jan 30, 2020

That was just one thing we needed to search for (but by far the most common). The guy that wrote the parallel grep did try creating some indexes of common fields to speed searches, but quickly realized that he was re-implementing the wheel (poorly)

Plus we made good use of Kibana dashboards for the service

morelisp · on Jan 30, 2020

> it's not clear why you'd have gone with Elasticsearch in the first place.

Having been in a company with a similar issue (Hadoop/Spark, not ES), the issue is: You have a bunch of programmers hired to handle the 10TB data/day. Then you need some work done processing some data that is, e.g. 100GB once a week. Rather than evaluate the best way to process that data, the through process basically goes "it doesn't fit in my laptop's memory, so we'll use the cluster."

pkulak · on Jan 30, 2020

How did they connect at all with the wrong hostname?

fxtentacle · on Jan 30, 2020

They didn't, that's why they called us to complain that our FTP server was offline...

Some of them also had one of those DNS-grabbing ISPs, so by mistyping the hostname they would accidentally connect to the wrong IP.

EDIT: I think I now got what you meant. When a customer says "my password is not working" then the first thing we do is check in the logs that the customer actually did connect to the correct server. That's like the number one issue, correct username and password but wrong hostname.

neurostimulant · on Jan 30, 2020

> those $200 each bare metal servers are 2x Intel Xeon 6-core + 256GB RAM + 15x 10TB 7200 rpm.

That's really cheap for such a big server. I think the same specs on Hetzner and OVH cost at least twice of that. Where do you rent those servers?

pepemon · on Jan 30, 2020

I think this guy have no idea what he's talking about.

duncanawoods · on Jan 30, 2020

Sounds similar to a Hetzner SX292 - Xeon 6-core, 256 RAM, 15x10TB. Only one CPU though at $285.

iso1631 · on Jan 30, 2020

I've been fighting elasticsearch to parse some netflow traffic. Found a great tool for logging the traffic and it even set up an elasticsearch instance and kibana for me, pretty much plug-and-play.

It logs lots of data, but I'm only interested in subnet level at this stage. I'm not interested in direction either.

A (multiple) text files with say a csv format (other fields are fine, cut -f will strip those out)

datetime,collector,srcip,dstip,bytes

I could easilly pipe into sed to strip the 4th octet, but in reality I'd probably parse with perl. Would take about 10 minutes and output a nice simple spreadsheet showing me which subnets are busiest, which times are busy etc, and apply accounting on a subnet basis (or with a bit more perl trivally assign different IPs to different accounting buckets)

I can only assume elk is a completely different way of thinking. Collegues think that grep is "hard", but click-click-click in kibana is easy.

However, I know I'm a grumpy old fart. I find lots of new ways to reinvent the wheel tiring and pointless, but it feels like shouting into the wind. Recently things that have made me simply sign include ubuntu switching from /etc/network/interfaces to netplan, or from debian-installer to subiquity. Moving from initd to systemd.

I'm sure there are good reasons for changing all of these, but for my use cases it just increases the workload.

4d617832 · on Jan 30, 2020

Can you tell me which tool you consider for getting netflows? I am trying building something that is really close to what you describe but fail to find the time to do so.

iso1631 · on Jan 31, 2020

Using https://gitlab.com/thart/flowanalyzer at the moment

However thinking it would be far easier to write something from scratch

threeseed · on Jan 30, 2020

ElasticSearch also does a lot more than your CLI does.

Splunk style dashboards, multi-user access, full text searching e.g. stemming, ability to support non-structured formats.

fxtentacle · on Jan 30, 2020

I fully agree, but our business need was to stream log files (e.g. tail -f) and to search for snippets in them and to produce statistic counters (e.g. |grep|wc)

While I can imagine situations where ElasticSearch is the best solution, I have seen it mostly used in situations that would be better served with simple command-line tools, similar to what the author demonstrated with awk vs. Hadoop.

enriquto · on Jan 30, 2020

> ElasticSearch also does a lot more than your CLI does.

This is not necessarily a good thing. In fact, it is probably a really bad thing if you do not need the additional stuff.

mamon · on Jan 30, 2020

We reached the point where NOT being hosted on AWS is a competitive advantage :)

afpx · on Jan 30, 2020

You should really be packaging up what you built and selling it as a competing service.

pepemon · on Jan 30, 2020

The irony is that there is no service. It's a duck-and-tape solution reminding me my first years in Linux ops industry.

DrPhish · on Jan 30, 2020

As a possible middle-ground: I'm using Graylog with a lot of success, but my log volumes are rather modest (medium sized corporate network switches/routers/firewalls/servers log aggregation and processing)

fxtentacle · on Jan 30, 2020

We have 5TB per day, which may or may not break Graylog. Before evaluating ES, we used Logstash, but that couldn't keep up at around 2TB+ per day.

virgilp · on Jan 30, 2020

> Before evaluating ES, we used Logstash

Huh? I thought one would use Logstash to ingest & massage data going into ES. I view it as a sort of "quick&dirty stream processing engine". How would one go about using Logstash _instead of_ ES, as opposed to "in addition to ES" ?

fxtentacle · on Jan 30, 2020

Sorry, I worded that badly.

We used to use a locally-hosted ELK stack, so yes, including ES as the data backend for Kibana, but there Logstash or more precisely filebeat and logstash-forwarder became the bottleneck. That was a horrible experience because then filebeat would keep the old (unsent, but deleted) log files open, so production servers crashed running out of HDD space even though "du -h -s /" was a lot smaller than the HDD size. It took me way too long to figure out that deleted but still open files were the problem.

So our decision to move from using ES invisibly as part of Logstash towards a full dedicated ES cluster was forced by the Logstash pipeline getting too slow.

I then evaluated Elastic Cloud and using Amazon EC2 for deploying ES, both of which were ultimately rejected. The first due to fears of data loss, the second due to costs.

virgilp · on Jan 30, 2020

I'm confused again - wouldn't Elastic Cloud cost more than EC2? It makes no sense to be the other way around.

And secondly - why use logstash if you had filebeat - why not send data to ES directly? (I mean, specifically for your usecase, where you don't seem to need to do much processing prior to ingestion). Yeah, I saw that big parts of Logstash are in ruby so I expect underwhelming performance from it - but Filebeat should in theory be able to ingest large volumes of data into ES directly.

fxtentacle · on Jan 30, 2020

We used a central logstash server to receive the data from filebeat and logstash-forwarder running on multiple other servers. So in the initial deployment, logstash only set up the host and service tags needed for Kibana plus it was like a central gateway for external servers to write into the locally running ES storage.

As for why the price that elastic.co quoted us was lower than EC2, I have no idea. My guess would be that they were hoping to get a reference customer onboard for their (then new) offering.

Or maybe they are internally running bare-metal and thereby purchasing their CPU power cheaper than EC2 pricing. I mean you don't really need a redundant fault-tolerant virtual machine if the database on top is redundant and fault-tolerant, too.

pepemon · on Jan 30, 2020

I'm piping 5TB/day through 2 Graylog hosts with plain stupid DNS round-robin, all good.

jbergens · on Jan 30, 2020

We used Greylog in previous company and it worked perfectly.

Scaled to our needs but we might have used a cluster.

api · on Jan 30, 2020

The cost savings with bare metal servers vs managed cloud can be ludicrous... like take away 2-3 zeroes. It's particularly true for bandwidth and raw CPU power where major cloud providers massively overcharge.

The real answer is like most other things: it depends.

If you have a highly variable workload or you really get a lot of value from specialized managed services then cloud will save you and generally work better. If you have a less variable workload then a more DIY approach can yield truly massive cost savings and even better uptime and performance in some cases.

pepemon · on Jan 30, 2020

Where do you get such low prices for dedicated rigs? Well anyway, looks like you haven't tried to roll your own ES cluster on bare metal hosts - I've got more throughput than yours with 2 Graylog hosts backed by 5x bare metal ES instances. With all benefits of web interface for less tech savvy people, structured logs, analytics and data exposure through unified API. All of these for less than 2k/mo. If I lived in wonderland with server prices like yours, our check would be less than 1k/mo.

leonardteo · on Jan 30, 2020

I personally use OVH.com and Hetzner (for EU based installations). Don't expect support, but you get good machines, unlimited traffic (OVH). If you can set up your infrastructure so that it's JBOS (Just a Bunch of Servers) where you can hot swap them if there's any hardware failure, etc. it's great price/performance.

choeger · on Jan 30, 2020

Not quite that price, but with 308EUR/month not too far off:

https://www.hetzner.com/dedicated-rootserver/sx292

I guess if you buy and host the machines yourself, you should be able to beat hetzner's price.

wwright · on Jan 30, 2020

How does the functionality compare, though? With ELK, you have a variety of options for monitoring (I have personal experience with Kibana, Grafana, and the API) and visualizing. You explicitly say “rsyslog and grep”; do you build dashboard systems on top of grep? What does the performance look like? How many dashboards does that scale to?

What’s your strategy for availability?

prepend · on Jan 30, 2020

Do you mean “bare metal” as just basic aws images using only rsyslog and grep? You call out physical disks so it makes me think like physical machines.

How are two physical machines $400/month? Is that assuming you already have the hardware laying around?

How much would it cost for basic aws ec2 instances running rsyslog and grep?

certera · on Jan 30, 2020

Why not run ES on the bare metal servers?

fxtentacle · on Jan 30, 2020

Tried and it was horribly slow. We add about 5TB of logs per day, and the ES index wasn't able to keep up, despite having 256 GB RAM and 2x Intel Xeon 6-core

camkego · on Jan 30, 2020

That seems like a really good price for bare metal of that magnitude, where are you getting those from?

fxtentacle · on Jan 30, 2020

https://www.hetzner.com/dedicated-rootserver/matrix-sx

On their German website, you can bid for used-but-still-good servers and there you'll find great offers. There, you can also find hardware that they customized for others and then didn't need anymore. For example, I rented some GPU rendering servers for $110 monthly each from there.

https://www.hetzner.de/sb

Plus I can't praise their service and ops team enough. After we rented the new storage servers, I sent them an email to ask for it, and they then put the new storage servers on the same router inside the same datacenter as our other servers with them. Now we effectively have 2x redundant 1GBit LAN between our servers with them, plus an 10GBit internet uplink shared between our servers.

For that level of performance, they are ridiculously cheap.

BUT - and that is a big gotcha - you will be responsible for diagnosing hardware issues and asking for a replacement yourself. That's why we have RAID and two storage systems with them, because we tend to see one HDD fail every 3-6 months.

oxfordmale · on Jan 30, 2020

Bespoke solutions are always going to outperform generic ones. You need to choose the right tools for each job. ElasticSearch definitely has it advantages if you have to write complex queries, however, might be less suitable for grep like queries.

bses · on Jan 30, 2020

> We recently discussed new logging tools at work. It was either a redundant Amazon EC2 cluster with ElasticSearch for $50K monthly, or two large bare metal servers with rsyslog and grep for $400 monthly. The log ingestion and search performance was roughly the same...

If grep has the same search performance as elasticsearch, you should not be using elasticsearch and any comparison is bullshit.

> It was either a redundant Amazon EC2 cluster with ElasticSearch for $50K monthly, or two large bare metal servers with rsyslog and grep for $400 monthly.

$50K monthly is $69.44 an hour, or 8 instances of the most expensive thing from https://aws.amazon.com/elasticsearch-service/pricing/ (with 488GB RAM)

Please cite references, your numbers seem made up. Where can you get 256GB of RAM for $200/month?

fxtentacle · on Jan 30, 2020

https://www.hetzner.com/dedicated-rootserver/matrix-sx

They get slightly cheaper for large customers like us, plus you can get hardware upgrades like the 2x CPUs by paying for the difference upfront. So yes, it's $400 monthly + $2000 once, but those $2k don't really make much of a difference over the years that we've been running the system ^^

closeparen · on Jan 30, 2020

Are European hosting providers cheaper because they’re simpler, or because they’re European? I don’t see anything analogous for cheap dedicated servers in the US market. Wages, consumer prices, real estate, etc. are all lower so it makes sense that French and German companies charge less than Amazon.

Collocation can be a great deal at scale, but I don’t see a way to cheaply rent a handful of servers for substantially less than Amazon would charge. VPS providers maybe, but even the quintessential cheap American VPS (DigitalOcean) has a bunch of cloud features these days.

The last big dedicated provide I know of was Softlayer, which is now $2100/mo for a 242GB and 56 core dedicated box under IBM Cloud. Amazon’s m5 is around the same capacity at $3700/mo.

fxtentacle · on Jan 30, 2020

I believe it's cheaper connectivity, due to the network monopoly in the US. For European hosting providers, traffic tends to be cheap or free. For Amazon, you'll pay through the nose for egress traffic.

And when we tried to rent a Verizon 1gbit dedicated line to our Boston office, they quoted us $15k monthly.

In Germany, we paid first €199 then €499 for a dedicated 1gbit fiber line from 1&1.

closeparen · on Jan 30, 2020

Even before network charges, though.

An Amazon m5.12xlarge with a 1-year reservation is $1,075/mo, about the same as OVHCloud's top-end HG3 in the US at $975. Both are 28 cores / 56 vCPUs and 192GB RAM.

thaeli · on Jan 30, 2020

I have some servers at Wholesale Internet - they're located in Kansas City. Good counterpart to Hetzner or SoYouStart; with these low end dedicated hosts you end up being multi-provider to have geographically redundant datacenters anyway, but if you can build within those restrictions (often you can't have a private network so you're doing host to host VPN, for instance) it's a cheap way to get beefy servers and unmetered bandwidth.

Wholesale lists with a transfer limit on their price page but on the server config page you can upgrade to an unmetered Gbit port. And the low end boxes have an unmetered 100mbit port already. My experience has been that they really don't mind you heavily using that bandwidth either.. did have a point a few years ago where they were bandwidth limited until more fiber could be turned up in their datacenters, but you're going to occasionally have that sort of thing with any single datacenter provider.

All in all, I'm very satisfied with them for their specific niche of CONUS cheap dedicated servers.

SoYouStart wins for use cases where being located in France or Canada is acceptable, largely because they have basic DDoS protection on their bandwidth and I host some game related servers - anything gaming related and you will get DDoS attacks semi-regularly. And game servers can't run behind CloudFlare which is my other go to for basic DDoS protection.

jdc · on Jan 30, 2020

Really? Try Googling '256gb ram dedicated'

bses · on Jan 30, 2020

You're right, I replied too quickly.

> We recently discussed new logging tools at work. It was either a redundant Amazon EC2 cluster with ElasticSearch for $50K monthly, or two large bare metal servers with rsyslog and grep for $400 monthly. The log ingestion and search performance was roughly the same...

If grep has the same search performance as elasticsearch, you should not be using elasticsearch and any comparison is bullshit.

jdc · on Jan 30, 2020

Right, though some people really do reach for heavy-weight tools on datasets that are way too small to justify their use.

But yeah, even ripgrep should get out-performed by any db/search daemon with a decent index.

avip · on Jan 30, 2020

Your setup is redundant and expensive. logrotate directly to s3 and use Athena.

koevet · on Jan 30, 2020

Out of curiosity, is the 200$/month price tag for colocation?

didibus · on Jan 30, 2020

Where are you getting such a bare metal machine for 200$ a month?

nasalgoat · on Jan 30, 2020

ELK could handle that workload pretty well and isn't too hard to set up. Plus you get a nice web interface. But I still find myself logging in and grepping for things.

idnefju · on Jan 30, 2020

Horizontal scaling is also where ES shines. If you don't need that among other things, then there are alternatives yea

amelius · on Jan 30, 2020

But those "hipster" tools can scale. Perhaps you simply haven't hit the right scale yet?

ddalex · on Jan 30, 2020

You don't need to scale. You're not Google, Amazon or FB to have a billion daily active users.

And if you manage to make it to that scale, you can certainly hire engineers to refactor those systems. Fix the problem only when you encounter it.

Parsing 1Tb of logs daily with a 14 day retention? A desktop with a NAS attached can do that.

The corps that developed these fancy tools and then open sourced them and labeled them hipster ish names? Of course they want engineers out there to use them, it's called vendor lockin.

amelius · on Jan 30, 2020

Yes, you are right. But there are a lot more usecases, e.g. imagine you run a website for a (small or large) government, or a website for a popular television show. Or perhaps you are a developer in a telecom company in some country. Scale is everywhere.

> you can certainly hire engineers to refactor those systems.

Wouldn't it be better if those engineers used standard tools like Hadoop?

smoe · on Jan 30, 2020

I worked for the largest online news site of Switzerland. We were nowhere near the scale where such tools would provide significant benefit over the "boring", "outdated" technology we we using. We also had a fraction of the operational cost than our much much smaller competitor within the same publisher that went full hipster on their tech.

Thing is, it for a lot of use cases it is a whole lot harder to get to large scales from a business and product perspective than it is for good engineers to adapt to it.

dajohnson89 · on Jan 30, 2020

how much dev time would both approaches take?

carlsborg · on Jan 30, 2020

You are underestimating the value of indexed logs and a Kibana view into them. For example, if you wanted a histogram of error logs for the last 12 days, and then zoom in and see the actual logs.

fxtentacle · on Jan 30, 2020

You are underestimating how much costs the issues caused by Kibana and a slow ES can produce.

I used to have these nice Kibana diagrams set up with auth and a https proxy so that our team could easily check if the server error rates looked normal. I quickly found out two things:

1. Most of our employees never ever looked at it, dismissing it as "too technical" just from seeing a screenshot.

2. The ELK stack uses backpressure, meaning the servers running filebeat run out of HDD space.

BTW, with a bit of grep in front, the actual per-service working data sets are small enough so that I can load them into R on a 256GB RAM server and then produce nice diagrams. You can also script R from the command line and then send the diagrams around as a cronjob.

pepemon · on Jan 30, 2020

Have you used the file input? With such throughput you don't need it at all, it hangs both of its host because of iowait and remote endpoint too if beats proto is used. All of my log traffic is sent as GELF UDP or syslog UDP.

eecc · on Jan 30, 2020

Sure it’s nice and maybe it’s worth 50k/month if you’re running Uber... but grep and sed together can get you quite far.

Admittedly for the histogram you’d have to cut and paste into a spreadsheet but yeah, hardly something you need in real time (probably more for a postmortem presentation to management.)

travbrack · on Jan 30, 2020

You can do this with graylog

maxmunzel · on Jan 30, 2020

I did some testing on the same (kind of) dataset and task:

First test: A single 2.9GB file

time rg Result all.pgn | sort --radixsort | uniq -c 13 [Result ""] 1106547 [Result "0-1"] 1377248 [Result "1-0"] 1077663 [Result "1/2-1/2"] rg Result all.pgn 1.12s user 0.55s system 99% cpu 1.680 total sort --radixsort 3.87s user 0.37s system 71% cpu 5.911 total uniq -c 2.69s user 0.02s system 45% cpu 5.909 total

Using Apache Flink and a naive implementation It took 13.969 seconds.

Second test: same dataset, split between 4 files

time rg Result chessdata/ | awk -F ':' '{print $2}' - | sort --radixsort | uniq -c 13 [Result ""] 1106547 [Result "0-1"] 1377248 [Result "1-0"] 1077663 [Result "1/2-1/2"] rg Result chessdata/ 1.70s user 0.97s system 42% cpu 6.292 total awk -F ':' '{print $2}' - 5.47s user 0.07s system 88% cpu 6.289 total sort --radixsort 4.13s user 0.42s system 43% cpu 10.559 total uniq -c 2.73s user 0.03s system 26% cpu 10.559 total

Flink: 12.724s

Conclusion: For this kind of workload, both approaches have comparable runtimes, even tough taco bell programming has the upper hand (as is should for simply filtering a text file). It took me about equally long to implement both. I think both approaches have their use case.

I ran this locally on my Laptop with 4 logical cores.

ma2rten · on Jan 30, 2020

Hadoop is very slow, because it persist the data to disk before every stage. You really wouldn't want to use Hadoop if you don't have a good reason too. More modern tools like Spark and Flink fare better there.

dang · on Jan 30, 2020

A thread from 2018: https://news.ycombinator.com/item?id=17135841

2016: https://news.ycombinator.com/item?id=12472905

2015: https://news.ycombinator.com/item?id=8908462

threeseed · on Jan 30, 2020

It's just the annual expedition for HN where everyone gets their turn to be smarter than everyone else.

Even more irrelevant now because Hadoop is largely a dead-end technology.

beagle3 · on Jan 30, 2020

A classic from 2015 along the same lines: Scalability, but at what COST?

http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...

thecleaner · on Jan 30, 2020

The author is experimenting with 1.75Gigs of data. At that scale sure, a local machine will be faster. Hadoop's real use-case though is when your data doesn't fit in memory and even this is kind of debatable. It makes sense to measure the performance with some prototypes and then make a final design rather than just use whatever AWS offers. Besides packaged services in AWS are also a bit more costly than basic services like EC2 instances and network goodies.

reallydontask · on Jan 30, 2020

These days you can get servers with Terabytes of RAM, so a lot of people (most?) could fit their data in memory.

I just took a gander to HPE's website and you can get proliant servers with up to 12 TB of RAM (you might be able to get them with more RAM, did not check in detail)

thecleaner · on Jan 30, 2020

HPE ?

arnolox · on Jan 30, 2020

Hewlett-Packard Enterprise

keanzu · on Jan 30, 2020

How much does it matter that the data doesn't fit in memory when consumer grade (Samsung 970 Pro) SSDs can deliver 3GB/s sequential read? On inexpensive hardware you can process a ~TB every 5 minutes per SSD.

gambiting · on Jan 30, 2020

Well, I know someone in the business where the average database is multiple TB, and they still have to somehow run queries against it, 1TB/minute is just far too slow.

sp332 · on Jan 30, 2020

Hadoop is useful at the scale where your data is already distributed, maybe even in different datacenters. At that point, it's faster to push the computation out to where the data is already stored. It doesn't make sense to me when the data starts out all in one place.

krab · on Jan 30, 2020

This reminds me my experience from a company internal hackathon. My colleague started writing a Spark program that would process the data we needed (a few hundreds GB uncompressed). Before he finished writing it, I was able to process all the data on a single machine with a unix pipeline. The computationally intensive steps were basically just grep, sort and uniq. When he finished the program, it couldn't run because of some operational issues on the cluster at the moment, so we didn't even find out the speed to compare.

For me, the morale is that the cheap hardware saves money/time twice:

1. It's faster if a program can run on a single machine.

2. It's easier to write a program that runs on a single machine.

With this in mind, cloud works great for analytical data processing. Just start a big enough machine, download data, do the computation, upload the result and turn the machine off. If you develop the program on a sample of the data so you can do it locally, it will be even cheap because you use only short time of the powerful server.

qaq · on Jan 30, 2020

Given you can soon get beefy 64 core Threadripper based workstation under 10K just running the analysis locally looks like a very decent option.

dcolkitt · on Jan 30, 2020

The two approaches aren't necessarily mutually exclusive. Spark can easily shell out using pipe(). Plus you can use that to compose and schedule arbitrarily large data sets through your bash pipeline through a multi-node cluster.

Beyond that, while the Unix tools are amazing for per-line FIFO-based processing, they really don't do a great job at anything requiring any sort of relational algebra.

greggyb · on Jan 30, 2020

join and comm would like a word with you.

You can't match SQL expressiveness, but you can definitely handle set-based stuff.

bandrami · on Jan 30, 2020

Wait... are you telling me people over-engineer solutions to ultimately simple problems? You're kidding.

supermatt · on Jan 30, 2020

very simple processing, not memory bound, tiny data-set - of course its going to be faster locally when the command itself takes less time than the networking, distribution, coordination and collation overhead of using any distributed tool...

llarsson · on Jan 30, 2020

You know that, I know that, and we can be happy that we have the experience to know what the right tool for this job would be by sizing up and describing the characteristics of the problem like you just did. But those with less experience may not be able to do that unless shown stuff like this in practice.

Some may think this problem requires MapReduce. The quote from the original implementation blog post certainly seems to indicate so.