Generally when they cross the international date line - for example when traveling from Sydney to San Francisco, I can take off at midday and land at ~7am on the same day local time.
Many years ago, I was writing Clojure for a job and took to finally learning emacs beyond just saving a file and closing the editor.
It took about a week to finally get used to paredit-mode (and rainbow-parens, because…) but once I did it is the most productive I have ever felt, I keep trying to find that experience again.
Unfortunately the dynamic types and slow build system (lein) made the build step quite tedious (this was 11-12 years ago) - but in terms of expressing ideas, man - paredit + lisp style was amazing!
I've worked places where it would be 1000x harder getting a spare laptop from the IT closet to run some processing than it would be to spend $50k-100k at Azure.
Do you have any examples of companies building Hadoop clusters for amounts of data that fit on a single machine?
I’ve heard this anecdote on HN before but without ever seeing actual evidence it happened, it reads like an old wives tale and I’m not sure I believe it.
I’ve worked on a Hadoop cluster and setting it up and running it takes quite serious technical skills and experience and those same technical skills and experience would mean the team wouldn’t be doing it unless they needed it.
Can you really imagine some senior data and infrastructure engineers setting up 100 nodes knowing it was for 60GB of data? Does that make any sense at all?
each node in our hadoop cluster had 64GiB of ram (which is the max amount you should have for a single node java application, where 32G is allocated for heap FWIW), we had I think 6 of these nodes for a total of 384GiB memory.
Our storage was something like 18TiB across all nodes.
It would be a big machine, but our entire cluster could easily fit. Largest machine on the market right now is something like 128CPU's and 20TiB of Memory.
384GiB was available in a single 1U rackmount server at least as early as 2014.
Storage is basically unlimited with direct-attached-storage controllers and rackmount units.
I had an HP from 2010 that supported 1.5TB of ram with 40 cores, but it was 4U. I'm not sure what the height has to do with memory other than a 1U doesn't have the luxury of the backplane(s) being vertical or otherwise above the motherboard, so maybe it's limited space?
Theres different classes of servers, the 4U ones are pretty much as powerful as it gets, many sockets (usually 4) and a huge fabric.
1Us are extremely commodity, basically as “low end” as it gets, so I like to use them as if they are a baseline.
A 1U that can take 1.5TiB of ram might be part of the same series of machines that might have a 4U machine that could do 10TiB. But those are hugely expensive. Both to buy and to run
> Do you have any examples of companies building Hadoop clusters for amounts of data that fit on a single machine?
I was a SQL Server DBA at Cox Automotive. Some director/VP caught the Hadoop around 2015 and hired a consultant to set us up. The consultant's brother worked at Yahoo and did foundational work with it.
Consultant made us provision 6 nodes for Hadoop in Azure (our infra was on Azure Virtual Machines) each with 1 TB of storage. The entire SQL Server footprint was 3 nodes and maybe 100 GB at the time, and most of that was data bloat. He complained about such a small setup.
The data going into Hadoop was maybe 10 GB, and consultant insisted we do a full load every 15 minutes "to keep it fresh". The delta for a 15 minute interval was less than 20 MB, maybe 50 MB during peak usage. Naturally his refresh script was pounding the primary server and hurting performance, so we spent additional money to set up a read replica for him to use.
Did I mention the loading process took 16-17 minutes on average?
You can quit reading now, this meets your request, but in case anyone wants a fuller story:
Hadoop was used to feed some kind of bespoke dashboard product for a customer. Everyone at Cox was against using Microsoft's products for this, while the entire stack was Azure/.Net/SQL Server...go figure. Apparently they weren't aware of PowerBI, or just didn't like it.
I asked someone at MS (might have been one of the GuyInACube folks, I know I mentioned it to him) to come in and demo PowerBI, and in a 15 minute presentation absolutely demolished everything they had been working on for a year. There was a new data group director who was pretty chagrined about it, I think they went into panic mode to ensure the customer didn't find out.
The customer, surprisingly, wasn't happy with the progress or outcome of this dashboard, and were vocally pointing out data discrepancies compared to the production system. Some of them days or even a week out of date.
Once the original contract was up, and time to renew, the Hadoop VP now had to pay for the project from his budget, and about 60 days later it was mysteriously cancelled. The infra group was happy, as our Azure expenses suddenly halved, and our database performance improved 20-25%.
The customer seemed to be happy, they didn't have to struggle with the prototype anymore, and wow, where did all these SSRS reports that were perfectly fine come from? What do you mean they were there all along?
In 2014 I was at Oracle Open World. A 3rd party hardware vendor was saying (and having customers) for Hadoop "clusters" that had 8 cpu cores. Basically their pitch was that Oracle Hardware (ex sun) started at a dense full rack of about a 1 million USD or so, but with the 3d party you could have a hadoop "cluster" in 2U and for 20K. The oracle thing was actually quite price competitive at the time, if you needed hadoop. The 3rd party thing was overpriced for what it was.
Yet, I am sure that 3rd party hardware vendor made out like bandits.
I worked at a corp that had built a Hadoop cluster for lots of different heterogeneous datasets used by different teams. It was part of a strategy to get "all our data in one place". Individually, these datasets were small enough that they would have fitted perfectly fine on single (albeit beefy for the time) machines. Together, they arguably qualified as big data, and justification for the decision to use Hadoop was because some analytics users occasionally wanted to run queries that spanned all of these datasets. In practice, these kind of queries were rare and not very high value, so the business would have been better off just not doing them, and keeping the data on a bunch of siloed SQL Servers (or, better, putting some effort into tiering the rarely used data onto object storage).
I wonder if companies built Hadoop clusters for large jobs and then also use them for small ones.
At work, they run big jobs on lots of data on big clusters. The processing pipeline also includes small jobs. It makes sense to write them in Spark and run them in the same way on the same cluster. The consistency is a big advantage and that cluster is going to be running anyway.
Moore's law and its analogues makes this harder to back-predict than one might think, though. A decade ago computers had only had about an eighth (rough upper bound) of the resources modern machines tend to have at similar price points.
This is exactly the point of the article. From the conclusion:
> Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.
This will not stop BigCorp to spend weeks to setup a big ass data analytics pipeline to process a few hundred MB from their „Data Lake“ via Spark.
And this isn’t even wrong, bc what they need is a long-term maintainable method that scales up IF needed (rarely), is documented and survives loss of institutional knowledge three layoffs down the line.
Scaling _if_ needed has been the death knell of many companies. Every engineer wants to assume that they will need to scale to millions of QPS, most of the time this is incorrect, and when it is not then the requirement have changed and it needs to be rebuilt anyway.
I think it completely matters - yes these orgs are a lot more wasteful, but there is still an opportunity to save money here, especially is this economy, if not for the internal politics wins.
I’ve spent time in some of the largest distributed computing deployments and cost was always a constant factor we had to account for. The easiest promos were always “I saved X hundred million” because it was hard to argue against saving money. And these happened way more than you would guess.
> I’ve spent time in some of the largest distributed computing deployments
Yeah obviously if you run hundreds or thousands of severs then efficiency matters a lot, but then there isn't really the option to use a single machine with a lot of RAM instead, is there?
I'm talking about the typical BigCorp whose core business is something else than IT, like insurance, construction, mining, retail, whatever. Saving a single AKS cluster just doesn't move the needle.
Yeah I see your point where it just doesn’t matter, especially back the the original point where it may not be at scale now, but you don’t want to go through the budget / approval process when you need it etc.
I think my original point was more in the “engineers want to do cool, scalable stuff” realm - and so any solution has to support scaling out to the n’th degree.
Organisational factors pull a whole new dimension into this.
I mean yeah, definitely - it blows my mind how much tolerance for needless complexity the average engineer has. The principal/agent mismatch applies universally, and beyond that it is also a coordination problem - when every engineer plays by the "resume driven development" rules, opting out may not be best move, individually.
The long term maintainability is an important point that most comments here ignore. If you need to run the command once or twice every now and then in an ad hoc way then sure hack together a command line script. But "email Jeff and ask him to run his script" isn't scalable if you need to run the command at a regular interval for years and years and have it work long after Jeff quits.
Some times the killer feature of that data analytics pipeline isn't scalability, but robustness, reproducibility and consistency.
> "email Jeff and ask him to run his script" isn't scalable
Sure, it's not.
But the only alternative to that is not building some monster cluster to process a few gigabytes.
You can write a good script (instead of hacking one together), put it in source control and pull it from there automatically to the production server and run it regularly from cron. Now you have your robustness, reproducibility and consistency as well as much higher performance, for about one-ten-thousandth of the cost.
Oracle acquired Java with Sun Microsystems, it was originally designed for embedded systems and the dream of “write once, run everywhere”.
The idea of a “hardware JVM” always fascinated me, I seem to recall some parallax microcontrollers that could run a subset of jvm bytecode back in the 90s, but never actually got to play with them.
I don't think you can do better than a binary search here, so log2(1000000) questions, which happens to come out to 20 with rounding.
I tried thinking a bit outside the box for other questions that would break down the search space - but none of them were any improvement on the binary search.
This was winding down about 12 months ago, unsure of the current status of it.
It's the same model they use for operating in China.
From an operational perspective this makes things very hard - as you don't actually get access to the services - you need to run an operator from the independent company through the troubleshooting / mitigation for any incident.
It wasn't a point upgrade - Confluence 4.0 got rid of markdown in favour of an "XHTML" storage format, then a layer on the editor to autocomplete markdown into the rich-text as you typed.
Personally I preferred the ability to edit pages as markdown, and we toyed with ways to allow users to edit the rich pages in markdown (i.e. a new transformer from the storage format to wiki-markup, instead of the editor format) - but it wasn't possible to not come up with a diverging solution for the transformations here (at the time, given the time we had spent on it).
Whilst we copped a lot of flack from the die-hard Confluence users when this was done, the vast majority of users liked the change. Also from a different angle, this was a really fun time at Atlassian - we had run out of space in the CornX (office at the time) so we had leased the floors above the pub next door (The Dundee Arms), and there were five of us in there, all with out machines crammed around a single table in this tiny room. Good times.
It was a death knell for us. We moved all our documentation into Git, which the devs found more convenient and usable. I think there might’ve been some survival bias: the people who kept using it after the breaking change liked it. The people who dropped it immediately because it ruined their workflows just moved on (or put it in read-only mode).
Oh completely agree regarding the survival bias - I think the over-arching plan was to make it more accessible to more than just dev teams, and it seems to have worked. "Like MS Word" was thrown around a bit.
Congrats on the completion! I graduated from this program in May of 2021 after about a 15 year gap from my undergrad, coupled with two young kids, a job change, moving across the world, and military service in the middle of it!
Agree some of it was brutal, but I really enjoyed it all as a whole. The key to me getting through was being disciplined with when I would do the work / watch the lectures. I read the book Deep Work by Cal Newport just as I started the program in 2017 and I think it really helped to set me up for success here.
I'm glad it's done, and I really enjoyed some of the subjects - a couple I didn't get to cover in my undergrad - especially HPC, GA and AOS were my favorites.
https://www.xreal.com/air2/