Wait, so Ubisoft spent $1M on a company wide backup system that apparently worked great, it's just that this sysadmin wanted incremental backups and the $1M system wasn't built to support that (blazingly fast ingest, slow online reads). So the sysadmin had to fight a lot to get dedicated hardware to do incremental backups, and eventually got it. I'm not sure Ubisoft is in the wrong at any point here?
Everything else mentioned wants me to not work at Ubisoft, and roughly matches what I would have expected, just worse. The disregard for developers, the absurd NIH syndrome, etc. But I don't get the headline of the article
Agreed. This reads like the $1 million system did what the company needed it to do (safely archive code to prevent more loss of old games) but it didn’t do exactly what this developer wanted.
There are various good points scattered in the article for the author’s specific use case, but it’s written as if the entire company was mistaken to not make this decision revolve around this one developer.
Hi, Author here. thanks for reading my poorly written rant!
The issue was that nobody ever said what the system was designed for and equally nobody was open to the idea of doing things differently until things went pop.
The main thesis of the article is supposed to be that it doesn't matter what something costs or how much money is invested if it doesn't solve your needs.
I wrote it a long time ago in a fit of aggravation over someone on hackernews waxing poetic about how much amazon invested in their security- without much regard as to what it was actually spent on.
Sorry but it just looks like you picked wrong solution from start and stuck with it.
Vanilla pgsql backup + WAL shipping would just. work. with those constraints.
Backup solutions that take time to read data and only "ramp up" once you start restoring full backup are nothing new, literally in industry for decades, before in form of tape libraries, now in form of amazon glacier and similar. Hell, backup solution where you can mount whole backup in directory is on the fancier side, because generally priorities are elsewhere
Very likely you're right, as mentioned in another comment (and in the article I think) I didnt know the semantics of the backup system, I’d just tested with a couple of 400G HDDs - then the NFS endpoint I was given to replace those drives was behaving differently and a deep investigation across wide timezones and uncommunicative teams began.
We have the benefit of hindsight now so things can be more clear than they were at the time, nonetheless:
WAL shipping without ever reading sounds hopeful, I don't personally believe in backups that are never verified.
It's particularly nice as it allows to do point in time recovery - you can tell PostgreSQL to replay WALs up to a given point so if say corruption happened because of some code bug you could play the database to the point minute before.
Backup process is two part
* archive logs as they come in - PostgreSQL have hook that will run program with the WAL segment to archive so just need to put whatever you want to use to backup it with there
* start backup from PostgreSQL - that will stop it writing to the DB files and only write to WALs - and just copy the database directory files. Then tell PostgreSQL to resume writing to them. No need for anything fancy like file snapshots.
So just copying files really.
Restore is just restore the above and feed it WALs to the chosen point.
We also run slave so master being shot would only kill running transactions, fancier setups I've seen also run "delayed slave" - slave that replays non-current WAL so basically presents view of database from say 1 hour or 1 day ago. That way if something fucks up DB you already have server that is running, just need to replay WALs to the chosen point.
> I don't personally believe in backups that are never verified.
We ended up making backup job lottery. Pick a job out of system and send an email to ticketing system "hey, admin ,restore this job for testing". So far it worked.
One system also have indirectly tested restore, as the production database is routinely anonymized and fed to dev server
We've also baked it into automation we use for deploying stuff so for most things not backing up is harder than backing up. Still, accidents happened...
yes and that's how I know you never worked in the game industry. A lot of companies use perforce, perforce is a source control system that also has assets, and also has metadata, that you also have to take backups and query.
I'm a bit green when it comes to db back ups. Do you have anything I can read on this approach it sounds really promising but I don't think I understand it?
I got that from the article, and didn't come across as too ranty to me, just standard talking about what a pain some thing was to do in an organisation.
Maybe a different country thing - devs in the US might be expected to be a lot more positive, in other countries we find complaining cathartic + this comes across as standard chat about work.
My manager doesn't do almost anything. I wonder why they keep him around. He has like two reports and we're both fully self-directed. He can't understand technical issues, and whenever he proposes something, it's completely untenable due to his lack of understanding what's useful or possible.
It is definitely possible they indeed don't do anything.
Fwiw though - a decade ago when I was a sysadmin I had a manager that I was certain never did anything.
And then he was replaced.
And then... We realized how much politics, uncertainty, churn, screaming, changing requirements, ambiguous priorities and other carp he protected us from :-/
Not saying it's the case with your manager. But managers have duties roles and priorities I for one hadn't always appreciated.
> And then... We realized how much politics, uncertainty, churn, screaming, changing requirements, ambiguous priorities and other carp he protected us from :-/
Indeed, have experienced that. At my best job experience ever, there was zero politics or distractions, I could focus all day long on interesting and relevant technical problems. I was both very happy and extremely productive.
And then.. my manager left. Whoa! The following months were insane mudslinging and relentless politics thrown at us in the team, could barely get any work done. Left the company soon after.
Turns out our previous manager was absorbing and firewalling all that from us so we could be happy and productive.
A great manager is like a great security team. You'll never notice what they're doing until they're not there.
The only bad managers I've ever really had are the ones that don't realize they're responsible just as much as their reports when shit goes sideways. To that point, I've had technical and non-technical managers and it doesn't matter, we can speak a different language but as long as we realize we're on the same boat and heading in more or less the same direction things are great. This requires trust, understanding of goals not implementation, and empathy.
Almost every bad manager I've had basically lacked at least two of those last three things if not all three.
I will say, the one big advantage of a technical manager is they can often be used as a good rubber duck for logistics around changes. Should we do {A, B, C} or {X, Y, Z}? You can get into the technical with them and explain the intricate trade-offs of the decision and how that might impact things. Their technical and political knowledge of the situation can give very good clarifying answers.
> And then... We realized how much politics, uncertainty, churn, screaming, changing requirements, ambiguous priorities and other carp he protected us from :-/
Over years I've noticed like 80% of those is because other managers are terrible and it so happens that the dept. I'm in takes some of the fallout of their mess.
It's interesting you should say this because this is one of the categories of Bullshit jobs as defined by David Graeber in his book. And incidentally only y'day I was listening to him on a podcast[1] speaking about BS jobs.
His favourite example is corporate lawyers. Companies are forced to hire them just because other companies have them and are in fear of getting sued. I hadn't thought of managers to fall in this category, which makes sense.
Enabling people to be fully self directed and successful is a rare skill. Often it is done by diminishing their own importance; putting you on a good path and then getting out of your way; and giving you all the glory
A good manager should protect you from the bullshit from above them, while making sure you are still known to exist and get recognition for your work from above.
And a lot of that stuff is inevitable at larger companies, so you either go your own way or be prepared to deal with it as part of your job requirements without a manager shielding you.
One of the lessons I've learned though is that..
It's not all red tape and inefficiency. Big light bulb moment for me was realizing that users, customers, stakeholders, clients, etc, simply and validly may have different priorities and goals. That's not necessarily disfunction. We are not ants. But it's the manager's job, not sysadmin or coders, to resolve those different goals and help provide unambiguous and stable requirements to the team. This may not immediately be obvious as contributing value to a person building a server or coding a function. It certainly wasn't to me... Until it didn't happen anymore :-)
Exactly. A manager sitting by and letting their reports do whatever they want is hardly a worst case scenario. So many are actively sabotaging the work their team does.
If you think your manager doesn't do anything he's probably one of the good ones. Possibly not one of the great ones. But still, often one of the best things those who manage can do is get out of the way and allow their reports to do their jobs properly.
I'm beginning to think that making more money by becoming a better developer is hard, and that managing, while difficult in a totally different way, is less hard. At least for the money.
I think people often convince themselves that becoming a manager is a path for those who couldn’t hack it. In reality it’s a completely different set of problems. If you’re a decent developer with phenomenal soft skills, there is a very good chance you’d be better in management or sales than writing code.
My own observation is that it's more important for managers to be well-organized, remember details, and keep track of things
Not sure if that's what you mean by soft skills, but I generally think of soft skills as things like ability to communicate well, navigate conflict, and apply emotional intelligence to people management
The "soft skills" as I understand them are more important for roles like team lead/staff, whereas the organizational skills (and maybe even a willingness to be unlikeable at times) are more important for management.
That's why I think a lot of engineers prefer the IC track (I personally think my organizational skills and focus don't qualify me for management, although I enjoy being a lead)
> My own observation is that it's more important for managers to be well-organized, remember details, and keep track of things
It's interesting, I feel like this applies to a line manager. Once you get beyond that you can just hire to fill in your own weaknesses. If you're bad with organization, just hire a good project manager.
I guess that's how you end up with the Pareto principal. Even just moving from being a line manager to being a manager-of-managers is a different job that requires different skills, but the assumption is if you're a good line manager then the next step is to manage managers. There are probably plenty of people who there who would make fine directors but are bad managers so they don't get that chance. And then you have people who are good managers who end up as bad directors.
In my experience quite a few managers are there to do cargo cult activities mostly involving power point presentations and planning for their managers. In the worst situation it seemed like the vast majority of effort and salary spend went towards satisfying the VP’s appetite for reporting when engineers were very understaffed they focused on expanding to new layers of management instead of hiring ICs.
some of this management can be useful but often a lot of it is not, at best. As organizations age companies become mostly about advancing careers of middle management and the things the company actually does become secondary.
The sad part is they think that all the reporting and metrics, and the juking of the stats which goes along with it, is making things better. In reality it’s generally keeping things the same but with a pretty coat of paint.
My experience with good managers at large companies is they do a lot of work to make sure the rest of the company doesn’t get in the way of their developers. Which yes, mostly involves a lot of reporting and meetings with other people on the company. But it probably is more important for the productivity of their reports than what managers do at smaller companies, even if it seems like it accomplishes nothing.
I think they would spend their time better on improving processes and removing obstacles but that almost seems a taboo to point out. I have seen tons of scrum retrospectives that got ignored as soon as the problems were outside the reach of the immediate team. In the end you can do as much estimation and reporting as you want, but things still get done by doing them, not by talking about them
Management and Sales also don't usually end up on the on-call rotation. While there are exceptions, usually particularly key individuals, in general developers are at the bottom of the white collar company social ladder. Thus status conscious individuals will view management as an upgrade from development.
Your conclusion sounds an awful lot like the people in your first sentence are right. Anecdotally, this is what I’ve seen. As people progress on the engineering side and the work gets more demanding, increasingly people bail out and make the transition to management.
Having been both, being a really good manager is hard in a very different set of ways.
Depending what itch you're trying to scratch, and what opportunities are available in your org, it might be just as rewarding to be a tech lead, feature lead, mentor, cross-team liaison, steering committee member, or possibly other roles rather than team lead / manager.
The only time I'm OK with being in a manager role is in my own company. Working for someone else, you couldn't pay me enough to do that job. That's because I'm a dev to the core, that's where my interest and skills are. Being a manager is a different beast entirely and, while I can do it, I find it very unpleasant.
I think what I struggled with most in my time managing a small team was feeling that I lacked agency.
Particularly with "soft" objectives like improving morale, building team cohesiveness, helping people set/advance career goals, etc, I just didn't feel like I had a ton in terms of levers to pull and resources to draw on. And I lacked the experience and training to know how I could either advocate for those resources, or achieve the objectives in ways that wouldn't have a lot of cost and disruption attached. The leadership course I took was focused on personality type assessments, and conflict resolution as a very abstract kind of exercise ("try to see and validate all sides, practice non-violent communication") rather than a concrete, like, here's how to navigate when your direct report says they're quitting if they doesn't get a 20% raise, and upper management says lol no way.
Anyway, the pay scale at my org has equivalent IC roles at each manager grade, and in the end I've found I can have more of the type of influence and leadership that I desire by exerting soft power over the decision making processes in the areas that I most care about— which are mostly related to technical direction and strategy.
I mean... the whole point is that that stuff is not a titled role. It's responsibility you can choose to take on as a senior IC that helps extend your reach, influence, and visibility into what is going on elsewhere in the company, making your voice that much more valuable and insightful when you're pulled into those meeting to talk long term technical direction and strategy.
I don't think it's too hard to imagine how to explain this in a way that might excite a child, but at the end of the day, they look to us for their cues on whether to be excited about Daddy's promotion, whether it's to "Team Lead" or "Staff Engineer".
> I mean... the whole point is that that stuff is not a titled role. It's responsibility you can choose to take on as a senior IC that helps extend your reach, influence, and visibility into what is going on elsewhere in the company, making your voice that much more valuable and insightful when you're pulled into those meeting to talk long term technical direction and strategy.
> I don't think it's too hard to imagine how to explain this in a way that might excite a child, but at the end of the day, they look to us for their cues on whether to be excited about Daddy's promotion, whether it's to "Team Lead" or "Staff Engineer".
I genuinely can’t tell if you’re serious or trolling.
Poe’s Law is in play, I guess.
Fwiw, the eyes of every child I have ever been around always glaze over whenever an adult whips out corporate speak. I think there is a good reason for this.
That is generally true. In most companies it’s way easier to get promoted to a certain salary level in the management track. From my experience the ceiling on the technical track is way lower than on the management track for people with similar levels of competence.
Underrated comment. Many people believe their manager does nothing... until they become a manager and realize all the BS their manager protected them from on the daily.
So the idea isn't the manager is incompetent but his managers are so bad they are a net negative and cause so much disruption that a full time employee is required to deal with it.
Alternatively: The broader organization has requirements around accounting, planning, forecasting, resource allocation, etc. that are orthogonal to the engineer's job but critical to the company for legal, regulatory, and financial reasons, and managers exist in part to handle that stuff so that the engineer does not need to.
They could all be great managers and literal geniuses and it wouldn't matter. Each person you add to an org increases the number of connections and communications required to do something.
To get anything done I have to get Safety, and Compliance, and Marketing, and Sales, and Project Management, and Vendor Management, and Product Management, plus the Product Dev Team, plus any sister Dev Teams, plus QA involved. Also gotta check with Security, and confirm our Infra team can host it.
Inevitably these people will have their own things going on and may not be available. They will have to wrestle with their own complexity. This leads to meeting after meeting. Then there will be paperwork. A lot of it. Maybe I don't need most of those teams involved but I need sign-offs about that.
Plus each of those teams is there for a reason. If I ignore Compliance we get sued -- and each lawsuit is going to cost a lot more than the cost of those meetings. If Marketing isn't involved then the product launch will screw up and the offering won't go anywhere. If QA isn't there than the product will suck, or maybe screw up catastrophically.
Even if every single manager, every single C-level, every single team lead and Sr., is a bonified genius, it's still going to be a mountain of lame, pedantic work. If there was a way to ditch that then I'm sure, somewhere, one of these MIT Berkeley Disrupter-types would have done so, but the reality is that's how any org larger than like 100 people is going to work.
It’s the peers and stakeholders that are the problem; no one knows what they need, and everyone knows what they want, and if they aren’t responsible for the work they’ll always try to include everything and declare it a must-have. And you’ll get those who escalate over nothing, people actively working against you (because they disagree with the project goals and got overridden, or wanted to own it themselves), people who are necessary but apathetic / busy, people who are angry about something else but take it out wherever they get the chance to, changing requirements and sudden timeline shifts/deadlines because of events outside of your control… etc
At least when there’s someone with strict authority involved, these things can be resolved pretty quickly — everyone defaults to authority. But that’s rare; most projects involve multiple domains, each with authority over their dominion, and opinions on what the others need to be doing (and all trying to minimize their own responsibility and risks, by both simplifying the project, but also offloading it onto their peers).
Ultimately coordination between parties is inherently a bitch.
well look, I already told you, I deal with the customers so the engineers don't have to. I have people skills. I'm good at dealing with people, can't you understand that? What the hell is wrong with you people?
If your work is getting done, things aren't falling apart, and he leaves you alone, then he's doing a great job, whatever he does. At the very least his presence will keep other managers from meddling.
My manager isn’t a programmer but uses ChatGPT to educate himself on a topic and attempts to read the code to see what’s happening before asking one of us if anything is still unclear.
It definitely can do this, though obviously it can details wrong here and there. For example:
Describe what this code does in clear english:
from datetime import datetime
import pytz
utcmoment_naive = datetime.utcnow()
utcmoment = utcmoment_naive.replace(tzinfo=pytz.utc)
# print "utcmoment_naive: {0}".format(utcmoment_naive) # python 2
print("utcmoment_naive: {0}".format(utcmoment_naive))
print("utcmoment: {0}".format(utcmoment))
localFormat = "%Y-%m-%d %H:%M:%S"
timezones = ['America/Los_Angeles', 'Europe/Madrid', 'America/Puerto_Rico']
for tz in timezones:
localDatetime = utcmoment.astimezone(pytz.timezone(tz))
print(localDatetime.strftime(localFormat))
ChatGPT Response:
This Python code does the following:
1. Imports two modules, datetime and pytz.
2. Gets the current time in UTC using the datetime module and assigns it to a variable named 'utcmoment_naive'.
3. Uses the pytz module to create a timezone-aware datetime object and assigns it to a variable named 'utcmoment'.
4. Prints the values of the 'utcmoment_naive' and 'utcmoment' variables.
5. Defines a date format to be used for printing local date and time.
6. Defines a list of three timezones.
7. Loops over the list of timezones and converts the UTC time to the local time in each timezone using the 'astimezone' method of the 'datetime' object and the 'pytz.timezone' method.
8. Prints the local time in each timezone in the specified date format.
Overall, this code gets the current time in UTC, converts it to the local time in three different timezones, and prints the local time in each timezone in a specified format.
So you’re telling me if one writes a plug-in that integrates with source control checkins and provides these summaries plus an additional terser summary to nontechnical users, I wonder if companies would buy that?
The obvious problem with this is that you'd be sending all your (presumably proprietary) source code directly to ChatGPT, i.e. OpenAI, i.e. Microsoft, with a dubiously vague license to use it to improve the ChatGPT product.
I can't think of a single company I or someone I know has worked at that has a security department/person that would allow this. Maybe very small businesses wouldn't care that much?
> I can't think of a single company I or someone I know has worked at that has a security department/person that would allow this. Maybe very small businesses wouldn't care that much?
Well there are lots of open source companies where this isn't a concern. Even outside that lots of businesses use GitHub as a code repository just fine, and the GitHub tools can read your code to eg scan for security vulnerabilities.
I absolutely do not envy any team who would steal our source code. No, it’s not bad and is rather self-documented, but the amount is huge and it’s all inter-connected. What use can they extract from it?
Add things like "summary of what each contributor did each week/month" at various levels of detail.
There is an evil and a less evil way to deploy this of course. I have had jobs where I had to provide monthly summaries of what I did this month. I would like to start with a chatgpt summary of all my commits and/or PR's over the month, before editing it, why not?
Mostly what he is trying to do is run queries on the database to export csvs. So mostly he would ask one of the devs to write a query and he would then save and tweak it himself. Now ChatGPT can mostly write these basic queries and cut down on the number of requests to devs.
Generally managers are hired to act us buffer between higher order management and employee. HOM don't want to micro manage employees, but they want responsibility, which lower rank employee (the one who do actual work) tend to carry the least. While most workforce would come defensive, that they work responsibily, but indeed most don't. So they higher this manager who do micro manage employee on HOM behalf. The minute most employee gets a good offers, they will fly like its none of there responsiblity, so HOM higher this buffer manager to manage this minute but important aspect for successful project delivery.
This rant starts of pretty badly, with "Windows bad, Unix good, Windows especially bad because it doesn't have Unix tools".
Reverse that statement. Is Unix bad because it doesn't Windows tools?
> but simultaneously there was nothing to lean back on: no shell, no Unix tools like sed/awk, no SSH. I
He joined in 2014, when Windows had a superior shell and remoting system that eliminates the requirement for low-level string parsing tools like sed/awk entirely!
PowerShell + WinRM + Desired State Configuration (DSC) was all pretty mature at the time, and I had used these technologies to manage huge fleets of servers solo without issues.
Then, almost none of this heroic database backup efforts would have been required if they had just a commercial database product, e.g.: one designed for Windows such as SQL Server.
In 2014 it supported AlwaysOn Availability groups that allowed multiple synchronous and asynchronous replicas! It also has had true online backups since before 1997.
His second mistake is that backups can be incremental, but a restore in a (true) disaster is a full restore by definition! Differential + Log backups are great to capture data regularly during the day, but the business requirement is usually that a full restore must complete in a reasonable time.
Typical commercial databases plus the product he mentioned could easily backup and restore tens of terabytes in setups I've seen directly to or from any proper database engine without having to through "NFS" on the way. Typically you'd use a native backup agent.
This whole article sounds like a self-important Unix engineer refusing to touch commercial products, lack of understanding of business requirements, and an allergy to Windows.
I respectfully disagree. Having been an admin for both (started out as a Windows developer and NT4 sysadmin, so you can’t say I wasn’t experienced) I’ve always found UNIX to be vastly easier to manage servers.
Yes, Bash has a lot of warts (so much so that I wrote my own shell) but Powershell creates lots of new warts of its own.
As for automation, Windows doesn’t even come close to
UNIX-like systems for east of scripting and automation. There’s just no contest.
Bash might be ugly at times but at least its ugliness is consistent. I’ve lost count of the number of times I’ve found parameters parsed differently between applications (because Windows passes parameters as a single string rather than an array of arguments like POSIX systems), or that so called headless installers still have GUI prompts and/or spin off a non-blocking process so it’s challenging figuring out when it’s complete. Or Powershell routines come back with an unexpected type thus breaking your entire pipeline. Or don’t even support basic help flags so you cannot discover how to use the damn routine to begin with. Or the utterly ridiculous numeric error code that MS returns rather than descriptive error messages. Or its over engineered approach to logging that makes quick inspections of the systems state far more involved than it should be, or even just managing simple things like software updates requires while other expensive 3rd party solutions because the whole application and OS package management situation is fundamentally broken at its foundations …I could go on. But suffice to say there’s so many painful edge cases to consider with Windows.
Windows does get somethings right though: RDP is a great protocol and its backwards compatibility story is second to none. But for server administration, Windows feels like a toy OS compared to most UNIX-like systems. As in it has support for pretty much anything you’d want to do
With modern server management processes but yet everything it does support it does so in a really awkward and immature (technologically speaking) way.
I’m sure it’ll get there though. But likely not before I retire
What are you talking about - Windows was always terrible for an admin. Unix was and is vastly superior when it comes to remote admin. Having used both (Powershell is powerful yes, but it's nowhere as fluid as a Unix shell).
That's just not true. It feels true to Unix admins trying to do Unix things on Windows and getting frustrated when nothing feels familiar.
Let me quote the article directly: "no SSH"
WinRM is roughly equivalent and superior in many ways. E.g.: It takes quite the effort to produce a report of "some setting or flag" across a fleet of servers with SSH, Bash, Sed, Awk, etc..
This will connect in parallel to all the servers in the list (could be thousands!), 32 at a time by default. It'll get all of the process info (memory, cpu time, etc...) for a game server, and then add the source server computer name as an extra column. Dumping this to a valid CSV is a no-brainer.
Put this in a scheduled task and you've got something Excel or Power BI or whatever can pick up and chart, graph, or send to a manager.
Try this with Linux. Seriously. Sit down and do the equivalent: many servers in parallel, batched to prevent overload, server name as an extra column, valid CSV output with proper quoting of quote characters, etc...
Oh, did you notice the lack of authentication rigmarole? The whole thing is secured and encrypted via Kerberos automatically. Copy the equivalent of that too. (Assume you have 10K servers that you haven't manually added the SSH keys of into your local workstation config.)
Put the script in the reply.
PS: I know some Unix people think that brevity is more important than time-to-solution, so to head off that whole argument:
PS C:\> Get-Alias | ? ResolvedCommandName -in ( 'Invoke-Command', 'Get-Process', 'Export-Csv' )
CommandType Name
----------- ----
Alias epcsv -> Export-Csv
Alias gps -> Get-Process
Alias icm -> Invoke-Command
Alias ps -> Get-Process
> Try this with Linux. Seriously. Sit down and do the equivalent: many servers in parallel
On my phone so not going to type the full command line out ostensibly it would be:
xargs ssh ‘ps process-name’
I’d need to pull the man pages up to double check which flags you’ve specifically requested. But the above is the general gist of it.
——
I think it’s pointless to compare such granular features side by side though. For every microscopic thing Windows better Linux will have others it does better too. And visa versa. It’s the overall administration at a higher level I’m more interested in:
How long does it take to go from a new requirement to a deployed service? How easy is it to automate? To adapt? To scale? To fix? Or even forget fixing entirely and just run as a fleet of ephemeral instances.
Windows can and does support everything Linux does too. But Linux is just generally easier to work with (in my 30 years of experience managing both).
I wouldn’t say Linux is the peak of UNIX-like systems either though. But that’s a whole other argument for another day :)
It's not even remotely the gist of it. The output of that is formatted (pretty-printed) text, not CSV. You won't know which computer the output came from. You can't use this output in any meaningful way. It's not useful to the business.
ssh can't connect to multiple servers at once (other tools like multissh can). However, most such tools aren't designed for and aren't capable of scaling up to a thousand or more servers.
Windows PowerShell and WinRM can, yet people like you just blindly dismiss it, assuming that it "must" be worse than Unix/Linux, since everyone knows that Windows doesn't even have a shell!
To quote the original article: "there was nothing to lean back on: no shell"
Windows has a shell, the author of the article just wasn't aware of its existence, let alone its comparative feature set.
I'm well aware of the existence of powershell and CMD, such things are not a real luxury when you're limited to whatever Windows provides you with as utilities as opposed to the astonishingly wide plethora of commandline utilities available else where (where everything is a file and is easy to read, parse, filter, save-).
Windows 2012r2 came with Powershell 4.0, which I'm sure you'll agree was limited in ways that bash is not, since bash is a collection of utilities and doesn't require deep integration with the platform to function.
For example in your example above you use 'Export-Csv', but that was introduced in Powershell 7.0
Am I saying that it's not possible? Not really, powershell can probably do everything I need it to, but it's not as simple as `ssh user@host` and I'm talking in a way that feels native to the system. It requires a paradigm shift in thinking that most of our developers didn't embrace and discoverability is just lower there anyway.
Claiming that winrm is better than SSH is... just the most absurd thing I've heard this year.
The main reason that I got annoyed with Windows though is not the eggregious license cost- it wasn't that it used an enormous amount of resources- the documentation was really nice!
It was that my iteration times on toying with things went from 5 minutes to 25 minutes in the minimum, that doing any kind of exploratory work meant near certainty that I'd be RDP'ing.
This is not a "I hate windows" rant, but the defensive position you have taken here is frankly unpaletable. Windows has it's strengths, and perhaps I am not the person to charter a course to a fantastically well designed "Windows-centric" solution to problems -- but fuck, it worked honestly and it worked better than any of the Windows admins in Ubi could have mustered. Maybe that's a statement about the general competence of windows admins or just the ones at Ubi, I don't really know.
But Windows is a poor fit for dynamic server workloads such as those. The only reason it was chosen was because the game was developed on Windows and against Windows APIs - some of which were even better than Linux ones at the time, like IOCompletionPorts, which I have extolled virtues upon - rewriting the software to work on Linux would have been a hugely time intensive process. That decision cost Ubisoft literally millions of euroes and was a decision that was later reversed- after I left.
It is literally that simple. Nay, simpler, because authentication is automatic and secure:
Enter-PSSession 'hostname'
Oh, I'm sorry... too verbose and readable.
etsn 'hostname'
> since bash is a collection of utilities[1] and doesn't require deep integration with the platform to function.
You're also unaware of just how elegant PowerShell's extensibility model is. It runs circles around traditional shells, because you can whip up a new "native" command in just minutes, but with features that would take a week of C programming against the Unix byte-oriented pipeline.
The first time I saw something like this, it blew my mind:
Import-Csv 'serverspec.csv' | New-VM
I wrote some spectacularly complex scripts in bash, and it always felt like digging through dirt, looking for nuggets of gold that everyone promises is there, but I just can't find it.
[1] Those utilities are hilariously inconsistent not just between each other, but between distros and even distro versions. E.g.: SSH is so easy, right? Okay, write me a script that can create a sudo-enabled user and install an SSH key for them on every popular Linux made in the last decade. Include Redhat, Ubuntu, Amazon Linux, and whatever else is reasonable. Three "Linux admins" failed at this task, and I had to resort to using ChatGPT and then fixing up the result by testing it against a dozen VMs that took me a day to spin up.
> It is literally that simple. Nay, simpler, because authentication is automatic and secure:
You keep saying that as if there’s no such thing as secure automatic authentication on SSH yet people have been doing exactly that since before AD was even a thing.
> Three "Linux admins" failed at this task, and I had to resort to using ChatGPT and then fixing up the result by testing it against a dozen VMs that took me a day to spin up.
You do realise that you can get bad Linux sysadmins as well as bad Windows administrators?
Plus it’s not really fair to compare managing different distributions to managing a Windows (unless you’re counting Windows 95 and CE as part of your fleet). Most businesses standardise on one Linux distro (or BSD flavour) and only diversify from that if that absolutely have to. It’s not because Linux is harder than Windows. It’s because managing multiple different core stacks is harder regardless of whether that’s Linux and Windows. Or Linux and BSD. Or Windows 95 and Windows XP. Or different base distributions of Linux.
The first rule of managing large fleets of servers is homogenise as much of it as you can.
This is a great discussion from everyone! I have learned a lot about Powershell here, so thank you!
That said, the comparison to a homogenous windows env and a mixed flavor linux environment is disingenuous. Using multiple flavors of linux is just multiple OSes, of which Windows would be another. The best advice here is to stick to one flavor of linux, which should be possible because there are no licensing fees!
There is an often overlooked downside to “options”: it means that even well-meaning people will pick inconsistent things unless ruthlessly forced into compliance.
Single-choice vendors like Microsoft or Apple remove this variability, which is a big benefit at the scale of large enterprise or “entire ecosystems”.
Linux is too flexible, and the inevitable consequence of this is that I have customers that each have a dozen incompatible variations in production.
Even if each customer had precisely one version of one distro, my scripts wouldn’t be safely portable between them if they don’t happen to pick the same distro!
I’ve got decade-old complex PowerShell scripts I’ve reused across a dozen unrelated customers. That’s a significant benefit.
People do standardise on a distro within companies though (I made this point to you elsewhere too the last time you claimed otherwise).
And it’s not like Microsoft doesn’t change things around between different releases of Windows. I’ve ran into just as many compatibility issues with edge case software between different versions of Windows as I have with Linux. The key is just understanding the base of your OS as well as your software. This is as true for Windows as it is Linux.
And I have Bash scripts written in the 90s, before Powershell was even a thing, that still work now. In fact one of the biggest complaints about POSIX is the age of the specification. So it’s strange that you’re talking about Powershell as if it’s unique in its compatibility promise.
I’m honestly questioning just how much experience you have in this field because you talk as if everyone misunderstands Windows yet then go on to make the exact same misconceptions about Linux (and POSIX in general).
> It's not even remotely the gist of it. The output of that is formatted (pretty-printed) text, not CSV. You won't know which computer the output came from. You can't use this output in any meaningful way. It's not useful to the business.
You can add all of that easily. Printing hostnames or turning text to CSV isnt a hard job on any computer. I was doing this in the 90s on Windows, years before Powershell. And on pre-OSX macs (think System 8 and 9) connected via AppleTalk. So modern Linux systems are clearly not going to struggle in that regard.
And you’re now just doing exactly what you criticised the author for: making up bullshit requirements based on what you’re familiar with and then using that as some lame excuse to call everything else garbage.
> ssh can't connect to multiple servers at once (other tools like multissh can). However, most such tools aren't designed for and aren't capable of scaling up to a thousand or more servers.
You’re making distinction that doesn’t exist. It doesn’t matter if you’re spinning up multiple SSH sessions from one program, or multiple programs each with a singular SSH session. It’s still the same number of outbound network connections (and likely same number of processes too).
What matters is how those connections are managed. And the code I exampled does run those connections in parallel.
> Windows PowerShell and WinRM can
Bash can. It just needs to call a process. Just as PowerShell is calling a process to do it.
Saying SSH cannot whereas Powershell can is comparing apples to oranges. SSH isn’t itself a shell, it’s a secure client/server protocol that allows you to run remote code inside a pseudo-TTY. Generally that would be a shell, but it doesn’t have to be.
(Source: I’ve written multiple shells for both Windows and Linux over the years)
> yet people like you just blindly dismiss it,
That’s not what I did. You’re now making assumptions about me. Ironically “blindly dismissing” my comments in the process.
> assuming that it "must" be worse than Unix/Linux,
I dont need to make assumptions. I have a hell of a lot of experience managing Windows systems. 25+ years in fact.
I’ve also written plenty of server side software for Windows too. And network enabled Windows malware (pre-NT4 college network. I mostly just used it to siphon off other peoples quotas from the college ISDN line. But it taught me enough to earn a career in IT)
> since everyone knows that Windows doesn't even have a shell!
Literally no one said that outside of the article.
Windows clearly has several shells: explorer.exe, cmd.exe, Powershell. Plus lots of frontends for remote services via RPC.
I also credited RDP as being excellent too.
> To quote the original article: "there was nothing to lean back on: no shell"
I’m not the author of that article so I don’t really understand why you’re using that to prove a point about my own ignorance.
> Windows has a shell, the author of the article just wasn't aware of its existence, let alone its comparative feature set.
We aren’t talking about the article though. We were talking about your generalised point that Windows is easier to administer than UNIX-like platforms.
If truth be told, I stopped reading the article after about 3 paragraphs because it was pretty obvious that it wasn’t a well thought out piece. But that doesn’t mean I agree with you any more than the author. You’re both at the extreme opposite ends of reality.
No it isn’t. The problems you describe are literally what the Unix command line was invented to solve, way back in the 70s. And what every other pipe-supporting command line shell, PS included, has been inspired by since.
If you want to talk about which shell you personally prefer writing scripts in, then that is a different matter. I have zero issue with people preferring Powershell (heck, Linux can run PS these days). I even wrote my own shell because I felt I could do better than Bash. But don’t make bullshit up like “Linux can’t print host names nor CSV” when that’s basically the MVP for any command line interface.
> Less talk, more script source, paste below please.
Do test it first though, several people have submitted hilariously incomplete, invalid, or partial solutions.
I honestly don’t see why I should bother. I’m on my phone, cooking breakfast and playing with my kids. And it’s pretty clear you’ve already made your mind up. So I don’t see why I should give up family time to win an argument with a random internet commentator who’s clearly never going to get anything beneficial from the exchange anyways.
The fact that you don’t understand how ssh nor xargs work just demonstrates that you’re not in position to make these kinds of comparisons in the first place.
So take my word for it or don’t. I don’t really care any more. But I’m not going to interrupt morning and fire up a laptop just to write a script that you’ll instantly dismiss for whatever arbitrary and biased reason you dream up next.
I'm sorry, but this comes across as exactly the same kind of bias as the author. "my knowledge of windows sysadmin methods means Unix methods are crap".
I'm afraid it's horses for courses, it's always been. We'll all pick our own preferences of course,but let's not assume total superiority over "them".
Winrm is nowhere near equivalent to SSH. It's terrible. Just try using anything that uses window's DPAPI over winrm. You have no such limitations with SSH. Also it's authentication mechanisms are vastly superior (SSH key pairs).
The whole point of UNIX is to use independent, composable tools that speak the same interface. If the interface is space-separated values, use space-separated values. If Excel or PowerBI don't use that then don't use Excel or PowerBI.
Your boss uses Excel and doesn't like it when the reports\ read\ like\ this because you ignored his requirements. Your coworker is also angry at you now, because when he scheduled your script to run regularly, the output occasionally turns into gibberish because of inconsistent handling of special characters and error conditions, breaking automations randomly.
I spotted several errors in your provided "solution", which ought not be possible, because I'm a person that uses Linux briefly every few years.
The problem statement isn't a made up, artificial, or toy scenario. The few times I have to use Linux, it's to do this kind of activity. "Run a command on all Linux servers in the data centre" is something I had to do just recently, and it turned into a multi-month exercise involving six people. It would have been an hour with Windows. Think "look for log4j", "install an agent", or "report on the kernel version used".
Problems like this seem trivial, but it is precisely the independent nature of those tools that makes them not composable.
For example, 'ps' has different output and different capabilities on different systems! You can't naively run it across multiple systems, because output will be an inconsistent mess. You can't then sort, filter, aggregate, or do anything useful with its output.
"aux" is the wrong option to use. It'll pretty-print the output, which forces it to truncate long values. It'll return date times with inconsistent offsets, depending on each server's individual time zone and regional settings. It'll strip the year off, so if you want the 'date' a process started, it won't be obvious if it has been running for 1 month or 13 months.
If any servers aren't already in your ssh keychain, then your command will... what? Freeze? Output errors into the report? Prompt for your input thousands of times, one per server? In parallel? How... did you expect this to work!?
saying your thousands of servers wont have ssh keys installed when they have arbitrary software installed is disingenuous at best.
thousands of servers are managable only with orchestration solutions, this is a solved problem. the unix world also has evolved from a bash for loop to ssh into servers and read outputs.
i can solve your task in comparable time with pyinfra or ansible executing arbitrary python code on every node with not just csv output but whatever you dream up.
The big problem with Powershell vs Unix is lack of documentation. Seriously, everything takes 100x less time to figure out on a Unix platform than Powershell. And my life is to short for that.
That being said powershell is much more discoverable _for me_ because it’s simple objects that get piped. I don’t have to figure out how to parse the text of a traditional shell. I can just get things done.
but where was powershell (introduced in 2006) on unix (introduced in 2016) when admins needed it since epoch?
the tools i have learnt might have more cryptic names but they were there since the beginning.
it took microsoft an awfully long time to realize cmd.exe just wont cut it.
also powershell has no choice but to include the kitchen sink to be usable because there is little composability in the windows shell world compared to the cambrian explosion of unix utilities.
It depends on exactly what and how you're doing it.
I have as devops managed both a small fleet of windows boxes (~20 machines) an a similar amount of Unix (Linx-based) machines.
Remote management of Windows was initially a pain to set up but once it was running it was fantastic. Being able to manage the entire application layer with properly tested, versioned code written in C# is an absolute killer. Things like debugging/logging etc are also "just there", and SQL Server is a fantastic database.
The Linux alternative on the other hand was easier to update and had some benefits to the developers (you can sort-of recreate a production-like environment in containers locally). The management after that was a huge pain because everything seems to have been built McGuyver style. I like Postgres but prefer using SQL Server (probably personal preference at this point tbh).
Things are different nowadays: systems like Kubernates apparently simplify the Linux side and make it affectively trivial to deploy and do cool stuff such as dynamically scale systems. It's also more efficient as you can remove the overhead of the Virtual Machines. It doesn't take much imagination to see that that would make a Unix-based system better.
I agree with everything you've said, but to add to that: there are these weird inflection points with both platforms.
A single "small business" server is easier to run on Windows and it'll do everything you need it to do with minimal fuss. Easy GUI admin consoles are available, etc...
For 2-3 servers it swings back to Linux because some things are just easier at that scale.
For 10-1000 servers that aren't all identical, I prefer Windows because of things like Active Directory, GPOs, DSC, etc...
Above 10K and especially above 100K scale the Windows licensing costs become absurd and you're likely deploying thousands of identical nodes (web farms, HPC clusters, etc...) and then Linux becomes very compelling. This is why all of the FAANGs run primarily Linux, because at the multi-million-node scale it's the best option by far.
Interestingly, container technology is maturing in Windows as well, especially with Server 2022 and Windows 11.
However, it still has a few rough edges, so I'm waiting to see how it all pans out...
> For 10-1000 servers that aren't all identical, I prefer Windows because of things like Active Directory, GPOs, DSC, etc...
I run over 1000 Windows boxes and AD/GPOs are terrible. GPOs are spaghetti by design. There are no concepts of infrastructure as code with these tools, and they're quickly becoming liabilities at any scale. There aren't any real ways of making repeatable and reversible changes so these infrastructures via code.
Broaden your horizons a bit and you'll notice these are not the way.
Using Chef on top to manage DSC is the only thing that allows me to keep my sanity, and the more config I pull out of GPO and AD the happier everyone becomes, and the safer all of our systems are. Windows is terrible because it's too heavyweight and too opaque. It is the opposite of an agile choice of operating system.
> I prefer Windows because of things like Active Directory, GPOs, DSC
I felt the same until we had a bug which broke the network adapters under high throughput of packets per second (of course, we didn't know the cause, but the machine was bumped off the network until a reboot which wiped away debug info).
Going via the IPMI yielded a "Can't connect to AD" message, and the rotating local "Administrator" account password was unrealistically hard to acquire (ubisofts fault, not windows); which lead to me writing this: https://github.com/dijit/statham (but in Go, this is a rust reimplementation) so that we could all have local accounts. :|
> container technology is maturing in Windows as well, especially with Server 2022 and Windows 11.
I'm actually using Windows Kubernetes nodes today, it has two huge notable drawbacks.
1) The base container image needs to match the kernel version, meaning you need to build containers for Windows Server 2019-ltsc, Windows Server 2022, Windows 10 and Windows 11
2) The containers are enormous, one of mine is sitting at 6GiB and takes 4 minutes to pull on GKE.
and the minor drawback of; the instructions to make a bare metal kubernetes windows worker node are sporadic, outdated and otherwise largely missing.
> The base container image needs to match the kernel version
The kernel ABI has been stabilised in 2022 & 11, which means this is no longer an issue. I tested it and it seems to work as advertised. E.g.: both Windows 11 and 11 22H2 OS can both run a variety of 2022 and 2019 containers, and can run "RTM" and semi-annual 2022 containers in process isolation mode.
> the instructions to make a bare metal kubernetes windows worker node are sporadic, outdated and otherwise largely missing.
I wouldn't even try. Just use Azure Kubernetes Service (AKS), AWS EKS, or something similar.
That sounds great. As this should be very useful for my workplace, it would be great to have an official source to cite on this. Unfortunately, my Google-fu hasn't returned much right now.
If you know of any Microsoft documentation pages, blog posts or something similar on this, can you point me in the right direction? Thanks! :)
> I agree with everything you've said, but to add to that: there are these weird inflection points with both platforms.
Spot on.. and I do like both platforms with a slight favour for Linux because licensing can be a hassle (unless MS still do those special Enterprise deals and you have one) and you can kind of get around the messy Linux things and it's generally easier to work with unknowns/new things in Linux (as the MS world abstracts too much).
It's good that we have the options and any adult Developer/Systems Admin would recognise that and take advantage of it.
I know people will now argue into oblivion about powershell vs bash, but I actually wonder why was this person hired in first place to manage windows based systems what clearly seems not be his domain of expertise?
In my experience it is often better to hire people who are curious, motivated and at least a little bit intelligent than people who already possess specific skills.
You can argue until you're blue in the face about if my approach was the right one, ultimately though what we did worked, was incredibly lean, easy to understand/troubleshoot/debug and most importantly it served the goals of the business.
Could we have gotten up to speed with the Windows way of doing things?
Probably.
There were so many niggling problems with the platform that were not part of my prejudice, that if we continued to go that route I am certain that 1000's of our person-hours would have been consumed even if we had been absolute experts, things like the fact that windows bonding in 2012R2 was done in userland, that joining to the domain requires a reboot (which will break your software if it's registering to take workloads on boot), that powershell versions are often incompatible and that interface devices are very difficult to deterministically reference.
Windows sysadmins seem to come in two varieties:
1) Experts who know exactly what Windows is thinking at all times
2) Button mashers
Unix/Linux engineers have a much broader range of knowledge, and there's certainly more of them on the senior end.
It's counterproductive to hire people who go against established procedures and built systems at company just because they have high quality opinions about one piece of software or another.
>He joined in 2014, when Windows had a superior shell and remoting system that eliminates the requirement for low-level string parsing tools like sed/awk entirely!
These debates always remind me of this quote (referring to C++) here on HN by another user:
It seems like grappling with these arcane features feels productive, at first.
You enter a flow state and after a while, something works that didn’t work before.
It’s only when you step back and ask what actually got accomplished that you realize it’s time wasted.
That's what Linux feels like to me. People enjoy solving arcane problems, digging through man pages, download esoteric commands. Heck I enjoy this stuff too! Recently I had a lot of fun playing around with parallel compression tools available on Linux.
It's fun... but it's rarely as productive as not having to deal with the challenge in the first place.
There's not much of a debate. Windows as a server is dead outside sme shops. This isn't the 90s anymore. But it is very nostalgic hearing people talk up microsoft. It's not something I'd heard in the last 15 years.
I can comment directly on this as I work in the backup sector. First hand knowledge, yada yada yada.
Dell EMC DataDomains do have good ingest performance, you can typically throw hundreds of streams at them and they'll greedily gulp it down.
And it is true that they are dog slow at restoring data. The reason? They are deduplicating appliances, you have to rehydrate the data, and this can take a very long time depending on the number of blocks needed that constitute the backup. Some blocks may be shared between hundreds if not thousands of backups. Dell isn't one to talk about their downsides much.
They are best used not as _primary_ storage, but secondary (think 3-2-1 backup rule). You should have some fast nearline storage available for recent backups that require really low RTO (recovery time objective).
Depending on the application used for backup, it may not have had native Postgresql WAL processing. This and may have had to be image based which slows down the process and requires some additional scripting.
Thats pretty cool, the datadomain appliance is actually really cool and I definitely appreciate its existence.
I definitely feel like we were using it wrong and I’m not entirely convinced it was my fault.
In my ideal scenario I would have had a weeks worth of point in time backups on a machine in the rack, and replicated the contents to the DD after verification.
Sadly I was denied that “in-rack” solution and was sold the datadomain solution (without it being named) as if it was simply a remote disk, not a fancy appliance.
the main point I tried (and failed) to convey in the article is that a solution can be brilliant and expensive but that doesn't mean its what you meed for the job at hand.
Oh yeah, for sure. I got the gist of your article. I thought it was pretty clear.
But overall, planning a backup strategy can take some time. You need to know your goals before you go buy some vendor's hardware. But, in your circumstance, you gotta do what you gotta do.
Our remote manager bought us an expensive HP floor-mount server like a dishwasher. It was over spec'd and underpowered. And we didn't need a server, not for work.
My colleague loaded up his entire ripped CD collection and we got Bluetooth headphones and had it serve up tunes all day. Back when that was a thing. Only thing it ever did.
> and begin real investigation you will quickly find that many databases that are popular are totally fine losing data. MongoDB being the most famous example that I can think off of the top of my head.
Always pisses me off. University teaches the principles of ACID and how hard databases work to adhere to these principles, and then so called "NoSQL" comes along and says "lol we have eventual consistency".
It is very frustrating that many of the NoSQL databases aren't ACID. Some even say they are, but they aren't. Frustrating to me because we have one that is, and it's a lot of work and harder to win benchmarks against the ones that fake it or aren't.
> Always pisses me off. University teaches the principles of ACID and how hard databases work to adhere to these principles, and then so called "NoSQL" comes along and says "lol we have eventual consistency".
99% of the organisations that pride themselves on using "real databases" and ACID aren't actually using those guarantees or gaining anything out of them. Transactions are inherently useless in a web service, for example, because you can't open the transaction on the client, so the part of the system where the vast majority of consistency issues happen (the client <-> server communication) will always be outside the transaction boundary. ACID fanboys love to talk about how all the big name internet companies are built on RDBMSes, and fail to mention that most of them were built on MySQL 3 which never actually had working ACID in the first place.
If MongoDB had come first and SQL/ACID RDBMSes had come after, we'd recognise them for what they are: a grossly overengineered non-solution to the wrong problem.
> Transactions are inherently useless in a web service, for example, because you can’t open the transaction on the client
(1) There is no reason you couldn’t, in a stateful, connected web app (or a stateful web service) open backend DB transactions controlled by activity on the client. You probably don’t want to, because, “ew, stateful”, and would prefer to use compensation strategies to deal with business transactions that span beyond a database transactions, but you could.
(2) The conclusion “transactions are inherently useless in a web service” does not follow from the premise “you can’t open the transaction on the client”. They are just two completely unrelated things that you’ve glued together with “because”. I write web services. The backends very often need a consistent view of state and to be able to assure that a sequence of mutating operations against the DB are applied all or nothing; transactions do both of those things. The fact that transactions are opened from the backend doesn’t make them useless.
> the part of the system where the vast majority of consistency issues happen (the client <-> server communication) will always be outside the transaction boundary.
“Consistency issues” in the ACID sense do not (cannot, in fact, since “consistency” is a property of database state) happen anywhere other than inside the database. Client <-> server communications have all kinds of issues, but ACID;s C is not one of them.
> ACID fanboys love to talk about how all the big name internet companies are built on RDBMSes
Nah, actually, as an “ACID fanboy”, I’ll say that most big name internet companies were not built on ACID systems, and that if you are running a big internet company you have a much greater than usual chance of (1) having a case with a tradeoff that really does call for a non-ACID system, (2) having to invent revolutionary new technology if it turns out you actually do need an ACID system, because OTS RDBMS’s don’t scale the way you need.
But for everyone else, you probably aren’t Google.
> If MongoDB had come first and SQL/ACID RDBMSes had come after,
While they weren’t MongoDB, specifically, non-relational, non-ACID, key-value stores where the “value” could be arbitrary data did exist before RDBMS’s. OTOH, users of MongoDB and other similar NoSQL system have often discovered that, oh yeah, they do want the things RDBMS’s provide, which is why the pendulum swung somewhat back in the RDBMS direction after peak NoSQL hype.
NoSQL has its place, too, but the relational model and ACID guarantees actually do solve real problems.
> (1) There is no reason you couldn’t, in a stateful, connected web app (or a stateful web service) open backend DB transactions controlled by activity on the client. You probably don’t want to, because, “ew, stateful”, and would prefer to use compensation strategies to deal with business transactions that span beyond a database transactions, but you could.
You'd have to break the rules of how web systems are meant to work. HTTP requests are supposed to be independent (and may arrive out of order with each other etc.), holding a long-lived connection between server and client has limited support.
> The backends very often need a consistent view of state and to be able to assure that a sequence of mutating operations against the DB are applied all or nothing;
You never need that if these things are the result of a fixed client request; the set of things to be done can't change, so eventual consistency will serve the purpose. The case where you need ACID is where there's a complex back and forth to be had with the thing that's driving the change within the same transaction - and to the extent that "the thing that's driving the change" is solely on the backend, it's not really a web system.
> “Consistency issues” in the ACID sense do not (cannot, in fact, since “consistency” is a property of database state) happen anywhere other than inside the database. Client <-> server communications have all kinds of issues, but ACID;s C is not one of them.
Small-c consistency (actually I was thinking mainly in the CAP sense). Things like requesting to make a change based on stale data.
> You never need that if these things are the result of a fixed client request; the set of things to be done can't change, so eventual consistency will serve the purpose.
Yes, if I wanted to handle orders of magnitude more simultaneous requests than this app could conceivably ever need to, but with more app complexity and greater latency for substantive results even at it's existing scale, I could use an EC solution instead of an ACID one, but...I don’t.
What are you getting out of ACID, and what is it costing you? The main practical effects of using an ACID datastore are that a) your writes don't commit until they've finished updating all your indices (sometimes what you want, but often not) b) if you get your dataflow wrong you'll get deadlocks, occasionally, probably starting weeks after you made the mistake that caused them
> Transactions are inherently useless in a web service, for example, because you can't open the transaction on the client, so the part of the system where the vast majority of consistency issues happen (the client <-> server communication) will always be outside the transaction boundary.
I think I need more help understanding this. In terms of ACID, a consistency issue is something like, for example: "The client ordered an item, and the count of item availability was reduced by 1, but no actual order was saved in the database. An item is now 'lost' with no trace of where it went and how it disappeared."
There is even another consistency hazard right in this very example: "The count of item availability was supposed to be reduced by 2 because 2 items were ordered by different clients at a same time, but a race condition (the two transactions both individually reading the same old value before writing back the new value) led to the item count only reduced by 1. There is now an extra item 'available' that does not exist."
The canonical example is bank accounts and money transfers, but I figured I'd use something that was more likely in the domain of "99% of organizations" (though that also includes things like ERP, where a lot of other canonical examples apply).
Care to elaborate how transactions are "inherently useless" in a web service here, and how the consistency issue happens "between client and server" communication?
Presumably, in this example, the client/server communication triggering the transaction is the order of an item. There are not that many things that can realistically go wrong in that communication (with checksums on different layers, and nowadays even cryptography, guarding against most arbitrary corruption): The order was not received, or the order was received multiple times. In both those cases, the database remains internally consistent: Yes, the order happening 0 times or 2 times may have been wrong, but the database overall shows the state of what you would expect after 0 or 2 orders. No item has been going inexplicably "missing" or "created" until the next inventory, and the external issue can just be corrected by reissuing or canceling the orders, again with the prospect of an internally consistent database. And even a corrupted order would not change that.
> In terms of ACID, a consistency issue is something like, for example: "The client ordered an item, and the count of item availability was reduced by 1, but no actual order was saved in the database. An item is now 'lost' with no trace of where it went and how it disappeared."
> There is even another consistency hazard right in this very example: "The count of item availability was supposed to be reduced by 2 because 2 items were ordered by different clients at a same time, but a race condition (the two transactions both individually reading the same old value before writing back the new value) led to the item count only reduced by 1. There is now an extra item 'available' that does not exist."
You don't need ACID to solve that kind of problem though; event sourcing and eventual consistency handles it fine. What you need ACID for is synchronous transactions; for example you want to check that both item A and item B are available, order both of them, and have it either both orders go through or neither of them, even though those are separate orders to separate tables (of course if you have a concept of an "order" as a first-class entity then you can trivially solve this - but again, in that case you don't need ACID at all).
> Care to elaborate how transactions are "inherently useless" in a web service here, and how the consistency issue happens "between client and server" communication?
So the example above - the whole point of ACID is to let you do two separate things at once, e.g. you want to open the page for item A and for item B, check that both are available, and then order them both on their respective pages. With an old-fashioned synchronous client-server system with a dedicated fat client you could do that, but on a web system obviously it's not possible (other than by creating some kind of "batched request" or "combined order" concept for doing both orders in the same request - but again, once you've done that you don't need ACID at all) because there's no way for the client to do two page loads in the same transaction.
How can eventual consistency handle the inventory use case as well as an ACID system? You can't tell the user "your order is placed successfully" and then later tell them "actually no, it was out of stock".
That's a different problem from your previous examples, and it's the same problem whether it's ACID or not: there's always an opportunity for the user to see the item as being in stock, then request to order it, and discover it's out of stock. Your "recovery" from the case where there's nothing in stock doesn't need to happen in a transaction (and can't meaningfully do so); you just tell the user their order failed, and again you do that the same way whether it's ACID or not.
I see what you mean, but with ACID you can tell the user right away if the order was successfully placed. With eventual consistency you don't know how long it'll take so you either show them a loading spinner that takes an indeterminate amount of time, or what?
Not saying that the spinner solution is wrong (as it definitely scales better), but the immediate feedback is something you lose with eventual consistency, by definition, right?
> I see what you mean, but with ACID you can tell the user right away if the order was successfully placed. With eventual consistency you don't know how long it'll take so you either show them a loading spinner that takes an indeterminate amount of time, or what?
It's doing the same thing either way though; in either world you can even wait for the order to be processed, or not. In an event sourcing system you'll commit the event quickly and then wait potentially arbitrarily long for the result to appear downstream; in an ACID RDBMS you'll wait potentially arbitrarily long for your commit to execute (and maybe if you're lucky your database has got a deadlock detector, but what are you going to if it tells you you hit a deadlock? 100% of the time I've seen the answer is "backoff and retry").
> in an ACID RDBMS you'll wait potentially arbitrarily long for your commit to execute (and maybe if you're lucky your database has got a deadlock detector, but what are you going to if it tells you you hit a deadlock? 100% of the time I've seen the answer is "backoff and retry").
I don't agree with the conclusions from this. re: deadlocks, these can be prevented as they are only possible in certain situations, and have mitigations (keep transactions short, never acquire the same locks in a different order, ...). For something "simple" like atomically decrementing an inventory count and then inserting a new row into an orders table, deadlocks (or locking problems at all) are not possible. Of course as you make your system more complex it becomes more likely though, and that's a very fair argument of why ACID systems won't scale as well in general.
But I would still maintain that, even if you have a commit that times out, it does so in an atomic way. It takes at most your "statement timeout" (should be a few seconds probably), and then you can (in deterministic time) show the user an error message. This is still an improvement for the user experience over showing a "your order has been placed" message and then later cancelling it due to overconsumption of inventory.
I appreciate your replies by the way! I haven't been convinced yet that eventual consistency can provide an equally good experience as ACID for this use case, but you've made me think about things in a new way.
> re: deadlocks, these can be prevented as they are only possible in certain situations, and have mitigations (keep transactions short, never acquire the same locks in a different order, ...).
In principle yes, but this relies on human vigilance; as far as I know there's no automatic checker that can reliably tell you whether your queries have the possibility of deadlocking. Do you review every query before it gets run? And when you miss a case like acquiring locks in the wrong order, it can be weeks or months before it actually bites you.
> But I would still maintain that, even if you have a commit that times out, it does so in an atomic way. It takes at most your "statement timeout" (should be a few seconds probably), and then you can (in deterministic time) show the user an error message. This is still an improvement for the user experience over showing a "your order has been placed" message and then later cancelling it due to overconsumption of inventory.
One of the most fun ways I've seen an SQL system break: user navigates to a page, gets a timeout in their browser; 23 days later the database falls over.
(the page initiated a query for 2 years' worth of data, the database server chugged away through its indices for 23 days and then started trying to stream all the data back).
I agree that it's good to have that kind of fallback behaviour - in the system I currently work on we have something like that, where if a complex process doesn't get an event for over 1 second (most likely because the thing computing it broke, but it could also just be slow) then we have a simple consumer downstream that just emits a cancel event in that case (and passes everything else through otherwise) and we take that stream as canonical. And having something like that by default is a good thing, and one of the things that SQL databases do right is that they're a lot more request-response, whereas event sourcing things can be a bit "shouting into the void". I'd like a system with better support for that kind of case. That said, I think in a lot of cases the SQL defaults aren't great - I don't think I've ever used a database that had a good default timeout setting, and the way SQL databases treat any validation failure as "drop the data on the floor" is rarely what you want in practice.
There is a difference between theoretical worst case and practical experience. Whether the direct-update experience or the event sourcing experience on this point is better depends on a lot of factors (many of which are dimensions of scale.) Neither is categorically ideal, you've got to understand the particular application.
Of course, when using an ACID RDBMS, you can also very trivially use a CQRS/event-sourcing approach for some flows with append-only event log tables and a separate process applying them to query tables that other clients only read and direct update for other flows.
> when using an ACID RDBMS, you can also very trivially use a CQRS/event-sourcing approach for some flows with append-only event log tables and a separate process applying them to query tables that other clients only read and direct update for other flows.
Sort of. You can't opt out of transactionality (transaction isolation settings are often global) and you can't defer index updates except by making your indices "manual".
The first job I worked for essentially built an event sourcing system on top of an RDBMS. We still had the production system brought down by an analyst leaving a transaction open for weeks (we weren't big enough to have a separate analytics datastore at that point). They should've known to enable autocommit (even though their queries were read only!), but it's still a footgun that was only there because of the ACID RDBMS.
Im no expert but my recent conclusion from writing an ordering system is customers can instruct their banks to roll back many payments a month or even a year after the fact
So actual ACID transactional consistency requires a year. So after basic checks just say you'll give them what they want and accept payment -- since you'll need failure recovery anyway for lack of delivery, refunds, acts of God etc
The idea I've seen repeated here a lot is to aim for strong consistency internal to your system, even though the real world/things external to your system will be eventually consistent.
So the inventory <-> create order interactions are internal and can be strongly consistent, while the realities of the banking system are external and will be eventually consistent.
Just because you can't make everything strongly consistent doesn't mean you shouldn't bother at all; it can still eliminate entire classes of errors (in this case, errors around inventory being over-consumed leading to cancelled orders are completely avoided with ACID)
edit: I'm also no expert when it comes to distributed systems. Most of my experience is with transactional systems which I know quite well, and distributed is still a mystery to me. I am very open to new ideas around it and don't have a ton of confidence in my comments here.
A lot of things have improved in distributed databases as of late. There are distributed databases today (my co. provides one) that can have strict serializability (the highest data integrity level) for hyper scale applications. Tolerating "eventual consistency" is largely a choice, oftentimes forced because your DB application or implementation (NoSQL and traditional RDBMS, depending on their implementation) does not provide you strict transaction guarantees. There are third parties (https://jepsen.io/analyses is a great source) that can evaluate and validate such claims. A lot of the anecdotal information in this thread here is no longer accurate or applicable. Yes, there are databases (like Fauna) that don't compromise on the highest transactional levels, and yes, there are more widely used databases (MongoDB, DynamoDB, and many others) that either cannot make those guarantees, or their implementation by their service provider cannot demonstrate that guarantee. Happy to provide more info if interested.
> the part of the system where the vast majority of consistency issues happen
This is survivorship bias: the reason the vast majority of consistency issues happen between the client and the server is _because_ the server can eliminate issues on its side with transactions and similar.
No, you're inherently a lot more likely to get inconsistency over the higher-latency and less-reliable public internet that between the server application and database that are probably sitting in the same rack, if not on the same machine.
In my observation, the game industry and tech industry are like humans and apes: they have the same ancestor, but at some point in our ancient past we diverged. You rarely see crossover between the two these days, and I doubt we'll see any in another decade.
What I think that means is there's probably a huge opportunity selling five year old tech ideas to the gaming industry
The implied notion that the manager really had no idea what they actually wanted, or at the very least were unable to sensibly describe the requirements, is entirely believable, but from a systems development perspective nothing in this article makes any sense.
Well, spending 3 months proving that MySQL can reliably store data raised an eyebrow. I mean, we are talking about gamer profiles here. Worrying about whether or not the log is on the same volume as the table really seems like a bit much in that context. Even worrying about fsync is honestly questionable.
If you buy something and we grant it to you and your profile gets corrupted.
We are sued.
Better not risk it.
Also Gamers are the walking definition of a hostile customer. Give them no quarter to call you incompetent.
(also, we are talking enormous binary blobs, not easy to recover from if partially written- most databases don't store large rows in-line for performance reasons)
Some of the MariaDB 10.1 versions trivially corrupted themselves by just loading them up with writes a bit, no crashes or power issues involved.
One of the 10.6 versions had (has?) a bug where a connection sometimes simply does not die ever, even after disconnecting and killing the query using the KILL sql command, so it never releases its slot (which matters if you have lots of connections), although to be fair I've only hit that once since we upgraded to 10.6.
Frankly I prefer MySQL and friends over Postgres simply because that's what I always used and I find it familiar (plus I don't like the process based concurrency that Postgres has, while I haven't really tried it to see if it matters, process creation is fairly expensive in Windows to use as a concurrency mechanism and limits the numbers of connections a lot more than threads), but let's not deny reality here, MySQL has never been rock-solid and a glance at its source code should be enough to tell you why.
As for your disregard for performance issues, it's hard to say if it's warranted without knowing the numbers and the load profile but it's probably preferable to spend the time to test it properly than having to fix everything in a hurry later with customers screaming at you over the phone.
Plus I like programs built to be snappy and fast, a good practice that sadly has been lost to time.
So "nothing makes sense from a system developers perspective" because gamer profiles are supposedly not valuable, and it's fine to have a system where they may get lost or corrupted, instead of trying to solve the problem with the premise that they may not?
Putting aside that you seem to criticize that premise instead of the system development approach that fulfills the premise, I suspect that the resulting mess of support requests, manual intervention, and community unrest Ubisoft (or any other MMORPG developer) would find themselves in, if, say, "the shiny sword of atrocious doom" that players spent 240 hours sinking their life into obtaining were to disappear regularly, is probably not worth it.
I understand this despite not doing any online gaming myself.
If its linked to things like achievements and stuff, yeah, it's pretty important. I understand this isn't money they're dealing with, but it is time and I would be fairly annoyed if I had dozens of hours into a run and lost an achievement I was targeting due to some server side screw up.
I wouldn't run around cursing their name but I would probably stop playing for a period of time. Whether or not that's a problem depends on their business model.
This could be one of those big-corporate situations:
1. Someone pays a lot of money for something.
2. It turns out to be the wrong thing (either for new needs, or because they didn't consult or listen to the people who could've told them that in time).
3. Someone wants to avoid the political backlash of admitting they bought the wrong thing.
4. The org incurs much more costs than actually fixing the situation would cost, as people have to use the wrong thing and/or can't use the right thing. (Costs from lost productivity, lower quality/uptime, damaged morale, etc.)
(But what I really want to know is... Why did "The Division 2" not build upon "The Division", but instead make you start a new character, and then be more of a selling-brightly-colored hats game? Also, TD1's gritty survival mode was the most compelling gameplay of the franchise, IMHO.)
> Why did "The Division 2" not build upon "The Division",
Four main reasons, you can argue if they are "that big of a deal" but ultimately they were. Please also remember that Division 2 was made in 2 years, and Div 1 was made in 7.
1) Clean slate of the architecture built on google cloud meant that connecting in to the datacenters in Montreal would have been very time consuming to get right.
2) The Division 1 continued, so a migration would have had to have some pretty clearly defined semantics, do we sync both the backends constantly? Do we fork your character? Do we stop your previous account from working. Forking sounds like the right thing until you realise you can't fork again. :\
3) We would have to carry over every single item from the Division 1, most of which do not make sense in the setting.
4) We'd have to have a clear upgrade path for your binary save game (from a moving target as Div1 was still getting updates!)
But it was a topic that was discussed at great length.
> Also, TD1's gritty survival mode was the most compelling gameplay of the franchise
Yeah, I really liked that, I wish we could have released it as a standalone game, especially as it came out in the height of BR and for me it hit similar themes.
Thank you! (And thank you for contributing to great games!) It's good to hear that the question was taken seriously, even if I might've preferred the decision going a different way.
I get the same on Firefox on Linux (Fedora 37). It looks like the server is responding with an ECDHE key exchange using a SHA1 signature algorithm which is disabled on modern versions of OpenSSL. I'm not sure why it works on Chrome, maybe their TLS client is a bit more lax. You can see openssl fail to verify the signature locally with `openssl s_client -connect blog.dijit.sh:443` due to the wrong signature type.
I am not sure about that. On my system that openssl invocation fails because the site doesn't staple the intermediate Letsencrypt R3 certificate, and openssl doesn't retrieve it while browsers do.
The error from Firefox is SSL_ERROR_UNSUPPORTED_SIGNATURE_ALGORITHM which seems to align to what I'm seeing on the signature algorithm returned. Lastly by setting my system wide policy to allow the LEGACY (Fedora's term) algorithms the website starts to work on Firefox.
It could very well be that the server is at fault here (based on what I've read this seems to be the case) and that's due to the Let's Encrypt cross signed certificate. But the reason it's failing to load on the client side is because some clients block SHA1 based signing algorithms and that's what the server is offering here.
Same issue here. My Firefox on Fedora gives "SSL_ERROR_UNSUPPORTED_SIGNATURE_ALGORITHM".
I found a bug report[0] which suggest it is (mostly) a server issue. Fedora refuses to use SHA1, and the server ignores the client's attempt to negotiate a more secure algorithm.
This kind of thing seems to happen in very large organisations that have accreted many production IT systems, especially once an org realises it has half-a-dozen different services in production that are performing an apparently equivalent job. What an opportunity for architecture simplification! Why run half a dozen different variants of apparently the same thing when we could consolidate on one standard approach, and reap a bunch of cost savings and reductions in operational complexity.
Enterprise architecture can commission a project to review options and select the one true enterprise backup solution. After sufficient peer review / diligent testing / bribery from vendors, an enterprise backup solution is chosen, wrangled into production, and then inflicted upon the org.
Let's be pragmatic. For existing production systems, it may not make sense to migrate them over to the new enterprise backup solution. Much risk and cost, limited payoff. Perhaps it's simplest to just keep them running their existing legacy backup solution until the entire production system that depends on it is finally decommissioned, which is planned for only 3 years away. We'll definitely decommission it after 3 years.
But for any new teams trying to deliver a new service into production. Are you storing data? Well, you must have a backup strategy before you can go to prod, and further, your backup design must align with the enterprise backup strategy -- you must integrate with the enterprise backup solution for your design to be stamped approved by architecture, unless there is an exception granted by appeal and/or ritual sacrifice to your manager's manager's manager's manager's manager.
Reading this, one might think that this happens only at X or Y company, but in my experience, this is a lot more common.
One example: at a previous job, we spent $1M and almost 1 year to build a (new version of a) process (as in, a website) that served less than 50 customers. The project was launched to great fanfare internally (talks, videos, the whole package) but created no value, even those 50 customers didn't transition to it, as the old one worked fine.
The same job where my manager wrote in my review that I'm too negativist and only point out problems.
I have so many of these stories, some funny, others more tragic - companies going under and people loosing their job and going into depression because a manager was stubborn and incompetent - that I think I should write a book, but then I realise I'm not that special and my stories are pretty boring.
The author makes a good progression in their reasoning but then somehow arrives to the wrong conclusion:
> So when someone says that Amazon has invested a lot of money into security I think about the fact that Ubisoft spent $1M on a backup solution that didn’t work for the game that would have had the best use of it.
The missing point here is thst ubisoft is/was spending money for something outside of its domain of expertise, and managed to make a suboptimal decision. Amazon however is spending money on two of its core domains of expertise: building and managing datacenters and building and operating web services.
Not to mention, aws is very clear and explicit about what is their responsibility and what’s your responsibility (security of the cloud vs security in the cloud).
I wrote the article in a fit of anger because someone (on hackernews) said that Amazon spent a lot of money on security therefore its better than whatever [you] could possibly need.
The main thesis of the post is that sometimes incentives are not aligned and you still need to think critically about what you bought and what its intended purpose is, the cost of a thing does not mean it fits your need.
You can pull a train with a Bugatti Veyron, but its not the right tool just because it costs as much or more than a train does.
"The main thesis of the post is that sometimes incentives are not aligned"
Sure, but your example of AWS and security is exactly the opposite where incentives are aligned.
"but its not the right tool just because it costs as much or more than a train does. "
Yes, and you didn't investigate the impedance mismatch between your tool (PgBackRest) and your backup system. What this has to do with your stray bullet to AWS I have no idea.
If I had been streaming and not reading then Ubisofts solution would have been the right one.
But the situation was as such: I didn't know what the incentives or intentions were; and others didn't either because all they thought was that "the investment had been made".
Turning off your brain is the wrong solution, you have to check that your incentives are the same as Amazons
> The main thesis of the post is that sometimes incentives are not aligned and you still need to think critically about what you bought and what its intended purpose is, the cost of a thing does not mean it fits your need.
The whole post is clickbait. The million dollar solution was to store source code that they never ever wanted to lose. It (presumably) worked for that. What it didn't work for was backing up user profiles and being able to quickly restore them.
The environment that the datadomain was hosted in was not allowed to host source code or raw assets, but the bank vault thing was because of lost source code.
The issue is that most games just fire off into the void if they need to backup and restores are never tested.
> The games industry is weird: It simultaneously lags behind the rest of the tech industry by half-a-decade in some areas and yet it can be years ahead in others.
I'd love to see examples of "ahead" coz I literally never saw it. Game dev studios act like automated testing is a new thing...
> PostgresSQL performed much better and had the additional benefit of being able to cleanly split write-ahead logs (which are largely sequential) and data to separate RAID devices. Something that MySQL doesn’t really support and would have to be hacked in using Symlinks on every table create.
It's funny that you can guess that it this happened good few years ago purely because the solution to above wasn't just "slap a NVMe on it"
>I tested this and ordered the storage I would need to have a rolling 90 day window of backups (with older backups being taken off-site)..
>The hardware request was rejected.
> When I inquired as to why, I was told that Ubisoft has a standard backup solution which is replicated globally and sent to cold storage in a bank vault somewhere in Paris. I was told this is because we had lost some source code once upon a time and we could no longer build certain games because of that.
My first thought was "why not just ship pg backups + wal logs there" and just a full backup from time to time. No read-back needed.
The whole problem seems to be "the admin overengineered solution without knowing the constraint, then tried to headbutt the constraints instead of changing approach
> Our EMC DataDomain system was optimised primarily for ingesting huge volumes of traffic, but if we want incremental backups then perhaps we needed a something a little more dynamic.
Nope, your incremental approach sucked. Majority of software that I've seen used to do incremental backups could generate incrementals on its own, without any readback from server.
The minority *was software that wholly managed backups*, so it was prepared for that, and often it was just "read metadata from database" instead of actual data.
> I don’t know what else to take away from this.
That people use tapes for backups so ? To get your requirements before implementing solution ? To find better solution when requirements are known ? To not get stuck in initial idea.
Literally the basic, recommended way of doing PostgreSQL backups with no extra software would work.
The “years ahead” was described in the article. We had what amounts to a very well oiled kubernetes installation with mTLS, but on Windows, in C++ and 10 years ago (before Kubernetes was a thing).
Everything else you say is true, I could have chosen another solution for backups but the tradeoff between backup speed (replaying WAL can be time consuming) vs database load (full backups delay replication to replicas) was made before I knew the read/write characteristics of the datadomain. (I didn't even know it was a datadomain- nobody told me anything except giving me an NFS mount point until I started having problems, at which point I spent weeks debugging with a storage engineer from Montreal- this article is a 6-8month span of back and forth distilled).
Regardless: restore times are important, as the wise sages of our industry once said; you don't have a backup until it is tested in a restore - which is what this system was effectively doing. How would you test without reading?
> The “years ahead” was described in the article. We had what amounts to a very well oiled kubernetes installation with mTLS, but on Windows, in C++ and 10 years ago (before Kubernetes was a thing).
The revolutionary thing about Kubernetes was not what it did, but that it was an enterprise supported open source version of what everyone was doing internally. And not tailored to a specific domain but fully generic.
Both you, I and the other person replying to your comment were running a bespoke system like this in 2014. That's just 3 people semi-randomly meeting in a thread. If you imagine how much engineering work was saved by Kubernetes since then it's mind boggling.
Of course not all of them were as fancy as you're describing yours was, mine was just a collection of bash scripts, but they solving that set of problems was definitely a thing for many projects in 2014.
> Both you, I and the other person replying to your comment were running a bespoke system like this in 2014. That's just 3 people semi-randomly meeting in a thread. If you imagine how much engineering work was saved by Kubernetes since then it's mind boggling.
Not if you're one running the k8s cluster. K8S benefits are wholly on dev deploying apps side, not ops.
Hell, in early versions k8s "automation" to build the cluster was so bad it made clusters that would self-destruct after a year, because there was no proper cert management built in. Even now you have huge amounts of code dedicated purely to "care and feeding" of k8s cluster. Sure, you can deploy one with one command, but if something goes wrong or something needs to be debugged you're jumping into a swamp.
We already had CM-automated CA so implementing K8S wasn't really a problem for us (just a bunch of learning) but it absolutely overcomplicates everything involved because vast majority of apps just isn't big enough to reap the benefits.
Especially when you need to herd a bunch of daemons just to have proper monitoring or logging inside it. Or how many apps now need a "sidecar" just to get some stats in it and you have 20 different apps that only job is "fetch some stats from app that doesn't support <current monitoring fad> and push it into <current monitoring fad>"
> The “years ahead” was described in the article. We had what amounts to a very well oiled kubernetes installation with mTLS, but on Windows, in C++ and 10 years ago (before Kubernetes was a thing).
I mean we had those things back then too, that's why I was curious. But then I guess we're a bit of minority, we have near-everything under configuration management for over a decade now.
Doing it under windows does seem like an achievement, it isn't exactly OS nice for tinkering.
> Everything else you say is true, I could have chosen another solution for backups but the tradeoff between backup speed (replaying WAL can be time consuming) vs database load (full backups delay replication to replicas) was made before I knew the read/write characteristics of the datadomain. (I didn't even know it was a datadomain- nobody told me anything except giving me an NFS mount point until I started having problems, at which point I spent weeks debugging with a storage engineer from Montreal- this article is a 6-8month span of back and forth distilled).
I honestly expected the usual "cherry picking files is slow but if you tell the backup software to do full restore it goes somewhat quickly" but it appears I misjudged how shitty the "enterprise" backup software can really be.
I wanted to say "a lot of fault lies on Ubisoft for not documenting upfront the workings and quirks of the system" but if you had similar speed of restores there is no excuse for that and no amount of warnings can fit something that bad...
There are days where I wish company I work for was 100x bigger so I could excuse going "fuck that shit, we will just make our own backup system"... we once priced migration from OSS software to Veeam and it costed more than the running cost for the servers... and few racks of new servers.
Anyway, our solution when we had that problem one time (backup too big to restore quickly except in "DC burning" cases) was 2 tiered backup, some "cheapest used box with many HDDs" acting as first tier for backup with second tier being the usual data store. Which seems like what happened in your case in the end.
And yeah, we also had cases where big fuckups caused management to finally spend the money.
> I'd love to see examples of "ahead" coz I literally never saw it.
Game developers were deeply aware of data-oriented design and optimizing code and data structures for efficient CPU cache usage well before I saw most other areas of industry aware of it. You can find counter-examples, of course, but overall, this is an area of performance that many game developers understand in their bones, and many programmers outside of games—even ones who care a lot about performance—are oblivious too.
Depends on which side you look from. If you look from embedded realtime development side it might be game developers being behind and rest of the industry REALLY behind.
Game is essentially soft realtime system that needs to deliver frame every 16.6ms or gamer is unhappy. Very few pieces of software does optimization like game development does because it is just not that often needed.
Hi, I’m the author of the article and I’m horrified by how many spelling mistakes I made. I must have written this in a rage.
To answer your question, since we had control of the clients (since its a game) we used a proof of work challenge on TLS handshake which increased in complexity the more failed attempts you gave us.
Very cheap on the server, very expensive on the client, which effectively rate limited connection attempts.
Drop incoming SYNs when the CPU hits thresholds. More specifically, every second the cpu is at or above the 'bad times' threshold, increase the probability that you'll drop a SYN; every second the cpu is at or below the 'good times' threshold, decrease the probability that you'll drop a SYN.
I dont think its possible to do it without "any load" The hardware grabbing the packet, putting it into the socket buffers, analysing if its valid, requires some work.
I worked for a university that spent about a million dollars on data storage for a research project. They estimated that they required about a petabyte of storage, so a big expensive storage array was purchased.
In the end the project used just a few terabytes of storage because they miscalculated their requirements.
It was a bit of a running joke that they spent a million dollars, when a $150 SSD would have been fine. They ended up repurposing the storage array (which is when I got involved).
Luckily everyone has now moved to cloud, and life is better.
> Ubisoft had built an organisation optimised for treating developers like fools
I think this whole write up really says more about how video game developers treat entry and mid-level career developers/programmers/graphic artists.
Because there seems to be an inexhaustible supply of young idealistic naive persons who will take any salary offered and work for 80 hours a week, because they've been offered a job at a big name video game company.
>>Because there seems to be an inexhaustible supply of young idealistic naive persons who will take any salary offered and work for 80 hours a week
As someone who also works at ubisoft(and I have actually worked with the author), I want to point out that for all the failings of Ubisoft, our work life culture is top notch and there is an incredibly strong focus on avoiding overtime**. It's drilled into our heads constantly that it's NOT normal to work more than the contracted 7.5h a day, if you as much as send someone an email at 8pm someone will talk to you to make sure you aren't working late(and will tell you to avoid doing that in the future, because it makes it look like reading and sending work emails at 8pm is a normal thing). In my 9 years here I have worked overtime only a very small handful of times, usually around launches of our project - for a week or two. I haven't logged more than my standard 37.5h/week on any project for literally years now. I manage juniors now and I wouldn't let them work more than that even if they wanted to.
** at least in the studio where I work and the studios I have interacted with - Ubisoft has 40(?) studios across the world and I cannot possibly comment on every single one of them. But it does certainly seem to be the company policy worldwide.
Good old "we spent $1m on it" fallacies. "Exactly, we spent a million on a backup solution that's objectively bad, and we spent 100 times that on the actual code we're trying to secure. So: do we want to have wasted a trivial $1m by redoing the backup part, or do we want to have wasted $100m because of a terrible pretend-backup solution? Because this should be a business no-brainer".
Probably only sort of consistent, I imagine? Game servers tend not to support seamless failover. If the server crashes at the wrong moment, data is going to be lost, regardless of how politely behaved the data backend is.
That is, if you kill a boss and successfully pick up an item, you know it'll be saved to DB. If the game crashes before you can pick it up, it's probably just gone.
> Probably only sort of consistent, I imagine? Game servers tend not to support seamless failover. If the server crashes at the wrong moment, data is going to be lost, regardless of how politely behaved the data backend is.
You could just have a let's say "user profile service" where all of the transactions about user profile (items, XP, etc.) go, on top of internal game server data.
That way the important stuff could be send immediately, like "epic or above item drops", and everything else either in batched update (XP, achievements etc.) or periodically (stuff like player rearranging inventory) or at end of the session.
the way we had it working was local storage (sqlite) and remote storage (postgresql with a middleware).
If you pick something up it flags as an “interesting event” and syncs your profile to local storage- on game server crash, there would be a crash collector which syncs the local database with the remote one and removes the lock - meaning you can join a working server.
The crash collector also did things like… well, collect the crash dump and send it for debugging.
>all of this, written by hand in C++, nothing off the shelf, very minimal dependencies (OpenSSL being the only one of note), everything running on Windows and completely bespoke
Great engineering but this is the time you could have spent building your core competency. re-inventing the wheel is a good recipe for failure
In 2014 Kubernetes didn't exist and the alternatives all had downsides. I was in charge of a non trivial deployment then and I also did extensive research and decided to build something bespoke on top of docker.
There's a reason Google had Borg and all the other big companies had their internal systems. Those were not reinventing the wheel, they were just building wheels round enough to hit the highway.
By the time our product launched Kubernetes had come out and we could migrate to that eventually.
Everything else mentioned wants me to not work at Ubisoft, and roughly matches what I would have expected, just worse. The disregard for developers, the absurd NIH syndrome, etc. But I don't get the headline of the article