More

bittermandel · 2024-06-28T09:48:52

Any code base that doesn't use the advanced features of it's language(s) are always better in my experience. Heavy usage of e.g. meta-programming in Python or perhaps uber's fx (dependency injection) in Go makes projects infinitely harder to get into.

tommypalm · 2024-06-28T10:53:44

I worked at GOV.UK for a few years on what was effectively specialised CMSs all written in Rails. Mostly basic CRUD stuff. A contractor came along and built the most insane CMS I've ever seen. I found out later it was flavor of the Java Repository Pattern using dependency injection. It became so difficult to work with that the last thing I worked on there was to delete it and rebuild it using standard Rails patterns.

The KISS philosophy exists for a reason and that includes over-using advanced language features just to show off.

nijave · 2024-06-28T11:19:00

Besides just KISS, a lot of messes I've seen have been implementing patterns outside the framework or implementing complex patterns that didn't add value.

Besides KISS (or maybe as an extensive), try to keep framework-based codebases as close to the official documented setup as possible. You automatically get to s of free, high-quality documentation available on the Internet.

RangerScience · 2024-06-28T20:31:30

I've had a consistent experience in Rails where I think for a day or two I've got a legitimate use case for one the whacky things you can do... and then as I work the problem more, it turns out: nope, the simple stuff is still the Right Way.

Someday, Ruby shenanigans, someday...

PS - Being able to pull the shenanigans is super useful during the dev process, usually to skip some yak shaving during exploration, so it's nice to have anyway.

andrei_says_ · 2024-06-28T16:26:11

Props for gov.uk! I’ve looked at its documentation and design system and see both as peak user experience and clarity.

peelle · 2024-06-28T19:41:07

I disagree with this, but I understand where it's coming from. I think you have a form of whiplash from things like: - Novices overusing the new shiny. - Java/C++/etc Jr programmers overusing design patterns. - Perl programmers solving everything with regexes. - Small startups with GraphQL, or any other large enterprise tool. - Metaprogramming, Maco's, Dependency injection, Recursion, etc when a simpler solution is a better fit.

IMHO, a "best codebase" will be just a bit more advanced than I am, with good resources for me to grok it. I want to be able to learn from it. I also don't want to be so far out of my depth that I can't make a reasonable contribution.

arp242 · 2024-06-28T21:03:43

A pinch of salt can really liven up a dish. Not every dish needs it, but when used appropriately it's almost magic in how much difference it can make.

A lot of salt always make everything disgusting.

tracerbulletx · 2024-06-28T22:06:11

Huh? Salt is one of the most important foundational elements of basically all cooking.

BiteCode_dev · 2024-06-28T10:14:34

Depends who is providing it.

Django and pydantic meta programming usually make the code easier to deal with.

In shop written meta programming usually sucks.

jamestimmins · 2024-06-28T10:02:01

Such a great point. I audibly groan when I come across Python meta-programming.

While not an advanced feature, I have a similar response when I see lots of decorators in Python. They quickly become a debugging nightmare.

pphysch · 2024-06-28T20:31:45

It would be cool if newer (imperative) languages had a clear protocol to delimit the simple core of the language from the advanced, library-building features. Like Rust has `unsafe` blocks, imagine a `advanced` or `metaprogramming` block that is required in order to use advanced metaprogramming features. Then you could tell a junior to implement $MUNDANE_INTEGRATION without use of `advanced`, and that constraint could be statically verified in PRs, etc.

It seems like it would vastly simplify language evolution. The core language can have rigid limits on changes and follow the Zen of Python or whatever, while the `advanced` language extensions can be a bit looser. Features could graduate from advanced to core without breaking anything. You get a single powerful+expressive language with a "porcelain interface" without necessarily giving juniors ammo to shoot themselves in the foot.

teleforce · 2024-06-29T06:12:46

I think D language fits your very descriptions.

You have GC by default that's Python like and intuitive programming constructs. Since for the most part D compilation is much faster than C++ and Rust compilation as long as you stick to the core non-CTFE. Heck you can use D as a compiled scripting language with REPL using rdmd [1].

Then if you want to go gung-ho, you can use the other advance features like D's excellent template meta programming, CTFE, bit banging in-line assembly, etc. D has so many advanced features that can be explored later on, even modern C++ is playing catch up with D in every releases. Nowadays D compiler also support C natively [2] and GDC compiler is included inside the venerable GCC eco-system.

There are university that teach software development and engineering class with D due to its managed, on-demand and gradual complexity, not like "in your face" complexity of the big kahuna programming language like C++ and Rust [3]. Like other compiled languages you can build real-time system even firmware device drivers in D unlike Python.

[1] https://dlang.org/areas-of-d-usage.html#academia

[2] Adding ANSI C11 C compiler to D so it can import and compile C files directly:

https://news.ycombinator.com/item?id=27102584

[3] Teaching Software Engineering in DLang [pdf]:

https://news.ycombinator.com/item?id=37620298

rvdginste · 2024-06-28T17:19:21

I think one should not use advanced language features just because, but I also think one should not avoid using advanced language features where it is useful.

Why would the code base be worse when advanced language features are used?

lanstin · 2024-06-28T23:11:59

Because unless you hire steadily more intelligent developers you will be headed towards a mass of code that is hard and scary to change.

sebstefan · 2024-06-28T10:13:01

I can see a few cases where that depends...

Really simple languages: Ruling out meta-programming is really going to limit you in Lua for example. Just being able to do `mySocket:close()` instead of `Socket.close(mySocket)` involves meta-programming.

Older languages: For C++ the "simple" features are going to include raw pointers and macros. Maybe it's not so bad to allow smart pointers and templates to avoid those

wruza · 2024-06-28T12:04:43

Both of these examples are examples of an under-programmed core though. Lua is notorious for lacking batteries, so everyone has to reinvent their own. There’s literally no serious Lua program without some sort of classes, but they still resist adding them into lauxlib.

burutthrow1234 · 2024-06-28T20:48:01

I worked with someone who insisted on using fx for DI in go. It's so antithetical to the entire Go philosophy, I don't even think it's an "advanced feature". It's just bringing Java cruft to a language where it isn't necessary and making everything worse.

Mikhail_Edoshin · 2024-06-28T11:08:40

Having just removed a metaclass from my Python code I totally agree.

bittermandel · 2024-04-15T15:14:10

We're self-hosting Neon with an internal Kubernetes operator (keep your eyes out for more info) and we're incredibly happy with Neon's technical solutions. I'm not sure we'd be able to build our company without it :o

andix · 2024-04-15T16:00:23

What's your added value in comparison to using plain Postgres, for example with the CloudnativePG operator for k8s?

bittermandel · 2024-04-15T16:20:03

The by far biggest is being able to scale it on top of Ceph, while getting performance from NVMe disks on Pageservers. This enables us to increase efficiency (aka cut costs) by 90%+ in a multi-tenant environment. Of course we don't get the same out of the box experience as CNPG, but considering we have the engineering capacity to build those parts ourselves, it was a no-brainer!

andix · 2024-04-15T16:49:33

Thanks for the explanation! This sounds like the right use case to use something more complex.

I’m always hesitant to add additional technology to the stack if it doesn’t provide a bullet proof benefit. A lot of use cases are perfectly fine with plain postgres, and I’m always fighting against polluting the stack with additional unneeded complexity.

leftcenterright · 2024-04-15T17:31:26

Sounds like a good use case. Do you have any benchmarks or numbers which you could share regarding the performance of database? (especially disk writes and reads)

clarkbw · 2024-04-15T15:19:52

i'd love to learn more about what you're doing! if you haven't already, please send a message. definitely want to hear lessons learned with your k8s operator.

tristan957 · 2024-04-15T21:50:02

Are you writing your own control plane or using neon_local?

jpdb · 2024-04-15T18:54:37

Would love to see this!

Are you able to handle branching as well?

bittermandel · 2024-04-15T15:12:10

(From molnett.com) For us that self-host, the architecture of building cold storage on top of Object Storage and warm storage on top of NVMEs is unbeatable. It enables us to build a much more cost-effecting offering while keeping Postgres as our database.

bittermandel · 2023-12-04T09:49:20

Interesting timing considering the on-going process of unionizing.

benterix · 2023-12-04T10:48:28

This is one bit I don't understand. You either unionize or don't. If you do, you do it fast or you lose it.

All management hates unions, and especially the management of SV companies. They will do everything to crush it. So you either do it fast or lose employees as the management will attack first.

u320 · 2023-12-04T09:52:35

I don't see how it's related. If anything this will accelerate that.

occz · 2023-12-05T12:46:57

It will be interesting to see whether this reinvigorates the unionization push that followed the layoffs early this year.

bittermandel · 2023-10-26T15:18:34

I have been waiting for this for a while. Their OSS code is of the highest quality and I'm hoping the hardware is at the same level!

bittermandel · 2023-10-09T15:58:04

Unless one is fine with relicensing.

bittermandel · 2023-10-08T20:18:30

Thank you so much for this write-up. Reminds me strongly of when we were exploring GKE container native routing. It really makes it so much efficient being able to route directly to the pods, rather than passing through all these Service layers.

bittermandel · 2023-10-04T11:27:14

We use Kata Containers to create Firecracker VMs from Kubernetes. Works really well for us. Though I am hoping there will be a more specific solution for Firecracker, as we don't need any other runtimes (which kind of ruins the purpose of Kata).

boyd · 2023-10-04T13:14:28

I was looking at this a while back and it was difficult/tedious at the time (and I think also not supported on Amazon EKS).

Has it gotten better? Any resources you recommend? Cursory Google searches on this have so much outdated info it can be hard to quickly wade back in.

bittermandel · 2023-09-30T22:31:49

Doesn't he do just that?

> They're already a well-explored user experience problem in existing products like Gerrit and Phabricator.

bittermandel · 2023-09-15T08:55:57

This doesn't look like an outage at all. > Diagnosis: GKE customers using Kubernetes version 1.25 may experience issues with Persistent Disk creation or deletion failures.

eVeechu7 · 2023-09-15T09:47:47

Agree. The word outage isn't used in the notice. They say 'incident' which is more accurate. The title should be changed.

voytec · 2023-09-15T11:26:52

Don't read into announcements like this too much. Status pages and outage notices are often political.

Status pages are rarely dynamic and updates require blessing from upstairs. And more often than not complete outages are referred to as "degraded performance affecting some users".

oooyay · 2023-09-15T12:19:16

I don't know how status pages work at Google, but I do work in reliability engineering and I sometimes make recommendations to update the status pages.

Some context before I go on is that reliability is often measured by mapping critical features to services and degradation. This gets more challenging as a feature starts to map to more than a couple services and those services begin to have dependencies. When your reliability on average can be measured in its number of nines opposed to its significant preceding digits your signal interpretation game has to step up significantly. These two situations make it infinitely more complex to state whether a given service degradation in a chain of services is truly having external customer impact at a given time. That's why a human needs to make the call to update the status page and why status page availability numbers are different from internal numbers.

I spend a good portion of nearly every sprint hunting down systemic issues that'll pop up across the ecosystem of services from a birds eye view. Often, knowing whether external customer impact will be felt for this series of errors relies heavily on knowing the current configuration of services in a chain, their graceful failure mechanisms, what failure manifests as client side, and whether that failure is critical to an SLA.

I have not, in my history of reliability engineering, seen anyone object to updating the status page for political reasons.

re-thc · 2023-09-15T12:41:32

> I have not, in my history of reliability engineering, seen anyone object to updating the status page for political reasons.

The status page is tied to public SLAs = impact on $$$. Internally you can track anything. What's public is the problem.

oooyay · 2023-09-15T13:10:49

No, not really. SLAs are calculated on a per customer basis and generally have a legal definition in contracts if they're actual, functioning SLAs.

The status pages purpose is generally to head off a flood of customer reported issues. This is why you'll usually see issues that affect a broader subset of users on that page.

re-thc · 2023-09-15T13:23:11

> No, not really. SLAs are calculated on a per customer basis and generally have a legal definition in contracts if they're actual, functioning SLAs.

And how can I as a customer calculate this? We're not going to sue each time there's a breach of SLA to get the real data. Whatever the status page says will trigger customers to decide if they should claim SLA credits. A lower number (delayed update of the status page) will skip payouts or reduce it.

> The status pages purpose is generally to head off a flood of customer reported issues. This is why you'll usually see issues that affect a broader subset of users on that page.

That's what you assume and that's what it's supposed to be. It's long been abused otherwise. Amazon for example will require explicit approval to update the page. They and others have famously delayed updating the status page as late as they can get away with often attempting to not even call an outage. It will say something like "increased error rates".

oooyay · 2023-09-15T14:15:14

Five nines of availability calculates to 5 minutes, you can calculate up and down from there. If you don't want to do the conversion from percentage to minutes there's lots of calculators like this one: https://uptime.is/five-nines

I wasn't assuming what status pages are used for, I was speaking to my experience working in reliability engineering. I can't speak to Amazons practices as I've not worked there, but when I've seen this happen it's because we struggled to identify customer impact. The systems you're talking about are vaste and a single or even subset of applications reporting errors doesn't mean there's going to be customer impact. That's why I mentioned it usually takes a human that knows that system and it's upstreams to know if there'll be customer impact from a particular error.

I'd encourage you to read the wording of an SLA in a contract. They're often very specific in terms of time and the features they cover. Increased error rates tells me you'll probably run into retry scenarios, which depending on your contract may not actually affect an SLA. Error rates are generally an SLO or an SLI, which are not contractually actionable.

mdekkers · 2023-09-15T13:47:31

> And how can I as a customer calculate this?

Either your shit works, or it doesn’t. You do monitor, don’t you?

re-thc · 2023-09-15T16:05:12

> Either your shit works, or it doesn’t. You do monitor, don’t you?

That then becomes a he said she said problem with the vendor you're claiming against. Does everyone have time for it? You will submit the SLA credit claim and chances are unless it's WAY off you'll accept the vendor's nerfed version and move on. Something is better than nothing.

zug_zug · 2023-09-15T14:09:28

I'm an SRE and I've seen it firsthand at multiple companies.

eddythompson80 · 2023-09-15T12:19:09

Not sure why you’re being down voted. Status pages for big companies are never hooked up to automation. It’s just bad PR to show red across the bar.

If there is a networking outage, everything on a status page should be red but then that looks bad for PR. So you just set “networking outage” but everything else is green even though everything is realistically down.

aldarisbm · 2023-09-15T13:29:45

it's also not only bad PR, but CSPs are subject to SLAs.

chucky · 2023-09-15T09:52:47

Cloud providers (and everyone else) are unfortunately always downplaying their incidents though, so I don't trust that information. I have no idea about this particular case though since I'm not a GKE user.

Would be interesting to hear from actual users how serious this is.

dustedcodes · 2023-09-15T10:07:44

I run GKE clusters in europe-west2. Completely unaffected. I run on the latest k8s version. I have an uptime monitoring service which runs in AWS and it reports zero downtime over the last 24 hours.

voytec · 2023-09-15T11:35:23

> Completely unaffected. I run on the latest k8s version.

The announcement mentions specific version: "issue with Google Kubernetes Engine impacting customers using Kubernetes version 1.25"

vel0city · 2023-09-15T12:55:06

Well yeah that's kind of their point. GKE makes it pretty easy to stay on a recent version.

mwarkentin · 2023-09-15T13:35:29

We're using their "stable" release channel which means most of our clusters are currently on a version of 1.25. We only have a few deployments using PVCs so impact is pretty minimal anyways for us as far as I can tell.

vel0city · 2023-09-15T12:57:46

Even ignoring the fact this only affects one old version and Google makes it very easy to upgrade (and know when you're behind), I can't imagine this is affecting many workloads. The vast majority of the workloads I run (and I think get run, generally) don't rely on PV's. Then from that, most of the time you're not making wholly new volumes all the time you're generally just consuming the already existing volumes.

cdogl · 2023-09-15T09:48:06

A big part of the Kubernetes value proposition is “autoscaling”, used in the loosest sense. Pods will come and go over time, in response to events, etc. as part of normal operations for many systems.

If I still had to deploy to Kubernetes, I’d consider this an outage.

cmckn · 2023-09-15T11:39:17

I would expect the Venn diagram of:

1) actually using autoscaling in prod

2) using it for a stateful workload (with PV’s)

3) needing to scale that workload during this time window

4) using this specific k8s version

To be razor thin.

I’m honestly surprised they’ve reported the impact on this dashboard; but good to see they did.

smif · 2023-09-15T11:56:45

It seems like 1.25 is a month and a half out from retirement, maybe it's related to that.

Kubernetes 1.25 (in Danny Glover voice): I'm getting too old for this shit!

waych · 2023-09-15T14:26:20

autoscaling isn't necessary to elicit downtime.

Need to deploy a dev copy of your cluster? How about the CI/CD pipelines that are blocked? Go sit on your hands and wait it out. Almost 2 days now.

cmckn · 2023-09-15T16:22:38

Woah, I completely missed that this has been going on since the 13th?! That’s outrageous, surely something can just be rolled back.

vel0city · 2023-09-15T12:48:09

This isn't preventing pods from moving from node to node, and for a lot of workloads it wouldn't even prevent making new pods. This is if you make a new storage claim (a PVC) the underlying storage filling that claim isn't being created at the moment, for one particular version of Kubernetes. Existing volumes are not impacted. Every other version is not impacted.

Loads of pods don't even use PVCs. In the several hundred deployments I routinely manage, there's only a handful of PVCs, and they aren't exactly dynamically created and destroyed. I've got many GKE clusters and no workloads I'm running are affected, and I imagine most existing workloads in GKE clusters aren't affected. I'm then also doubly not affected, because I'm not running 1.25.

ithkuil · 2023-09-15T09:42:06

yeah, as a user of that service you may or may not be affected by this particular problem.