The Safety Boat: Kubernetes and Rust

zozbot234 · on May 2, 2020

It's a bit weird to see "several weeks" of effort being described as a problematic learning curve. At least the blogpost makes it clear that the effort pays out hugely but still, "several weeks" is not rocket surgery. It's not learning Haskell or category theory! ISTM that they're just running with an assumption that most devs wouldn't be professional and committed to this, which strikes me as an unwitting gatekeeping attitude.

steveklabnik · on May 2, 2020

Go has a shorter time, and that’s the measuring stick in this area.

wrs · on May 2, 2020

Go appears to have a shorter time, because you don't realize how much higher-level stuff you're just expected to do The Right Way, with no support from the language or libraries. So you're free to think you've finished learning Go, but then the actual learning begins.

dnautics · on May 3, 2020

Agreed. Worked with a senior who "learned go in a week" then faffed around for months (to prod! because he had the implicit trust of management because he was "a genius") deploying broken software with tons of concurrency bugs because he didn't know how to manage shared state.

jerf · on May 3, 2020

Even accounting for what you say, it's a short learning curve compared to most languages. It's not like there's any programming language in the world where you just read the manual through once and, boom, instantly you know exactly how to architect a multi-person-century project right out of the gate or something.

eecc · on May 3, 2020

Hmm, that’s a bit of a straw man you’re making here. No one is arguing better languages will magic complex architecture efforts, we’re talking about basic state management here... garden variety implementation details

nickbauman · on May 3, 2020

State in the context of concurrent programming is not a "garden variety implementation detail." It's the Great White Whale of our industry. No language does it especially well. I have my favorites in this arena, but it's still hunting a large sea mammal with a harpoon. What you're suggesting is essentially avoiding hunting it altogether. We're not there yet.

gas9S9zw3P9c · on May 3, 2020

This is very true. I recently ported large portions of Go code to Rust, and while it was somewhat of a steep learning curve and a fight with the compiler, in the end I felt very comfortable that the result is correct, fast, and had good abstractions.

Go was so "easy" that I was immediately productive, but this resulted in often suboptimal and messy code that I had to refactor over and over again due to concurrency, abstraction, or performance issues.

True productivity is hard to measure.

zozbot234 · on May 3, 2020

Right, it's a false economy. Though, to be fair, the Java/C# ecosystem has the same problem to an even greater degree. Not to mention popular "dynamic" languages such as Python, Ruby etc.

hu3 · on May 2, 2020

"The Right Way" lies in a wide spectrum and depends on the project.

It's often about how fast one can deliver maintainable software that works well enough.

erik_seaberg · on May 3, 2020

Learning should pay off. Bragging that one can learn your language quickly is like bragging that your toolbox is nearly empty.

akiselev · on May 3, 2020

Go is the golden standard for extracting the most value out of unexperienced computer science grads but it is not the measuring stick, not by a long shot.

pjmlp · on May 3, 2020

Java did it first, catching up with 1996 here.

michaelcampbell · on May 3, 2020

Interestingly enough, Java caught hell from people for "pandering" to "average" programmers, where Go seems to be getting kudos for the same thing. Strange times.

jstimpfle · on May 4, 2020

I've only ever read through Go tutorials but never used it. I've used Java in a small capacity, and while I found some things convenient, some other things were also pointlessly restrictive. For example that syntactic overhead, having to wrap everything in a class. You can't even have a global variable without typing your fingers sore.

Pretty sure that Go isn't equally restrictive (while still being garbage collected).

erik_seaberg · on May 3, 2020

Java was derided as being for people who thought C++ was too hard. But those people were right. C++ is too hard, in that "undefined behavior" demands an inhuman degree of perfection.

pjmlp · on May 4, 2020

And now Go is being praised for being for people that think Java and Python are too hard.

It is a better option than keeping using C, and it would have been great in 1996, but that is about it.

ashtonkem · on May 2, 2020

Learning a new framework in a familiar language might take a few weeks. A few weeks for a new language is really fast!

alvarelle · on May 3, 2020

But with rust you have to re-learn the language every six weeks as it changes so fast. (/s)

gameswithgo · on May 3, 2020

I think learning Rust will be harder than Haskell for a lot of people.

kyllo · on May 3, 2020

I have learned both and agree with this statement. I think that Rust is harder to learn if you've only worked with high-level, GC languages, and don't have a background doing lower-level programming in C/C++/Obj-C, as well as some experience with functional languages.

Going from something like Java or Python to Rust, one would have a lot to learn.

jpgvm · on May 3, 2020

Hmm I don't think so. Haskell is so fundamentally different to imperative languages that it requires basically throwing away everything you know about programming.

To learn Rust you simply need to understand how values are kept track of by the compiler. Once you develop an intuition for this it's the same as any other modern imperative programming language.

ernado · on May 3, 2020

> we caught a significant race condition

It is a data race, not a race condition.

> and which passed the race checker for Go

No, it is not. https://github.com/helm/helm/pull/7820#issuecomment-60436062...

There is a comment by issue author which is literally a go data race detector warning. Like "WARNING: DATA RACE".

Rusky · on May 3, 2020

Data races are a kind of race condition, no?

jrockway · on May 3, 2020

I can convince myself that data races need not be a race condition. Consider this simple program:

    var i int
    doneCh := make(chan struct{})
    go func() { i = 1; doneCh <- struct{}{} }() // a
    go func() { i = 1; doneCh <- struct{}{} }() // b
    <-doneCh
    <-doneCh

At the end of the program, i is always equal to 1 no matter which order a or b wrote to i. But it's a race because you are assigning to a shared variable without synchronization. A small modification to the program creates a race condition:

    var i int
    doneCh := make(chan struct{})
    go func() { i = 1; doneCh <- struct{}{} }() // a
    go func() { i = 2; doneCh <- struct{}{} }() // b
    <-doneCh
    <-doneCh

Is i 1 or 2? It depends.

It is correct for the race checker to complain about the first program, because after a bit of hacking the first program can very easily change into the second program.

(And I tried it, and it does complain.)

sa46 · on May 3, 2020

I wasn't sure. After a bit of research, this seems to be a debate [1]. Using the common definitions, it's possible to have a data race that doesn't cause a race condition. [2] It's also possible to have a race condition without a data race.

[1] https://en.wikipedia.org/wiki/Race_condition#Data_race

[2] https://blog.regehr.org/archives/490

Rusky · on May 3, 2020

I would argue that "a data race that doesn't cause a race condition" is still, itself, a tiny race condition- just a contained one.

But you're right, this is just choice of terminology. :)

gtkspert · on May 3, 2020

Also, to be clear, by “we” they really mean “a contributor”

oconnor663 · on May 3, 2020

IIUC, the point is that the code has been in prod for a year, but the race detector only just now found the bug? But I could be wrong.

ernado · on May 3, 2020

It is right, race detector is not enabled by default and you should explicitly run tests with it or tell compiler to enable it - it is not compile-time, but run-time.

But still, it detects this error.

jrockway · on May 3, 2020

It is a data race. I'm guessing the race detector (go test -race) didn't detect it because they are layering multiple synchronization primitives (mutexes, channel i/o, and a WaitGroup) and their tests hit the "good" code path but production workloads didn't.

Here's what happens. Delete takes a ResourceList. It delegates to "perform" and then "batchPerform". perform calls batchPerform in a separate goroutine, which calls a helper function in another goroutine for every resource in the ResourceList. The helper function is defined in Delete and updates a data structure defined in Delete. This is a classic case where some synchronization is necessary. The function runs multiple times in multiple goroutines, and updates a single shared structure. (Perhaps not obvious because it delegates to two helper functions, and the list that the function is executed on is a "ResourceList" not a []Resource, so it isn't clear that there is a "for { go func() }" loop anywhere; the programmers did their best to make it non-obvious that a loop is occurring.)

The confounding factor here is that batchPerform tries to synchronize with a WaitGroup, but it's faulty and not enough to protect the data integrity. batchPerform creates a WaitGroup, but only calls Wait() on the WaitGroup when the "kind" of an individual resource is not equal to the "kind" passed to batchPerform. I am guessing that it's very natural to craft some test data where this condition is met, and the for loop in batchPerform only runs the function once at a time (perhaps a ResourceList of length 1). In that case, there is no race condition for the race detector to detect.

All in all, if I were reviewing this code, it would not be checked in its current form. Splitting perform and batchPerform doesn't make sense to me, and they both implement faulty synchronization logic in a slightly different way. (batchPerform uses "for { wg.Add(); go f() }; wg.Wait", perform does "for range x { go func() { ch <- f() }() }; for range x { <- ch }". I consider these pretty much exactly equivalent, but neither prevents f() from running concurrently with itself. The only reason this passed the race checker is because batchPerform doesn't actually use the WaitGroup in the normal way, instead degrading to "for range x { wg.Add(); go f(); wg.Wait() }", which DOES prevent f() from running concurrently with itself, with certain inputs.

The root cause is that the caller of Delete isn't really sure about the semantics of "perform". Does it protect the body of the callback function? There is no documentation, and the author thought "yes". But the answer was "no". In general, the convention in go is to consider something thread-unsafe unless it's marked as thread safe. When you see something like "var foo Foo; f(list, func(bar){ foo = bar })" your spidey sense should be concerned about synchronization. But in this case, the code went out of its way to hide the existence of a loop and the existence of parallel processing, and so the programmer made a mistake. A bug or at least VERY confusing use of WaitGroup in batchPerform allowed the tests to pass. Should the compiler detect this? It would be nice. But a code reviewer should have been super concerned about this implementation.

boulos · on May 3, 2020

The bug they caught [1] is one of the reasons some languages require you to explicitly name your captured variables. You still could have typed that code in, especially if you started with a for loop and then made it parallel (fwiw, perform should have been named something clearly suggesting it was parallel), but you'd at least be confronted with "oh, you went from serial, local state to a capture. Still think it's okay to explicitly borrow that state from this scope?". Then again, that's the point of Rust here :).

Fwiw, it's too bad the commit message didn't say something like "Since we're doing delete on many resources in parallel, we need to hold a lock while updating errs/res.Deleted". The reviewer was also obviously confused at first.

[1] https://github.com/helm/helm/pull/7820/commits/edb2b7511bcb9...

melling · on May 3, 2020

“For comparison, last week we caught a significant race condition in another Kubernetes-related project we maintain called Helm (written in Go) that has been there for a year or more, and which passed the race checker for Go. That error would never have escaped the Rust compiler, preventing the bug from ever existing in the first place.”

I’ve heard people brag that Haskell is a great language because it’s supposedly easier to write correct code.

Rust has this same reputation?

tybit · on May 3, 2020

Yes, though I believe Rust has already proven this more in practice than Haskell has.

Rust has many advocates now at places like Mozilla, Amazon and Microsoft that have delivered critical software in Rust that they believe has made it safer.

Falell · on May 3, 2020

Yes. The common blurb is: In safe rust the borrow checker encourages 'fearless concurrency' by statically preventing all data races.

exdsq · on May 3, 2020

It’s harder to formally prove Rust code compared to Haskell, the company I work at prototyped in Rust and then used domain information gained to improve the parallel implementation in Haskell.

lmm · on May 3, 2020

Yes, for much the same reasons. Pretty much any ML-family language has the same effect; just having proper sum types, polymorphism and first class functions (and not having null) goes a long way to preventing huge classes of bugs.

nindalf · on May 3, 2020

The critical feature enabling fearless concurrrency is Rust's borrow checker though, something that the other ML languages don't have.

lmm · on May 3, 2020

In practice you have fearless concurrency in every other ML language I know, because they're all immutable-first. It's true that if you wrote some code that mutated data then it wouldn't be concurrency-safe, but why would you do that?

yongjik · on May 3, 2020

I think the common saying is that once your Haskell code compiles, it's usually correct.

The fine print is that nobody claimed it's easy to write Haskell code that compiles.

sitkack · on May 3, 2020

https://youtu.be/DdR9q69se-I?t=1329

shock · on May 2, 2020

> One of the biggest ones to point out is that async runtimes are still a bit unclear. There are currently two different options to choose from, each of them with their own tradeoffs and problems. Also, many of the implementation details are tied to specific runtimes, meaning that if you have a dependency that uses one runtime over another, you’ll often be locked into that runtime choice.

My understanding of how async/await works in Rust is that you can have multiple async runtimes in one Rust program. Is that not the case?

roblabla · on May 2, 2020

That is the case, but it's super awkward to use. Basically, you cannot await a tokio future on an async-std runtime, or an async-std future on a tokio runtime. You can, however, have both runtimes running at the same time, and use some form of message-passing to bridge them.

It's definitely easier to only deal with one runtime. Ideally, we should have some kind of abstraction to allow crates to support both runtimes (e.g. a trait that'd allow creating an async TcpSocket of the right "kind" for your runtime), but AFAIK this is not currently done.

steveklabnik · on May 2, 2020

> AFAIK this is not currently done.

That's correct; we're still working on these abstractions. It's the end goal that most folks have in mind, though.

jeffdavis · on May 3, 2020

Can you explain and/or link to the issues? I thought Futures were the abstraction that lets you choose a runtime?

Rusky · on May 3, 2020

Futures are part of the answer, and more specifically the way that the Wakers passed to Future::poll use dynamic dispatch to re-schedule the task.

Other major abstractions that are missing so far include async versions of the Read and Write traits, a Stream trait for the async equivalent of the Iterator trait, and perhaps a way to spawn new tasks.

This series of interviews covers these in more depth: http://smallcultfollowing.com/babysteps/blog/2020/04/30/asyn...

Matthias247 · on May 3, 2020

> That is the case, but it's super awkward to use.

That's really the case for any language where an eventloop is not part of a builtin runtime (like it e.g. is with Javascript or Dart). E.g. in C++ we also have boost asio, libuv, libevent, wagle, seastar,GUI framework eventloops in GTK, QT, etc.

The thing is once you are in async land, nothing is interoperable anymore in most environments. Whether that's ideal or not is a separate discussion.

What I experience however somehow is that Rust users raise a lot more concerns about interoperability than I've seen so far in other ecosystems. I might stem from the fact that those users often never used another native async environment.

wwright · on May 3, 2020

IMO it’s more that the Rust community has fostered a culture of doing things carefully and doing them well whenever possible (I mean, it’s the language that will argue with you for hours over reference lifetimes, after all).

Matthias247 · on May 3, 2020

There are certainly high expectations in the Rust community about doing things perfectly. But I don't think those "async ecosystem" discussions are a good example of productive discussions. I think e.g. in C++ there had been far more expert talk on standarization, within expert groups - like for the standarization of executors or the networking TS. And yet after 5 years or so nothing had been standardized yet.

In Rust the amount of people that actually work on the low level details and try to make things better is likely < 5. But there are a lot of expectations from everyone else about having perfect interoperability.

empath75 · on May 2, 2020

I was looking at smol and it seems to have a good pattern for working with all the other run times.

xrd · on May 3, 2020

Does anyone here have any experience using Kotlin and can compare concurrency (with coroutines) to either Go or Rust? When I was doing more Java I really liked the approach Kotlin took with concurrency, but reading the comments here I'm sure I didn't understand the issues at the depth that is needed.

Rusky · on May 2, 2020

It's possible, the downsides are a bit silly when you look at how the Future trait was designed to allow tasks to be runtime-agnostic.

There is ongoing work to standardize more runtime interfaces so that more libraries can be runtime-agnostic.

gas9S9zw3P9c · on May 2, 2020

In theory you can, but in practice it would make your code very messy. If your dependency is using runtime A and you are using runtime B - how would they interact? Runtimes like tokio also provide convenience macros for you main mehtod, kind of locking you into them (I think), at least for that codepath.

If your application has two parts or binaries that are completely separate you could potentially use two different runtimes, but otherwise I don't think it would make sense. And even then, it would just be a mess.

Right now, your runtime is essentially picked for you by your dependencies.

mappu · on May 2, 2020

Going off on a tangent, but this exact problem would be a worst-case scenario for Go getting user-defined generic types instead of only the current blessed ones.

t. C++ developer with a mixed std::string/QString/BSTR codebase.

monocasa · on May 3, 2020

Well, that's the current state of Go if the half dozen blessed versions don't fit your use case. Everyone just writes their own slightly incompatible versions.

adev_ · on May 3, 2020

> t. C++ developer with a mixed std::string/QString/BSTR codebase.

This has absolutely nothing to do with generics.

None of std::string or QString are generics. They are just an example of historical alternative implementations for 'reasons' ( portability/speed) that create a lesson the long term

slavik81 · on May 3, 2020

It's also unavoidable. It's not like you can pass UTF-8 Go strings to UTF-16 COM interfaces. Somebody wrote code to convert a Go string to a BSTR and vice versa. You can do the exact same thing for std::string if you want.

comex · on May 2, 2020

But strings are not generic types...

zxcmx · on May 2, 2020

So much this. Have seen apps where there were 4 diff string types brought by external dependencies and then like 3 or 4 more to deal with from diff Windows APIs.

Thaxll · on May 3, 2020

I don't believe for one second that it takes just a couple of weeks to an average SE to be proficient in Rust.

steveklabnik · on May 3, 2020

It really depends on so many factors it’s extremely hard to tell. We’ve brought folks at Cloudflare up to speed roughly that fast.

“average” and “proficient” are both very variable in that statement, imho.

sitkack · on May 3, 2020

They started with >1 Klabnik units and every person you bring up, it creates a larger pool of folks to lean on for support.

steveklabnik · on May 3, 2020

I can’t take credit here, while I am around to answer questions, getting folks going is not my job.

It is true that we have a chat room with a bunch of folks, of which I’m part.

e12e · on May 3, 2020

I wouldn't discount what gp is saying though - having an (or a few) experts on hand from the start, can help training the first new convert "the right way" and they can then mentor the next one and so on.

Even just by being availabletto answer questions or help with code review. Doing some pair programming sessions would probably be useful too.

steveklabnik · on May 3, 2020

Oh yeah, it’s helpful for sure. I just don’t want to take too much credit!

amrx101 · on May 3, 2020

For me it was 3 to 4 months, I had switched from Golang to Rust. Its been 8 months now and I believe that I have hang of things now.

acdha · on May 3, 2020

Depends on your definition of average: I found that to be the case with significant programming experience with traditional languages (notably not something like Haskell) so I think it’s plausible since the compiler, editor, and documentation are rather above average for newcomers. In particular, Cargo providing a lot of easy tools and the compiler providing really helpful error messages seemed to be useful for the time to write a real first program which does something useful.

Edit: one other big factor - presumably in their environment you have coworkers to get advice from. That’s huge when you’re first starting.

monkpit · on May 3, 2020

And your definition of “proficient”!

acdha · on May 3, 2020

Yes, I’m using it in the sense of “can successfully develop a program which does the job” with the assumption that it’ll still take more time to do it quickly, use more advanced techniques, etc.

pas · on May 3, 2020

We became pretty comfy with it in less than a month in Aug 2017. (Let's say average guy had a few years of Python and this-and-that before that, and a ~5 year CS degree before that.) Sure, there was no async/await anywhere yet, but no crossbeam-channels either. And there were a lot less friendly tutorials and there were a bit more rough edges. (Especially that we did "IoT" so cross-compiling was ... an experience.)

empath75 · on May 3, 2020

IME it takes 2-3 months for a talented senior developer to get comfortable with it.

aganame · on May 3, 2020

”Several weeks of hard effort”, they said. I can buy it, if they actually work hard and are basically competent. Rust is a difficult language in total, that’s for sure, but you can get a lot done without knowing it all.

xrd · on May 3, 2020

After reading this article, I'm excited about finding a reason to write a component in Rust and WASM. Can anyone recommended the best getting started guide for dipping your toes in the water? This article didn't have a link to anything that seemed appropriate for that goal.

ronlobo · on May 3, 2020

It is exciting to see Microsoft is pushing so many efforts into Rust and WASM.

The Rust onboarding experience is incredibly explicit and once things start to click and code compiles, you're on the train.

conroy · on May 2, 2020

I looked into WASM / WASI last week but couldn't find an answer to this anywhere: can I write a network service in Rust and compile it to WASM / WASI?

I know that wasmtime can execute a WASM module and give it access to a file system. Can that filesystem contain a socket that the WASM module can interact with?

jononor · on May 3, 2020

Very curious as to why you would want to do that? If you want a network service, WASM does not seem to help with much, only complicate things?

empath75 · on May 2, 2020

https://wascc.dev/ Has done some work there.

whb07 · on May 3, 2020

There’s a version of NGINX that’s compiled to WASM.

Conceivably you could compile all of the CPython runtime into WASM, just that you’d be left with a big binary that gets passed around all the time over the wire.

mappu · on May 2, 2020

You could speak FastCGI (or plain HTTP) over stdin/stdout, although that won't get you accept(2) semantics without some other kind of layering.

Klasiaster · on May 3, 2020

Good article but it somehow suggests that because there is no garbage collection you would need to fight the borrow checker. This is not fully true because you could put your data in a Box (so that it is stored on heap instead of the function's stack) and you can wrap it in a mutex with reference counting (Arc+Mutex or Rc+RefCell), which roughly gives you what garbage collection does. Also cloning can avoid solving the borrow-check puzzle if you don't need a shared state. Of course you would not want to pack your code with Arc+Mutex or data copying if performance matters, but it's fine for a beginner to start with when writing Rust and then learn to do the optimized borrow version a bit later when needed.

pjmlp · on May 3, 2020

It doesn't give the productivity that GC allows for writing GUI code and UI designers.

Imagine having JetPack Composer, SwiftUI, Qt designer, or WPF/UWP Blend in Rust.

wwright · on May 3, 2020

Qt is actually a good example because C++ has a similar memory model to Rust (at least with respect to GC). The Qt solution was basically to give everything a “Cow” (copy on write) wrapper, and to use an event loop-based, somewhat manually-annotated GC for objects that want it.

Rust could totally do the same thing, and you could probably make it way easier to use than the mess that is Qt.

pjmlp · on May 3, 2020

Having dealt with Gtk-rs, and their current solution being the clone! macro, I am not so sure.

Remember that not only is GUI development with proper tooling very interacting, instead of the FOSS alternatives of code-compile-check visually, there is also the whole eco-system of third parties selling component libraries, with no control how they get integrated into the component toolbox.

So whatever solution one comes up with,it needs to be more productive than forcing users to scatter Rc<RefCells<>>, or fix their code that broke compilation, just because moving a widget on the GUI tree invalidated the borrow checker assumptions.

klitze · on May 3, 2020

It has a weird taste that Microsoft is preferring Rust over Golang considering that Golang is a Google thing.

Don’t get me wrong, all technical arguments are correct and rust does have advantages for cloud software. But this also comes quite handy for MS. :)

pjmlp · on May 3, 2020

VSCode support for Go, and some Delve improvements, were actually developed by Microsoft.

sittingnut · on May 3, 2020

rust and kubernetes - post with mostly useless hype monsters united.

rvz · on May 2, 2020

> For comparison, last week we caught a significant race condition in another Kubernetes-related project we maintain called Helm (written in Go) that has been there for a year or more, and which passed the race checker for Go. That error would never have escaped the Rust compiler, preventing the bug from ever existing in the first place

While the possible security benefits of Rust is interesting in software like Kubernetes, it seems like this whole blog-post is an implicit RIIR proposal for the Kubernetes ecosystem from a Microsoft software engineer which isn’t going to happen anytime soon.

> Rust has made great progress in the past year with its async story, but there are still some issues that are being worked out.

On top of that, there are still many crates that aren’t using async-await yet and most are not even 1.0, thus are not stable. I would not touch such crates if they are still immature or even unsafe.

Realistically, a Rust Kubernetes is possible but practically the effort of a production ready version is measured in years.

steveklabnik · on May 2, 2020

Kubernetes is an ecosystem. It doesn’t need to be written in Rust for Rust components to play a part. Helm is not Kubernetes, for example, though your comment seems to blur the two. There are folks writing stuff to interact with the broader ecosystem in Rust. That’s one of the interesting bits of networked systems! You can be heterogeneous with languages more easily when the network/api is the boundary.

dilyevsky · on May 2, 2020

Doesn’t microsoft own helm now? Nothing is stopping them rewriting it in rust since it can easily interact with kubernetes via rest api

seneca · on May 3, 2020

No. Helm is owned by the CNCF.

dilyevsky · on May 3, 2020

CNCF doesn’t “own“ anything afaik. If you look at maintainers list most seem to still belong to deis org which is part of msft now

seneca · on May 3, 2020

CNCF hold the copyrights.

NewJazz · on May 3, 2020

That is absolutely not true. The contributors to helm retain full copyright. No assignment or even CLA is used (only a DCO).

seneca · on May 4, 2020

I stand corrected!

dilyevsky · on May 3, 2020

No they don’t. Check any file header in their github repo

pjmlp · on May 3, 2020

Plenty of Kubernetes stuff is actually written in Java, .NET and other languages, not necessarily Go. Thankfully.