Hacker News new | past | comments | ask | show | jobs | submit | avita1's comments login

Is this practically true? Yes, anyone can clone any repo from Github, but surely scraping all of Github would run into rate limits?

The terms and conditions say as much https://docs.github.com/en/site-policy/github-terms/github-t...


Well today you get to learn about the GitHub Archive project, which creates dumps of all GitHub data.

One example is the data hosted in Google Cloud.

https://cloud.google.com/blog/topics/public-datasets/github-...


How do you solve the context propagation issue with eBPF based instrumentation?

E.g. if you get a RPC request coming in, and make an RPC request in order to serve the incoming RPC request. The traced program needs to track some ID for that request from the time it comes in, through to the place where the the HTTP request comes out. And then that ID has to get injected into a header on the wire so the next program sees the same request ID.

IME that's where most of the overhead (and value) from a manual tracing library comes from.


100%. Context propagation is _the_ key to distributed tracing, otherwise you're only seeing one side of every transaction.

I was hoping odigos was language/runtime-agnostic since it's eBPF-based, but I see it's mentioned in the repo that it only supports:

> Java, Python, .NET, Node.js, and Go

Apart from Go (that is a WIP), these are the languages already supported with Otel's (non-eBPF-based) auto-instrumentation. Apart from a win on latency (which is nice, but could in theory be combated with sampling), why else go this route?


eBPF instrumentation does not require code changes, redeployment or restart to running applications.

We are constantly adding more language support for eBPF instrumentation and are aiming to cover the most popular programming languages soon.

Btw, not sure that sampling is really the solution to combat overhead, after all you probably do want that data. Trying to fix production issue when the data you need is missing due to sampling is not fun


All good points, thank you.

What's the limit on language support? Is it theoretically possible to support any language/runtime? Or does it come down to the protocol (HTTP, gRPC, etc) being used by the communicating processes?


We already solved compiled languages (Go, C, Rust) and JIT languages (Java, C#). Interpreted languages (Python, JS) are the only ones left, hopefully we will solve these as well soon. The big challenge is supporting all the different runtimes, once that is solved implementing support for different protocols / open-source libraries is not as complicated.


Got to get PHP on that list :)


FWIW it's theoretically possible to support any language/runtime, but since eBPF is operating at the level it's at, there's no magic abstraction layer to plug into. Every runtime and/or protocol involves different segments of memory and certain bytes meaning certain things. It's all in service towards having no additional requirements for an end-user to install, but once you're in eBPF world everything is runtime-and-protocol-and-library-specific.


It depends on the programming language being instrumented. For Go we are assuming the context.Context object is passed around between different functions or goroutines. For Java, we are using a combination of ThreadLocal tracing and Runnable tracing to support use cases like reactive and multithreaded applications.


That’s a very big assumption, at least for Go based applications.


I don't think it's unreasonable, you need a Context to make a gRPC call and you get one when handling a gRPC call. It usually doesn't get lost in between.


True for gRPC, but not necessarily for HTTP - the HTTP client and server packages that ship with Go predate the Context package by quite a long while.


We also thinking on implementing fallback mechanism to automatically propagate context on the same goroutine if context.Context is not passed


Going to be rough for supporting virtual threads then?


We have a solution for virtual thread as well. Currently working on a blog post describing exactly how. Will update once releases



The eBPF programs handle passing the context through the requests by adding a field to the header as you mentioned. The injected field is according to the w3c standard.


> So you have endless "Fix a" "Typo" "fixup" "revert redo" "add y missed in z" commits and then the squash pushes all that crap into the commit message for whatever the final mess will be?

In Github at least you can set the behavior to take the PR description by default as the squashed commit message. In fairness this is not the default. The default behavior for squash merges is to ask for a new commit message right as you hit the merge button, and the default is all of the messages from the commits being squashed together.

> make no effort to produce high-quality independent commits

I'm partial to sqaush merges when using github. I don't put much effort into the individual commit messages, instead I put lots of effort into the PR description (the thing reviewers will read, and what will eventually become the commit message in revision history). That said, one of my favorite features from gerrit at a past job was that the commit message itself could be reviewed.


> That said, one of my favorite features from gerrit at a past job was that the commit message itself could be reviewed.

Reviewable lets you review your commit messages just like any other file, BTW! (Disclosure: I'm the founder.)


Did Google interviews change dramatically around 2012?


I'm guessing 2012 because I thought that was the year they stopped doing brain teaser interviews, which were very controversial at the time. Googling it now, the exact year is fuzzy -- like, they originally stopped in 2006, but only fully stopped around... 2011-2012? My evidence is this thread and its linked article: https://www.reddit.com/r/programming/comments/1gq72n/comment...


I received an offer from Google around May 2012. My interviews were long and numerous with interactive coding and whiteboard sessions. No brain teasers. Larry Page was still giving final yes or no on hiring.


no, but that is about when Larry and Sergey checked out (wasn't an overnight thing, took a few years).


This isn't catching the panic though, this is propagating the panic through the parent goroutine. The whole program will still shut down, but the stacktrace that shows up in the panic contains not only information information about the goroutine that panicked, but also the launching goroutine. That can help you figure out why the panic happened to begin with.


Something I've increasingly wondered is if the model of CI where a totally pristine container (or VM) gets spun on each change for each test set imposes an floor on how fast CI can run.

Each job will always have to run a clone, always pay the cost of either bootstrapping a toolchain or download a giant container with the toolchain, and always have to download a big remote cache.

If I had infinity time, I'd build a CI system that found a runner that maintained some state (gasp!) about the build and went to a test runner that had most of its local build cache downloaded, source code cloned, and toolchain bootstrapped.


You'd love a service like that, until you have some weird stuff working in CI but not in local (or vice-versa), that's why things are built from scratch all the time, to prevent any such issues from happening.

Npm was (still is?) famously bad at installing dependencies, where sometimes the fix is to remove node_modules and simply reinstalling. Back when npm was more brittle (yes, possible) it was nearly impossible to maintain caches of node_modules directories, as they ended up being different than if you reinstalled with no existing node_modules directory.


I think Nix could be leveraged to resolve this. If the dependencies aren't perfectly matched it downloads the _different_ dependencies, but can use any locally downloaded instances already.

So infra concerns are identical. Remove any state your application itself uses (clean slate, like a local DB), but your VM can functionally be persistent (perhaps you shut it off when not in use to reduce spend)?


You wouldn't catch it, it's true.

But I'd depends if you're willing to trade accuracy for speed. I suggest the correct reaction to this is... "How much speed?"

I presume the answer to be "a lot".


My immediate reaction is “correctness each and every time”.


I mean, given that my full build takes hours but my incremental build takes seconds--and given that my build system itself tends to only mess up the incremental build a few times a year (and mostly in ways I can predict), I'd totally be OK with "correctness once a day" or "correctness on demand" in exchange for having the CI feel like something that I can use constantly. It isn't like I am locally developing or testing with "correctness each and every time", no matter how cool that sounds: I'd get nothing done!


This really depends a lot on context and there's no right or wrong answer here.

If you're working on something safety critical you'll want correctness every time. For most things short of that it's a trade-off between risk, time, and money—each of which can be fungible depending on context.


Do you really need to build the whole thing to test?


In my experience, yes.

A small change in a dependency, essentially, bubbles or chains to all dependent steps. I.e., a change in the fizzbuzz source but inherently run the fizzbuzz tests. This cascades into your integration tests — we must run the integration tests that include fizzbuzz … but those now need all the other components involved; so, that sort of bubbles or chains to all reverse dependencies (i.e., we need to build the bazqux service, since it is in the integration test with fizzbuzz…) and now I'm building a large portion of my dependency graph.

And in practice, to keep the logic in CI reasonably simple … the answer is "build it all".

(If I had better content-aware builds, I could cache them: I could say, ah, bazqux's source hashes to $X, and we already have a build for that hash, excellent. In practice, this is really hard. If all of bazqux was limited to some subtree, but inevitably one file decides to include some source from outside the spiritual root of bazqux, and now bazqux's hash is "the entire tree", which by definition we've never built.)

(There's bazel, but it has its own issues.)


I work in games, our repository is ~100GB (20m download) and a clean compile is 2 hours on a 16 core machine with 32GB ram (c6i.4xlarge for any Aws friends). Actually building a runnable version of the game takes two clean compiles (one editor and one client) plus an asset processing task that takes about another 2 hours clean.

Our toolchain install takes about 30 minutes (although that includes making a snapshot of the EBS volume to make an AMI out of).

That's ~7 hours for a clean build.

We have a somewhat better system than this - our base ami contains the entire toolchain, and we do an initial clone on the ami to get the bulk of the download done too. We store all the intermediates on a separate drive and we just mount it, build incrementally and unmount again. Sometimes we end up with duplicated work but overall it works pretty well. Our full builds are down from 7 hours (in theory) to about 30 minutes, including artifact deployments.


This is how CI systems have always behaved traditionally. Just install a Jenkins agent on any computer/VM and it will maintain persistent workspace on disk for each job to reuse in incremental builds. There are countless other tools that work in the same way. This also solves the problem of isolating builds if your ci only checks out the code and then launches a constrained docker container executing the build. This can easily be extended to use persistent network disks and scaled up workers, but is usually not worth the cost.

It's baffling to see this new trend of yaml actions running in pristine workers, redownloading the whole npm-universe from scratch on every change, birthing hundreds of startups trying to "solve" CI by presenting solutions to non-problems and then wrapping things in even more layers of lock-in and micro-VMs and detaching yourself from the integration.

While Jenkins might not be the best tool in the world, the industry needs a wake-up shower on how to simplify and keep in touch with reality, not hidden behind layers of SaaS-abstractions.


Agreed, this is more or less the inspiration behind Depot (https://depot.dev). Today it builds Docker images with this philosophy, but we'll be expanding to other more general inputs as well. Builds get routed to runner instances pre-configured to build as fast as possible, with local SSD cache and pre-installed toolchains, but without needing to set up any of that orchestration yourself.


This was the idea behind https://webapp.io (YC S20):

- Run a linear series of steps

- Watch which files are read (at the OS level) during each step, and snapshot the entire RAM/disk state of the MicroVM

- When you next push, just skip ahead to the latest snapshot

In practice this makes a generalized version of "cache keys" where you can snapshot the VM as it builds, and then restore the most appropriate snapshot for any given change.


I have zero experience with bazel, but I believe it offers the possibility of mechanisms similar to this? Or a mechanism that makes this "somewhat safe"?


Yes it does, but one should be warned that adopting Bazel isn't the lightest decision to make. But yeah, the CI experience is one of its best attributes.

We are using Bazel with Github self-hosted runners, and have consistent low build times with a growing codebase and test suite, as Bazel will only re-build and re-test what needs to be changed.

The CI experience compared to e.g. doing naive caching of some directories with Github managed runners is amazing, and it's probably the most reliable build/test setup I've had. The most common failure we have of the build system itself (which is still rare with ~once a week) is network issues with one of the package managers, rather than quirks introduced by one of the engineers (and there would be a straightforward path towards preventing those failures, we just haven't bothered to set that up yet).


> Each job will always have to run a clone, always pay the cost of either bootstrapping a toolchain or download a giant container with the toolchain, and always have to download a big remote cache.

Couldn’t this be addressed if every node had a local caching proxy server container/VM, and all the other containers/VMs on the node used it for Git checkouts, image/package downloads, etc?


> the model of CI where a totally pristine container (or VM) gets spun on each change for each test set imposes an floor on how fast CI can run

I believe this is the motivation behind https://brisktest.com/


I'm using buildkite - which lets me run the workers myself. These are long-lived Ubuntu systems setup with the same code we use on dev and production running all the same software dependencies. Tests are fast and it works pretty nice.


I'm not using it right now, but at a previous company we used Gitlab CI on the free tier with self-hosted runners. Kicked ass.


Self-hosted runners are brilliant, but have a poor security model for running containers or building them within a job. Whilst we're focusing on GitHub Actions at the moment, the same problems exist for GitLab CI, Drone, Bitbucket and Azure DevOps. We explain why in the FAQ (link in the post).


> poor security model for running containers or building them within a job

You mean Docker-in-Docker? If so, we used Kaniko to build images without Docker-in-Docker


There is a misconception that Kaniko means non-root, but in order to build a container it has to work with layers which requires root.

Using Kaniko also doesn't solve for:

How do you run containers within that build in order to test them? How do you run KinD/K3s within that build to validate the containers e2e?


The benefit of Kaniko (relative to Docker-in-Docker) is that you don't need to run in privileged mode.

We test our containers in our Dev environment after deploying


That is a benefit over DIND and socket sharing, however it doesn't allow for running containers or K8s itself within a job. Any tooling that depends on running "docker" (the CLI) will also break or need adapting.

This also comes to mind: "root in the container is root on the host" - https://suraj.io/post/root-in-container-root-on-host/


This reminds me of the erlang map-reduce "did you just tell me to fuck myself" meme


> Each job will always have to run a clone

You can create a base filesystem image with the code and tools checked out, then create a VM which uses that in a copy-on-write way


AWS Autoscaling groups with a custom AMI does this by default, fwiw.


Understanding amendments would be a very good use case. Often changes to the law are not new laws, but changes in other laws. Any time I've tried to actually parse the law, I've found getting a good picture of the current state of the law, and a snapshot of the law in the past is tricky.

To take this example I found off the NYS assembly website legalizing adultery (it was the first one I found, I swear)

https://nyassembly.gov/leg/?default_fld=&leg_video=&bn=A0010...

It's phrased as "Section 255.17 of the penal law is REPEALED", but if you try to find "the penal law", but if you look up a a copy of the penal code do you see 255.17 in it? If so, how can you find out what was actually repealed. If not, do you need to hunt through every possible amendment to figure out the state of the law is at the time of reading?


Cool article, I'm not sure I agree with the headline.

I used to write low-scale Java apps, and now I write memory intensive Go apps. I've often wondered what would happen if Go did have a JVM style GC.

It's relatively common in Go to resort to idioms that let you avoid hitting the GC. Some things that come to mind:

* all the tricks you can do with a slice that have two slice headers pointing to the same block of memory [1]

* object pooling, something so common in Go it's part of the standard library [2]

Both are technically possible in Java, but I've never seen them used commonly (though in fairness I've never written performance critical Java.) If Go had a more sophisticated GC, would these techniques be necessary?

Also Java is supposed to be getting value types soon (tm) [3]

[1] https://ueokande.github.io/go-slice-tricks/

[2] https://pkg.go.dev/sync#Pool

[3] https://openjdk.java.net/jeps/169


Object pooling in Java used to be fairly common. I don't see it much anymore in new code, but used to run into it all the time when writing code for Java 1.4/5. Even Sun used pooling when they wrote EJBs. Individual EJBs can be recycled instead of released to the GC.

Nowadays the GC implementations are good enough that's it's not worth the effort and complexity.

Though now that I think about it Netty provides an object pooling mechanism.


Pooling objects (for the purposes of minimizing GC) is consider a bad practice in modern Java. The article suggests that compacting, generational collectors are a bad thing, but they can dramatically speed up the amount of time it takes to deallocate memory if most of your objects in a given region of memory are now dead. All you have to do is remove objects that are still alive, and you're done: that region is now available for use again. The result is that long-lived objects have a greater overhead.


Does object pooling still make sense for direct ByteBuffers nowadays?


Yes. Those aren't GC controlled so any arguments about GC is irrelevant with direct byte buffers.

Also, object pooling isn't really a GC related hack, it's more useful as a cache booster. Programmers like immutability and garbage collection but your CPU doesn't like these things at all. If you're constantly allocating new objects it doesn't matter if your GC kicks ass, because those objects will always be in parts of memory that are cold in the cache. If you allocate some up front and mutate them, they're more likely to be warm.

Obviously this isn't a language or even VM thing. It's a "mutable variables are good for performance" thing.


> Both are technically possible in Java, but I've never seen them used commonly (though in fairness I've never written performance critical Java.)

I don't know about the Java world, but in C#—especially in games written in Unity—object pooling is very common.


Writing High Performance .NET Code (https://www.writinghighperf.net/) has a chapter on this. In C#, time spent collecting depends on the number of still-living objects. That means you want objects you allocate to be short-lived (dead by the time GC happens) or to live forever (they go to the gen 2 heap and stay there). The book suggests object pooling when the lifetime of objects is between those two extremes, or when objects are big enough for the Large Object Heap.

But at the end of the section, the book says:

  I do not usually run to pooling as a default solution. As a general-purpose mechanism, it is clunky and error-prone. However, you may find that your application will benefit from pooling of just a few types.
What kind of things do you pool in Unity?


Unity aside, I believe it's useful in a lot of areas of gamedev.

Anecdotally, in a lower-level game engine I wrote at one point in C#, object pooling significantly reduced memory overhead (and IIRC increased framerates on complex scenes) when I scaled well past 1000 dynamic, moderately-lived entities. Particles, objects, projectiles, bad guys, etc. I believe can all benefit from pooling, assuming they aren't long-lived.

I do agree it can be error prone, but I'm convinced it's worth it for several places in gaming.


Ideally you would pool anything you might need to dynamically allocate during a level. You want to avoid allocations during game play entirely, if possible.

Unity itself will pool most media assets. Any given texture asset is shared between all object instances that use that texture. The programmer will end up pooling instances of their objects or just use structs and such. It can be tedious but I wouldn't call it more clunky than explicit memory management.

Large collections are actually not a problem at all in games as long as you only run the collection during a load screen.


I've seen it on a large site written in c#. Object pools of stream objects for serializing and deserializing data. This was 10 years ago.


Java has a pretty decent standard library with different list, map and set implementations and quite a few third party libraries with yet more data structures. Honestly, Go felt a bit primitive and verbose to me on that front on the few times I used it. Simplicity has a price and some limitations.

There are also other tricks you can do like for example using off heap memory (e.g. Lucene does this), using array buffers, or using native libraries. There obviously is a lot of very memory intensive, widely used software written for the JVM and no shortage of dealing with all sorts of memory related challenges. I'd even go as far as to argue that quite a few of those software packages might be a little out of the comfort zone for Go. Maybe if it were used more for such things, there would be increased demand for better GCs as well?

Object pooling is pretty common for things like connection pools. For example apache commons pool is used for doing connection pooling (database, http, redis, etc.) in Spring Boot and probably a lot more products. Also there are thread pools, worker pools and probably quite a few more that are pretty widely used and quite a few of those come with the Java standard library. Caching libraries are also pretty common and well supported popular web frameworks like Spring.

A typical Java based search or database software product (Elasticsearch, Kafka, Casandra, etc.) is likely to use all of the above. Likewise for things like Hadoop, Spark, Neo4j, etc.

Of course there's a difference between Java the language and the JVM, which is also targeted by quite a few other languages. For example, I've been using Kotlin for the last few years. There are functional languages like Scala and Clojure. And people even run scripting languages on jython, jruby, groovy, or javascript on it.

There even have been some attempts to make Go run on the JVM. Apparently performance, concurrency and memory management were big motivators for attempting that (you know, stuff the JVM does at scale): https://githubmemory.com/repo/golang-jvm/golang-jvm

Their pitch: "You can use go-jvm simply as a faster version of Golang, you can use it to run Golang on the JVM and access powerful JVM libraries such as highly tuned concurrency primitives, you can use it to embed Golang as a scripting language in your Java program, or many other possibilities."


Not sure if you're in on the joke, but for those who didn't go to the repo itself:

https://github.com/golang-jvm/golang-jvm

It's just a copy-paste of JRuby on April 1st and the readme now includes a rickroll.

Maybe it's irresponsible of them to leave it up in a way that Google still finds as a legitimate-looking search result.


LOL, I was not aware and stepped right into that.

There appear to be other attempts: for example https://github.com/zxh0/jvm.go (might be the same?)

Let's just say people have tried/joked about it but it never took off.


Its other way round. Implementing JVM in Go and not running Go over JVM.


> There even have been some attempts to make Go run on the JVM. Apparently performance, concurrency and memory management were big motivators for attempting that (you know, stuff the JVM does at scale):

This seems legit. Just links to their website/Wiki are not working right now.


Object pooling used to be more common in Java. Now it is mainly used for objects that are expensive (incurs latency) to create, not for GC reasons.


How have you found Go in contrast to Java. Is the simplicity worth it?


yes. golang is actually less restrictive than java. and avoids a ton of the bullshit abstractions you see in every java code base.


>ton of the bullshit abstractions

Not sure how that's the Java's problem. Most of these abstraction come older frameworks. You can have these same abstraction/design pattern in Go also.


Other comments have linked newer language features that make it easy. But for years, the Java Way of handling discriminated unions was to use the visitor pattern [1]. It's very verbose, and is an insane amount of typing unless your IDE is doing the typing for you, but it has the compile time guarantees that forces each caller to handle every type without instanceof/Object.

[1] https://dzone.com/articles/design-patterns-visitor


Mesospehere was renamed and has shifted to supporting Kubernetes.

https://d2iq.com/blog/mesosphere-is-now-d2iq


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: