Hacker News new | past | comments | ask | show | jobs | submit | tomnicholas1's comments login

This entire stack also now exists for arrays as well as for tabular data. It's still S3 for storage, but Zarr instead of parquet, Icechunk instead of Iceberg, and Xarray for queries in python.

nice pointer. Thanks! putting Zarr/icechunk/xarray into my weekend projects queue.

Surely Zarr is already a long-term storage format for multidimensional data? It can even be mapped directly to netCDF, GRIB and geoTIFF via VirtualiZarr[0].

Also if you like Iceberg and you like arrays you will really like Icechunk[1], which is Version-controlled Zarr!

[0] https://github.com/zarr-developers/VirtualiZarr

[1] https://icechunk.io/en/latest/


I know icechunk and I’m a huge fan of earthmover. But a common binary format like parquet seems nice… with interop for e.g duckdb and geo queries, you can “just load” era5 and do something like get wind direction/speed along the following path for the last 5 years group by day etc…

If you know the exact tensor shape of your data ahead of time Zarr works well (we use it as the dataformat for our ml experiments). If you have dynamically growing data or irregular shapes zarr doesn't work as well.

Icechunk can handle growing dimensions with ACID transactions!

For irregular shapes in some cases using multiple groups + xarray.DataTree can help you, but in general yeah ragged data is hard.


> The future of Python's main open source data science ecosystem, numfocus, does not seem bright. Despite performance improvements, Python will always be a glue language.

Your first sentence is a scorching hot take, but I don't see how it's justified by your second sentence.

The community always understood that python is a glue language, which is why the bottleneck interfaces (with IO or between array types) are implemented in lower-level languages or ABIs. The former was originally C but often is now Rust, and Apache Arrow is a great example of the latter.

The strength of using Python is when you want to do anything beyond pure computation (e.g. networking) the rest of the world already built a package for that.


So without the two-lang problem, I think all of these low-level optimization efforts across dataframes, tensors, and distributed computing would be part of a unified ecosystem based on shared compatibility.

For example, the reason why numfocus is so great is that everything was designed to work with numpy as its underlying data structure.


I’ve been thinking a lot recently about how one thing science needs is a social network for sharing big data.

One thing the post gets at is that providing a decentralized global subscribable data catalog is fundamentally a network protocol problem, somewhat similar to RSS.

The social network analogy is particularly generative here - because the desired network structure is similar to that of Federated social media, the protocol I want would be very similar in structure to those protocols underlying attempts to decentralize social media platforms. I therefore think it might well be possible to build what I’m suggesting by piggybacking off of BlueSky’s AT protocol or the Fediverse/Mastodon’s ActivityPub protocol.

I’ve made a repo[1] for discussing ideas for how one might implement such a protocol.

Curious what people think of any of this!

[1] https://github.com/TomNicholas/FROST


Reporter imagines how we'd cover overseas what's happening to the U.S. right now.

This is by far the most clear-eyed description I've seen of the severity of what is happening.

Relevant to HackerNews partly because it includes descriptions of how Big Tech executives are effectively bribing the Trump Administration in exchange for removing federal watchdogs.


I was also going to say this looks similar to one layer of dask - dask takes arbitrary python code and uses cloudpickle to serialise it in order to propagate dependencies to workers, this seems to be an equivalent layer for rust.


This looks to be a degree more sophisticated than that.

Authors in the comments here mention that the flo compiler (?) will accept-and-rewrite Rust code to make it more amenable to distribution. It also appears to be building and optimising the data-flow rather than just distributing the work. There’s also comparisons to timely, which I believe does some kind of incremental compute.


Is there a world in which GitHub used an open protocol for the social network part of their product like BlueSky's AT protocol[0]?

[0] https://docs.bsky.app/docs/advanced-guides/atproto


not p2p, but federated: https://forgefed.org (ActivityPub extension)

I believe Gitea has support for it, not sure to what extent.


Forgejo (Gitea fork) has been working for multiple years to add support for this. It will still take a lot of effort to finish, I doubt we will see anything usable this year.

Originally the plan was to PR the federation support to Gitea as well. I'm not sure if this is still the case, considering the rising tensions between the two projects and the fact that Forgejo is now a hard fork.


Forgejo, a Gitea fork that I use, has support for it according to the page you linked. But the FAQ for Forgejo mentions it's on the roadmap so not sure how complete ActivityPub support is in Forgejo either.

https://forgejo.org/faq/#is-there-a-roadmap-for-forgejo

I only use my Forgejo instance for myself currently so I haven't looked at the ActivityPub features of it before.


...this is an interesting thought exercise, thank you.


In python there is lithops, which provides nice Executor primitives that can run on a wide range of cloud services (AWS lambda, GCF etc.)

https://github.com/lithops-cloud/lithops


Omg the Python code examples are center aligned. But it looks sweet


That's extremely funny in the one language that cares about alignment.

(& the argument that I keep using against significant whitespace, which is that all sorts of other tools assume it can mess around with it with no downsides)


People should probably click before downvoting... this is what it looks like in the README:

  from lithops import FunctionExecutor

            def hello(name):
          return f'Hello {name}!'

   with FunctionExecutor() as fexec:
    f = fexec.call_async(hello, 'World')
             print(f.result())
If you copy/paste it the indentation is correct, it's just the display formatting for some reason.


The python package Hypothesis[0] already does a great job bringing property-based testing to the people! I've used it and it's extremely powerful.

[0]: https://github.com/HypothesisWorks/hypothesis


I have used Python's `hypothesis` as well, and I wish it were better. We had to rip it out at work as we were running into too many issues.

I have also used Haskell's `QuickCheck` and Clojure's `spec` / `test.check` and have had a great experience with these. In my experience they "just work".

Conversely, if you're trying to generate non-trivial datasets, you will likely run into situations where your specification is correct but Hypothesis' implementation fails to generate data, or takes an unreasonable amount of time to generate data.

Example: Generate a 100x25 array of numeric values, where the only condition is that they must not all be zero simultaneously. [1]

[1] https://github.com/HypothesisWorks/hypothesis/issues/3493


I understand your pain in some sense, but on another I feel like people with a decent amount of hypothesis experience "know" how the generator works and would understand that you basically _never_ want to use `filter` if you can avoid it, instead relying on unfalsifiable generation.

Silly idea for your generator would to generate an array, and if it's zero... draw a random index and a random non-zero number and add it into the array. Leads to some weird non-convexity properties but is a workable hack.

In your own example you turned off the "data too slow" issue, probably because building up a dataframe (all to just do a column sum!) is actually kind of costly at large numbers! Your complaint is probably actually meant for the pandas extras (or pandas itself) rather than the concept of hypothesis.


No, I ran into the same issues with basic data structures. The dataframe wasn’t necessary, it just matched the expected input of some function I wanted to test.


I took your case, I got way better perf just generating a list of numbers and then reshaping it into a dataframe.

But! Even though it doesn't even get that much slower at a certain number of rows it just starts hanging! Like at 49 rows everything is still fine and at 50 it no longer wants to work. It's very bizarre and I'll see if I can debug it. But I think your test case isn't indicative of some fundamental issue with Hypothesis rather than some sort of bug.


That kind of behavior can happen at the threshold of Hypothesis' internal limit on entropy - though if you're not hitting HealthCheck.data_too_large then this seems unlikely.

Let me know if you have a reproducer, I'd be curious to take a look.


> Even though it doesn't even get that much slower at a certain number of rows it just starts hanging

Yes, this brings back memories. I've definitely seen this kind of behaviour as well, in many different, not-particularly-exotic, situations.

I am absolutely convinced the issue I raised on the github project was a bug or a defect, despite the maintainers not taking it seriously.

I find QuickCheck and Clojure spec/test.check much more straightforward to use. I just never ran into this sort of thing with these other tools.


Not weighing in on any particular tech (hypothesis or otherwise), but intrigued by your example...

My initial impulse is to pick a random cell which must not be zero, generate a random number for each other cell and a random non-zerp number for that one. I'm not immediately decided on whether it's uniformly distributed.


I would pick the number of non-zeros first, assert that it's non-zero, then continue filling in the values themselves. And probably not with a uniform distribution.

Any algorithm that cares about the number of non-zeros could have non-trivial interactions with their arrangement and count, so picking something that generates non-trivial sparsity (and doesn't just make the array look like white noise) is going to have the best chance of exposing interesting behavior. The tricky part is thinking through how to generate "interesting" patterns, which admittedly I haven't put enough thought into.


Ah, yeah, generating for property testing probably doesn't want a uniform distribution. What patterns are interesting will surely depend on what we're doing with the array.


Indeed, the world is not IID. As such, test cases should not be a uniformly distributed sample of some mathematical distribution.


> Indeed, the world is not IID.

Right! And even if it were, in the sense that that's what we should expect as real world input, it wouldn't generally be the best distribution for finding bugs.


As the comments on your linked issue point out:

(a) Filtering is a last resort and is best avoided. As an example, the Gen type in Haskell's falsify package can't be filtered, since it's a bad idea. As another example, ScalaCheck's Gen type can be filtered, but they also allow "retries" (by default, up to 10,000 times), because filtering is very wasteful.

(b) If you're going to filter, scope it to be as small as possible (e.g. one comment points out that you're discarding and regenerating entire dataframes, when the filter only depends on one particular column)

(c) Have some vague awareness of how your generators will shrink, to avoid infinite loops. In your case, shrinking will make it more likely to fail your filter; and the "smallest" dataframe (all zeros) will definitely fail.


It’s still very weird that the generator can’t avoid an all-zeros array. Figuring out the root cause might find something interesting.


Care to expand upon the issues you were running into with hypothesis? I'm genuinely curious as I may soon be evaluating whether to use it in a professional context.


As far as I'm aware, Hypothesis is fundamentally based around the idea of "generators are parsers of randomness" discussed in this paper; i.e. a Hypothesis "strategy" is essentially a function from bytestrings to values. To generate random values, those strategies are run on a random bytestring; to shrink a previous value, the bytestring that lead to that value is shrunk.

Haskell's "falsify" package takes a similar approach, but uses a tree of random values. This has the advantage that composite generators can run each of their parts against a different sub-tree, and hence they can be shrunk independently without interfering.


Interesting - I'm curious whether you feel that Xarray covers these use cases already?

https://xarray.dev/

Especially as I've said before that Hyperspy shares so many features in common with Xarray that Hyperspy should just use Xarray under the hood.

https://github.com/hyperspy/hyperspy/discussions/3405


Thank you for the info! I recall looking at the available tools and thought that neither scratched my itch of flexible interactive filtering filtering and flexible interactive visualization. Great tools for either one, but not for both. But I will give xarray another look.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: