This looks incredibly useful for my current project! We have lots of arrays of objects having trajectories of different lengths to deal with, and I often end up having to use numpy with dtype=object, which feels, well, "awkward". It can be very hard to convince numpy arrays to have items that are actually lists, but I like using the vectorized syntax. I've found it sometimes necessary to have Pandas dataframes containing lists too, which I also find awkward.
Just a note, I noticed that there is an unfortunate error in the page describing the bike route calculations. Right in the box where it's supposed to show how much faster things are after JIT, instead a traceback is displayed ending with:
> ImportError: Numba needs NumPy 1.20 or less
I don't think this is intentional. This is in the output box just above the phrase "But it runs 250× faster than the pure Python code:". The box below it shows no timings.
I guess this is a danger of having a "live" backend to render and serve documentation.
Don’t be too scared by all the TODO’s in the docs, most of that stuff actually works – the docs just haven’t been written yet. Admittedly those docs have had TODOs on them for over a year.
To see how it works, you’ve got to look at the repo or any of the talks. The developers are also quite active on the tracker, so do just ask if something is unclear.
Julia is there now. It seems absurd to me to build this on top of Python. You are building everything as a DSL that needs to be carefully crafted to never touch the data structures of the host language. Changing from Python to Julia and suddenly being able to use datastructures other than numpy arrays, and writing things that should be for loops as for loops, was such a relieve. But, I guess as Javascript has shown, with an infinite amount of engineering resources dedicated to it you can build anything in any language.
I’m extremely sympathetic to this viewpoint. I spent the last couple years building a JIT compiler for PyTorch, and a constant battle was that people want to write “normal” idiomatic Python, but those idioms work to defeat optimizations, unless you go to really crazy lengths. Some of the recent work in this space (torch.fx in particular) has made things easier by allowing the user to “clean up” their model, but it was definitely an eye-opening experience.
Well, Python was at the right place at the right time. I am actually happy that array-oriented programming (via Numpy) got mainstream. A lot of great ideas came from sheer number of people doing tinkering and great engineering in this domain. If only people caught on to Iverson's 'Notation as a Tool of Thought' [0] and Backus' 'Can programming be liberated from the von Neumann style? A functional style and its algebra of programs'.
Hi! I'm the original author of Awkward Array (Jim Pivarski), though there are now many contributors with about five regulars. Two of my colleagues just pointed me here—I'm glad you're interested! I can answer any questions you have about it.
First, sorry about all the TODOs in the documentation: I laid out a table of contents structure as a reminder to myself of what ought to be written, but haven't had a chance to fill in all of the topics. From the front page (https://awkward-array.org/), if you click through to the Python API reference (https://awkward-array.readthedocs.io/), that site is 100% filled in. Like NumPy, the library consists of one basic data type, `ak.Array`, and a suite of functions that act on it, `ak.this` and `ak.that`. All of those functions are individually documented, and many have examples.
The basic idea starts with a data structure like Apache Arrow (https://arrow.apache.org/)—a tree of general, variable-length types, organized in memory as a collection of columnar arrays—but performs operations on the data without ever taking it out of its columnar form. (3.5 minute explanation here: https://youtu.be/2NxWpU7NArk?t=661) Those columnar operations are compiled (in C++); there's a core of structure-manipulation functions suggestively named "cpu-kernels" that will also be implemented in CUDA (some already have, but that's in an experimental stage).
A key aspect of this is that structure can be manipulated just by changing values in some internal arrays and rearranging the single tree organizing those arrays. If, for instance, you want to replace a bunch of objects in variable-length lists with another structure, it never needs to instantiate those objects or lists as explicit types (e.g. `struct` or `std::vector`), and so the functions don't need to be compiled for specific data types. You can define any new data types at runtime and the same compiled functions apply. Therefore, JIT compilation is not necessary.
We do have Numba extensions so that you can iterate over runtime-defined data types in JIT-compiled Numba, but that's a second way to manipulate the same data. By analogy with NumPy, you can compute many things using NumPy's precompiled functions, as long as you express your workflow in NumPy's vectorized way. Numba additionally allows you to express your workflow in imperative loops without losing performance. It's the same way with Awkward Array: unpacking a million record structures or slicing a million variable-length lists in a single function call makes use of some precompiled functions (no JIT), but iterating over them at scale with imperative for loops requires JIT-compilation in Numba.
Just as we work with Numba to provide both of these programming styles—array-oriented and imperative—we'll also be working with JAX to add autodifferentiation (Anish Biswas will be starting on this in January; he's actually continuing work from last spring, but in a different direction). We're also working with Martin Durant and Doug Davis to replace our homegrown lazy arrays with industry-standard Dask, as a new collection type (https://github.com/ContinuumIO/dask-awkward/). A lot of my time, with Ianna Osborne and Ioana Ifrim at my university, is being spent refactoring the internals to make these kinds of integrations easier (https://indico.cern.ch/event/855454/contributions/4605044/). We found that we had implemented too much in C++ and need more, but not all, of the code to be in Python to be able to interact with third-party libraries.
If you have any other questions, I'd be happy to answer them!
Do you have some examples of slicing, masking, un-padding, and (I suppose) “Haskell-like” ops, eg fmap, but also eg treemap, vmap, pmap that are in Jax? Also grouping, cutting, and interweaving, and also.. this is kind of a weird ask but suppose I had an operation in pure assembly that is extremely fast w two int64 parameters outputting one int64, what’s the easiest path for me to get awkward to apply that to two Arrays and give me one Array back as output?
That includes what I think you mean by "masking." (You mean keeping only the array elements that are `true` in a boolean array? There's another function we call ak.mask that keeps all array elements, but replaces the ones that line up with `false` with a missing value: https://awkward-array.readthedocs.io/en/latest/_auto/ak.mask...)
Mapping is implicit, as it is in NumPy. If you use an Awkward Array in any NumPy ufunc, including binary operators like `+`, `-`, `*`, `==`, `<`, `&`, etc., then all the arrays will be broadcasted and computed element-by-element. This is true whether the data structure is flat or a deep tree. ("Array-oriented" is a different style from "functional.")
There hasn't been much call for grouping yet—Awkward Array is more like a NumPy extension than a Pandas extension—but there is a way to do it by combining a few functions, which is described in the ak.run_lengths documentation: https://awkward-array.readthedocs.io/en/latest/_auto/ak.run_...
For wrapping a function with an ABI interface, I think the easiest way to do that would be to use Numba and ctypes.
import ctypes
import awkward as ak
import numba as nb
libm = ctypes.cdll.LoadLibrary("/lib/x86_64-linux-gnu/libm.so.6")
libm_exp = libm.exp
libm_exp.argtypes = (ctypes.c_double,)
libm_exp.restype = ctypes.c_double
libm_exp(0) # 1.0
libm_exp(1) # 2.718281828459045
libm_exp(10) # 22026.465794806718
@nb.vectorize([nb.float64(nb.float64)])
def ufunc_exp(x):
return libm_exp(x)
array = ak.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])
ufunc_exp(array) # calls libm_exp on every value in array, returns an array of the same structure
Numba's @vectorize decorator (https://numba.pydata.org/numba-doc/latest/user/vectorize.htm...) makes a ufunc, and Awkward Array knows how to implicitly map ufuncs. (It is necessary to specify the signature in the @vectorize argument; otherwise, it won't be a true ufunc and Awkward won't recognize it.)
When Numba's JIT encounters a ctypes function, it goes to the ABI source and inserts a function pointer in the LLVM IR that it's generating. Unfortunately, that means that there is function-pointer indirection on each call, and whether that matters depends on how long-running the function is. If you mean that your assembly function is 0.1 ns per call or something, then yes, that function-pointer indirection is going to be the bottleneck. If you mean that your assembly function is 1 μs per call and that's fast, given what it does, then I think it would be alright.
If you need to remove the function-pointer indirection and still run on Awkward Arrays, there are other things we can do, but they're more involved. Ping me in a GitHub Issue or Discussion on https://github.com/scikit-hep/awkward-1.0
Amazing.. 100 gratitudes for you from me for taking the time out to explain all of that. Very impressed by this and all the work that’s been done.
Oh and for un-padding, I meant like how do I do the inverse of fill_none . pad_none
Also saw there was some stuff about algebraic types (eg semigroup reductions) - is that kind of algorithm-level type annotation a direction you all are interested in exploring further?
Un-padding: something like string-trimming (e.g. `str,rstrip`), but for missing values at the ends of lists... There isn't a function for that.
If you happen to know that the only uses of missing values are at the ends of lists, `ak.is_none` and `ak.sum` (with the appropriate `axis`) can count them, and you could perhaps construct a slice from that (negative to count from the end, and therefore slice off the missing values only). I'd have to think about it, but that would be the beginning of a columnar implementation of "unpad_none".
As for the algebraic types, I was using the terminology to explain what the reducers do. Some operations, like sum and product, have identities, and some don't, like argmin.
As for type annotations, I don't know what you mean. We're not using Python type annotations, but they'd be too coarse to describe what these operations do. Awkward-specialized type annotations might be overkill. For Dask, which needs to be able to predict types, we're passing tracer objects through the codebase to observe the types change without actually computing values, so it's a type-propagation by execution.
Ah interesting. Like so w the algebraic stuff I meant like, well if you have a semigroup or a monoid homomorphism it translates nicely into a parallel distributed computation problem- hence the semigroup flag works nicely with the reduction ops
Like, for instance, what if I could make an array of heterogenous ufuncs, and apply that to a similarly shaped array (like an Applicative).. like if I wanted to implement eg graph re-writing by applying a rules ufunc array to an adjancency array, etc, or even , to get very meta, apply a rules function array to another rules function array
Or if I wanted to compute eg the fixed point of a series of those applications, etc.
Or maybe if I wanted to use Arrow types to abstractly represent computations within each cell, do some fancy stuff in each cell, perform some rudimentary ’compiler optimization’ by inspecting which cells would end up doing unnecessary work (in the context of whatever problem I am doing; eg suppose I only permitted 3 chained ufunc calls per cell or something weird like that), that would be really cool too
Or eg if for some unknown reason I wanted each cell to fire off 2 concurrent ufuncs within each cell, and I only was interested in the result that ‘won’ the data race for each cell, I could use eg an Alternative in the style of the Concurrently library.
Or if I wanted eg each cell to be like a MonadPlus; do some work in the cell but also provide builtin “recovery” capabilities per cell if the cell evaluated to empty/missing/None
Ah now another interesting possibility could be a matrix of lambda calculus statements..!
Naturally, we considered this! :) We needed a larger set of tree node types than Apache Arrow in order to perform some of these operations without descending all the way down a subtree. For example, the implementation of the slice described in the video I linked above requires list offsets to be described as two separate arrays, which we call `starts` and `stops`, rather than a single set of `offsets`. So we have a ListArray (most general, uses `starts` and `stops`) and a ListOffsetArray (for the special case with `offsets`) and some operations produce one, other operations produce the other (as an internal detail, hidden from high-level users). Arrow's ListType is equivalent to the ListOffsetArray. If we had been forced to only use ListOffsetArrays, then the slice described in the video would have to propagate down to all of a tree node's children, and we want to avoid that because a wide record can have a lot of children.
So Awkward Array has a superset of Arrow's node types. However, one of the great things about columnar data is that transformation between formats can share memory and be performed in constant time. In the ak.to_arrow/ak.from_arrow functions (https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_a...), ListOffsetArrays are replaced with Arrow's list node type with shared memory (i.e. we give it to Arrow as a pointer). Our ListArrays are rewritten as ListOffsetArrays, propagating down the tree, before giving it to Arrow. If you're doing some array slicing and your goal is to end up with Arrow arrays, what we've effectively done is delayed the evaluation of the changes that have to happen in the list node's children until they're needed to fit Arrow's format. You might do several slices in your workflow, but the expensive propagation into the list node's children happens just once when the array is finally being sent to Arrow.
As far as what matters for users, Awkward Arrays are 100% compatible with Arrow through the ak.to_arrow/ak.from_arrow functions, and usually shares memory with O(1) cost (where "n" is the length of the array). When it isn't shared memory with O(1) conversion time, it's because it's doing evaluations that you were saved from having to do earlier.
That sounds great! I'm not working in Python though, but I am interested in the Awkward Array data structure. I'm trying to find example code that just uses libawkward from C++.
I think it would be great to develop this concept in other languages, and I even tried to do some standardization before implementation at the start of the project, but when it came down to it, it had to be implemented in some language. Even then, as users needed specific functions, we added them directly in Python whenever possible, since the development time of doing it in C++ and only exposing it in Python was prohibitive. Clement Helsens found this out when he tried to use the C++ layer without Python (https://github.com/clementhelsens/FCCAnalyses/blob/awkward/a... a lot of what he needed wasn't there.
But the ideas are simple, and if you have a language in mind, building up a similar implementation could be pretty quick for a specific application—it's generality that takes a long time. This page: https://awkward-array.readthedocs.io/en/latest/ak.layout.Con... has links to all the node types that we've found useful and each link has a minimal, demo-style implementation—not the real implementation with all its baggage. They show what it looks like to start implementing such a thing.
If the language you're talking about is Julia, I'm doubly interested. In high-energy physics, there's a group of us who think that Julia would be a great alternative to the mix of Python and C++ we currently have. Despite the comments above about Julia having this already, Awkward Array is a rather different thing from JIT-compiled data structures, and it would be as valuable in Julia as it is in Python, least of all because GPUs have to be addressed in a vectorized way.
> Despite the comments above about Julia having this already, Awkward Array is a rather different thing from JIT-compiled data structures, and it would be as valuable in Julia as it is in Python, least of all because GPUs have to be addressed in a vectorized way.
This is just done through multiple dispatch with type parameters in Julia. Vector{Vector{Float64}} is an "array of lists". Vector{Dict{Symbol,Float64}} is an "array of records". Vector{Union{Missing,Vector{Float64}} is an array of lists that can have missings, where the memory and generated code is optimized for this union. Vector{String} is a vector of strings. LazyArrays.jl provides lazy array implementations, which satisfy the AbstractArray interface so they can be used pretty much anywhere a standard Array can be. Primitives like flattening and grouping are just the built in `reduce(hcat,x)` etc. which have dispatches to optimize the operation to only allocate once. Etc.
As standard arrays, you can do all of the `sum`, `prod`, broadacsting, linear algebra, `\`, `transpose`, etc. on these objects because of dispatches on `::AbstractArray`. You can put CuArrays into these kinds of objects, which is one of the building blocks that Flux.jl uses to represent the neural network weights BTW. The entire DataFrames.jl is built on such generically typed arrays, so data filtering operations like Query.jl support these types along with their unions with missings. And these are the objects used across the whole language, so the JIT of course applies to these objects.
Then if you want heterogeneous arrays that don't mix implementation with representation, my favorite is https://github.com/jonniedie/ComponentArrays.jl which are arrays of arrays that are stored contiguously which thusly work well with linear algebra. https://jonniedie.github.io/ComponentArrays.jl/dev/examples/... that's an example of quickly implementing some neural networks and using it with AD, and it all works with broadcast etc. to build optimized GPU kernels.
I don't get what's missing? Seems like that's the list in the documentation.
I think the bigger piece is the reversal of structure. In Julia, additions to arrays are allowed by the user and not hardcoded into the Base library. "Allowing GPUs" or "complex numbers", or "dictionaries" as elements is not special cased. If you create a type for numbers with error bounds (https://github.com/JuliaPhysics/Measurements.jl), everything here still applies. Vector{Dict{Symbol,Measurements}} generates the efficient code. So by design it's impossible to enumerate all of the ways a Julia array can be used, and that's why it's built with multiple dispatch on parametric types. So while I listed out the examples going down the AckwardArray docs, note that by composability you also get all of the combinations of the above, along with combinations of any object a user builds in the language.
I commend the effort of hardcoding a lot of cases people might use in Python, but man this really makes clear how a general solution is usually the right idea after the 3rd specialization.
> I don't get what's missing? Seems like that's the list in the documentation.
The essential part of this is the fact that it is a columnar implementation. Although that's not interface-level, it's not a detail: the time complexities and degree of memory sharing between input and output are different for these algorithms. The video link above is fast-forwarded to the 3.5 minute section that describes this in detail. (I've talked about it in about a dozen talks, but that one is a pretty good presentation.)
Keep in mind, I think Julia is great! The composability and generality of the array abstract type make short work of a wide variety of non-columnar algorithms. I wish we were using it more in high energy physics. As I said above, I'm doubly interested if someone is thinking about doing columnar algorithms in Julia. (The multiple dispatch would make it a bit different: you probably wouldn't want a high-level wrapper like ak.Array, just a uniform interface on all the node types, since they don't have class methods to expose internals.)
The columnar approach has two main advantages, only one of which is to make up for Python's necessarily slow implementation. The other is that it improves opportunities for vectorization, the hardware equivalent of what we're doing for Python. Doing it in Julia would not be superfluous!
Also note how I was able to tell Julia to just reinterpret a bunch of contiguous floating point values as objects of type `Muon`, which produced a `ReinterpretArray`. Nowhere in there did I ever copy any data from the original array produced by the `rand(3*7)` call.*
As mentioned in my post, ComponentArrays.jl will represent it columnarly and uses that for doing this efficiently with GPUs and automatic differentiation.
This isn't really comparable to JAX afaict - JAX offers only masked dense computation, this is first-class ragged support with no mention of autodiff or jit (outside of numba)
Awkward is a complement to Numba, not an alternative.
Awkward is the container for data. Like Numpy, it supports storing data as low-level arrays rather than everything being a "boxed" Python object. While Numpy is for regular arrays, Awkward adds irregular arrays of JSON-like objects.
Numba is the computation layer. Its JIT compiler builds faster code by specialising for specific concrete data types, rather than Python allowing everything to be dynamically changed.
If Numba code is fed arrays of Numpy/Awkward data, your computations get much faster.
Can I define a deep neural network as an Awkward array of matrices of different sizes? It looks very promising to compute a backpropagation step without ping-ponging with the Python interpreter.
I hadn't considered an Awkward array being used as a neural network, and I would expect specialized neural network frameworks to optimize it better: don't they fuse the matrices representing each layer into a single outside-of-Python computation? If they don't, I wonder why not.
Nevertheless, my ears perked up at "matrices of different sizes," since that was something I implemented a while back thinking it would find a use-case if it were well advertised.
You can matrix-multiply two arrays of matrices, in which all the matrices have different shapes, as long as matrix "i" in the first array is compatible with matrix "i" in the second array. Like this:
The left[0]'s 3×2 times right[0]'s 2×3 results in the output's 3×3, etc for i=1 and i=2. This is the kind of operation you wouldn't even be able to talk about if you didn't have jagged arrays.
I meant to add that autodiff is a feature we'll be working on this upcoming spring, by using JAX arrays as the buffers. Until we have autodiff, backpropagation won't be possible.
It's a C++ library with a Python interface. How much C++ and how much Python is currently in flux (https://indico.cern.ch/event/855454/contributions/4605044/): we're moving toward having the majority of the library be in Python with only the speed-critical parts in C++ behind a C interface.
The main library has no JIT, so no LLVM, but there is a secondary method of access through Numba, which has JIT and LLVM.
The main method of access is to call array-at-a-time ("vectorized") functions on the arrays, like NumPy. The types of the data in the arrays can be defined at runtime, and are stored in a columnar way, like Apache Arrow. This format allows for operations to be performed on dynamically typed data at compile-time speeds without specialized compilation—it can be done with a suite of precompiled functions. The reason that's possible is because these data structures are never instantiated in the normal "record-oriented" way, they're sets of integer arrays that are only interpreted as data structures, and it's not hard to precompile a suite of functions that only operate on integer arrays.
A 3.5 minute explanation of the columnar format and what a columnar operation looks like can be found here: https://youtu.be/2NxWpU7NArk?t=661
It's a complex structure with many array elements (30 million), but those elements have a common type. It's 1.4 GB of uncompressed JSON. If I convert it to a Parquet file (a natural format for data like these), it would be 1.6 GB of uncompressed Parquet. It can get much smaller with compression, but since the same numbers are repeating in the above, compressing it would not be a fair comparison. (Note that I'm using 3 bytes per float in the JSON; uncompressed Parquet uses 8 bytes per float. I should generate something like the above with random numbers and then compress the Parquet.)
Reading the JSON into Python dicts and lists using the standard library `json` module takes 70 seconds and uses 20 GB of RAM (for the dicts and lists, not counting the original string).
Reading the Parquet file into an Awkward Array takes 3.3 seconds and uses 1.34 GB of RAM (for just the array).
Reading the JSON file into an Awkward Array takes 39 seconds and uses the same 1.34 GB of RAM. If you have a JSONSchema for the file, an experimental method (https://github.com/scikit-hep/awkward-1.0/pull/1165#issuecom...) would reduce the reading time to 6.0 seconds, since most of that time is spent discovering the schema of the JSON data dynamically.
The main thing, though, is that you can compute the already-loaded data in faster and more concise ways. For instance, if you wanted to slice and call a NumPy ufunc on the above data, like
output = np.square(array["y", ..., 1:])
the equivalent Python would be
output = []
for sublist in python_objects:
tmp1 = []
for record in sublist:
tmp2 = []
for number in record["y"][1:]:
tmp2.append(np.square(number))
tmp1.append(tmp2)
output.append(tmp1)
The Python code runs in 140 seconds and the array expression runs in 2.4 seconds (current version; in the unreleased version 2 it's 1.5 seconds).
For both loading data and performing operations on them, it's orders of magnitude faster than pure Python—for the same reasons the same can be said of NumPy. What we're adding here is the ability to do this kind of thing on non-rectangular arrays of numbers.
Just a note, I noticed that there is an unfortunate error in the page describing the bike route calculations. Right in the box where it's supposed to show how much faster things are after JIT, instead a traceback is displayed ending with:
> ImportError: Numba needs NumPy 1.20 or less
I don't think this is intentional. This is in the output box just above the phrase "But it runs 250× faster than the pure Python code:". The box below it shows no timings.
I guess this is a danger of having a "live" backend to render and serve documentation.