Our Journey with Apache Arrow (Part 2): Adaptive Schemas and Sorting

h1t35h · on July 5, 2023

I've seen the power of moving systems to more machine readable file formats such as Parquet, Arrow as opposed to storing them as csv, json etc. For people who are actually making these design choices in my experience it has always been a better idea to prefer smarter formats over readable ones for large scale systems. They really help in longer term for :

- cost (more maintainable from the tech overhead required).

- storage (lower size)

- compute (faster reads and indexing)

taeric · on July 5, 2023

I've been seeing a sharp dichotomy between the data source originators and the ones processing it. For the folks that originate data, Excel seems to be such a native language that it is hard to move too far away from it. For processing, absolutely consider parquet or similar.

Such that it really depends who you are talking to about which format you should use. Ideally, it seems, support both. Allow rapid ingress and egress of data to/from Excel in whatever way that you can. CSV is the common stopgap between the two, with loads of sharp edges on loading it into Excel.

h1t35h · on July 5, 2023

Having worked with some teams/stakeholders who have programs and SOPs that are very tightly coupled with the excel, csv data handling I do have a few observations:

- These processes and SOPs are actually quite optimal for low scale use-cases but as organizations grow they find themselves in a tough space where even after you have a hiring spree going on everything seems to be falling apart and not maintainable.

- You almost always do lose out on an audit trail. Now, you don't always need an audit trail but if you find your org sharing tons of excel docs being shared on a regular basis you do need a tech integration to solve it. The lack of it means a gap in identifying things that can be made simple and no way to identify / fix things that may have gone wrong over a period of time.

- I fully agree with the statement . Ideally, it seems, support both. Allow rapid ingress and egress of data to/from Excel . The data in systems in software systems is useless it's almost always meant to be consumed by humans (mostly). And that just means while I do most of the authoritative processing in file format that computers do well with I allow it to be converted into a more human readable format which happens to be excel in some cases and visualizations in others.

taeric · on July 5, 2023

Fully agreed on the problems of scaling up places that share data using Excel. I harbor a fear that there is no good way to share data in a large organization, and that you are best looking for ways to shrink the organization and reduce the number of people that need to be involved.

Same thing, largely, seems to apply to documents. Yes, you could farm out a book to a hundred authors and editors. We could build a tool that allows active collaboration that shows what everyone is doing. And we even have some exciting data structures that seem to make this a viable path forward. My experience has been that that isn't the case, though. Specifically, the data structures are fine and nice, but getting even a dozen folks working on the same doc at the same time usually reduces the document to a glorified database with each person doing an optimistic lock on their part so that they can do what they want to do.

guybedo · on July 5, 2023

not directly related to the article, but might be useful to some people:

I recently used pymongoarrow to speed up data extraction from mongo in python as the deserialization from bson to python is super slow. Using pymongoarrow helped achieve nice speed improvements (average 3x to 5x)

geodel · on July 5, 2023

Besides being performant that you described, I love this name pymongoarrow so much, I'd like to call so many things with that name.

yevpats · on July 5, 2023

We also adopted arrow recently for CloudQuery - our open source ELT framework https://arrow.apache.org/blog/2023/05/04/adopting-apache-arr...