Try pyarrow.ParquetFile.iter_batches() Streams batches of rows https://arrow.apa...

calderwoodra · 2024-05-08T15:57:23 1715183843

I've spent many many hours trying these suggestions, didn't have much luck. iter_batches loads the whole file (or some very large amount of it) into memory.

semi-extrinsic · 2024-05-08T20:04:48 1715198688

It sounds like maybe your parquet file has no partitioning. Apart from the iterating over row groups like someone else suggested, I suspect there is no better solution than downloading the whole thing to your computer, partitioning it in a sane way, and uploading it again. It's only 15 GB so it should be fine even on an old laptop.

Of course then you might as well do all the processing you're interested in while the file is on your local disk, since it is probably much faster than the cloud service disk.

okr · 2024-05-11T10:18:59 1715422739

What do you mean by the parquet file might have no partitioning? Is the row group size not the implicit partitioning?