Hacker News new | past | comments | ask | show | jobs | submit login

Try pyarrow.ParquetFile.iter_batches()

Streams batches of rows

https://arrow.apache.org/docs/python/generated/pyarrow.parqu...

Edit — May need to do some extra work with s3fs too from what I recall with the default pandas s3 reading

Edit 2 — or check out pyarrow.fs.S3FileSystem :facepalm:




I've spent many many hours trying these suggestions, didn't have much luck. iter_batches loads the whole file (or some very large amount of it) into memory.


It sounds like maybe your parquet file has no partitioning. Apart from the iterating over row groups like someone else suggested, I suspect there is no better solution than downloading the whole thing to your computer, partitioning it in a sane way, and uploading it again. It's only 15 GB so it should be fine even on an old laptop.

Of course then you might as well do all the processing you're interested in while the file is on your local disk, since it is probably much faster than the cloud service disk.


What do you mean by the parquet file might have no partitioning? Is the row group size not the implicit partitioning?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: