I work with pyspark and parquet quite a lot. I never had to deal with parquet ou...

fifilura · 2024-05-08T17:32:11 1715189531

Pyspark is probably the way to go.

I just wanted to mention that AWS Athena eats 15G parquet files for breakfast.

It is trivial to map the file into Athena.

But you can't connect it to anything else than file output. But it can help you to for example write it to smaller chunks. Or choose another output format such as csv (although arbitrary email content in a csv feels like you are set up for parsing errors).

The benefit is that there is virtually no setup cost. And processing cost for a 15G file will be just a few cents.

calderwoodra · 2024-05-08T18:51:51 1715194311

Athena is probably my best bet tbh, especially if I can do a few clicks and just get smaller files. Processing smaller files is a no brainer / pretty easy and could be outsourced to lambda.

fifilura · 2024-05-08T19:35:30 1715196930

Yeah the big benefit is that it requires very little setup.

You create a new partitioned table/location from the originally mapped file using a CTAS like so:

  CREATE TABLE new_table_name
  WITH (
    format = 'PARQUET',
    parquet_compression = 'SNAPPY',
    external_location = 's3://your-bucket/path/to/output/'
  ) AS
  SELECT *
  FROM original_table_name
  PARTITIONED BY partition_column_name

You can probably create a hash and partition by the last character if you want 16 evenly sized partitions. Unless you already have a dimension to partition by.

jcgrillo · 2024-05-08T18:36:24 1715193384

It's been a while (~5yr) since I've done anything with Spark, but IIRC it used to be very difficult to make reliable jobs with the Java or Python APIs due to the impedance mismatch between Scala's lazy evaluation semantics and the eager evaluation of Java and Python. I'd encounter perplexing OOMs whenever I tried to use the Python or Java APIs, so I (reluctantly) learned enough Scala to make the Spark go brr and all was well. Is it still like this?

okr · 2024-05-11T10:13:08 1715422388

Same for me, the only reason to learn scala was Spark. The Java Api was messy. And still today, i like Scala, well, many functional languages, but for jumping between projects they are a nightmare, as everything is dense and cluttered.