>If you’re running a query like “give me the average price of all transactions o...

citrin_ru · on July 5, 2021

> Wouldn't that be two columns?

Columnar databases can also have indices. If there is an index by date (or time) then you DB will know row range from the index and will read only price column within given row range. If there is no index by data it would be two columns, but it is still much less than a full row with many columns.

glogla · on July 5, 2021

Yeah but it's nitpicky :)

Maybe "past 5 years" mean "over everything" and then it might technically be one column.

Or if you're using something like S3 with Parquet for "columnar database" and it is partitioned by date, then the date values are stored in metadata - so you would logically read two columns but physically only read one from storage. Same story for something like Redshift and using date as sortkey.

coldacid · on July 7, 2021

Oh, I'm definitely being nitpicky :D

Animats · on July 5, 2021

If the number of transactions is small enough to query quickly, a simple SQL query will give you that. If the number of transactions is very large, querying a random sample of perhaps 1 in 1000 will give you reasonably accurate results. You can calculate the probable error from the sample size.

There are questions which require examining all the data, but often you don't really need to.