Hacker News new | past | comments | ask | show | jobs | submit | pgaddict's comments login

No, that's a completely different thing - an index access method, building a bloom filter on multiple columns of a single row. Which means a query then can have an equality condition on any of the columns.

That being said, building a bloom/cuckoo filter as a data type would be quite trivial - a basic version might take an hour or two, the more advanced stuff (to support partial aggregates etc) might take a bit more. Ultimately, this is not that different from what postgresql-hll does.


Yeah, I certainly understand ZFS was designed for different types of storage, but we now have flash storage as a default ...

Which is why I do these benchmarks - to educate myself and learn what behavior to expect from different systems, and maybe even find stuff to optimize in Postgres. The goal definitely is not to "shame" some of the filesystems (or whatever I benchmark), I have huge respect to all of the engineers working on these things.

FWIW I really like the ideas behind ZFS, and many deployments chose to use it. But that's kinda the point - to decide what's the right trade off you need to have some basic understanding what kind of behavior to expect.


There's a link to a github repo [https://github.com/tvondra/pg-fs-benchmark] with all the configuration details, including the ZFS details for each of the runs.

For example https://github.com/tvondra/pg-fs-benchmark/blob/master/i5/20..., but it's all set by (admittedly ugly) scripts that are also in the repo: https://github.com/tvondra/pg-fs-benchmark/blob/master/i5/ru...

But yeah, I certainly did the tuning recommended on the vadosware.io page (I think I actually used it as one of the sources back then, I guess I should have mentioned that in the blog post).


Would be a great topic for pgconf.eu in June (pgcon moved to Vancouver). Too bad the CfP is over, but there's the "unconference" part (but the topics are decided at the event, no guarantees).


Did you mean pgconf.dev in May (which has the unconference), or pgconf.eu in October (which doesn't have an unconference, but the CfP will open sometime in the - hopefully near - future)?


Yeah, I meant May. Sorry :-( Too many conferences around that time, I got confused.

That being said, submitting this into the pgconf.eu CfP is a good idea too. It's just that it seems like a nice development topic, and the pgcon unconference was always a great place to discuss this sort of stuff. There are a couple more topics in the JIT area, so having a session or two to talk about those and how to move that forward would be beneficial.


Now try adding a new organization with ID 11 and timestamps on the tail end (>2020-01-01). And query for that ID.


Sure.

  -- 1e7 entries for organization_id = 10 distributed evenly over the 10
  -- years after '2020-01-01 00:00:00' as requested
  insert into foo (
    organization_id,
    created_at,
    updated_at)
  select
    10,    --9 was the previous max organization_id so presumably 10 is OK rather than 11
    generate_series('2020-01-01 00:00:00'::timestamp, '2020-01-01 00:00:00'::timestamp + interval '10 years', (interval '10 years')/1e7),
    generate_series('2020-01-01 00:00:00'::timestamp, '2020-01-01 00:00:00'::timestamp + interval '10 years', (interval '10 years')/1e7);
   
  cluster foo using foo_created_at_idx; --should be a no-op but why not
   
  vacuum analyze;   --same thing
   
  create index on foo (updated_at) where organization_id = 10; --stats will be highly-skewed so we had better fix that
   
  explain analyze select * from foo where organization_id = 10 order by updated_at limit 10;
   
                                                                    QUERY PLAN                                                                   
  -----------------------------------------------------------------------------------------------------------------------------------------------
   Limit  (cost=0.43..23.35 rows=10 width=57) (actual time=0.073..0.079 rows=10 loops=1)
     ->  Index Scan using foo_updated_at_idx1 on foo  (cost=0.43..23495662.12 rows=10254706 width=57) (actual time=0.071..0.075 rows=10 loops=1)
   Planning Time: 0.185 ms
   Execution Time: 0.101 ms
BTW, SQLite was no better at solving this riddle. It also failed miserably in its query plan until I created the same partial index.


Not sure, but I see this ...

I simplified the SQL a little bit - I don't think it's necessary to have two timestamps and cluster on one of them. The other timestamp is not correlated, so the access through that index will be random anyway. And I used 10M rows only.

  create unlogged table foo (
     id int primary key generated by default as identity,
     organization_id int,
     created_at timestamptz
  );
  
  insert into foo (id, organization_id, created_at)
  select
    i, floor(random()*10),
    timestamp '2010-01-01' + random()*(timestamp '2020-01-01' - timestamp '2010-01-01') 
  from generate_series(1, 1e7) s(i);
  
  insert into foo (id, organization_id, created_at)
  select
    1e7 + i, 10,
    timestamp '2020-01-01' + random()*(timestamp '2021-01-01' - timestamp '2020-01-01') 
  from generate_series(1, 100000) s(i);

  -- see the timestamp range for each org
  select organization_id, min(created_at), max(created_at) from foo group by 1 order by 1;
  
  create index on foo (created_at);
  
  vacuum analyze;
  checkpoint;
  
  set max_parallel_workers_per_gather = 0;
  
  explain analyze select * from foo where organization_id = 1 order by created_at limit 10;
  explain analyze select * from foo where organization_id = 10 order by created_at limit 10;
For me, this produces:

                                                                  QUERY PLAN                                                                
  ------------------------------------------------------------------------------------------------------------------------------------------
   Limit  (cost=0.43..5.62 rows=10 width=16) (actual time=0.072..0.378 rows=10 loops=1)
     ->  Index Scan using foo_created_at_idx on foo  (cost=0.43..505826.61 rows=975331 width=16) (actual time=0.071..0.375 rows=10 loops=1)
           Filter: (organization_id = 1)
           Rows Removed by Filter: 76
   Planning Time: 0.040 ms
   Execution Time: 0.390 ms
  (6 rows)
  
                                                                      QUERY PLAN                                                                    
  --------------------------------------------------------------------------------------------------------------------------------------------------
   Limit  (cost=0.43..43.61 rows=10 width=16) (actual time=30316.079..30316.117 rows=10 loops=1)
     ->  Index Scan using foo_created_at_idx on foo  (cost=0.43..505826.61 rows=117161 width=16) (actual time=30316.078..30316.114 rows=10 loops=1)
           Filter: (organization_id = 10)
           Rows Removed by Filter: 10000000
   Planning Time: 0.041 ms
   Execution Time: 30316.136 ms
  (6 rows)
So perhaps there's something "wrong" with your data, because there's no "Filter" or "Rows Removed by Filter" in your query plans.


Now add a partial index and re-run your queries.

  create index on foo (created_at) where organization_id = 10;


Yeah. But the whole discussion here was about the dilemma the optimizer faces if it only has the two indexes on (created_at) and (organization_id), and why the assumption of independence/uniformity does not work for the skewed case.

Of course, adding a composite or partial index may help, but people are confused why the optimizer is not smart enough to just switch to the other index, which on my machine does this:

                                                             QUERY PLAN                                                         
    ----------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=182766.62..182766.65 rows=10 width=16) (actual time=441.647..441.650 rows=10 loops=1)
       ->  Sort  (cost=182766.62..182988.83 rows=88881 width=16) (actual time=441.642..441.644 rows=10 loops=1)
             Sort Key: created_at
             Sort Method: top-N heapsort  Memory: 25kB
             ->  Seq Scan on foo  (cost=0.00..180845.94 rows=88881 width=16) (actual time=210.901..437.292 rows=100000 loops=1)
                   Filter: (organization_id = 10)
                   Rows Removed by Filter: 10000000
     Planning Time: 0.327 ms
     Execution Time: 441.696 ms
    (9 rows)
That's ~2 orders of magnitude faster, but the optimizer has no way to predict this.

Sure, the composite/partial indexes will do much better, but my intent was to explain why LIMIT queries have this problem.


> Yeah. But the whole discussion here was about the dilemma the optimizer faces if it only has the two indexes on (created_at) and (organization_id), and why the assumption of independence/uniformity does not work for the skewed case.

Was it? The whole discussion started when zac23or said at the top of the thread that their biggest problem with PostgreSQL is its optimizer, that this is their worst example, that it needs an index on (organization_id, created_id) for it to be solved, and that SQLite and MS SQL Server do not need that index on (organization_id, created_at). They didn't initially say how their data are distributed other than that there are millions of rows with organization_id=10 and they didn't initially say that their data are skewed. Later, they said that the data are "distributed equally" and added that organization_id is random, which would be consistent, though they did say that their data are clustered not by created_at but by organization_id. Consequently, the skewed distribution seems to be a special case that you added in this sub-thread. That's fine, but that's a different albeit related problem from the one zac23or posed. If the original problem is, "Arrange for PostgreSQL to perform well on this query on this data model for equally-distributed data without adding an index on (organization_id, created_id)." that problem is solved. It can be done. If your additional problem is "Arrange for PostgreSQL to perform well on this query on this data model for unequally-distributed data without adding an index on (organization_id, created_at)." that problem is also solved with a partial index on just (created_at).

> Of course, adding a composite or partial index may help, but people are confused why the optimizer is not smart enough to just switch to the other index, which on my machine does this:

I'm curious how you forced the optimizer to switch to that query plan if it's not smart enough to do it on its own. I don't know, but I suspect you ordered by an expression, with something like

  explain analyze select * from foo where organization_id = 10 order by created_at + interval '0 day' limit 10;
That produces nearly identical results on my machine, which increases my suspicion.

                                                           QUERY PLAN                                                          
  -----------------------------------------------------------------------------------------------------------------------------
   Limit  (cost=183393.70..183393.73 rows=10 width=24) (actual time=310.827..310.828 rows=10 loops=1)
     ->  Sort  (cost=183393.70..183657.98 rows=105713 width=24) (actual time=307.195..307.196 rows=10 loops=1)
           Sort Key: ((created_at + '00:00:00'::interval))
           Sort Method: top-N heapsort  Memory: 26kB
           ->  Seq Scan on foo  (cost=0.00..181109.28 rows=105713 width=24) (actual time=294.054..300.537 rows=100000 loops=1)
                 Filter: (organization_id = 10)
                 Rows Removed by Filter: 10000000
   Planning Time: 0.116 ms
   JIT:
     Functions: 5
     Options: Inlining false, Optimization false, Expressions true, Deforming true
     Timing: Generation 0.594 ms, Inlining 0.000 ms, Optimization 0.347 ms, Emission 3.296 ms, Total 4.237 ms
   Execution Time: 311.490 ms
This plan is better for organization_id=10 than the one the optimizer chose, without resorting to a partial index, but it comes at a cost: this plan is worse for any of the other values of organization_id, whose data are not skewed.

I think a third question, which is implicit in your comments is, "Well then why doesn't the optimizer use one plan for organization_id=10 and the other plan for the other values of organization_id?" That's a good question. Is this possible? Does any other database do this? I don't know. I'll try to find out but if somebody already has the answer, I'd love to hear it.

A fourth question is, "How do other databases perform on this and similar queries?" In my experiments, the picture is still cloudy. With my original data, SQLite chose the same plan and performed just as badly in the skewed case, until I also gave it a partial index. With your data, SQLite actually outperforms PostgreSQL in the skewed case even without the partial index. Why is that? I don't know. These are just experiments for different cases that are highly context-dependent. So far, I don't see enough evidence to support the original claim, that PostgreSQL's optimizer has a problem and that this is an example.


>> Yeah. But the whole discussion here was about the dilemma the optimizer faces if it only has the two indexes on (created_at) and (organization_id), and why the assumption of independence/uniformity does not work for the skewed case. > > Was it? The whole discussion started when zac23or said at the top of the thread that their biggest problem with PostgreSQL is its optimizer, that this is their worst example, that it needs an index on (organization_id, created_id) for it to be solved, and that SQLite and MS SQL Server do not need that index on (organization_id, created_at). ...

I don't know, but that's how I understood the discussion - I tried to explain why the Postgres optimizer makes this choice, and why planning this type of queries (LIMIT + WHERE) is a hard problem in principle.

>> Of course, adding a composite or partial index may help, but people are confused why the optimizer is not smart enough to just switch to the other index, which on my machine does this: > > I'm curious how you forced the optimizer to switch to that query plan if it's not smart enough to do it on its own.

I simply dropped the index on created_at, forcing the planner to use the other index.

> I think a third question, which is implicit in your comments is, "Well then why doesn't the optimizer use one plan for organization_id=10 and the other plan for the other values of organization_id?" That's a good question. Is this possible? Does any other database do this? I don't know. I'll try to find out but if somebody already has the answer, I'd love to hear it.

The answer is that this does not happen because the optimizer does not have any concept of dependence between the two columns, or (perhaps more importantly) between values in these two columns. So it can't say that for some departments it's not correlated but for ID 10 it is (and should use a different plan).

Maybe it would be possible to implement some new multi-column statistics. I don't think any of the existing optional extended statistics could provide this, but maybe some form of multi-dimensional histogram could? Or something like that? But ultimately, this can never be a complete fix - there'll always be some type of cross-column dependency the statistics can't detect.

> A fourth question is, "How do other databases perform on this and similar queries?"

I'm not sure what all the other databases do, but I think handling this better will require some measure of "risk" or "plan robustness". In other words, some ability to say - how sensitive is the plan to estimates not being perfectly accurate, or maybe some other assumptions being violated. But that's gonna be "better worst case, worse average" solution.


> I don't know, but that's how I understood the discussion - I tried to explain why the Postgres optimizer makes this choice, and why planning this type of queries (LIMIT + WHERE) is a hard problem in principle.

That may be a hard problem. If it is, then a harder problem still is limit + order by. That's why I almost always tell people that limit is not what you want and probably never should be used outside of ad-hoc data exploration.


Abandoned? Certainly not. But it's a complex part of a mature database product, with many existing deployments/users, which means the improvement is affecting literally everyone. And query planning in general is a hard problem. So it takes time to get new stuff in.


I agree with everything you said, but it's frustrating that the planner fails on a seemingly trivial query ("seemingly trivial", perhaps it's a very complex problem).

> So it takes time to get new stuff in.

It's true, I discovered this problem in PG 9, and it's true in PG 16, if I'm not mistaken.


I think this very much depends on what exactly is the problem - if it's with how costing for LIMIT works (https://news.ycombinator.com/item?id=39728826), then yeah, this did not change since 9.x in a substantial way, so 16 has the same behavior.

On a technical level, this is hard because the optimizer has a fairly limited amount of information (optimizer stats) - in particular, it does not know if the columns are correlated. If they are not, then this costing is perfect, and the plan is the correct one. And this assumption of independence is what most databases do by default, so we do that. But if the columns are correlated, it blows up, unfortunately. That is, it's not a very robust plan :-(

But I think the really hard problem is the impact the change might have on existing systems. As I said, the problem is we have very little information to inform the decision. We could penalize this "risky" plan somehow (we don't really have any concept of "risk" separately from the cost), but inevitably that will make the query much worse for some of the systems where that plan is correct.

I understand the frustration, though.

(FWIW I love sqlite, it's an amazing piece of software, but claiming that it has a better optimizer based on a single query may be a bit ... premature.)


So, why PG devs are resistant against introducing hints, and allowing users optimize base on their knowledge about data nature and have predictable results?..


I don't think there's a uniform agreement to not have hints, and even for devs opposing the idea of hints is there is not a single universal reason and it's more like every single dev has his own reason(s) to not like them. For example:

1) The dev may be working on something else entirely, not caring about hints at all. I'd bet most devs are actually in this group.

2) PG does the right decisions for the use cases the dev needs/supports, so there's not much motivation to make this complex, or feeling hints are needed. Everyone scratches their own itch in the first place.

3) There's a feeling we shouldn't be fixing planning issues by hints but by fixing the optimizer. I agree it's rather idealistic and perhaps not very practical (how does it help that in 2 years there might be a fix, when you have the problem on PROD now?) or supported by history (there's a bunch of cases that we know are a problem for years, not much progress was made).

4) Often bad past experience with applications overusing hints to force plans that the "smart" developer thought are great, but then it grew to 100x the size over years, and now the plans are insane. But also the hints are part of the application code and that can't be changed and it's obviously he fault of the DBA to "make it work". If I had a $1 for every such application I had to deal with ...

5) There issues are fairly rare (depending on the application/schema/,...) are there are other ways to force the database to use a different plan. You can do CTEs, set different cost parameters, disable some plan nodes, ... Not as explicit as hints, but good enough for no one to spend enough time on hints.

6) Belief this can/should be solved out of core. The fact that pg_hint_plan exists suggests maybe it's not a bad idea.

7) Concerns about complexity - the optimizer is quite complicated piece of code already, hints would make it even more complex (both the code and testing). If the community is not ready to accept this extra complexity and maintain it forever, it won't be added.

8) There are proprietary/commercial forks that actually do support hints - for example EDB Postgres Advanced Server supports this. (disclosure: I work for EDB, although not on EPAS.)

9) I don't think there ever was a formal patch adding this capability. And without a patch it's all just a theoretical discussion.

10) ... probably many more reasons ...

I'm not saying any of this is a good reason to (not) support hints, it's merely my observation of what devs say when asked about hints.


imho, lack of hints makes PG just dangerous to use in prod, since DB can stuck in poorly optimized query any moment, and shutdown whole site/system. It would be priority #1 if I would be product manager of this project, and all your bullet points would be naturally resolved.


I don't know which "architectural peculiarities" you have in mind, but the issues we have are largely independent of the architecture. So I'm not sure what exactly you think we should be fixing architecture-wise ...

The way I see it the issues we have in query planning/optimization are largely independent of the overall architecture. It's simply a consequence of relying on statistics too much. But there's always going to be inaccuracies, because the whole point of stats is that it's a compact/lossy approximation of the data. No matter how much more statistics we might collect there will always be some details missing, until the statistics are way too large to be practical.

There are discussions about considering how "risky" a given plan is - some plan shapes may be more problematic, but I don't think anyone submitted any patch so far. But even with that would not be a perfect solution - no approach relying solely on a priori information can be.

IMHO the only way forward is to work on making the plans more robust to this kind of mistakes. But that's very hard too.


It's really hard to say why this is happening without EXPLAIN ANALYZE (and even then it may not be obvious), but it very much seems like the problem with correlated columns we often have with LIMIT queries.

The optimizer simply assumes the matching rows are "uniformly" distributed, because limit is costed as linear approximation of the startup/total cost of the input node. For example, consider this made up query plan

  Limit (cost=0.0..100.0 rows=10)
  -> Index scan (cost=0.0...10000000.0 rows=1000000)
     Filter: dept_id=10
The table may have many more rows - say, 100x more. But the optimizer assumes the rows with (dept_id=10) are distributed in the index, so it can scan the first 1/100000 if the index to get the 10 rows the limits needs. So it assumes the cost is 10*(0+10000000)/1000000.

But chances are the dept_id=10 rows happen to be at the very end of the index, so the planner actually needs to scan almost the whole index, making the cost wildly incorrect.

You can verify this by looking how far the dept_id rows are in "order by created_id" results. If there are many other rows before the 10 rows you need, it's likely this.

Sadly, the optimizer is not smart enough to realize there's this risk. I agree it's an annoying robustness issue, but I don't have a good idea how to mitigate it ... I wonder what the other databases do.


One thing that I've run across is when doing a select on a large table with serial or indexed timestamps, ordered by that column is that the plans are horrible when the criteria you're using are unexpectedly rare, even if there's an index on that column. e.g. If col foo has low cardinality, with a bunch of common but a few rare entries, the planner will prefer using the ordering index to the column index even for the very low frequency ones.

Partial indexes can be really useful there, as well as order by (id+0). But it's a total hack.


> It's really hard to say why this is happening without EXPLAIN ANALYZE

I can run EXPLAIN and paste it here, but it's just one of the problems with the planner (in a very basic query), I have many other problems with it.

On the other hand, I've never had any other serious problems with Postgres. It's a very good database.


> I can run EXPLAIN and paste it here ...

Might as well, could turn up something useful. :)


I'd suggest reporting it to pgsql-performance mailing list [1], and continuing the discussion there. I'd expect more community members to join that discussion there.

[1] https://www.postgresql.org/list/pgsql-performance/



Looks very nice. I have three random questions:

1) How does this deal with backups? Presumably the deltalake tables can't be backed up by Postgres itself, so I guess there's some special way to d backups?

2) Similarly for replication (physical or logical). Presumably that's not supported, right? I guess logical replication is more useful for OLTP databases from which the data flow to datalakes, so that's fine. But what's the HA story without physical replication?

3) Presumably all the benefits are from compression at the storage level? Or are there some tweaks to the executor to do columnar stuff? I looked at the hooks in pg_analytics, but I see only stuff to handle DML. But I don't speeak rust, so maybe I missed something.


1) and 2) Backups and replication are on the roadmap. It's the next major feature we're working on.

3) We store data in Parquet files, which is a heavily compressed file format for columnar data. This ensures good compression and compatibility with Arrow for in-memory columnar processing. We also hook at the executor level and route queries on deltalake tables to DataFusion query engine, which processes the data in a vectorized fashion for much faster execution


Thanks. I wonder how you plan to replicate stuff, considering how heavily it relies on WAL (I haven't found the answer in the code, but I'm not very familiar with rust).

How large part of the plan you route to the datalake tables? Just scans or some more complex part? Can you point me to the part of the code doing that? I'm intrigued.


Hey! Sorry I missed this.

1) You can find the code here: https://github.com/paradedb/paradedb/blob/996f018e3258d3989f...

For deltalake tables, we send the full plan to DataFusion.

2) Re: WALs and replication -- We are currently adding support for WALs


The chances may be low, but either it's a draft or a final version.

There's clearly little pressure to rush this, considering it's not difficult to add a custom function generating UUIDv7 ...


Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: