Hacker News new | past | comments | ask | show | jobs | submit login

With your updated plan, you have

> Hash Cond: (tasks.id = t1.id)

> -> Seq Scan on tasks (cost=0.00..254.60 rows=48 width=14) (actual time=0.566..10.550 rows=10000 loops=1)

> Filter: (status = 'QUEUED'::"TaskStatus")

So it's still doing a seq scan on tasks when I'd expect it to join using the PK. It must be getting tripped up by the redundant filter on queued status. Try removing that.

I ninja edited my previous comment, but if you move the LIMIT to the 2nd CTE, that should fix your issue with workers not getting work. If you do that and add the other index I think in principle it should be able to do everything by maintaining a priority queue of the heads of each partition (which are each pre-sorted by the index now). idk if pg does that though. If it does, then that portion of the query should be streamed, so you don't need to try to limit it early to avoid a sort of 10k elements when you only need 10. Then if you remove the redundant QUEUED check, it should be doing everything else through indexes.

Basically, if doing this manually I'd expect the "good" solution to do this in a way where starting from an index, each row is streamed (i.e. no full sort) with logn complexity. So I look at it from a perspective of "how do I get the database to do what I'd do by hand?"




I created a gist with these recommendations - certainly an improvement, but I don't think it gets around the `WindowAgg` running across all 10k rows: https://gist.github.com/abelanger5/5c1a75755072239716cb587a2.... Does this accurately implement your suggestions?

Happy to try out other suggestions, but I haven't found a way to get a window function or `JOIN LATERAL` to scale in near-constant time for this queueing strategy.


It looks like now it does still only pull 100 rows out of the sort (so moving the limit into the 2nd cte didn't hurt). It isn't doing all 10000 rows now though, which is interesting. By any chance, do you have 9200 different tenants? If so that makes sense. What I suggested would work when you have a small number of tenants with queued work (it scales n log n with tenants with queued work, but log n with amount of tasks that a single tenant has queued). So if you're currently testing with many tenants queueing at once, you could see how it behaves with like 20 tenants where one has 9k items and the others have ~50 each. Then it sort of depends on how your distribution looks in practice to know whether that's acceptable.

You could also probably do tricks where individual workers filter to specific tenant IDs in the first CTE (e.g. filter group_key mod num_workers = worker_id) to reduce that cardinality if you expect it to be large. Or you could e.g. select 100 random group_keys as a first step and use that to filter the window, but then that needs another partial index on just `group_key where status = queued`.

Edit: I see it's still doing a seq scan on tasks. There is a PK on id, right? It knows there's only 100 rows from the rest of the query so it seems odd to me that it would decide to scan the table. You could try putting a hash index on id if it's refusing to use the btree index I guess. Or it might change its mind if you add 1M SUCCEEDED tasks or something.

Another thing to consider is that out of the box, pg's default config for the planner is tuned to like 20 year old hardware. You need to tweak the io costs for SSDs and tell it you have more RAM if you haven't done that. See e.g. https://pgtune.leopard.in.ua/ for better starting values.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: