Hacker News new | past | comments | ask | show | jobs | submit login
API pagination design (solovyov.net)
353 points by fagnerbrack on Dec 27, 2020 | hide | past | favorite | 139 comments



Another reason to use cursors is to avoid the problem of repeated or skipped elements caused by concurrent edits. If you use offsets and you’re on page 10, and someone deletes an item on page 1, the whole list shifts and you can accidentally skip an item on page 11. Likewise if someone adds an item on page 1 and you’re on page 10, one of the page 10 items will also show up on page 11.

Cursors elegantly sidestep these issues.


One way to get around this is to drop the notion of page number. Instead do greater than last element on last page.

But this has the drawback of only working as long as the sorting field doesn't have many duplicates. You have to beware if timestamps are high resolution enough for your application or if you have 100+ entries with same family name.

Usually users are smart(lazy) enough to realize they got more than 100 results and will narrow down the search instead. When was the last time you looked beyond even the first 5 results on google yourself, i'd rather adjust the search term than go to page 2.


> One way to get around this is to drop the notion of page number. Instead do greater than last element on last page.

That's the whole point of a cursor, isn't it?

And pagination just means querying the records to return a limited subset.


Depends how long the cursor is alive. Url containing ?date=lt:2012-12-27 is stateless, user readable and more likely to outlive ?cursorId=xyaxjeasu&page=2.

Now it might not be as accurate as having a cursor but it is a lot easier to implement, and still beats simply ?page=2 without cursor, which is the level you see in most apis today.


> Depends how long the cursor is alive.

The cursor is never invalidated. You can always get a valid response from any cursor, whether the resource exists or was already deleted. That's the whole point of using a cursor.

Perhaps your confusion lies in assuming that a cursor's life cycle is tied to a specific object. It isn't. A cursor means "get me whatever resources would immediately follow this resource". It matters nothing if the cursor exists or not. When you run the query, you will get exactly the resource which would follow the cursor. That's by design.


If your ability to query can manage sorting on two fields, this can be resolved by sorting on the field first, then the primary key as a secondary field.

If I'm sorting by last name, and I sort by last name, then id, I can save the cursor as last name + id, and be able to continue right where I left off.


What if the last element on a page is deleted when you click "next"?


You use the sort key as the page selector. So if it's cities sorted by name, the query is "give me the first 20 cities with names after 'Constantinople'". If Constantinople gets the works, it doesn't matter, because there are still 20 cities whose names come after that lexicographically.


> If Constantinople gets the works

Relevant song: https://youtube.com/watch?v=Z0JhC3LO0-8


If it is updated, though (in this case, renamed to Istanbul), it can appear later once again, and newly inserted rows behind "Constantinople" will never appear.

I mean, these problems happen if you use offsets too, and my point is that they don't matter.

The most hardcore solution to those would be pre-computing pages but I doubt it makes sense to bring even more state to the search results.


Yes, pagination over a consistent view is a whole other problem! Either you store search results server side, in which case now you need to think about expiration and running out of space, or you are able to regenerate them on the fly somehow. A database which supports as-of queries might make that quite easy, but not many databases do.


> but not many databases do.

Really? The underlying storage engine of most databases stores all data necessary for an "as of" query. Simply ignore all data pages newer than the timestamp cutoff, and prevent the garbage collection of any page that could be part of such a query.

In fact thats the way [eg. postgres] can do a long-running query on a table that someone else is modifying. You in effect are looking at a snapshot of the table at the moment you started the query, even if it takes an hour to produce all the results.


> What if the last element on a page is deleted when you click "next"?

That makes absolutely no difference. You're still querying the database for elements that come after a value from a data type for which there is an order. You don't need that element to exist to run a comparison. For example, consider a timestamp-based query: you don't need an element with that specific date to exist to search for any element whose timestamp was taken after an arbitrary moment.


A B C D

2 items per page

first page is A&B. Next query is "WHERE ID>'B'".

What's the problem?


Aa and Ab are inserted after loading the first page.


That's an orthogonal problem that you need to define correctness for your application.

Because LIMIT...OFFSET will also arguably give you the wrong result. E.g. Aa is inserted. The user will see "B" both as the last item on the first page, and the first item on the second page.

Or with your example:

Aa Ab B C

Now "A" is inserted between page loads. The user will never see "A". What's right for your application? Maybe a notification saying "previous pages have gotten new items". Maybe not. It all depends.

Reddit can be annoying if you go page after page. As stories are bumped down you see them again. That, in my opinion, is a bug. But solving it requires something completely different from merely defining page boundaries.


Sure, and then B comes by again. Which just means there are different use cases with different definitions of correct, and you need to specify yours.


Though you can't avoid seeing an item just before it is deleted and skipping an item just before it is created. This isn't a huge problem but does mean that the list you end up with may not be a coherent view of the database. But if you require guarantees like that then you might be better of just using SQL directly.

Or just include everything in a single response, HTTPS was designed to handle arbitrary length payloads if I recall correctly. You might want to use a more streamable format than JSON though.


If you care about that kind of guarantee you might want to keep a connection open and stream new updates.


> If you care about that kind of guarantee you might want to keep a connection open and stream new updates.

You don't need that. You just need to get the collection resource to be cacheable and subsequently track your resources' state with conditional requests.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditiona...


> use a more streamable format than JSON

Did you mean we should stream data over WebSocket or use HTTP/2 or do we need to do something different altogether?


AFAIK chunked transfer works specifically with HTTP 1.1, think Kubernetes for example uses this with JSON in a few prominent places for things like WATCH. But that setup is hardly ideal for the use case.


Check out MessagePack.

https://msgpack.org/


I don't think you need to do anything difficult you can just treat it as one big file, pretty sure online radio streams have been using this technique for ages. Though I'm not too sure about the particulars.

This is annoying to do with JSON though because you need to remember to close all your brackets, something like CSV is a lot easier because you can just write the header once and then just stream the data.

Edit: Looks like Python supports this using chunked transfer by simply providing the HTTP request data as an iterator.


I have question, lets say we have

page 1 elements: A B C D E

page 2 elements: F G H I J

so 5 elements in each page.

Suppose I'm on page 2. If I insert a new element Q and it gets pushed as first then page 1 will have Q A B C D. Now if I go back to page 1, I'll get A B C D E and also a token/pointer to go back one more time only to retrieve Q.

So while cursor solved the issues you mentioned, it still will have this case in which pagination gets broken. I'm interested how we can tackle this.


A cursor is either valid in a transaction, or will cache its results - either way it will give a consistent view of data:

https://www.postgresql.org/docs/13/sql-declare.html


These cursors are not database cursors. I too was confused by the bad terminology...


Ah, I assumed from the topic, term and discussion that this would be implemented via sql cursors. Apparently elastic search has a similar concept - but like sql cursors it cannot be stateless (because you cannot ask for Nth to Mth result of searching the collection as of state Z -if you don't want to capture/supply state Z).

This means any stateless pagination api is fundamentally broken. Please don't make promises you cannot keep.

For elastic: https://www.elastic.co/guide/en/elasticsearch/reference/curr...


Another approach to solving this issue is with a temporal database (standardized in SQL:2011); supporting a “valid time” in your query gives your user a consistent view of the database tables even if others are modifying it concurrently. It’s also handy if you want to keep audit records of the database (e.g. so users can see who/when/what changes were made).

https://en.wikipedia.org/wiki/SQL:2011


The Use The Index Luke material about this is essential background reading, but in terms of actually implementing:

My favorite SQL implementation of cursor-based keyset pagination can be found on this Hasura issue: https://github.com/hasura/graphql-engine/issues/141, specifically this comment: https://github.com/hasura/graphql-engine/issues/141#issuecom.... Something so elegant about this SQL. I enjoy it a lot.

Also really enjoy this answer about how to implement keyset pagination on multiple columns:

https://stackoverflow.com/questions/38017054/mysql-cursor-ba...


That multi column answer doesn't seem to address pagination of queries where some of the columns are in descending order and some aren't.

I implemented a keyset paging library for sqlalchemy which supports this, and found the nicest way to do it is to swap the comparisons for the descending columns. For instance if this is an all-ascending paginated query:

    select ... where row(c1, c2, c3) > row(11, 12, 13) order by c1, c2, c3
Then with c2 descending it would look like this:

    select ... where row(c1, 12, c3) > row(11, c2, 13) order by c1, c2 desc, c3


fyi that mysql multiple column answer is incorrect, mysql does indeed support row/tuple comparisons [1]:

> For row comparisons, (a, b) > (x, y) is equivalent to:

> (a > x) OR ((a = x) AND (b > y))

[1] https://dev.mysql.com/doc/refman/8.0/en/comparison-operators...


Just in case you might be able to help me out.. is there any SQL flavor that supports specifying any except certain columns? For example something like:

SELECT * EXCEPT FOO_ID FROM FOO;

I have always wanted this but have never vome across it...


The correct answer is to just apply a view and enumerate the wanted column names once.

Otherwise, dynamic column names (that's your search engine fodder) require introspection…

    -- "Chinook" sample database
    select column_name
    from information_schema.columns
    where table_name = 'Track'
    except
    select 'UnitPrice'
… coupled with the moral equivalent of PL/pgSQL `execute` (i.e. run-time eval) statement.


Thanks for answering! Currently I do use views but it's just something I had wondered about.

I mean, the information from the information_schema must be updated anyway when one deletes a column or table, so I thought maybe a function like that which looks it up could exist.

I will try with PL/pgSQL, have long wanted to familiarise myself with it anyway.


Postgres has a clever trick [1] you can use, though I wouldn't say it's something I would ever use.

[1] https://blog.jooq.org/2018/05/14/selecting-all-columns-excep...


I've also wanted this on Postgres for years.

If there's subset of rows you frequently want, you may just be able to define a view and use that. (At one point, I defined a text macro in my terminal to list the fields I usually wanted on our "orders" table.)


Never heard of such a thing but maybe a good question for SO.


Maybe

WHERE NOT EXIST (SELECT ...)

can help?


I think the question was about excluding columns, not rows.

I don't know any way to do that personally.


That is good to know! I’ve only used this approach with PGSQL


I think that cursor should be used more with realtime data and more stable data like products in an e-commerce shop should use paginated queries.

There's many UX benefits for paginated queries. It gives customer a clear overview of how many products there are and customer can also navigate faster to get a better understanding of what general prices are. If the table/list doesn't have good filters, customer can easily act out binary search to find what they are looking for in an ordered list of items.

Same with highscores, forums threads and other similar things. Say you are browsing highscores and you want to see what scores are around 10000th position etc.

As a user I find having only prev/next buttons a bit claustrophobic in this case. I think UX should trump whatever performance gains there are from it.


A while ago, github replaced the pagination on their commits page to be based on a git commit ID instead of a page number. After that change, it made it a lot more difficult to find the oldest commit in a large repo - frustrating.

That said, I think the percent of average ability internet users (ie. people visiting an ecommerce shop) will virtually never attempt to navigate by changing values in a query string param. High scores on a gaming forum, maybe more likely. Either way, if users need the ability to jump through pages, I would hope that would be exposed in a UI rather than expecting people to edit the URL directly.


The move from page-based pagination to cursor based pagination on the commits page was so that the underlying queries could be moved to GraphQL, which use cursors (specifically the GraphQL Relay specification). That was one of the first pages that was moved over, and these days I think it would be relatively trivial to add a “last” page, since the Relay connection has the concept of totalCount and we have a few tricks around first/last cursors. Theoretically the ordering could also be flipped to view the first commit on the first page.


Interesting, but right now it seems they return HTML and use some sort of Turbolinks like logic to make it seem single page app? Or is GraphQL not yet implemented there / used on some other layer in backend for example?


Oh sorry, I should have clarified it’s GraphQL “under the hood” (backend). Most of GitHub is a Rails app, and in the case of the commits page, the controller makes a GraphQL query, and then takes the results and injects it into the view, all server side. The original idea was that if GitHub was using its own GraphQL API to build its pages, our users would have the exact same access as GitHub engineers.


Yeah, it's best if it's possible to do it both ways. With url and also by clicking links and maybe even add a select/input to go to certain page.

If you look at any e-commerce systems, ranging from WooCommerce to Shopify, Amazon and Walmart, I think they all use pagination. Some Amazon pages do have infinite scroll however

I have also been frustrated without having the ability to quickly go through commits... Or I think releases for that matter?

For instance, browsing through tags/release versions, which should be a very valid use-case in my opinion also uses cursor based navigation.


> A while ago, github replaced the pagination on their commits page to be based on a git commit ID instead of a page number. After that change, it made it a lot more difficult to find the oldest commit in a large repo

git log has a reverse option to reverse the order of commits displayed. You can also provide a commit reference like HEAD~10 to get the 10th ancestor from the head commit.

Does Github have a way to specify URL parameters to achieve something similar?


Not on that page, no, but you could easily fetch this information via the GraphQL API.


From an API evolution standpoint, cursors are a no-brainer over explicit offsets:

1. Nothing stops you from having your cursors be implemented via a SQL offset under the hood.

2. You can have your cursors be base64 data, and embed a cursor version in the ID. You can use this to change how your pagination works without breaking clients during the transition period. (You'll need to document+enforce a maximum validity for pagination tokens to do this successfully.)

3. You can encrypt your pagination tokens, so clients don't get used to making assumptions about how your pagination works under the hood. (This isn't security by obscurity, it's just defending against Hyrum's Law [1].)

It doesn't cost you much to implement cursors early on, and unlike offsets they will grow with you across the evolution of your API.

[1]: https://www.hyrumslaw.com/


We've even used cursors to migrate implementations; in our case, we were migrating from solr5 to solr7. When you started a session, you were allocated to solr5 or solr7 according to the business rules in effect; the solr7 cursors were prefixed with solr7-, so that in subsequent requests we could send you to the same major version. It could work the same way with load balancing too I guess.


Sometimes you want a cursor so you can resume where you left off without worrying about new entries that may have come in which would otherwise mess up your pagination.

Sometimes you want position-based queries, because you explicitly do want everything to be positional.

Sometimes you want to combine the two techniques, e.g. if you jump to the middle of a large, ever-changing list, and then want to retrieve the next batch of results after wherever you happen to be.

I like the design that JMAP ended up with <https://tools.ietf.org/html/rfc8620#page-45>: you can specify a position integer, or an anchor ID and optionally an anchorOffset integer. An anchor is a restricted case of a cursor, being the ID of an entity in the result set rather than an opaque type that could embed other information (such as coroutine addresses, thinking back to the old days of HN), but has the notable advantage of being client-controllable.

(Because JMAP is very much an object synchronisation protocol and not just an API for objects that don’t record their history in any way, like your common-or-garden REST API, this is also paired with change tracking so that you can be notified when the set of entities matching the query changes; this is how Fastmail’s webmail (probably the most-used JMAP client for now) updates its message lists for mailboxes (roughly `Email/query { filter: { inMailbox: inbox } }`) and search results (roughly `Email/query { filter: { text: "foo" } }`). Such a principled approach to changing state is extremely valuable for supporting live updating of a UI, and pretty much essential for offline support.)


This approach still requires a full table scan right ?


No: it becomes just an added `WHERE id >= anchor` constraint, which with the usual indexes is cheap.


Keyset pagination is described in a few places online in detail. The one that comes to mind is in Use the Index, Luke [0].

[0] https://use-the-index-luke.com/no-offset


One cannot go to an arbitrary page with this method - doesn't it destroy use-ability quite a bit ?


token based pagination is how infinity scrolling pages are implemented. It definitely has usability problems, but beyond a certain scale offset based pagination is not viable from a performance point of view, especially if users are given option to chose sorting order.


Given the confusion over the reuse of the term “cursor”, I prefer the terms we (Google) have settled on for Pagination [1]: page token and page size. It’s often awkward to argue about “really, do I need to support pagination from the outset”, but the examples given in that AIP are from real-world battle scars. Even if you just set the page size to a big number, at least you prepare your callers for “one day, you’ll need to handle this”.

[1] https://google.aip.dev/158


Not contradicting, just adding that GCP (Datastore) has had cursors for at least 8 years. When I came across that using App Engine many years ago I really liked the cursor thing, and it stuck with me. I found it a little funny this article sounded a bit like "Here's this new thing"


Oh, but that’s an “API” for a Database. So that’s a fine exception to “you should offer cursors”, since that’s what you actually intend to present. A more generic API that happens to be backed by a database with a list of things in it, should not reflect the underlying database decisions if it can help it.


It seems to me you might have meant to reply to another comment.

This is not an "exception to 'you should offer cursors'" this is offering cursors.


It's definitely not new, but worth repeating.

But yes, I'm surprised it would show up on hackerNEWS.


Having used the keyset pagination for a while, it prevents you from worrying about performance in most cases.

While the post claims that is not possible to go back in the result set, it isn't true but it isn't just that simple, a way being to encode the current and the next pointer in the cursor argument, which can be used to go back, also, I think that you can commonly reverse the ordering and play with the query conditions to go back.

On the other hand, the biggest drawbacks I have experienced is:

- Dealing with a cursor where the item involved is actually deleted before you do the next query, there are many strategies but it is certainly something you need to plan for.

- Building APIs that allow sorting by different arguments in the result, which can get the query conditions tricky easily.

In any case, its always worth exploring this mechanism for people not aware of it.


This is trumpeted around and actually put into production every once in a while.

The reason opaque pagination is an antipattern is because you can’t optimistically fetch resources.

So your customer, the person that paying you for your product, needs to wait for some number of synchronous reads.

With non-opaque offsets these can be done in parallel. If the typical request requires 4 pages, these can be done 4 at a time and of it is less than 4 pages those can be discarded.

This is a clever hack that ends up being user hostile in actual practice. Remember APIs are designed for the benefit of the consumer vs the benefit of the maintainers.


If the API takes a cursor and a number of items to fetch, the idea of a "page" or how large it is exists entirely on the client. You can fetch 40 results in one query and say it corresponds to the next four "pages", if you're configured to show ten items per page

It seems worth noting a high number of concurrent queries to the same database shard as part of the same overall page load can be very wasteful of CPU as database load increase, due to the cost of context switching. https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-... dives into that.


> So your customer, the person that paying you for your product, needs to wait for some number of synchronous reads.

The customer is not always right.

It's also not clear what the practical issue with sequential requests are? Multiple parallel requests may get caught in a rate limiter, and impose much more work on the backend than a cursor (multiple unnecessarysort/discards). It's not a given that spamming a service gets you all the data any faster than using a cursor.

> With non-opaque offsets these can be done in parallel. If the typical request requires 4 pages, these can be done 4 at a time and of it is less than 4 pages those can be discarded.

If the typical request requires 4 pages, then your page size is suboptimal.


I think supporting an explicit (large) page size addresses your concern. It’s not that you want to make four separate requests, it’s that you want the backend to give you more than a tiny amount of data. If you’re going to induce you the 400 results of loading on the backend, backend implementators may as well give you all the results in one go (assuming the output fits in whatever request/response limits you have).

I definitely agree that just having opaque page tokens without the ability to say “I want up to 100” leads to needless pain for clients and overall system inefficiency.


I think that cursor should be used more with realtime data and more stable data like products in an e-commerce shop should use paginated queries. There's many UX benefits for paginated queries. It gives customer a clear overview of how many products there are and customer can also navigate faster to get a better understanding of what general prices are. If the table/list doesn't have good filters, customer can find what they are looking for faster too. Same with highscores, forums threads and other similar things. Say you are browsing highscores and you want to see what scores are around 10000th position etc.

As a user I find having only prev/next buttons a bit claustrophobic in this case. I think UX should trump whatever performance gains there are from it.


non backend dev here. could someone elaborate what exactly is so opaque about pagination? and what makes offsets less opaque? does this term have meaning I don't know about?

and why does the DB need to wait for some number of synchronous reads?


"offset" can be passed transparently to db to retrieve a range of records while "cursor" is customarily implemented in the API layer. I think that's what "opaque" meant in GP's context.

A starting read is needed to obtain the first cursor, hence the synchronous read.


“opaque” here means we are hiding the real ID of the “cursor” item.


gotcha, thank you!


if you ask for next page of results, with parameters like offset=20&limit=10, then you, as a client, can try to reason and manipulate those parameters. Ask for multiple pages in parallel, ask for offset=18 etc. Make calculations on those parameters. If you only providing a token, like "next_page=abcdef1234" with some encoded structure, you're limiting your client in what it can actually do, but simultaneously simplify backend architecture and make it more forward compatible with future changes in backend (after all, that next_page token can actually be just an offset and limit encoded)


ah i see so some kind of tradeoff between power to the API consumer vs power to the API maintainer. thank you!

i don't see anyone arguing for "next_page=abcdef1234". realistically it's more like "cursor=abcdef1234&limit=10". slightly less opaque. still your point about asking for multiple pages in parallel still stands.

i think i agree with mostly everyone here in that this is a fine tradeoff to make and so would favor cursors over offsets. (unclear how cursors relate to "keysets")


> One deficiency you can see is that it’s impossible to generate a “previous page” link

It is perfectly possible to have a "previous page" link. Your pagination needs to support "before" cursor semantic, allowing a consumer to retrieve the N items before the row identified by the cursor.


Yep, just inverse your conditions and sorting and you’ve got previous page.


This.

It is also possible to create a "next page". You only need to take your current page's last id and use it in your query like "where id > current_last_element_id".

"first page" and "last page" are possible as well, simply do "where id > 0" for first and "where id <= (subselect max id)".

You may need to invert order here and there, but it is perfectly possible to use cursor base pagination in a GUI as long as you don't need to provide jumping to a specific page number.


I tried to make cursor based pagination work in a GraphQL API over MySQL and failed. The problem was that it had to work for a very wide range of SQL statements, with arbitrary order clause, where clause and at least one join.

Offset pagination works fine in this scenario. But something like `where id > 42 limit 100` fails with an arbitrary sorting order on non-unique columns. All I could do would be to generate the unpaginated result, then find the cursor within it, and take the limit after that. Which is too inefficient to be practical.

If I missed a decent solution, please tell.


I implemented this for Datasette. The key is to use the primary key as a "tie breaker" - so you sort by the specified column first and then by the primary key.

I managed to get this working with compound primary keys too: click "next" at the bottom of this page for a demo: https://latest.datasette.io/fixtures/compound_three_primary_...

The code is pretty complicated - mostly here: https://github.com/simonw/datasette/blob/0.53/datasette/view...


You sort by your fields and then by id. You also include those field values along with id into your “cursor”. This is complicated for general case but possible to implement.


All the columns you order by should be part of your cursor value, and the id can be used to tie break non unique columns.


Cursor based pagination and offset base pagination are two different approach for different purpose (hence not interchangable).

Cursor based is good when you know exactly where the next page go (Eg: chat message when you want to load previous messages). Offset based is good when you want to browse data randomly (Eg: you want to go to page 10 when you are in page 1)


Yeah, it's usual UX for e-commerce sites for customers to have the possibility to choose which page they want to move to, same with forum threads etc. It gives a much better understanding on how many products there are in total and let's you just find things quicker if filters aren't perfect.

It gives you a great overview on how things are generally priced etc if you order by price.


I've worked with APIs (mainly based on ElasticSearch backends) that implement something like this pattern. There, the "cursor" is called a "ScrollId."

Here's what that looks like in ES: https://www.elastic.co/guide/en/elasticsearch/client/java-re...

The APIs I generally work will abstract away most of the scroll context configuration stuff, and just return a page of results, plus the ScrollId for the next page.


Elastic search now recommends using “search_after” instead of scrolling for deep pagination, although mechanically they’re pretty similar to use


Elasticsearch’s search_after is a pleasure to use. So easy especially with nullable fields, where the equivalent SQL can get really, really messy with just a few columns.


Yeah - search_after is the stateless version


Thanks for the heads-up!


Also, they have a concurrent version called "Sliced Scroll" which lets you do multi-threaded fetch in N streams, each with their own cursor.


We have explicit cursors and classic pagination in our public facing API. Classic paging is such a thorn in our side, but almost everyone uses it for their v1 unless they ask us for guidance first. We have guides about why pagination is bad, and discourage it in our API rate limits, but it still persists.

Looking back, I would exclude classic pagination from a public API and go full cursor. Possibly include an optional total count that can be requested (counts are expensive in PG though).


I think author misses crucial point while using OFFSET. At least postgres doc has this to say (https://www.postgresql.org/docs/13/queries-limit.html)

  When using LIMIT, it is important to use an ORDER BY clause that constrains the result rows into a unique order. Otherwise you will get an unpredictable subset of the query's rows. You might be asking for the tenth through twentieth rows, but tenth through twentieth in what ordering? The ordering is unknown, unless you specified ORDER BY.
And looks like pagination queries provided by author do not take sorting in to account, which means that items on page 100 will be with id greater than 10000, but in undefined order.

  explain analyze select id from product where id > 10000 limit 100


I have written about this as well[1] and even built a Github PoC about it comparing both approaches regarding speed[2] but one of the main flaws is the lack of a deeper consideration for sorting and filtering examples and impact demonstration. Will look into adding more detail about that in an addendum.

[1] https://medium.com/swlh/why-you-shouldnt-use-offset-and-limi... [2] https://github.com/IvoPereira/Efficient-Pagination-SQL-PoC


Several recommendations on this thread have led me to buy the SQL Performance Explained book.


I tried running their exact example queries, since I conveniently have a >100k row PostgreSQL table available of the same name. (> 100k rows) The first (offset) was indeed on the order of 5x slower, but I suspect what the author saw with a 40ms to a 0.1ms time difference was that it wasn't pre-warmed first run.

For me, with offset, the FIRST run was 25ms, but subsequet runs are around 2-3ms. For `id > 10000`, it was under 0.05ms.

It's also very odd that they don't do an `order by`. This seems important. Default ordering is semi-stable. It'll tend not to change, but it can at any point. Benchmarking without that may be misleading.


This is one of those obvious things that people have a blockheaded aversion towards. In every job I’ve consulted on, there has been a component where pagination is the bottleneck. And it’s rarely where pagination could be useful (like a static list of products). I typically see it when someone builds an aggregating (and aggravating) microservice that pulls all of the data from another service; that blocks the entire database.

Perhaps it’s the world of ‘Django developers’ I find myself in. But the fact this simple design pattern reaches the front page of Hackernews worries me.

I’ve personally had backend developers give confused looks and diatribes about simplicity when I’ve suggested this approach. Frontend developers frequently give the greatest resistance, which I think is because they most often act in concert with product managers who are welded to certain UX idioms that they chose without fully imagining the engineering consequences of.

One approach which I try to push people towards is fetching a big list (N≈5000) and paginating it however the UX designer wants. The response of the big list should be sparse, that is without including additional fields that aren’t displayed by the front end (and fetching more data when the user navigates). You’ll frequently get puzzled looks suggesting this, but benchmarks will usually show that fetching 5000 rows and a small number of columns takes a few milliseconds, and can be serialised into 150kb. On balance it ends up being faster: as you’ll get fewer network requests; fewer round trips to the database; fewer fetches; and less time spend serialising (which is a major bottleneck itself that Python developers ignore).


Sometimes it's the case of overoptimising/early optimization too. BE doesn't want to create another endpoint with simplified fields, FE being overly concerned with size of JSON responses over the wire / being processed client side (while the image for each item is not optimized themselves). Thanks for sharing the Big N approach, I have to try that.


I like the idea of a "cursor," or some kind of appointed "variable."

It has the potential to become complex, so I would assume it was an "also" capability, as opposed to an "instead of" capability.

I like the idea of a "range" syntax ("[X..<Y], [X...Y], [X<..Y], [X<.<Y], etc).

When I write APIs, server performance isn't really an issue for me. Most of the performance bottleneck, in my experience, is data transfer, so being able to optimize that, pays the greatest dividends.

Also, making the API easy to understand, and express semantically, is important.


Been doing it this way for a while actually. Even when it's used as a limit and offset. I find it's a good way to hide the actual pagination implementation from the client and therefor easy to swap out later without the client changing. Here's an example of a `Pagination` object that I use server side to generate and parse cursors (It's written in Kotlin):

https://gist.github.com/briandilley/a0715f9f85b632b7080f8921...


I have implented such a work in my initial career days like 15 yrs back there i have implemented first, previous,next, last no goto page number. It works for multiple columns in the order by. The trick is to swith the order by in reverse and limit the results to page size, this way we get results in reverse orde but while returning just send them.in the reverse order....

The assumption is to include the primary key as mandatory order by at the end implicitly.

This makes a lot easy in other situations like if mulitple pages of records found with same value.. etc...


What if your id column uses GUIDs?


How do you mean that it changes the proposition?


If you order by id as a tie breaker it's guaranteed that ever new inserted row has a higher id than the last id in the dataset. If the id is a GUID it's possible to add a new row between the paginated API calls with a GUID lower than that of the last received row. Or does the cursor prevent that the new row is skipped? What if the pagination is by order by id only?


Seems to me that you're assuming that everything should be paginated after being sorted by "ID". Why should it be? The user is not interested in your internal representation. An "ID" does not imply semantic meaning, or ordering, and you should not treat it as such.

If you want to order by creation time, then have a column with creation time. With resolution down to nanoseconds it'll be unique. And if not, well your query should be deterministic and "order by ctime,id".

But the problem is always possible. Doesn't matter what method you use. It applies to OFFSET … LIMIT, guid, name, or ID.

It's up to you to decide what you want the user experience to be if someone is on page 1, presses "next page", but before you click someone deleted the entire first page. And maybe there is no page 2 anymore.

These questions are application dependent, I'd say. And while I may have my opinions on the best way to handle it, it's an orthogonal question to what this article is talking about.

Edit: Oh, and there's GUID type 1, so "GUID" doesn't inherently mean "not sortable".


How long will the cursor be valid?


Indefinitely. It's not the kind of cursor that a database gives out, which is a resource. A cursor in this context is just a pointer to some position in a set of results, something like "all results where date created is greater than $X". For SQL implementations, check out:

https://github.com/hasura/graphql-engine/issues/141#issuecom...

or

https://stackoverflow.com/questions/38017054/mysql-cursor-ba...


It depends on the implementation. For example, the post mentions using Elasticsearch in production "which naturally supports this cursor stuff" but it has a max keep alive context duration of 24 hours[1] by default. This limit can be extended on a cluster level but it's unlikely they have it set to indefinite. Which means that their search context is likely only valid for 24 hours.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/curr...


Yep, that’s the case if you use the scroll parameter, but their search_after API is not stateful and that’s the sort of thing this article is about.


Can somebody elaborate on what the last paragraph means? If I have a limit of count per page and I know which page I am based on URL I can always generate a prev/next page link. I guess it is pointing out that limit/offset needs be used in such a case. How did reddit do it in terms or pagination? Such a large website wouldn’t survive on limit/offset.


I've been hitting paginated page limits. Why even bother to offer a paginated version if you can't grab an arbitrary page?


Depending on the data you are showing there are reasons.

One example is when the results are already ordered by relevance (this might be search matches, proximity, recency). In these cases most everybody will not need to go beyond the first page but you still get the benefit of faster loading and rendering due to reduced amount of data for the first load.


In addition to using cursors or whatever pagination method suits your backend in query params, it's also a good idea to include pagination URLs in the response itself if you can. Greatly simplifies what the client has to do and reduces coupling since now the client doesn't need to care what your query param structure is


I'm a fan of the "GraphQL Cursor Connections Specification", which could be applied even if not using GraphQL:

https://relay.dev/graphql/connections.htm


"If all you have is a hammer, everything looks like a nail."

Are GraphQL-fans so uncapable you have to put your GraphQL-spamming in every single post? Didn't you learn anything else in your life?


I mentally filter out "GraphQL spam" with the same rigor and aptitude I mentally filter out "bitcoin spam", but this unsubstantiated ad hominem does nothing to convince me that I'm smart or in-the-right for it.


No, and yes.


I think is better to not use pagination at all.

Design your api or page so that you receive all or more entries than you could possibly need.

If the user wants them all pagination is just in the way. If the user does not want them all give them as many as they are likely to be able to handle.


"Woah, our database is slowing to a crawl, requests are deadlocked everywhere. Yeah, some asshole is hitting GET /messages and pulling 2TB in a single request again."

"Surely that's what he wanted."


Given that is what the person is doing, that is likly what they are trying to do. Most of the times when I do things, I do it cause I want to do it, not cause I did an opsie.


"let's allow users to DOS us in a single request if they want to" is not an acceptable design decision, I'm sure you are aware of that.


Most services don’t handle ddos amounts of data, a few 1000 rows of data is fine.

Don’t design like you are Twitter if you are not. It is a waste of man hours. If I as a client need 10 000 rows of data that is what I will ask for either in one request or in a 100 consecutive.


This is terrible advice, sending the client as little as reasonably possible (therefore use pagination) is the most sane, performant, affordable, and secure default assumption. Any rails/laravel/feathersjs style framework have multiple types of pagination built in for free.


Performant for you assuming the client gives up after the first page. Less performant for you and the world as a hole if the client actually needs the data it ask for. Cause if it needs it the client will just make the request 100 times until it has all the data. What is more performant one query or a hundred.


Great article!

I wrote something similar, includind a playground which serves as a benchmark to compare the various pagination options:

https://www.samu.space/api-pagination/


Cursors are stateful. This greatly complicates the design of the backend.

First, you have to maintain state in some kind of session on the app server.

Second, if you have more than one app server, you'll have to share the sessions across them using Redis or Hazelcast or something.

Third, cursors have to be closed, which means you have to know when the user is done with the result set. This is impossible to know. The user could go to page 2 and then go to lunch. So you have to leave it open for a while, and then have some kind of timeout. Have a lot of users doing the same thing? Memory fills, both on the app server and in the database server.

Fourth, your cursor is going to leave a transaction open, which interferes with updates and all kinds of other things.

Fifth, most users never go to page 2 anyway. They run another search.

Keeping cursors open in a web app is a spectacularly bad idea.

UPDATE -- I am in error. The article does not recommend a normal database cursor. Disregard my comment.


Not that kind of cursor. There is no state on the server / in the database and nothing to close.

> Generally, you have some ordering criteria, for example, product id. In this case, you’ll encode your product id with some reversible algorithm (let’s say hashids). And on receiving a request with the cursor you decode it and generate a query like `WHERE id > :cursor LIMIT 100`.


Oops -- this is embarrassing -- I did not read the article closely. You're right, the article does not recommend using a normal database cursor. Never mind.


This is not about database cursors.


You could enable a batch hint in the API. The server has the option to allocate a cursor if one is available. Best of both worlds.


Cursor or pointer works well if the pagination request are in the same user session. However cursors can't be used to deep linking in urls and when the data is dynamic.


Implying that anyone goes further than page 1.

Put a notification on the page "your search query returned too many results, please refine your criteria"


thats how firebase does it. It's quite impossible to have proper pagination on realtime results, that's why I believe they used this approach: https://firebase.google.com/docs/firestore/query-data/query-...


I’ve spent more time reading this article, than this approach could save in practice if we accumulate all the saved milliseconds.


wow having this kind of mindset would probably have left us in the stone age.


Is an auto increment sequential number good enough for a 'cursor' or is there something I'm missing here.


As long as it can be used to "seek" directly into where the last page ended.

The problem with offset and limit is that you have to actually find the previous N results in order to get the N+1th. But with a cursor you can just "continue".

(when I say "seek" I mean use indexes. I'm comparing to listing a tar file (requires reading the whole thing to find the file in the middle) vs zipfile (seek to where the file is, and read))


When you got your JSON data in nested form, with arrays. How do you flatten it, for storage in the database?

Do you create sub tables to handle the arrays?


I'm pretty sure pagination APIs are the result of people who haven't ever read Stevens writing services.


Can you elaborate? I tried looking it up but nothing meaningful comes up.


The TCP book. As in Transmission Control Protocol and it solves the problem way better than any of the janky hacks called pagination.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: