Wednesday, July 29, 2020

mysql pagenation and penalty of it

we usually use mysql select query to get a list of database entity:
select 'id', 'name' from client;

what happens if there are 1 million rows to query?

General strategy is don't query a table without where clause. Always add a filter

select 'id', 'name' from client where 'created' between '2012-03-11 00:00:00' and '2012-05-11 23:59:00';

if created is indexed, the above query will run fast and return a small number of rows.

Mysql technically provides a pagination function.

The following query will return the first 500 rows of the 1 million rows in client table.
select 'id', 'name' from client limit 500;

In order to get the second 500 rows, we can add an offset
select 'id', 'name' from client limit 500 offset 500;

next 500, we increase the offset
select 'id', 'name' from client limit 500 offset 1000;
...
The same query has a shortcut
select 'id', 'name' from client limit 500 offset 100000;
is equivalent to
select 'id', 'name' from client limit 1000000, 500;

This approach has a penalty, if we ran the query
explain select 'id', 'name' from client limit 100000, 500;

the result shows, the sql actually read 100500 rows by the primary key, then throw away the first 100000 rows. What a waste. The memory is occupied, lots of rows are read. It is ok for small offset, but bad for large offset. This kind of query should not happen in production, it has danger of draining the database resource.

Alternatively, we can avoid large offset query by redesign the UI. We allow user to access the first a few pages, the query has small offset anyway. If the user have to click into deeper pages, we'd better to redesign the data model. Redirect the user to a different page, where a different data model is used. Say, we can partition the table:
CREATE TABLE client (
id INT NOT NULL,
name VARCHAR(20) NOT NULL,
created DATETIME NOT NULL )
PARTITION BY RANGE( YEAR(created) )(
    PARTITION from_2013_or_less VALUES LESS THAN (2014),
    PARTITION from_2014 VALUES LESS THAN (2015),
    PARTITION from_2015 VALUES LESS THAN (2016),
    PARTITION from_2016_and_up VALUES LESS THAN MAXVALUE

For the first a few pages, we use
SELECT * FROM client PARTITION (from_2016_and_up) WHERE created >= '2016-01-01' limit N, 500;

Then for more pages we just use a different query
SELECT * FROM client PARTITION (from_2015) WHERE created between '2015-01-01 00:00:00' and '2015-12-31 23:59:00' limit N, 500;

If we makes more partitions, we can change to use more queries.

This way, we don't have to read a huge amount of rows, we just read from the relevant partition.

Another good question to ask is: why we need to present this large amount of data to the user? Should we spit the tables or add some filter to limit the total dataset, such as date range? Does the user really want to click lots of pages to find the information instead of refining the filtering criteria? If they want to do that, let's don't allow the random page jump, just provide prev and next (maybe 1 to 10 page jump) hyperlinks, that will reasonably encourage the user to think harder. If the amount of rows need to display have to be large, why not use reactive http client with reactive mysql db support? Let's use webflux to have the db entity stream to the client side, so that while user scroll down the page, new rows get loaded dynamically.



No comments:

Post a Comment

Why I stopped publishing blog posts as information provider

Now the AI can generate content. Does that mean the web publishing industry reaches the end? ChatGPT said: ChatGPT Not at all. While AI can ...