Thursday, December 29, 2011

Reddit pagination scheme and addressing concerns about mysql performance and SEO

Pagination of search results or of blog pages is a simple technique yet you need to be aware of few performance concerns especially if your backend is an RDBMS like MySQL. The simplest scheme goes like


  • Get a page size
  • Get a page offset (or page number)
  • sort on some column
  • use LIMIT with offset


The queries you form are like,

SELECT post.title, post.summary from t_post post order by post.created_on DESC LIMIT 100,20

where 100 is the offset and 20 is result size.  The problem with such queries is the offset. The performance is fine for smaller offsets but for bigger offsets (first number after LIMIT) you are examining a lot of rows even with an index. Naturally that has a performance impact. check LIMIT performance optimization on mysql performance blog.

So what to do? The remedy is to drop the offset from queries. You needed the offset to locate the starting rows (pointer to your results) and you need something else to locate the range of rows for you. The solution is to use WHERE clause to locate rows. You should check this nice presentation by Surat Bhati and Rick James done during Percona 2009 conference.

www.percona.com/ppc2009/PPC2009_mysql_pagination.pdf 

To locate the right range using where clause you need a column that can sort your data. For date based sorting an auto increment key is a nice proxy.

SEO concerns

Suppose you had 3 pages and you served rows 1-20 from /page/3, rows 21-40 from /page/2 and rows 41-60 from /page/1. After inserting one more "page of data", /page/1 will fetch records 61-80, /page/2 will fetch 41-60, /page/3 will fetch 21-40 etc. The problem here is that all the pages have changed as a result of inserting "one more page". You can run into crawler indexing issues if that happens. (Every day every URL on your site is changing)

A better scheme is to use a script with query parameters to generate such summary pages (Check slashdot)
or use static pagination URL (check Digg)

Also, it may not be advisable to go on ad infinitum and you can refuse to show more results after user has browsed a certain number of pages (deep into your site)

Page link rendering concerns 


You will typically see two schemes. First is  a simple Next and Previous base scheme and second is rendering links to all the individual pages (1,2,3...10) with a total count. I believe the modern way is to use just the previous and Next links. Displaying total number of records is an absolute dud and user would any day prefer  to locate results directly via search than browsing.

Reddit site pagination

Now, armed with all this knowledge, lets see why reddit uses a  pagination scheme like (URL of type http://www.reddit.com/?count=25&after=t3_nu9we)

I know reddit source code is out there in open but I am not going to read it. Let's take an educated guess here. Most likely after parameter is in base36 and you can just use the php base_convert function to see the value in decimal. This number is used in where clause to locate the start row with a standard page size.

 count is keeping track of how deep you are and at some point the site will refuse to paginate further. Again, I am not going to click link after link but you can do the verification :D


© Life of a third world developer
Maira Gall