MySQL query/indexing tip

I originally learned about this MySQL shortcoming in the book High Performance MySQL, and luckily I remembered it for an issue that struck me today.

I have two tables – one containing blog-posts, and one containing a record for each tag on these posts. Now the requirement was to show a blog post AND to show the five latest related posts. Related posts in this scenario are other posts that share one or more tags with this post.

Now, it does not take one second to find a blog post, and it does not take one second to find tags related to it. And actually it does not even take a second to find a list of posts having a list of tags. The reason why all these requests take less than a second is because the tables have been supplied with efficient indexes – the table of blogposts has more than 3.5 million records and there are about 700.000 tags.

The problem was that when we tried to generate a list of related posts, the query took up to twelve seconds to run. It sucked and looked like this:

SELECT distinctrow i.id, i.item_title, i.item_url
FROM rss_items_master i
INNER JOIN rss_item_tag t ON i.id=t.item_id
WHERE t.item_tag IN (SELECT item_tag FROM rss_item_tag WHERE item_id=$itemid)
AND t.item_id != $itemid
ORDER BY i.item_date DESC
LIMIT 5;

In plain english we request a list of items that does not match the specific post, but they must have at least one tag that match one of the tags from the specific post. The nested SELECT-statement produce a list of tags to look for, and the outer SELECT statement pulls blog posts containing tags found in the nested SELECT. This does not perform, because only one index can be used per query – so either the index from the rss_item_tag table or from the rss_items_master table is ignored (i did not bother figuring out which one was ignored). To fix it, I split the SQL in two (PHP example):

$sql_tags = "SELECT item_tag FROM rss_item_tag WHERE item_id=$itemid;";
$arr_tags = query_array($sql_tags, $link);
// $tag_o will be an array of tags from the original post.
$tag_o = Array();
foreach($arr_tags as $tag_i) {
array_push($tag_o, mysql_real_escape_string($tag_i[0]));
}
$tags = implode("','", $tag_o);

Now $tags contain a list og tags to look for – actually it contains exactly what the nested SELECT-part did in the slow query. Now these two queries will actually complete in less than a second:

select distinctrow i.id, i.item_title, i.item_url
from rss_items_master i inner join rss_item_tag t on i.id=t.item_id
and t.item_tag in ('$tags')
and t.item_id != $itemid
order by i.item_date desc
limit 5;

So even though the SQL has been split in two, and even though we are passing a lot of information through variables that we have to build from the first query and added quite a lot of code – this outperforms the first single query by a factor larger than 10 – simply because it enables MySQL to use the indexes it should.

For more MySQL performance tips, check out MySQL Performance Blog.

Add your own comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.