Use pv as your MySQL import progress indicator

No Gravatar

There is only one relief for the pain of a long running command – a progress indicator, but commands only may or may not show their progress – and this is by design. They are good at one thing, and through regular pipes, the data can pass through a lot of steps. Processing a file may take time, but how much? Here is a description of how you would add a mysql progress indicator for data  imports. 

In my case, I had 11 Gigabytes of zipped MySQL database dump, that needed to be restored. The create and inserts by themselves does not take up a lot of time, but the amount of data was huge and there was no way of telling how complex the indexes were (and theywould add to the time needed for the mysql restore). This could take 1 hour, 3 hours or 3 days – I had no idea. To restore, I would simply do:

gunzip -c databasebackup-file.sql.gz | mysql -uxxx -p databasename

In my search for answers, quite a few of the search results mentioned pv – which you may have encountered, as some services actually uses a similar output to display progress like:

1.43GiB 1:43:13 [ 392KiB/s] [=========> ] 13% ETA 10:56:22

The restore above has processed 1.43 Gigabytes of data so far, that took an hour and 43 minutes and 13 seconds. The current throughput is 392 Kilobytes of processing per second. We are 13% complete, and it is estimated that the restore will take almost 11 hours more (ETA) to complete, based on the historic processing performance.

As the file is restored, the numbers are updated – one second at a time. In this particular case, there may be statements that run for a longer times than others, initially leading to quite a few “jumps” up and down in the ETA, but as time passes, the bumps will even out, and the ETA becomes more accurate. In this particular case the restore actually ended up taking just over 4 hours, but the percent indicator alone gives a great deal of usable insights.

So how is this done?

It is quite simple actually – instead of running

gunzip -c databasebackup-file.sql.gz | mysql -uxxx -p databasename

you would run

pv databasebackup-file.sql.gz | gunzip -c | mysql -uxxx -p databasename

The indicator starts running right away – even though you have to enter the database users’ password before the restore can begin. That may mess up the terminal, but after that you are good to go. Useful switches include:

– I, –fineta to display the estimated local time when done, instead of the ETA

-a Show the average data throughput – not the current

If you use pv for a database export, the size of the export is unknown until the export is complete, so pv cannot calculate an ETA, but if you know the expected size of the export, you can tell pv about that using the

-s SIZE, –size SIZE argument for pv which can help it calculate the ETA and percentage complete. Read the man page for further tips on using pv.

This is just one example of using pv to display the progress of an operation. You can easily use it for any operation using pipe, and even when copying large files or synchronizing using rsync.

Will ideas of the Massachusetts law spread?

No Gravatar

On reddit, I found this article regarding a new law i Massachusetts that is to increase the security of personally identifiable information.

If you have personally identifiable information (PII) about a Massachusetts resident, such as a first and last name, then you have to encrypt that data on the wire and as it’s persisted. Sending PII over HTTP instead of HTTPS? That’s a big no no. Storing the name of a customer in SQL Server without the data being encrypted?  No way, Jose. You’ll get a fine of $5,000 per breach or lost record. If you have a database that contains 1,000 names of Massachusetts residents and lose it without the data being encrypted that’s $5,000,000. Yikes.

The idea is good (yay – better security for anyone registering on a website) and bad (expensive). As a consultant in a country not covered by the law I could care less, but is this in general a good idea? Securing data and sending sensitive information across secure connections is always a good idea, and it could hit the EU soon. Documenting how you secure the Personally Identifiable Information (PII)? Who can argue with that being a good idea? As an IT consultant you won’t hear me cry myself to sleep.

But the proportions of the legislation seem unclear. If you log on to my website (securely of course), and I feed you the following pages through clear text HTTP – can I be fined if your name appear on the page as in “Reader XYZ is logged in”?

I guess so.

It strikes me that there is no level of sensitivity defined – anything considered personal must be secured. As a legislator, it seems very easy to do this when you don’t have to pick up the bill.

If this kind of legislation should hit Europe, I hope someone would elaborate a bit on the do’s and dont’s:

  • Are all kinds of data included? Can I be fined if you, as a Massachusetts citizen, post a comment to this and include your name? (the database is secured, but nothing is sent through HTTPS)
  • Would it make sense to allow users to wave their rights, and thereby allow users to work with online applications that are not 100% secure?

MySQL query/indexing tip

No Gravatar

I originally learned about this MySQL shortcoming in the book High Performance MySQL, and luckily I remembered it for an issue that struck me today.

I have two tables – one containing blog-posts, and one containing a record for each tag on these posts. Now the requirement was to show a blog post AND to show the five latest related posts. Related posts in this scenario are other posts that share one or more tags with this post.

Now, it does not take one second to find a blog post, and it does not take one second to find tags related to it. And actually it does not even take a second to find a list of posts having a list of tags. The reason why all these requests take less than a second is because the tables have been supplied with efficient indexes – the table of blogposts has more than 3.5 million records and there are about 700.000 tags.

The problem was that when we tried to generate a list of related posts, the query took up to twelve seconds to run. It sucked and looked like this:

SELECT distinctrow i.id, i.item_title, i.item_url
FROM rss_items_master i
INNER JOIN rss_item_tag t ON i.id=t.item_id
WHERE t.item_tag IN (SELECT item_tag FROM rss_item_tag WHERE item_id=$itemid)
AND t.item_id != $itemid
ORDER BY i.item_date DESC
LIMIT 5;

In plain english we request a list of items that does not match the specific post, but they must have at least one tag that match one of the tags from the specific post. The nested SELECT-statement produce a list of tags to look for, and the outer SELECT statement pulls blog posts containing tags found in the nested SELECT. This does not perform, because only one index can be used per query – so either the index from the rss_item_tag table or from the rss_items_master table is ignored (i did not bother figuring out which one was ignored). To fix it, I split the SQL in two (PHP example):

$sql_tags = "SELECT item_tag FROM rss_item_tag WHERE item_id=$itemid;";
$arr_tags = query_array($sql_tags, $link);
// $tag_o will be an array of tags from the original post.
$tag_o = Array();
foreach($arr_tags as $tag_i) {
array_push($tag_o, mysql_real_escape_string($tag_i[0]));
}
$tags = implode("','", $tag_o);

Now $tags contain a list og tags to look for – actually it contains exactly what the nested SELECT-part did in the slow query. Now these two queries will actually complete in less than a second:

select distinctrow i.id, i.item_title, i.item_url
from rss_items_master i inner join rss_item_tag t on i.id=t.item_id
and t.item_tag in ('$tags')
and t.item_id != $itemid
order by i.item_date desc
limit 5;

So even though the SQL has been split in two, and even though we are passing a lot of information through variables that we have to build from the first query and added quite a lot of code – this outperforms the first single query by a factor larger than 10 – simply because it enables MySQL to use the indexes it should.

For more MySQL performance tips, check out MySQL Performance Blog.