[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Monday, September 28, 2009

Scaling writes in MySQL

We use MySQL on most of our projects. One of these projects has a an access pattern unlike any other I've worked on. Several million records a day need to be written to a table. These records are then read out once at the end of the day, summarised and then very rarely touched again. Each record is about 104 bytes long (thre's one VARCHAR column, everything else is fixed), and that's after squeezing out every byte possible. The average number of records that we write in a day is 40 million, but this could go up.

A little bit about the set up. We have fairly powerful boxes with large disks using RAID1/0 and 16GB RAM, however at the time they only had 4GB. For BCP, we have a multi-master set up in two colos with statement level replication. We used MySQL 5.1.

My initial tests with various parameters that affect writes showed that while MyISAM performed slightly better than InnoDB while the tables were small, it quickly deteriorated as the table size crossed a certain point. InnoDB performance deteriorated as well, but at a higher table size. The table size turned out to be related to the innodb_buffer_pool_size, and that in turn was capped by the amount of RAM we had on the system.

I decided to go with InnoDB since we also needed transactions for the summary tables and I preferred not to divide my RAM between two different engines. I stripped out all indexes, and retained only the primary key. Since InnoDB stores the table in the primary key, I decided that rather than use an auto_increment column, I'd cover several columns with the primary key to guarantee uniqueness. This had the added advantage that if the same record was inserted more than once, it would not result in duplicates. This small point was crucial for BCP, because it meant that we did not have to keep track of which records had already been inserted. If something crashed, we could just reinsert the last 30 minutes worth of data, possibly into the secondary master, and not have any duplicates at the end of it. I used INSERT IGNORE to get this done automatically.

Now to get back to the table size limit that we were facing. Initial tests showed that we could insert at most 2100 records per second until the table size got to a little over the innodb_buffer_pool_size and at that point it degraded fairly rapidly to around 150 records per second. This was unacceptable because records were coming in to the system at an average rate of 1000 per second. Since we only needed to read these records at the end of the day, it was safe to accumulate them into a text file and periodically insert them in bulk. I decided to insert 40,000 records at one time. The number I chose was arbitrary, but later tests that I ran on batches of 10K, 20K and 80K showed no difference in insert rates. With batch inserts, we managed to get an insert rate of 10,000 records per second, but this also degraded as soon as we hit the limit going down to 150 records per second.

System stats on the database box showed that the disk was almost idle for most of the run and then suddenly shot up to 90-100% activity once we hit this limit, so it was obvious that at this point, the DB was exchanging data between buffers and disk all the time.

At this point, someone suggested that we try partitioning, which was available in MySQL 5.1. My first instinct was to partition based on the primary key so that we could read data out easily. However, reads weren't really our problem since we had no restriction on how fast they needed to be (at least not as much as writes). Instead, I decided to partition my table based on the pattern of incoming data.

The first part was obvious, use a separate table for each day's data. On a table of this size, DROP TABLE is much faster than DELETE From <table> Where ..., and it also reclaims lost space. I should mention at this point that we used file_per_table as well to make sure that each table had its own file rather than use a single innodb file.

Secondly, each table was partitioned on time. 12 partitions per day, 2 hours of data per partition. The MySQL docs for Partitioning were quite useful in understanding what to do. The command ended up looking like this:
CREATE TABLE (
...
) PARTITION BY RANGE( ( time DIV 3600 ) MOD 24 ) (
Partition p0 values less than (2),
Partition p1 values less than (4),
Partition p2 values less than (6),
Partition p3 values less than (8),
Partition p4 values less than (10),
Partition p5 values less than (12),
Partition p6 values less than (14),
Partition p7 values less than (16),
Partition p8 values less than (18),
Partition p9 values less than (20),
Partition p10 values less than (22),
Partition p11 values less than (24)
); 
The time field is the timestamp of incoming records, and since time always moves forward (at least in my universe), this meant that I would never write to more than 2 partitions at any point in time. Now, a little back of the envelope calculations:
44M x 102 bytes = approx 4.2GB
2x for InnoDB overhead = approx 8.4GB 
+10% for partitioning overhead = 9.2GB
/12 partitions = approx 760MB per partition 
This turned out to be more or less correct. In most cases total table size ranges between 8-10GB, sometimes it goes up to 13GB. Partition sizes range from less than 700MB to over 1GB depending on the time of day. With 4GB of RAM, we had an innodb_buffer_pool set at 2.7GB, which was good enough to store two partitions, but not good enough to work on any other tables or do anything else on the box. Boosting the RAM to 16GB meant that we could have a 12GB buffer pool, and leave 4GB for the system. This was enough for 2 partitions, even if the total number of records went up, and we could work on other tables as well.

After partitioning, tests showed that we could sustain an insert rate of 10K rows per second for some time. As the table size grew past 10 million records, the insert rate dropped to about 8500 rows per second, but it stayed at that rate for well over 44 million records. I tested inserts up to 350 million records and we were able to sustain an insert rate of around 8500 rows per second. Coincidentally, during Michael Jackson's memorial service, we actually did hit an incoming rate of a little over 8000 records per second for a few hours.

One more BotE calculation:
8500 rows per second  x  86400 seconds per day = 734.4 Million records per day
Considering that before this system was redesigned it was handling about 7 Million records per day, I'd say that we did pretty well.

Update: If you want to see charts, they're in my ConFoo presentation on slideshare.

23 comments :

Morgan Tocker
September 29, 2009 11:07 PM

Thank you for sharing! I wonder if you could share a sanitized version of the schema you were inserting into?

There's a benchmark that exists to test a storage engine's sustained insert performance called iibench. Results for InnoDB are available here (spoiler: it ain't all pretty):

http://mysqlha.blogspot.com/2008/12/innodb-insert-performance.html

The random IO of updating B-tree indexes has got to be one of the big factors for the slow down once you exceed buffer pool size, but I wonder if there's anything else to it that could potentially be improved. I know I've seen cases where table scans can just wipe the contents of the buffer pool because InnoDB doesn't have any mid-point insertion strategies to avoid this.

Tom Hotchkiss
October 01, 2009 12:44 PM

You might want to give the TokuDB storage engine a try since it is a perfect fit for your application. Take a look at the first graph in:

http://tokutek.com/mysql-performance-brief.pdf

Using the iiBench benchmark mentioned by Morgan, TokuDB sustained an insert rate of more than 16,000 rows per second with three fully random indexes on a 1B row table without any partitioning.

Contact me if you are interested at hotchkiss at tokutek dot com

zDenis
October 14, 2009 11:26 AM

You write: "I should mention at this point that we used file_per_table as well to make sure that each table had its own file rather than use a single innodb file"

Since you don't insert records in different tables at the same time, what was the main reasons which lead you to this decision?

Thanks

Philip
October 14, 2009 11:30 AM

file_per_table is always a good idea. It allows you to move some tables off to a different file system. Also, if you drop a table (we drop tables older than 2 months), the space is reclaimed immediately instead of fragmenting the main innodb tablespace.

Lastly, if I need to clone database, I can cheat a little and only copy over tables that are critical to day to day system operation, and then copy over older tables when the system is idle.

Unknown
March 21, 2010 12:36 AM

This seems to be a good fit for sqlite , the inserts are fast and its easy to aggregate at the end.

YesThatTom
March 21, 2010 8:24 AM

Dude, that's a really impressive bit of engineering. However, you failed the interview.

It was a trick question. The right answer is, "MySQL isn't the right tool for this job." You just created a monster. A monster that would require a senior level or higher DBA to maintain, and someone even more advanced to debug.

7 million writes, read once at the end of the day, then deleted. You put it in a database? Really?

Here's how to solve this.

It's called a text file.

For each record, you append it to a text file.
At the end of the day, you read the text file. Then you delete the text file. You handle day changes by rotating the text file.

Do you need to occasionally access records in the middle of the file? Build a simple index. Before writing each record, append the ftell() and the key to another file. Or, if you are indexing my timestamp, just record the ftell() of the first record in each 1-minute block. That gives you a good starting point to start searching for the record you want. Heck, if your index data is fixed-size records you can fseek() to BLOCKSIZE * minutes-after-midnight to go directly to the right per-minute record. Yes, you could also store this index data in a MySQl database; the index mark for every 10-minute block might be "good enough" and MySQL should certainly be able to handle that.

I'm sorry to bash your excellent design. However, debugging, by definition, is more difficult than creating. Therefore, when you create a system that is the height of your skill, you have, by definition, created a system that you yourself can not debug.

Think of the sysadmin that has to maintain this. The horror. The horror.

Philip
March 21, 2010 9:29 AM

Hi Tom,

thanks for your comments, but what you suggest won't work (already tried it). I've had to simplify the explanation for this blog post (both for brevity and because there's a lot that I can't reveal). It's a little more complex than "occasionally access records in the middle of the file". It's more like I have multiple fields and I need to run SQL queries with a WHERE clause that includes one or more of these fields and order by a few others. If I were to put all of this in a flat file and write my own indexing system, pretty soon I'd be implementing MyISAM, and I'd still have a problem fitting everything that I need in RAM at the same time.

What will work is multiple text files thrown onto a hadoop cluster... but then I'm getting a cluster to do the work of one box.

PS: This isn't the height of my skill ;)

Anonymous
March 21, 2010 9:41 AM

There are a couple other things you could have tried that typically speed up MySQL performance in scenarios like yours.

By default InnoDB flushes every transaction to disk -- this is usually the limiting factor when you're doing lots of inserts. It's necessary for a true ACID system, but in many cases, it's not worth the large overhead cost.

The first thing you can do is bundle lots of inserts (e.g., one second's worth of data) into a single transaction. It's usually a fairly simple code change, just inserting begin and end transaction statements at the right place.

A simpler (but less safe) alternative is to set the flag innodb_flush_log_at_trx_commit=0 in the config file. If you have a complete OS crash, you can potentially lose up to one second's worth of data. However, in many cases, the benefit of much faster performance is worth losing a little bit of data in infrequent, extreme situations.

These are pretty standard techniques that many sysadmins are familiar with.

Philip
March 21, 2010 9:54 AM

We use flush_log_at_trx_commit=2. There's no need of a transaction, because we do bulk inserts - 40K records per insert which takes 4-6 seconds.

I have a bunch of charts in my ConFoo presentation on this topic here: http://www.slideshare.net/bluesmoon/scaling-mysql-writes-through-partitioning-3397422

Anonymous
March 21, 2010 10:58 AM

Sorry, but that answer about using a text file is full of fail.

To the author, thanks for sharing this info. It's quite valuable.

Anonymous
March 21, 2010 11:32 AM

In your presentation, you mention that you are using replication. The binary logging for replication can be a huge bottleneck for writes -- it's similar to the issue of flushing transactions. If you're setting flush_log_at_trx_commit, then you should also set sync_binlog=0

Here's some sample data that explores various combos:
http://www.mysqlperformanceblog.com/2010/02/28/maximal-write-througput-in-mysql/

Usually it makes sense to tweak these settings first, then move on to partitioning and bulk inserts only if you have to.

Philip
March 21, 2010 6:58 PM

I did try sync_binlog=0. It did improve individual insert rate, but not significantly higher than the 2000 inserts per second that we required. Once I switched to bulk inserts, the difference made by sync_binlog was negligible.

Partitioning was required because no matter what we tried we'd always break down at the same point. Adding more RAM was the first solution, and that worked. Right now we still have enough RAM to manage even with data growth, but I estimate that we'll run out of RAM in 6-9 months and that's when partitioning will be the only solution.

Thanks for your comments and the link to the my-p-b.

Bradford
March 22, 2010 12:21 AM

I actually think HBase may make your life easier :)

Anonymous
March 22, 2010 12:59 AM

Interesting. Although myself, I would have dumped the data to a text file, installed hadoop and then written some simple Python scripts to pipe into hadoop. You can install it on one machine without requiring a cluster (its actually easier to setup) and lets you scale out when you need to.

Les
March 23, 2010 10:09 AM

The problem that it sounds like you are trying to solve (from the comments above), is actually tailor made for business intelligence cube technology. It takes care of the aggregation and slicing and dicing.

I have fairly often seen this kind of problem solved with teradata and big hardware. Your solution, though, is MUCH cheaper. Given the lack of changes after insert, I wonder if there is a cube implementation that would allow you to pull in the data periodically(every 15 or 20 min) without requiring it to be retained in a relational db. This would give you more flexibility to analyze the data on in the long term.

Jelmer
March 23, 2010 12:08 PM

seems to me a nosql solution like couchdb, mongo, hbase or cassandra would have been more appropriate here

Anonymous
May 20, 2010 12:05 PM

@Jelmer - how is it that NoSQL would be a better solution when you have NFI what all the requirements are? You moronity astounds me.

dat
May 21, 2010 7:51 AM

I don't know if the queries that you need to perform are suitable for a database like redis (http://code.google.com/p/redis/wiki/Benchmarks) but I'm curious to know if you tried this solution and if so, why you choose mysql at the end.
thanks,
dat

DB Web Dev
September 04, 2010 2:42 AM

I saw your ConFoo presentation then read thru the comments here. Was there a reason you could not use Archive table engine for this?

You said "lots of writes throughout the day, and one huge read at the end of the day".

So I would think Archive would let you insert fast and using much less HDD space than InnoDB. And with your select at the end of the day wouldn't be that bad if it was a full table scan.

Philip
September 04, 2010 3:35 AM

That's a good question. I don't remember the exact reason, but it may have to do with the fact that we can't do one giant select -- there's just too much data to hold in RAM. We need to do multiple selects on various keys.

When I say read once at the end of the day, I really mean in one batch do several selects on different keys and summarise the data at that point, and not one single select.

Anonymous
April 21, 2011 12:34 AM

Consider example in which it has high rate of insert and select query on table. Could have two table with different storage engine. So Table A will be INNODB for high insert and using trigger to insert data in Table B (MYISAM). Also delete data from Table A periodically.

Philip
April 21, 2011 1:25 AM

@Anonymous: that doesn't work since they're both on the same disk. Deletions are also not possible since on InnoDB that requires rebalancing the index. The best way to delete rows is to drop the table completely, which is why we partition by day. Remember that space is not an issue, the bottleneck is the rebalancing of the innodb primary key.

Gabriel Francisco
August 29, 2011 4:00 PM

Great article! Thnks man!

Post a Comment

...===...