bulk load and += « spack++

bulk load and +=

Let’s supposing that you have log files of some sort pouring in and you want to put aggregate data representing the logs into an rdmbs. To begin, let’s start with a blank slate, i.e., just dumping the data in. And let’s have a simple table, that in mysql is created via

CREATE TABLE `history` (
`id` int(11) NOT NULL auto_increment,
`hits` int(11) NOT NULL,
PRIMARY KEY (`id`)
);

I did a pass each for both MyISAM and Innodb with a million inserts.

engine	queries per insert	seconds (lower is better)
MyISAM	10000	7.046952963
MyISAM	1000	7.342753172
MyISAM	100	8.521313906
MyISAM	10	31.44731498
MyISAM	1	135.3045712
MyISAM	load data infile	4.927606106
Innodb	10000	19.76374817
Innodb	1000	30.58060002
Innodb	100	89.54839206
Innodb	10	723.135994
Innodb	load data infile	17.25715899

A multi-value insert for three values looks like this

INSERT INTO today (hits) VALUES (?), (?), (?)

Then I execute with the three values.

The fact that inserts with 1000 values start to approach the load data infile numbers is a little compelling. But let’s suppose that we want to do every insert from a bulk load but we want to have a table (like history above) that has aggregate data, += style. Is it possible? Sure.

Here is one approach for mysql:

Create a temp table, which I will call today
Bulk load the data into today
Run the query INSERT INTO history (SELECT * FROM today) ON DUPLICATE KEY UPDATE history.hits = history.hits + today.hits;
Drop today

I would like to apply this strategy and contribute some pig code that allows for bulk insert. This would (I think) allow for some pretty large scale aggregating all from with a “simple” pig script. Would also like to start using chukwa, but it looks a little tough. I think the architecture would then look something like

web servers -> chukwa -> pig -> mysql

Think then I would be pretty well at yahoo! or facebook scale.

Guess we’ll see how it all goes 🙂

Enjoy!

Earl

This entry was posted on Saturday, June 13th, 2009 at 10:44 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

bulk load and +=

engine

queries per insert

seconds (lower is better)

Leave a Reply