bulk load and +=

Let’s supposing that you have log files of some sort pouring in and you want to put aggregate data representing the logs into an rdmbs.  To begin, let’s start with a blank slate, i.e., just dumping the data in.  And let’s have a simple table, that in mysql is created via

CREATE TABLE `history` (
`id` int(11) NOT NULL auto_increment,
`hits` int(11) NOT NULL,
PRIMARY KEY  (`id`)
);

I did a pass each for both MyISAM and Innodb with a million inserts.

engine

queries per insert

seconds (lower is better)

MyISAM

10000

7.046952963

MyISAM

1000

7.342753172

MyISAM

100

8.521313906

MyISAM

10

31.44731498

MyISAM

1

135.3045712

MyISAM load data infile

4.927606106

Innodb

10000

19.76374817

Innodb

1000

30.58060002

Innodb

100

89.54839206

Innodb

10

723.135994

Innodb load data infile

17.25715899

A multi-value insert for three values looks like this

INSERT INTO today (hits) VALUES (?), (?), (?)

Then I execute with the three values.

The fact that inserts with 1000 values start to approach the load data infile numbers is a little compelling.  But let’s suppose that we want to do every insert from a bulk load but we want to have a table (like history above) that has aggregate data, += style.  Is it possible?  Sure.

Here is one approach for mysql:

  1. Create a temp table, which I will call today
  2. Bulk load the data into today
  3. Run the query INSERT INTO history (SELECT * FROM today) ON DUPLICATE KEY UPDATE history.hits = history.hits + today.hits;
  4. Drop today

I would like to apply this strategy and contribute some pig code that allows for bulk insert.  This would (I think) allow for some pretty large scale aggregating all from with a “simple” pig script.  Would also like to start using chukwa, but it looks a little tough.  I think the architecture would then look something like

web servers -> chukwa -> pig -> mysql

Think then I would be pretty well at yahoo! or facebook scale.

Guess we’ll see how it all goes 🙂

Enjoy!

Earl

Leave a Reply