Tuesday, May 28, 2013

Batch Writing, and Dynamic vs Parametrized SQL, how well does your database perform?

One of the most effective database optimizations is batch writing. Batch writing is supported by most modern databases and part of the JDBC standard and is supported by most JPA providers.

Normal database access consists of sending each DML (insert, update, delete) statement to the database in a separate database/network access. Each database access has a certain amount of overhead to it, and the database must process each statement independently. Batch writing has two forms, dynamic and parametrized. Parametrized is the most common, and normally provides the best benefit, as dynamic can have parsing issues.

To understand batch writing, you must first understand parametrized SQL. SQL execution is composed of two parts, the parse and the execute. The parse consists of turning the string SQL representation to the database representation. The execute consists of executing the parsed SQL on the database. Databases and JDBC support bind parameters, so the the arguments to the SQL (the data) does not have to be embedded in the SQL. This avoids the cost of converting the data into text, and allows for the same SQL statement to be reused, with multiple executions. This allows for a single parse and multiple executes, aka "parametrized SQL". Most JDBC DataSource implementations and JPA providers support parametrized SQL and statement caching, this effectively avoids ever having a parse in a running application.

Example dynamic SQL

INSERT INTO EMPLOYEE (ID, NAME) VALUES (34567, "Bob Smith")

Example parametrized SQL

INSERT INTO EMPLOYEE (ID, NAME) VALUES (?, ?)

Parametrized batch writing involves executing a single DML statement, but with a set of bind parameters for multiple homogenous statements, instead of bind parameters for a single statement. This effectively allows for a large batch of homogenous inserts, updates, or deletes, to be processed by the database and network as a single operation, instead of n operations. The database only needs to perform the minimal amount of work, as there is only a single statement, so at most only a single parse. It is also compatible with statement caching, so no statement parsing needs to occur at all. The limitation is that all of the statement's SQL must be identical. So, it works really good for say inserting 1,000 Orders, as the insert SQL is the same for each Order, only the bind parameters differ. But it does not help for inserting 1 Order, or for inserting 1 Order, 1 OrderLine, and 1 Customer. Also, all of the statements must be part of the same database transaction.

Dynamic batch writing involves chaining a bunch of heterogeneous dynamic SQL statements into a single block, and sending the entire block to the database in a single database/network access. This is beneficial in that there is only a single network access, so if the database is remote or across a slow network, this can make a big difference. The drawback is that parameter binding is not allowed, and the database must parse this huge block of SQL when it receive it. It some cases the parsing costs can outweigh the network benefits. Also, dynamic SQL is not compatible with statement caching, as each SQL is different.

JDBC standardizes batch writing through its Statement and PrepareStatement batch APIs (as of JDBC 2.0, which was JDK 1.2, aka a long time ago). The JDBC batch API requires different JDBC code, so if you are using raw JDBC, you need to rewrite your code to switch between batching and non-batching APIs. Most JDBC drivers now support these APIs, but some do not actually send the DML to the database as a batch, they just emulate the APIs. So how do you know if you are really getting batch writing? The only real way is to test it, and measure the performance difference.

The JPA specification does not standardize batch writing configuration, but most JPA providers support it. Normally batch writing is enabled in JPA through persistence unit properties, so turning it on or off is a simple matter of configuration, and requires no coding changes. Some JPA providers may not support batch writing when using optimistic locking, and may not re-order SQL to enable it to be batched, so even with batch writing enabled, you may still not be getting batching writing. Always test your application with batch writing on and off, and measure the difference to ensure it is actually functioning.

EclipseLink supports both parametrized and dynamic batch writing (since EclipseLink 1.0). In EclipseLink, batch writing is enabled through the "eclipselink.jdbc.batch-writing" persistence unit property. EclipseLink provides three options, "JDBC", "Buffered", and "Oracle-JDBC". The "JDBC" option should always be used.

"Buffered" is for JDBC drivers that do not support batch writing, and chains dynamic SQL statements into a single block itself. "Buffered" does not support parametrized SQL, and is not recommended.

"Oracle-JDBC" uses the Oracle database JDBC API that predates the JDBC standard API, and is now obsolete. Previous to EclipseLink 2.5, this option allowed batch writing when using optimistic locking, but now the regular "JDBC" option supports optimistic locking.

EclipseLink 2.5 supports batch writing with optimistic locking on all (compliant) database platforms, where as previously it was only supported on selected database platforms. EclipseLink 2.5 also provides a "eclipselink.jdbc.batch-writing" query hint to disable batch writing for native queries that cannot be batched (such as DDL or stored procedures on some database platforms).

EclipseLink supports parametrized SQL through the "eclipselink.jdbc.bind-parameters", and "eclipselink.jdbc.cache-statements" persistence unit properties. However, these don't normally need to be set, as parameter binding is the default, so you would only set the property to disable binding. Statement caching is not on by default, but only relevant to EclipseLink if using EclipseLink's connection pooling, if you are using a JDBC or Java EE DataSource, then you must configure statement caching in your DataSource config.

When batch writing is enabled in EclipseLink, by default it is parametrized batch writing. To enable dynamic batch writing, you must disable parameter binding. This is the same to enable buffered batch writing.

Supporting batch writing is not incredibly difficult, most JPA providers support this, ordering the SQL such that it can be batched is the difficult part. During a commit or flush operation, EclipseLink automatically groups SQL by table to ensure homogenous SQL statements can be batched (and at the same time still maintains referential integrity constraints and avoids dead locks). Most JPA providers do not do this, so even if they support batch writing, a lot of the time the SQL does not benefit from batching.

To enabled batch writing in EclipseLink, add the following to persistence unit property;

"eclipselink.jdbc.batch-writing"="JDBC"
You can also configure the batch size using the "eclipselink.jdbc.batch-writing.size" persistence unit property. The default size is 100.
"eclipselink.jdbc.batch-writing.size"="1000"

Batch writing is very database, and JDBC driver dependent. So I was interested in which databases, drivers it worked with, and what the benefit was. I made two tests, one does a batch of 50 inserts, and one does a batch of 100 updates (using optimistic locking). I tried all of the batch writing options, as well as not using any batching.

Note, this is not a database benchmark, I am not comparing the databases between each other, only to themselves.

Each database is running on different hardware, some are local, and some are across a network, so do not compare one database to another. The data of interest is the percentage benefit enabling batch writing has over not using batch writing. For the insert test I also measured the difference between using parametrized versus dynamic SQL, and parametrized SQL without statement caching. The result is the number of transactions processed in 10 seconds (run 5 times, and averaged), so a bigger number is a better result.

Database: MySQL Version: 5.5.16
Driver: MySQL-AB JDBC Driver Version: mysql-connector-java-5.1.22

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch4830%
dynamic-sql, no batch4993%
parametrized-sql, no statement caching478-1%
dynamic-sql, batch4993%
parametrized-sql, batch5095%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2450%
dynamic-sql, batch2440%
parametrized-sql, batch2481%

So the results seem to indicating batch writing has no affect whatsoever (5% is within the variance). What this really means, is that the MySQL JDBC driver does not actually use batch processing, it just emulates the JDBC batch APIs and executes statements one by one underneath.

MySQL does have batch processing support though, it just requires different SQL. The MySQL JDBC driver does support this, but requires the rewriteBatchedStatements=true JDBC connect property to be set. This can easily be set by modifying your connect URL, such as;

jdbc:mysql://localhost:3306/test?rewriteBatchedStatements=true

MySQL: rewriteBatchedStatements=true

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch5040%
dynamic-sql, no batch5080%
parametrized-sql, no statement caching483-4%
dynamic-sql, batch1292156%
parametrized-sql, batch2181332%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2500%
dynamic-sql, batch669167%
parametrized-sql, batch699179%

So, it appears batch writing does make a big difference in MySQL, if configured correctly (why the JDBC driver does not do this by default, I have no idea). Parametrized batch writing does the best, being 332% faster for inserts, and 179% faster for updates. Dynamic batch writing also performs quite well. Interestingly there appears to be little difference between dynamic and parametrized SQL on MySQL (my guess is either MySQL is really faster at parsing, or does little optimization for prepared statements).

PostgreSQL Version: 9.1.1
PostgreSQL 8.4 JDBC4

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch4790%
dynamic-sql, no batch418-12%
parametrized-sql, no statement caching428-10%
dynamic-sql, buffered1127135%
dynamic-sql, batch1127135%
parametrized-sql, batch2037325%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2330%
dynamic-sql, batch39569%
parametrized-sql, batch707203%

The results show batch writing makes a big difference on PostgreSQL. Parametrized batch writing performs the best, being 325% faster for inserts, and 203% faster for updates. Dynamic batch writing also performs quite well. For PostgreSQL I also measure the performance of EclipseLink's buffered batch writing, which performs the same as dynamic JDBC batch writing, so I assume the driver is doing the same thing. Parametrized SQL outperforms dynamic SQL by about 10%, but parametrized SQL without statement caching performs similar to dynamic SQL.

Oracle Database 11g Enterprise Edition Release 11.1.0.7.0
Oracle JDBC driver Version: 11.2.0.2.0

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch5480%
dynamic-sql, no batch494-9%
parametrized-sql, no statement caching452-17%
dynamic-sql, buffered383-30%
dynamic-sql, batch489-10%
parametrized-sql, batch3308503%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2820%
dynamic-sql, batch258-8%
parametrized-sql, batch1672492%

The results show parametrized batch writing makes a big difference on Oracle, being 503% faster for inserts, and 492% faster for updates. Dynamic batch writing does not provide any benefit, this is because Oracle's JDBC driver just emulates dynamic batch writing and executes statements one by one, so it has the same performance as dynamic SQL. Buffered batch writing actually has worse performance than not batching at all. This is because of the parsing cost for the huge block of dynamic SQL, this may vary in different configurations, if the database is remote or across a slow network, I have seen this provide a benefit.

Parametrized SQL with statement caching provides about a 10% benefit over dynamic SQL, and points out that to benefit from parametrized you need to use statement caching, otherwise the performance can be worse than dynamic SQL. Of coarse there are other benefits to parametrized SQL, as it removes CPU processing from the server, which may not help much in this single threaded case, but can make a huge difference in a multi-threaded case where the database is a bottleneck.

Apache Derby Version: 10.9.1.0 - (1344872)
Apache Derby Embedded JDBC Driver Version: 10.9.1.0 - (1344872)
(local)

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch30270%
dynamic-sql, no batch24-99%
parametrized-sql, no statement caching50-98%
dynamic-sql, batch24-99%
parametrized-sql, batch32527%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql14370%
dynamic-sql, batch6-99%
parametrized-sql, batch217251%

The results show parametrized batch writing makes a difference on Derby, being 7% faster for inserts, and 51% faster for updates. This result difference is not as much as other database because my database was local. For a networked database, it would be a bigger difference, but this does show that batch writing can provide a benefit even for local databases, so it is not just a network optimization. The really interesting results from Derby are the horrible performance of the dynamic and non-cached statements. This shows the Derby has a huge parsing cost, so if you are using Derby, using parametrized SQL with statement caching is really important.

DB2/NT64 Version: SQL09070
IBM Data Server Driver for JDBC and SQLJ Version: 4.0.100

The results are basically similar to Oracle, in that parametrized batch writing gives a big performance benefit. Dynamic batch writing has worse performance then no batching with parametrized SQL, and dynamic SQL and parametrized SQL without statement caching result in worse performance.

Microsoft SQL Server Version: 10.50.1617
Microsoft SQL Server JDBC Driver 2.0 Version: 2.0.1803.100

The results were similar to PostgreSQL, showing both parametrized and dynamic batch writing providing a significant benefit. Parametrized batch writing performed the best, and parametrized SQL outperformed dynamic SQL, and no statement caching.

** UPDATE **

It was requested that I also test H2 and HSQL, so here are the results.

Database: H2 Version: 1.3.167 (2012-05-23)
Driver: H2 JDBC Driver Version: 1.3.167 (2012-05-23)
(local)

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch47570%
dynamic-sql, no batch3210-32%
parametrized-sql, no statement caching47570%
dynamic-sql, buffered1935-59%
dynamic-sql, batch3293-30%
parametrized-sql, batch575320%

The results show H2 performs 20% faster with parametrized batch writing. H2 is an in-memory database (backed by a persistent log file), so is not expected to benefit as much as there is no network involved. Dynamic batch writing, and dynamic SQL perform worse the parametrized SQL. Interestingly using statement caching with parametrized SQL makes no difference. My assumption is that H2 is always caching prepared statements in its connection, so the user does not need to do their own statement caching.

Database: HSQL Database Engine Version: 1.8.1
Driver: HSQL Database Engine Driver Version: 1.8.1
(local)

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch73190%
dynamic-sql, no batch5054-30%
parametrized-sql, no statement caching6776-7%
dynamic-sql, batch5500-24%
parametrized-sql, batch917625%

The results show HSQL performs 25% faster with parametrized batch writing. HSQL is an in-memory database (backed by a persistent log file), so is not expected to benefit as much as there is no network involved. Dynamic batch writing, and dynamic SQL perform worse the parametrized SQL.