Tuesday, August 13, 2013

Optimizing Java Serialization - Java vs XML vs JSON vs Kryo vs POF

Perhaps I'm naive, but I always thought Java serialization must surely be the fastest and most efficient way to serialize Java objects into binary form. After all, Java is on it's 7th major release, so this is not new technology, and since every JDK seems to be faster than the last, I incorrectly assumed serialization must be very fast and efficient by now. I thought, since Java serialization is binary, and language dependent, it must be much faster and more efficient than XML or JSON. Unfortunately, I was wrong, if you concerned about performance, I would recommend avoiding Java serialization.

Now, don't get me wrong, I'm not trying to dis Java. Java serialization has many requirements, the main one being able to serialize anything (or at least anything that implements Serializable), into any other JVM (even a different JVM version/implementation), even running a different version of the classes being serialized (as long as you set a serialVersionUID). The main thing is it just works, and that is really great. Performance is not the main requirement, and the format is standard and must be backward compatible, so optimization is very difficult. Also, for many types of use cases, Java serialization performs very well.

I got started on this journey into the bowels of serialization while working on a three tier concurrency benchmark. I noticed a lot of the CPU time being spent inside Java serialization, so I decided to investigate. I started by serializing a simple Order object that had a couple of fields. I serialized the object and output the bytes. Although the Order object only had a few bytes of data, I was not that naive to think it would serialize to only a few bytes, I knew enough about serialization that it would at least need to write out the full class name, so it knew what it had serialized, so it could read it back. So I expected, maybe 50 bytes or so. The result was over 600 bytes, that's when I realized Java serialization was not as simple as I had imagined.

Java serialization bytes for Order object

----sr--model.Order----h#-----J--idL--customert--Lmodel/Customer;L--descriptiont--Ljava/lang/String;L--orderLinest--Ljava/util/List;L--totalCostt--Ljava/math/BigDecimal;xp--------ppsr--java.util.ArrayListx-----a----I--sizexp----w-----sr--model.OrderLine--&-1-S----I--lineNumberL--costq-~--L--descriptionq-~--L--ordert--Lmodel/Order;xp----sr--java.math.BigDecimalT--W--(O---I--scaleL--intValt--Ljava/math/BigInteger;xr--java.lang.Number-----------xp----sr--java.math.BigInteger-----;-----I--bitCountI--bitLengthI--firstNonzeroByteNumI--lowestSetBitI--signum[--magnitudet--[Bxq-~----------------------ur--[B------T----xp----xxpq-~--xq-~--
(note "-" means an unprintable character)

As you may have noticed, Java serialization writes out not only the full class name of the object being serialized, but also the entire class definition of the class being serialized, and all of the referenced classes. The class definition can be quite large, and seems to be the main performance and efficiency issue, especially when writing out a single object. If you are writing out a large number of objects of the same class, then the class definition overhead is not normally a big issue. One other thing that I noticed, is that if your object has a reference to a class (such as a meta-data object), then Java serialization will write the entire class definition, not just the class name, so using Java serialization to write out meta-data is very expensive.

Externalizable

It is possible to optimize Java serialization through implementing the Externalizable interface. Implementing this interface avoids writing out the entire class definition, just the class name is written. It requires that you implement the readExternal and writeExternal methods, so requires some work and maintenance on your part, but is faster and more efficient than just implementing Serializable.

One interesting note on the results for Externalizable is that it was much more efficient for a small number of objects, but actually output slightly more bytes than Serializable for a large number of objects. I assume the Externalizable format is slightly less efficient for repeated objects.

Externalizable class

public class Order implements Externalizable {
    private long id;
    private String description;
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    private List orderLines = new ArrayList();
    private Customer customer;

    public Order() {
    }

    public void readExternal(ObjectInput stream) throws IOException, ClassNotFoundException {
        this.id = stream.readLong();
        this.description = (String)stream.readObject();
        this.totalCost = (BigDecimal)stream.readObject();
        this.customer = (Customer)stream.readObject();
        this.orderLines = (List)stream.readObject();
    }

    public void writeExternal(ObjectOutput stream) throws IOException {
        stream.writeLong(this.id);
        stream.writeObject(this.description);
        stream.writeObject(this.totalCost);
        stream.writeObject(this.customer);
        stream.writeObject(this.orderLines);
    }
}

Externalizable serialization bytes for Order object

----sr--model.Order---*3--^---xpw---------psr--java.math.BigDecimalT--W--(O---I--scaleL--intValt--Ljava/math/BigInteger;xr--java.lang.Number-----------xp----sr--java.math.BigInteger-----;-----I--bitCountI--bitLengthI--firstNonzeroByteNumI--lowestSetBitI--signum[--magnitudet--[Bxq-~----------------------ur--[B------T----xp----xxpsr--java.util.ArrayListx-----a----I--sizexp----w-----sr--model.OrderLine-!!|---S---xpw-----pq-~--q-~--xxx

Other Serialization Options

I started to investigate what other serialization options there were in Java. I started with EclipseLink MOXy, which supports serializing objects to XML or JSON through the JAXB API. I was not expecting XML serialization to outperform Java serialization, so was quite surprised when it did for certain use cases. I also found a product Kryo, which is an open source project for optimized serialization. I also investigated the Oracle Coherence POF serialization format. There are pros and cons with each product, but my main focus was on how their performance compared, and how efficient they were.

EclipseLink MOXy - XML and JSON

The main advantage of using EclipseLink MOXy to serialize to XML or JSON is that both are standard, portable formats. You can access the data from any client using any language, so are not restricted to Java, as with Java serialization. You can also integrate your data with web services and REST services. Both formats are also text based, so human readable. No coding or special interfaces are required, only meta-data. The performance is quite acceptable, and outperforms Java serialization for small data-sets.

The draw backs are that the text formats are less efficient than optimized binary formats, and JAXB requires meta-data. So you need to annotate your classes with JAXB annotations, or provide an XML configuration file. Also, circular references are not handled by default, you need to use an @XmlIDREF to handle cycles.

JAXB annotated classes

@XmlRootElement
public class Order {
    @XmlID
    @XmlAttribute
    private long id;
    @XmlAttribute
    private String description;
    @XmlAttribute
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    private List orderLines = new ArrayList();
    private Customer customer;
}

public class OrderLine {
    @XmlIDREF
    private Order order;
    @XmlAttribute
    private int lineNumber;
    @XmlAttribute
    private String description;
    @XmlAttribute
    private BigDecimal cost = BigDecimal.valueOf(0);
}

EclipseLink MOXy serialization XML for Order object

<order id="0" totalCost="0"><orderLines lineNumber="1" cost="0"><order>0</order></orderLines></order>

EclipseLink MOXy serialization JSON for Order object

{"order":{"id":0,"totalCost":0,"orderLines":[{"lineNumber":1,"cost":0,"order":0}]}}

Kryo

Kryo is a fast, efficient serialization framework for Java. Kryo is an open source project on Google code that is provided under the New BSD license. It is a small project, with only 3 members, it first shipped in 2009 and last shipped the 2.21 release in Feb 2013, so is still actively being developed.

Kryo works similar to Java serialization, and respects transient fields, but does not require a class be Serializable. I found Kryo to have some limitations, such as requiring classes to have a default constructor, and encountered some issues in serializing java.sql.Time, java.sql.Date and java.sql.Timestamp classes.

Kryo serialization bytes for Order object

------java-util-ArrayLis-----model-OrderLin----java-math-BigDecima---------model-Orde-----

Oracle Coherence POF

The Oracle Coherence product provides its own optimized binary format called POF (portable object format). Oracle Coherence is an in-memory data grid solution (distributed cache). Coherence is a commercial product, and requires a license. EclipseLink supports an integration with Oracle Coherence through the Oracle TopLink Grid product that uses Coherence as the EclipseLink shared cache.

POF provides a serialization framework, and can be used independently of Coherence (if you already have Coherence licensed). POF requires that your class implement a PortableObject interface and read/write methods. You can also implement a separate Serializer class, or use annotations in the latest Coherence release. POF requires that each class be assigned an constant id ahead of time, so you need some way determine this id. The POF format is a binary format, very compact, efficient, and fast, but does require some work on your part.

The total bytes for POF was 32 bytes for a single Order/OrderLine object, and 1593 bytes for 100 OrderLines. I'm not going to give the results, as POF is part of a commercially licensed product, but is was very fast.

POF PortableObject

public class Order implements PortableObject {
    private long id;
    private String description;
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    private List orderLines = new ArrayList();
    private Customer customer;

    public Order() {
    }
    
    public void readExternal(PofReader in) throws IOException {
        this.id = in.readLong(0);
        this.description = in.readString(1);
        this.totalCost = in.readBigDecimal(2);
        this.customer = (Customer)in.readObject(3);
        this.orderLines = (List)in.readCollection(4, new ArrayList());
    }

    public void writeExternal(PofWriter out) throws IOException {
        out.writeLong(0, this.id);
        out.writeString(1, this.description);
        out.writeBigDecimal(2, this.totalCost);
        out.writeObject(3, this.customer);
        out.writeCollection(4, this.orderLines);
    }
}

POF serialization bytes for Order object

-----B--G---d-U------A--G-------

Results

So how does each perform? I made a simple benchmark to compare the different serialization mechanisms. I compared the serialization of two different use cases. The first is a single Order object with a single OrderLine object. The second is a single Order object with 100 OrderLine objects. I compared the average serialization operations per second, and measure the size in bytes of the serialized data. Different object models, use cases, and environments will give different results, but this gives you a general idea of the performance differences in different serializers.

The results show that Java serialization is slow for a small number of objects, but good for a large number of objects. Conversely, XML and JSON can outperform Java serialization for a small number of objects, but Java serialization is faster for a large number of objects. Kryo and other optimized binary serializers outperform Java serialization with both types of data.

You may wonder why it is relevant that something that takes less than a millisecond is of any relevance to performance, and you may be right. In general you would only have a real performance problem if you wrote out a very large number of objects, and then Java serialization performs quite well, so is the fact that it performs very poorly for a small number of object relevant? For a single operation, this is probably true, but if you execute many small serialization operations, then the cost is relevant. A typical server servicing many clients will typically send out many small requests, so while the cost of serialization is not great enough to make any of these single requests take a long time, it will dramatically effects the scalability of the server.

Order with 1 OrderLine

SerializerSize (bytes)Serialize (operations/second)Deserialize (operations/second)% Difference (from Java serialize)% Difference (deserialize)
Java Serializable636128,63419,1800%0%
Java Externalizable435160,54926,67824%39%
EclipseLink MOXy XML101348,05647,334170%146%
Kryo90359,368346,984179%1709%

Order with 100 OrderLines

SerializerSize (bytes)Serialize (operations/second)Deserialize (operations/second)% Difference (from Java serialize)% Difference (deserialize)
Java Serializable2,71516,47010,2150%0%
Java Externalizable2,81116,20611,483-1%12%
EclipseLink MOXy XML6,6287,3042,731-55%-73%
Kryo121622,86231,49938%208%

EclipseLink JPA

In EclipseLink 2.6 development builds, and to some degree 2.5, we have added the ability to choose your serializer anywhere that EclipseLink does serialization.

One such place is in serialized @Lob mappings. You can now use the @Convert annotation to specify a serializer such as @Convert(XML), @Convert(JSON), @Convert(Kryo). In addition to optimizing performance, this provides an easy mechanism to write XML and JSON data to your database.

Also for EclipseLink cache coordination you can choose your serializer using the "eclipselink.cache.coordination.serializer" property.

The source code for the benchmarks used in this post can be found here, or downloaded here.

Thursday, June 20, 2013

EclipseLink supports HQL and several advanced new JPQL features

EclipseLink 2.5 added several new JPQL features and supported syntax.
These include:
  • HQL compatibility
  • Array expressions
  • Hierarchical selects
  • Historical queries

HQL Compatibility

A common issue for users migrating from Hibernate to EclipseLink is that the HQL syntax of not having a SELECT clause is not supported, as it is not standard JPQL syntax. EclipseLink now supports this syntax. For queries that will return the entire object, this allows you to omit the SELECT clause, and start the query with the FROM clause.
from Employee e where e.salary > 50000

Array Expressions

Previously in JPQL you could use the IN operation and sub-selects to compare a single value, but what if you wanted to compare multiple values, such as composite ids? You would then have to dynamically generate very long AND/OR expression trees to compare each value one by one, which is difficult and cumbersome. Many databases provide a much better solution to this problem with array expressions, this allows for arrays within the SQL in sub-selects comparisons, or nested within an IN comparison.

EclipseLink now supports array expressions with JPQL. EclipseLink also uses array expressions internally when object comparisons are done in JPQL and objects with composite ids.

Select e from Employee e where (e.firstName, e.lastName) IN :names

Select e from Employee e where (e.firstName, e.lastName) IN (Select m.firstName, m.lastName from Managers m)

Hierarchical Selects

Traditionally it has been very difficult to query hierarchical trees in relational data. For example querying all employees who work under a particular manager. Querying one level is simple, two levels is possible, but querying the entire depth of the tree, when you don't know how deep the tree is, is very difficult.

Some databases support a special syntax for this type of query. In Oracle the CONNECT BY clause allows for hierarchical queries to be expressed. EclipseLink now supports a CONNECT BY clause in JPQL, to support hierarchical queries on databases that support CONNECT BY.

Select e from Employee e START WITH e.id = :id CONNECT BY e.managedEmployees

Select e from Employee e START WITH e.id = :id CONNECT BY e.managedEmployees ORDER SIBLINGS BY e.salary where e.salary > 100000

Historical Queries

Historical queries allow you to query back in time. This requires that you use EclipseLink's history support, or use a database that supports historical queries such as Oracle's flashback support. Historical queries use the AS OF clause to query an entity as of a point in time (or an Oracle SCN for flashback). This provides the ability to do some pretty cool queries and analytic on your data.
Select e from Employee e AS OF :date where e.id = :id

Select e from Employee e, Employee e2 AS OF :date where e = e2 and e.salary > e2.salary

Monday, June 10, 2013

Cool performance features of EclipseLink 2.5

The main goal of the EclipseLink 2.5 release was the support of the JPA 2.1 specification, as EclipseLink 2.5 was the reference implementation for JPA 2.1. For a list of JPA 2.1 features look here, or here.

Most of the features that went into the release were to support JPA 2.1 features, so there was not a lot of development time for other features. However, I was still able to sneak in a few cool new performance features. The features are not well documented yet, so I thought I would outline them here.

Indexing Foreign Keys

The first feature is auto indexing of foreign keys. Most people incorrectly assume that databases index foreign keys by default. Well, they don't. Primary keys are auto indexed, but foreign keys are not. This means any query based on the foreign key will be doing full table scans. This is any OneToMany, ManyToMany or ElementCollection relationship, as well as many OneToOne relationships, and most queries on any relationship involving joins or object comparisons. This can be a major perform issue, and you should always index your foreign keys fields.

EclipseLink 2.5 makes indexing foreign key fields easy with a new persistence unit property:

"eclipselink.ddl-generation.index-foreign-keys"="true"
This will have EclipseLink create an index for all mapped foreign keys if EclipseLink is used to generate the persistence unit's DDL. Note that DDL generation is now standard in JPA 2.1, so to enable DDL generation in EclipseLink 2.5 you can now use:
"javax.persistence.schema-generation.database.action"="create"
EclipseLink 2.5 and JPA 2.1 also support several new DDL generation features, including allowing user scripts to be executed. See, DDL generation for more information.

Query Cache Invalidation

EclipseLink has always supported a query cache. Unlike the object cache, the query cache is not enabled by default, but must be enabled through the query hint "eclipselink.query-results-cache". The main issue with the query cache, is that the results of queries can change when objects are modified, so the query cache could become out of date. Previously the query cache did support time-to-live and daily invalidation through the query hint "eclipselink.query-results-cache.expiry", but would not be kept in synch with changes as they were made.

In EclipseLink 2.5 automatic invalidation of the query cache was added. So if you had a query "Select e from Employee e" and had enabled query caching, every execution of this query would hit the cache and avoid accessing the database. Then if you inserted a new Employee, in EclipseLink 2.5 the query cache for all queries for Employee will automatically get invalidated. The next query will access the database, and get the correct result, and update the cache so all subsequent queries will once again obtain cache hits. Since the query cache is now kept in synch, the new persistence unit property "eclipselink.cache.query-results"="true" was added to enable the query cache on all named queries. If, for some reason, you want to allow stale data in your query cache, you can disable invalidation using the QueryResultsCachePolicy.setInvalidateOnChange() API.

Query cache invalidation is also integrated with cache coordination, so even if you modify an Employee on another server in your cluster, the query cache will still be invalidated. The query cache invalidation is also integrated with EclipseLink's support for Oracle Database Change Notification. If you have other applications accessing your database, you can keep the EclipseLink cache in synch with an Oracle database using the persistence unit property "eclipselink.cache.database-event-listener"="DCN". This support was added in EclipseLink 2.4, but in EclipseLink 2.5 it will also invalidate the query cache.

Tuners

EclipseLink 2.5 added an API to make it easier to provide tuning configuration for a persistence unit. The SessionTuner API allows a set of tuning properties to be configured in one place, and provides deployment time access to the EclipseLink Session and persistence unit properties. This makes it easy to have a development, debug, and production configuration of your persistence unit, or provide different configurations for different hardware. The SessionTuner is set through the persistence unit property "eclipselink.tuning".

Concurrent Processing

The most interesting performance feature provided in EclipseLink 2.5 is still in a somewhat experimental stage. The feature allows for a session to make use of concurrent processing.

There is no public API to configure it as of yet, but if you are interested in experimenting it is easy to set through a SessionCustomizer or SessionTuner.


public class MyCustomizer implements SessionCustomizer {
  public void customize(Session session) {
    ((AbstractSession)session).setIsConcurrent(true);
  }
}

Currently this enables two main features, one is the concurrent processing of result sets. The other is the concurrent loading of load groups.

In any JPA object query there are three parts. The first is the execution of the query, the second is the fetching of the data, and the third is the building of the objects. Normally the query is executed, all of the data is fetched, then the objects are built from the data. With concurrency enabled two threads will be used instead, one to fetch the data, and one to build the objects. This allows two things to be done at the same time, allowing less overall time (but the same amount of CPU). This can provide a benefit if you have a multi-CPU machine, or even if you don't, it allows the client to be doing processing at the same time as the database machine.

The second feature allows all of the relationships for all of the resulting objects to be queried and built concurrently (only when using a shared cache). So, if you queried 32 Employees and also wanted each Employee's address, the address queries could all be executed and built concurrently, resulting in significant less response time. This requires the usage of a LoadGroup to be set on the query. LoadGroup defines a new API setIsConcurrent() to allow concurrency to be enabled (this defaults to true when a session is set to be concurrent).

A LoadGroup can be configured on a query using the query hint "eclipselink.load-group", "eclipselink.load-group.attribute", or through the JPA 2.1 EntityGraph query hint "javax.persistence.loadgraph".

Note that for concurrency to improve your application's performance you need to have spare CPU time. So, to benefit the most you need multiple-CPUs. Also, concurrency will not help you scale an application server that is already under load from multiple client requests. Concurrency does not use less CPU time, it just allows for the CPUs to be used more efficiently to improve response times.

Tuesday, May 28, 2013

Batch Writing, and Dynamic vs Parametrized SQL, how well does your database perform?

One of the most effective database optimizations is batch writing. Batch writing is supported by most modern databases and part of the JDBC standard and is supported by most JPA providers.

Normal database access consists of sending each DML (insert, update, delete) statement to the database in a separate database/network access. Each database access has a certain amount of overhead to it, and the database must process each statement independently. Batch writing has two forms, dynamic and parametrized. Parametrized is the most common, and normally provides the best benefit, as dynamic can have parsing issues.

To understand batch writing, you must first understand parametrized SQL. SQL execution is composed of two parts, the parse and the execute. The parse consists of turning the string SQL representation to the database representation. The execute consists of executing the parsed SQL on the database. Databases and JDBC support bind parameters, so the the arguments to the SQL (the data) does not have to be embedded in the SQL. This avoids the cost of converting the data into text, and allows for the same SQL statement to be reused, with multiple executions. This allows for a single parse and multiple executes, aka "parametrized SQL". Most JDBC DataSource implementations and JPA providers support parametrized SQL and statement caching, this effectively avoids ever having a parse in a running application.

Example dynamic SQL

INSERT INTO EMPLOYEE (ID, NAME) VALUES (34567, "Bob Smith")

Example parametrized SQL

INSERT INTO EMPLOYEE (ID, NAME) VALUES (?, ?)

Parametrized batch writing involves executing a single DML statement, but with a set of bind parameters for multiple homogenous statements, instead of bind parameters for a single statement. This effectively allows for a large batch of homogenous inserts, updates, or deletes, to be processed by the database and network as a single operation, instead of n operations. The database only needs to perform the minimal amount of work, as there is only a single statement, so at most only a single parse. It is also compatible with statement caching, so no statement parsing needs to occur at all. The limitation is that all of the statement's SQL must be identical. So, it works really good for say inserting 1,000 Orders, as the insert SQL is the same for each Order, only the bind parameters differ. But it does not help for inserting 1 Order, or for inserting 1 Order, 1 OrderLine, and 1 Customer. Also, all of the statements must be part of the same database transaction.

Dynamic batch writing involves chaining a bunch of heterogeneous dynamic SQL statements into a single block, and sending the entire block to the database in a single database/network access. This is beneficial in that there is only a single network access, so if the database is remote or across a slow network, this can make a big difference. The drawback is that parameter binding is not allowed, and the database must parse this huge block of SQL when it receive it. It some cases the parsing costs can outweigh the network benefits. Also, dynamic SQL is not compatible with statement caching, as each SQL is different.

JDBC standardizes batch writing through its Statement and PrepareStatement batch APIs (as of JDBC 2.0, which was JDK 1.2, aka a long time ago). The JDBC batch API requires different JDBC code, so if you are using raw JDBC, you need to rewrite your code to switch between batching and non-batching APIs. Most JDBC drivers now support these APIs, but some do not actually send the DML to the database as a batch, they just emulate the APIs. So how do you know if you are really getting batch writing? The only real way is to test it, and measure the performance difference.

The JPA specification does not standardize batch writing configuration, but most JPA providers support it. Normally batch writing is enabled in JPA through persistence unit properties, so turning it on or off is a simple matter of configuration, and requires no coding changes. Some JPA providers may not support batch writing when using optimistic locking, and may not re-order SQL to enable it to be batched, so even with batch writing enabled, you may still not be getting batching writing. Always test your application with batch writing on and off, and measure the difference to ensure it is actually functioning.

EclipseLink supports both parametrized and dynamic batch writing (since EclipseLink 1.0). In EclipseLink, batch writing is enabled through the "eclipselink.jdbc.batch-writing" persistence unit property. EclipseLink provides three options, "JDBC", "Buffered", and "Oracle-JDBC". The "JDBC" option should always be used.

"Buffered" is for JDBC drivers that do not support batch writing, and chains dynamic SQL statements into a single block itself. "Buffered" does not support parametrized SQL, and is not recommended.

"Oracle-JDBC" uses the Oracle database JDBC API that predates the JDBC standard API, and is now obsolete. Previous to EclipseLink 2.5, this option allowed batch writing when using optimistic locking, but now the regular "JDBC" option supports optimistic locking.

EclipseLink 2.5 supports batch writing with optimistic locking on all (compliant) database platforms, where as previously it was only supported on selected database platforms. EclipseLink 2.5 also provides a "eclipselink.jdbc.batch-writing" query hint to disable batch writing for native queries that cannot be batched (such as DDL or stored procedures on some database platforms).

EclipseLink supports parametrized SQL through the "eclipselink.jdbc.bind-parameters", and "eclipselink.jdbc.cache-statements" persistence unit properties. However, these don't normally need to be set, as parameter binding is the default, so you would only set the property to disable binding. Statement caching is not on by default, but only relevant to EclipseLink if using EclipseLink's connection pooling, if you are using a JDBC or Java EE DataSource, then you must configure statement caching in your DataSource config.

When batch writing is enabled in EclipseLink, by default it is parametrized batch writing. To enable dynamic batch writing, you must disable parameter binding. This is the same to enable buffered batch writing.

Supporting batch writing is not incredibly difficult, most JPA providers support this, ordering the SQL such that it can be batched is the difficult part. During a commit or flush operation, EclipseLink automatically groups SQL by table to ensure homogenous SQL statements can be batched (and at the same time still maintains referential integrity constraints and avoids dead locks). Most JPA providers do not do this, so even if they support batch writing, a lot of the time the SQL does not benefit from batching.

To enabled batch writing in EclipseLink, add the following to persistence unit property;

"eclipselink.jdbc.batch-writing"="JDBC"
You can also configure the batch size using the "eclipselink.jdbc.batch-writing.size" persistence unit property. The default size is 100.
"eclipselink.jdbc.batch-writing.size"="1000"

Batch writing is very database, and JDBC driver dependent. So I was interested in which databases, drivers it worked with, and what the benefit was. I made two tests, one does a batch of 50 inserts, and one does a batch of 100 updates (using optimistic locking). I tried all of the batch writing options, as well as not using any batching.

Note, this is not a database benchmark, I am not comparing the databases between each other, only to themselves.

Each database is running on different hardware, some are local, and some are across a network, so do not compare one database to another. The data of interest is the percentage benefit enabling batch writing has over not using batch writing. For the insert test I also measured the difference between using parametrized versus dynamic SQL, and parametrized SQL without statement caching. The result is the number of transactions processed in 10 seconds (run 5 times, and averaged), so a bigger number is a better result.

Database: MySQL Version: 5.5.16
Driver: MySQL-AB JDBC Driver Version: mysql-connector-java-5.1.22

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch4830%
dynamic-sql, no batch4993%
parametrized-sql, no statement caching478-1%
dynamic-sql, batch4993%
parametrized-sql, batch5095%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2450%
dynamic-sql, batch2440%
parametrized-sql, batch2481%

So the results seem to indicating batch writing has no affect whatsoever (5% is within the variance). What this really means, is that the MySQL JDBC driver does not actually use batch processing, it just emulates the JDBC batch APIs and executes statements one by one underneath.

MySQL does have batch processing support though, it just requires different SQL. The MySQL JDBC driver does support this, but requires the rewriteBatchedStatements=true JDBC connect property to be set. This can easily be set by modifying your connect URL, such as;

jdbc:mysql://localhost:3306/test?rewriteBatchedStatements=true

MySQL: rewriteBatchedStatements=true

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch5040%
dynamic-sql, no batch5080%
parametrized-sql, no statement caching483-4%
dynamic-sql, batch1292156%
parametrized-sql, batch2181332%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2500%
dynamic-sql, batch669167%
parametrized-sql, batch699179%

So, it appears batch writing does make a big difference in MySQL, if configured correctly (why the JDBC driver does not do this by default, I have no idea). Parametrized batch writing does the best, being 332% faster for inserts, and 179% faster for updates. Dynamic batch writing also performs quite well. Interestingly there appears to be little difference between dynamic and parametrized SQL on MySQL (my guess is either MySQL is really faster at parsing, or does little optimization for prepared statements).

PostgreSQL Version: 9.1.1
PostgreSQL 8.4 JDBC4

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch4790%
dynamic-sql, no batch418-12%
parametrized-sql, no statement caching428-10%
dynamic-sql, buffered1127135%
dynamic-sql, batch1127135%
parametrized-sql, batch2037325%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2330%
dynamic-sql, batch39569%
parametrized-sql, batch707203%

The results show batch writing makes a big difference on PostgreSQL. Parametrized batch writing performs the best, being 325% faster for inserts, and 203% faster for updates. Dynamic batch writing also performs quite well. For PostgreSQL I also measure the performance of EclipseLink's buffered batch writing, which performs the same as dynamic JDBC batch writing, so I assume the driver is doing the same thing. Parametrized SQL outperforms dynamic SQL by about 10%, but parametrized SQL without statement caching performs similar to dynamic SQL.

Oracle Database 11g Enterprise Edition Release 11.1.0.7.0
Oracle JDBC driver Version: 11.2.0.2.0

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch5480%
dynamic-sql, no batch494-9%
parametrized-sql, no statement caching452-17%
dynamic-sql, buffered383-30%
dynamic-sql, batch489-10%
parametrized-sql, batch3308503%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql2820%
dynamic-sql, batch258-8%
parametrized-sql, batch1672492%

The results show parametrized batch writing makes a big difference on Oracle, being 503% faster for inserts, and 492% faster for updates. Dynamic batch writing does not provide any benefit, this is because Oracle's JDBC driver just emulates dynamic batch writing and executes statements one by one, so it has the same performance as dynamic SQL. Buffered batch writing actually has worse performance than not batching at all. This is because of the parsing cost for the huge block of dynamic SQL, this may vary in different configurations, if the database is remote or across a slow network, I have seen this provide a benefit.

Parametrized SQL with statement caching provides about a 10% benefit over dynamic SQL, and points out that to benefit from parametrized you need to use statement caching, otherwise the performance can be worse than dynamic SQL. Of coarse there are other benefits to parametrized SQL, as it removes CPU processing from the server, which may not help much in this single threaded case, but can make a huge difference in a multi-threaded case where the database is a bottleneck.

Apache Derby Version: 10.9.1.0 - (1344872)
Apache Derby Embedded JDBC Driver Version: 10.9.1.0 - (1344872)
(local)

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch30270%
dynamic-sql, no batch24-99%
parametrized-sql, no statement caching50-98%
dynamic-sql, batch24-99%
parametrized-sql, batch32527%

Update Test

OptionAverage Result% Difference from non batched
parametrized-sql14370%
dynamic-sql, batch6-99%
parametrized-sql, batch217251%

The results show parametrized batch writing makes a difference on Derby, being 7% faster for inserts, and 51% faster for updates. This result difference is not as much as other database because my database was local. For a networked database, it would be a bigger difference, but this does show that batch writing can provide a benefit even for local databases, so it is not just a network optimization. The really interesting results from Derby are the horrible performance of the dynamic and non-cached statements. This shows the Derby has a huge parsing cost, so if you are using Derby, using parametrized SQL with statement caching is really important.

DB2/NT64 Version: SQL09070
IBM Data Server Driver for JDBC and SQLJ Version: 4.0.100

The results are basically similar to Oracle, in that parametrized batch writing gives a big performance benefit. Dynamic batch writing has worse performance then no batching with parametrized SQL, and dynamic SQL and parametrized SQL without statement caching result in worse performance.

Microsoft SQL Server Version: 10.50.1617
Microsoft SQL Server JDBC Driver 2.0 Version: 2.0.1803.100

The results were similar to PostgreSQL, showing both parametrized and dynamic batch writing providing a significant benefit. Parametrized batch writing performed the best, and parametrized SQL outperformed dynamic SQL, and no statement caching.

** UPDATE **

It was requested that I also test H2 and HSQL, so here are the results.

Database: H2 Version: 1.3.167 (2012-05-23)
Driver: H2 JDBC Driver Version: 1.3.167 (2012-05-23)
(local)

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch47570%
dynamic-sql, no batch3210-32%
parametrized-sql, no statement caching47570%
dynamic-sql, buffered1935-59%
dynamic-sql, batch3293-30%
parametrized-sql, batch575320%

The results show H2 performs 20% faster with parametrized batch writing. H2 is an in-memory database (backed by a persistent log file), so is not expected to benefit as much as there is no network involved. Dynamic batch writing, and dynamic SQL perform worse the parametrized SQL. Interestingly using statement caching with parametrized SQL makes no difference. My assumption is that H2 is always caching prepared statements in its connection, so the user does not need to do their own statement caching.

Database: HSQL Database Engine Version: 1.8.1
Driver: HSQL Database Engine Driver Version: 1.8.1
(local)

Insert Test
OptionAverage Result% Difference from non batched
parametrized-sql, no batch73190%
dynamic-sql, no batch5054-30%
parametrized-sql, no statement caching6776-7%
dynamic-sql, batch5500-24%
parametrized-sql, batch917625%

The results show HSQL performs 25% faster with parametrized batch writing. HSQL is an in-memory database (backed by a persistent log file), so is not expected to benefit as much as there is no network involved. Dynamic batch writing, and dynamic SQL perform worse the parametrized SQL.

Thursday, March 7, 2013

But what if I'm not querying by id? (database and cache indexes)

Most data models define a sequence generated numeric primary key. This is the most efficient key to use as it is a single efficient guaranteed unique value. Some applications also use a UUID, which is a little less efficient in terms of time and space, but has its advantages in distributed systems and databases.

In JPA the primary key is defined as the JPA Id, and since JPA's object cache is indexed by Id, any queries by this Id obtain cache hits, and avoid database access. This is great when querying by id, and for traversing relationships across foreign keys based on the id, but what about queries not using the id?

Sequence generated ids are great for computers and databases, but are not very useful to people. How many times how you been to a store or website, and to look up your record they asked your for your sequence generated id?

Probably not very often, you are more likely to be asked for your phone number, email address, ssn or other such key that is easy to remember. These data keys can be considered alternative or secondary keys, and are very common in databases. Querying using alternative keys are very common, so it is important that they perform optimally.

The most important thing to ensure is that the columns are properly indexed in the database, otherwise each query will require a full table scan. The second thing to ensure is that the columns are indexed in the JPA object cache.

EclipseLink Database Indexes

To create a database index on a column you can use your own DDL script, or use the @Index annotation in EclipseLink (since EclipseLink 2.2). JPA 2.1 will also defined its own @Index annotation. For the EclipseLink index annotation you can just put this on the attributes you would like to index.

@Entity
@Index(columnNames={"F_NAME", "L_NAME"})
public class Employee {
  @Id
  private long id;
  @Index
  @Column(name="F_NAME")
  private String firstName;
  @Index
  @Column(name="L_NAME")
  private String lastName;
  @Index(unique=true)
  private String ssn;
  @Index
  private String phoneNumber;
  @Index
  private String emailAddress;
  ...
}

JPA 2.1 Database Indexes

The upcoming JPA 2.1 spec (draft) also defines support for database indexes. EclipseLink 2.5 is the reference implementation for JPA 2.1, so in EclipseLink 2.5 (dev builds) you can also create database indexes using the JPA 2.1 annotations. This is a little bit more complex, as you cannot define the @Index on attributes, only inside table annotations.

@Entity
@Table(indexes={
  @Index(name="EMP_SSN_INDEX", unique=true, columnList={"SSN"}),
  @Index(name="EMP_PHONE_INDEX", columnList="PHONENUMBER"),
  @Index(name="EMP_EMAIL_INDEX", columnList="EMAILADDRESS"),
  @Index(name="EMP_F_NAME_INDEX", columnList="F_NAME"),
  @Index(name="EMP_L_NAME_INDEX", columnList="L_NAME"),
  @Index(name="EMP_NAME_INDEX", columnList={"F_NAME", "L_NAME"}) })
public class Employee {
  @Id
  private long id;
  @Column(name="F_NAME")
  private String firstName;
  @Column(name="L_NAME")
  private String lastName;
  private String ssn;
  private String phoneNumber;
  private String emailAddress;
  ...
}

EclipseLink Cache Indexes

EclipseLink also supports indexes on the object cache (since EclipseLink 2.4). This allows JPQL and Criteria queries on indexed fields to obtain cache hits, and avoid all database access. The @CacheIndex annotation is used to index an attribute, or set of columns. When a set of columns are indexed, any queries using the attributes that map to those columns will use the index.

Cache indexes only provide a benefit to queries that expect a single result. Indexing a field such as firstName would provide no benefit, as there are many results with the same first name, and EclipseLink could never be certain is has them all loaded in the cache, so must access the database.

@Entity
@CacheIndex(columnNames={"F_NAME", "L_NAME"})
public class Employee {
  @Id
  private long id;
  @Column(name="F_NAME")
  private String firstName;
  @Column(name="L_NAME")
  private String lastName;
  @CacheIndex
  private String ssn;
  @CacheIndex
  private String phoneNumber;
  @CacheIndex
  private String emailAddress;
  ...
}

So, is indexing worth it? I created an example benchmark that shows the difference between indexed and non-indexed queries. These results were obtained accessing an Oracle database across a LAN from a desktop machine. Results are queries per second, so a bigger number is better. The test consisted of querying a Customer object by name, with a database size of 1,000 customers.

ConfigAverage Result (q/s)% Difference
No index, no cache7,1550%
No index, cache8,09513%
Database index, cache9,90038%
Database index, cache index137,1201,816%

The results show that the object cache indexed by id provides a little benefit (13%), as it avoids having to rebuild the object, but still has to access the database. Note the test did not access any relationships on Customer, if it had this object cache would still provide a major benefit in avoiding relationship queries. The database index provides a better benefit, (38%), this is a factor of the size of the table, the bigger the table the bigger the benefit. The object cache index provides the best benefit, almost a 20x improvement.

This shows it is important to properly index your database and your cache. Avoid indexing everything, as there is a cost in maintaining indexes. For the database, index any commonly queried columns, for the cache, index any secondary key fields.

See the EclipseLink UserGuide for more info on caching.

The source code for the benchmarks used in this post can be found here, or downloaded here.

Thursday, January 3, 2013

Got Cache? improve data access by 97,322%

Caching is the most valuable optimization one can make in software development. Making something run faster is nice, but it can never beat not having to run anything at all, because you already have the result cached.

JPA caches many things. The most important thing to cache is the EntityManagerFactory, this is generally done for you in JavaEE, but in JavaSE you need to do this yourself, such as storing it in a static variable. If you don't cache your EntityManagerFactory, then your persistence unit will be redeployed on every call, which will really suck.

Other caches in JPA include the cache of JDBC connections, the cache of JDBC statements, the result set cache, and most importantly the object cache, which is what I would like to discuss today.

JPA 1.0 did not define caching, although most JPA providers did support a cache in some form or another. JPA 2.0 defined caching through the @Cacheable annotation and the <shared-cache-mode> persistence.xml element. Some describe caching in JPA as two levels. Conceptually there is the L1 cache on an EntityManager, and the L2 cache on the EntityManagerFactory.

The EntityManager cache is an isolated, transactional cache, that only caches the objects read by that EntityManager, and shares nothing with other EntityManagers. The main purpose of the L1 cache is to maintain object identity (i.e. person == person.getSpouse().getSpouse()), and maintain transaction consistency. The L1 cache will also improve performance by avoiding querying the same object multiple times. The only way to avoid the L1 cache is to refresh, create a new EntityManager, or call clear().

The EntityManagerFactory cache is a shared cache across all EntityManagers, and reflects the current committed state of the database (stale data can be possible depending on your configuration and if you have other applications accessing the database). The main purpose of the L2 cache is to improve performance by avoiding queries for objects that have already been read. The L2 cahe is normally what is referred to when caching is discussed in JPA, and what the JPA <shared-cache-mode> and @Cacheable refer to.

There are many types of caches provided by the various JPA providers. Some provide data caches, some provide object caches, some have relationship caches, some have query caches, some have distributed caches, or coordinated caches.

EclipseLink provides an object cache, what I would call a "live" object cache. I believe most other JPA providers provide a data cache. The difference between a data cache, and an object cache, is that a data cache just caches the object's row, where as an object cache caches the entire object, including its relationships.

Caching relationships is normally more important than caching the object's data, as each relationship normally represent a database query, so saving n database queries to build an object's relationships is more important than saving the 1 query for the object itself. Some JPA providers augment their data cache with a relationship cache, or a query cache. If a data cache caches relationships at all, it is normally in the form of caching only the ids of the related objects. This can be a major issue, consider caching a OneToMany relationship, if you only have a set of ids, then you need to query the database for each id that is not in the cache, causing n database queries. With an object cache, you have the related objects, so never need to query the database.

The other advantage to caching objects is that you also save the cost of building the objects from the data. If the object or query is read-only, the cached object can be used directly, otherwise it only needs to be copied, not rebuilt from data.

EclipseLink also supports not caching relationships through the @Noncacheable annotation. Also the @Cache(isolation=PROTECTED) option can be used to ensure read-only entities and queries always copy the cached objects. So you can simulate a data cache with EclipseLink.

One should not underestimate the performance benefits of caching. Where as other JPA optimization may improve performance by 10-20%, or 2-5x for the major ones, caching has the potential to improve performance by factors of 100x even 1000x.

So what are the numbers? In this simple benchmark I compare reading a simple Order object, and it relationships (orderLines, customer). I compared the various caching options.
(result is queries per second, so bigger number is better, test was single threaded, randomly querying an order from a data set of 1000 orders, tests were run 5 times and averaged, database was an Oracle database over a local area network, low end hardware was used).

Cache OptionCache ConfigAverage Result (q/s)% Difference
No Cache@Cacheable(false)9650%
Object Cache@Cacheable(true)36,5443,686%
Object Cache@Cache(isolation=PROTECTED)35,1073,538%
Data Cache@Cache(isolation=PROTECTED) + @Noncacheable(true)1,88995%
Read Only Cache@ReadOnly940,12397,322%
Protected Read Only Cache@ReadOnly + @Cache(isolation=PROTECTED)625,60264,729%

The results show that although a data cache provides a significant benefit (~2x), it does not compare with an object cache (~100x). Marking the objects as @ReadOnly provides a significant additional benefit (~1000x).

The object cache, caches objects by their Id. This is great for find() or merge() operations, but does not help as much with queries. In EclipseLink any query by Id will also hit the object cache, but queries not by Id will have to hit the database. For each database result the object cache will still be checked, so the cost of building the objects and most importantly their relationships can still be avoided.

EclipseLink also supports a query cache. The query cache is configured independently of the object cache, and is configured per query, and not enabled by default. The query cache caches query results by query name and query parameters. This allows any query to obtain a cache hit. The query cache is configured through the "eclipselink.query-results-cache" query hint.

EclipseLink can execute queries in-memory against the object cache. This is not used by default, but can be configured on any query. Since the object cache does not normally contain the entire database, this works best with a FULL cache type, that has been preloaded. This is configured on the query through the "eclipselink.cache-usage" query hint.

This next benchmark compares the various caching options with a query. Each query is for the orders for a customer id, this will result in 10 Order objecs per query. Random customer ids are used.

Cache OptionCache ConfigAverage Result (q/s)% Difference
No Cache@Cacheable(false)1860%
Object Cache@Cacheable(true)1,021448%
Object Cache@Cache(isolation=PROTECTED)1,085483%
Data Cache@Cache(isolation=PROTECTED) + @Noncacheable(true)1986%
Read Only Query"eclipselink.read-only"="true"1,391647%
Read Only Query - Protected Cache"eclipselink.read-only"="true" + @Cache(isolation=PROTECTED)1,351626%
Query Cache"eclipselink.query-results-cache"="true"5,1142,649%
In-memory Query"eclipselink.cache-usage"="CheckCacheOnly"2,3971,188%

This shows that the object cache can still provide a significant benefit to queries through the benefit of caching the relationships (~5x). The query cache performs the best with ~25x benefit, and in-memory querying also performing well with a ~10x benefit. A data cache provide little benefit to queries.

I have measured the performance of several caching options in this post, but by no means have detailed all of the caching options in EclipseLink.
Other caching options available in EclipseLink include:

  • @Cache - type : FULL, WEAK, SOFT, SOFT_CACHE, HARD_CACHE
  • @Cache - size : size of cache in number of objects
  • @Cache - expiry : millisecond time to live expiry
  • @Cache - expiryTimeOfDay : daily expiry
  • @CacheIndex : non-id cache indexing
  • "eclipselink.cache.coordination" : clustered cache synchronization or invalidation
  • "eclipselink.cache.database-event-listener" : database event driven cache invalidation (Oracle DCN)
  • "eclipselink.query-results-cache.expiry" : query cache time to live expiry
  • "eclipselink.query-results-cache.expiry-time-of-day" : query cache daily expiry
  • TopLink Grid : integration with Oracle Coherence distributed cache
See the EclipseLink UserGuide for more info on caching.

The source code for the benchmarks used in this post can be found here, or download here.

What caching options to use depends on the application and its data. Caching may not be suitable to all types of applications or data, but for those in which it is applicable, it will normally provide the biggest performance benefit that is attainable.