Java Persistence Performance

Optimizing Java Serialization - Java vs XML vs JSON vs Kryo vs POF

2013-08-13T08:52:00.002-07:00

Perhaps I'm naive, but I always thought Java serialization must surely be the fastest and most efficient way to serialize Java objects into binary form. After all, Java is on it's 7th major release, so this is not new technology, and since every JDK seems to be faster than the last, I incorrectly assumed serialization must be very fast and efficient by now. I thought, since Java serialization is binary, and language dependent, it must be much faster and more efficient than XML or JSON. Unfortunately, I was wrong, if you concerned about performance, I would recommend avoiding Java serialization.

Now, don't get me wrong, I'm not trying to dis Java. Java serialization has many requirements, the main one being able to serialize anything (or at least anything that implements Serializable), into any other JVM (even a different JVM version/implementation), even running a different version of the classes being serialized (as long as you set a serialVersionUID). The main thing is it just works, and that is really great. Performance is not the main requirement, and the format is standard and must be backward compatible, so optimization is very difficult. Also, for many types of use cases, Java serialization performs very well.

I got started on this journey into the bowels of serialization while working on a three tier concurrency benchmark. I noticed a lot of the CPU time being spent inside Java serialization, so I decided to investigate. I started by serializing a simple Order object that had a couple of fields. I serialized the object and output the bytes. Although the Order object only had a few bytes of data, I was not that naive to think it would serialize to only a few bytes, I knew enough about serialization that it would at least need to write out the full class name, so it knew what it had serialized, so it could read it back. So I expected, maybe 50 bytes or so. The result was over 600 bytes, that's when I realized Java serialization was not as simple as I had imagined.

Java serialization bytes for Order object

----sr--model.Order----h#-----J--idL--customert--Lmodel/Customer;L--descriptiont--Ljava/lang/String;L--orderLinest--Ljava/util/List;L--totalCostt--Ljava/math/BigDecimal;xp--------ppsr--java.util.ArrayListx-----a----I--sizexp----w-----sr--model.OrderLine--&-1-S----I--lineNumberL--costq-~--L--descriptionq-~--L--ordert--Lmodel/Order;xp----sr--java.math.BigDecimalT--W--(O---I--scaleL--intValt--Ljava/math/BigInteger;xr--java.lang.Number-----------xp----sr--java.math.BigInteger-----;-----I--bitCountI--bitLengthI--firstNonzeroByteNumI--lowestSetBitI--signum[--magnitudet--[Bxq-~----------------------ur--[B------T----xp----xxpq-~--xq-~--

(note "-" means an unprintable character)

As you may have noticed, Java serialization writes out not only the full class name of the object being serialized, but also the entire class definition of the class being serialized, and all of the referenced classes. The class definition can be quite large, and seems to be the main performance and efficiency issue, especially when writing out a single object. If you are writing out a large number of objects of the same class, then the class definition overhead is not normally a big issue. One other thing that I noticed, is that if your object has a reference to a class (such as a meta-data object), then Java serialization will write the entire class definition, not just the class name, so using Java serialization to write out meta-data is very expensive.

Externalizable

It is possible to optimize Java serialization through implementing the Externalizable interface. Implementing this interface avoids writing out the entire class definition, just the class name is written. It requires that you implement the readExternal and writeExternal methods, so requires some work and maintenance on your part, but is faster and more efficient than just implementing Serializable.

One interesting note on the results for Externalizable is that it was much more efficient for a small number of objects, but actually output slightly more bytes than Serializable for a large number of objects. I assume the Externalizable format is slightly less efficient for repeated objects.

Externalizable class

public class Order implements Externalizable {
    private long id;
    private String description;
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    private List orderLines = new ArrayList();
    private Customer customer;

    public Order() {
    }

    public void readExternal(ObjectInput stream) throws IOException, ClassNotFoundException {
        this.id = stream.readLong();
        this.description = (String)stream.readObject();
        this.totalCost = (BigDecimal)stream.readObject();
        this.customer = (Customer)stream.readObject();
        this.orderLines = (List)stream.readObject();
    }

    public void writeExternal(ObjectOutput stream) throws IOException {
        stream.writeLong(this.id);
        stream.writeObject(this.description);
        stream.writeObject(this.totalCost);
        stream.writeObject(this.customer);
        stream.writeObject(this.orderLines);
    }
}

Externalizable serialization bytes for Order object

----sr--model.Order---*3--^---xpw---------psr--java.math.BigDecimalT--W--(O---I--scaleL--intValt--Ljava/math/BigInteger;xr--java.lang.Number-----------xp----sr--java.math.BigInteger-----;-----I--bitCountI--bitLengthI--firstNonzeroByteNumI--lowestSetBitI--signum[--magnitudet--[Bxq-~----------------------ur--[B------T----xp----xxpsr--java.util.ArrayListx-----a----I--sizexp----w-----sr--model.OrderLine-!!|---S---xpw-----pq-~--q-~--xxx

Other Serialization Options

I started to investigate what other serialization options there were in Java. I started with EclipseLink MOXy, which supports serializing objects to XML or JSON through the JAXB API. I was not expecting XML serialization to outperform Java serialization, so was quite surprised when it did for certain use cases. I also found a product Kryo, which is an open source project for optimized serialization. I also investigated the Oracle Coherence POF serialization format. There are pros and cons with each product, but my main focus was on how their performance compared, and how efficient they were.

EclipseLink MOXy - XML and JSON

The main advantage of using EclipseLink MOXy to serialize to XML or JSON is that both are standard, portable formats. You can access the data from any client using any language, so are not restricted to Java, as with Java serialization. You can also integrate your data with web services and REST services. Both formats are also text based, so human readable. No coding or special interfaces are required, only meta-data. The performance is quite acceptable, and outperforms Java serialization for small data-sets.

The draw backs are that the text formats are less efficient than optimized binary formats, and JAXB requires meta-data. So you need to annotate your classes with JAXB annotations, or provide an XML configuration file. Also, circular references are not handled by default, you need to use an @XmlIDREF to handle cycles.

JAXB annotated classes

@XmlRootElement
public class Order {
    @XmlID
    @XmlAttribute
    private long id;
    @XmlAttribute
    private String description;
    @XmlAttribute
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    private List orderLines = new ArrayList();
    private Customer customer;
}

public class OrderLine {
    @XmlIDREF
    private Order order;
    @XmlAttribute
    private int lineNumber;
    @XmlAttribute
    private String description;
    @XmlAttribute
    private BigDecimal cost = BigDecimal.valueOf(0);
}

EclipseLink MOXy serialization XML for Order object

<order id="0" totalCost="0"><orderLines lineNumber="1" cost="0"><order>0</order></orderLines></order>

EclipseLink MOXy serialization JSON for Order object

{"order":{"id":0,"totalCost":0,"orderLines":[{"lineNumber":1,"cost":0,"order":0}]}}

Kryo

Kryo is a fast, efficient serialization framework for Java. Kryo is an open source project on Google code that is provided under the New BSD license. It is a small project, with only 3 members, it first shipped in 2009 and last shipped the 2.21 release in Feb 2013, so is still actively being developed.

Kryo works similar to Java serialization, and respects transient fields, but does not require a class be Serializable. I found Kryo to have some limitations, such as requiring classes to have a default constructor, and encountered some issues in serializing java.sql.Time, java.sql.Date and java.sql.Timestamp classes.

Kryo serialization bytes for Order object

------java-util-ArrayLis-----model-OrderLin----java-math-BigDecima---------model-Orde-----

Oracle Coherence POF

The Oracle Coherence product provides its own optimized binary format called POF (portable object format). Oracle Coherence is an in-memory data grid solution (distributed cache). Coherence is a commercial product, and requires a license. EclipseLink supports an integration with Oracle Coherence through the Oracle TopLink Grid product that uses Coherence as the EclipseLink shared cache.

POF provides a serialization framework, and can be used independently of Coherence (if you already have Coherence licensed). POF requires that your class implement a PortableObject interface and read/write methods. You can also implement a separate Serializer class, or use annotations in the latest Coherence release. POF requires that each class be assigned an constant id ahead of time, so you need some way determine this id. The POF format is a binary format, very compact, efficient, and fast, but does require some work on your part.

The total bytes for POF was 32 bytes for a single Order/OrderLine object, and 1593 bytes for 100 OrderLines. I'm not going to give the results, as POF is part of a commercially licensed product, but is was very fast.

POF PortableObject

public class Order implements PortableObject {
    private long id;
    private String description;
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    private List orderLines = new ArrayList();
    private Customer customer;

    public Order() {
    }
    
    public void readExternal(PofReader in) throws IOException {
        this.id = in.readLong(0);
        this.description = in.readString(1);
        this.totalCost = in.readBigDecimal(2);
        this.customer = (Customer)in.readObject(3);
        this.orderLines = (List)in.readCollection(4, new ArrayList());
    }

    public void writeExternal(PofWriter out) throws IOException {
        out.writeLong(0, this.id);
        out.writeString(1, this.description);
        out.writeBigDecimal(2, this.totalCost);
        out.writeObject(3, this.customer);
        out.writeCollection(4, this.orderLines);
    }
}

POF serialization bytes for Order object

-----B--G---d-U------A--G-------

Results

So how does each perform? I made a simple benchmark to compare the different serialization mechanisms. I compared the serialization of two different use cases. The first is a single Order object with a single OrderLine object. The second is a single Order object with 100 OrderLine objects. I compared the average serialization operations per second, and measure the size in bytes of the serialized data. Different object models, use cases, and environments will give different results, but this gives you a general idea of the performance differences in different serializers.

The results show that Java serialization is slow for a small number of objects, but good for a large number of objects. Conversely, XML and JSON can outperform Java serialization for a small number of objects, but Java serialization is faster for a large number of objects. Kryo and other optimized binary serializers outperform Java serialization with both types of data.

You may wonder why it is relevant that something that takes less than a millisecond is of any relevance to performance, and you may be right. In general you would only have a real performance problem if you wrote out a very large number of objects, and then Java serialization performs quite well, so is the fact that it performs very poorly for a small number of object relevant? For a single operation, this is probably true, but if you execute many small serialization operations, then the cost is relevant. A typical server servicing many clients will typically send out many small requests, so while the cost of serialization is not great enough to make any of these single requests take a long time, it will dramatically effects the scalability of the server.

Order with 1 OrderLine

Serializer	Size (bytes)	Serialize (operations/second)	Deserialize (operations/second)	% Difference (from Java serialize)	% Difference (deserialize)
Java Serializable	636	128,634	19,180	0%	0%
Java Externalizable	435	160,549	26,678	24%	39%
EclipseLink MOXy XML	101	348,056	47,334	170%	146%
Kryo	90	359,368	346,984	179%	1709%

Order with 100 OrderLines

Serializer	Size (bytes)	Serialize (operations/second)	Deserialize (operations/second)	% Difference (from Java serialize)	% Difference (deserialize)
Java Serializable	2,715	16,470	10,215	0%	0%
Java Externalizable	2,811	16,206	11,483	-1%	12%
EclipseLink MOXy XML	6,628	7,304	2,731	-55%	-73%
Kryo	1216	22,862	31,499	38%	208%

EclipseLink JPA

In EclipseLink 2.6 development builds, and to some degree 2.5, we have added the ability to choose your serializer anywhere that EclipseLink does serialization.

One such place is in serialized @Lob mappings. You can now use the @Convert annotation to specify a serializer such as @Convert(XML), @Convert(JSON), @Convert(Kryo). In addition to optimizing performance, this provides an easy mechanism to write XML and JSON data to your database.

Also for EclipseLink cache coordination you can choose your serializer using the "eclipselink.cache.coordination.serializer" property.

The source code for the benchmarks used in this post can be found here, or downloaded here.

EclipseLink supports HQL and several advanced new JPQL features

2013-06-20T07:53:00.000-07:00

EclipseLink 2.5 added several new JPQL features and supported syntax.
These include:

HQL compatibility
Array expressions
Hierarchical selects
Historical queries

HQL Compatibility

A common issue for users migrating from Hibernate to EclipseLink is that the HQL syntax of not having a SELECT clause is not supported, as it is not standard JPQL syntax. EclipseLink now supports this syntax. For queries that will return the entire object, this allows you to omit the SELECT clause, and start the query with the FROM clause.

from Employee e where e.salary > 50000

Array Expressions

Previously in JPQL you could use the IN operation and sub-selects to compare a single value, but what if you wanted to compare multiple values, such as composite ids? You would then have to dynamically generate very long AND/OR expression trees to compare each value one by one, which is difficult and cumbersome. Many databases provide a much better solution to this problem with array expressions, this allows for arrays within the SQL in sub-selects comparisons, or nested within an IN comparison.

EclipseLink now supports array expressions with JPQL. EclipseLink also uses array expressions internally when object comparisons are done in JPQL and objects with composite ids.

Select e from Employee e where (e.firstName, e.lastName) IN :names

Select e from Employee e where (e.firstName, e.lastName) IN (Select m.firstName, m.lastName from Managers m)

Hierarchical Selects

Traditionally it has been very difficult to query hierarchical trees in relational data. For example querying all employees who work under a particular manager. Querying one level is simple, two levels is possible, but querying the entire depth of the tree, when you don't know how deep the tree is, is very difficult.

Some databases support a special syntax for this type of query. In Oracle the CONNECT BY clause allows for hierarchical queries to be expressed. EclipseLink now supports a CONNECT BY clause in JPQL, to support hierarchical queries on databases that support CONNECT BY.

Select e from Employee e START WITH e.id = :id CONNECT BY e.managedEmployees

Select e from Employee e START WITH e.id = :id CONNECT BY e.managedEmployees ORDER SIBLINGS BY e.salary where e.salary > 100000

Historical Queries

Historical queries allow you to query back in time. This requires that you use EclipseLink's history support, or use a database that supports historical queries such as Oracle's flashback support. Historical queries use the AS OF clause to query an entity as of a point in time (or an Oracle SCN for flashback). This provides the ability to do some pretty cool queries and analytic on your data.

Select e from Employee e AS OF :date where e.id = :id

Select e from Employee e, Employee e2 AS OF :date where e = e2 and e.salary > e2.salary

Cool performance features of EclipseLink 2.5

2013-06-10T06:07:00.002-07:00

The main goal of the EclipseLink 2.5 release was the support of the JPA 2.1 specification, as EclipseLink 2.5 was the reference implementation for JPA 2.1. For a list of JPA 2.1 features look here, or here.

Most of the features that went into the release were to support JPA 2.1 features, so there was not a lot of development time for other features. However, I was still able to sneak in a few cool new performance features. The features are not well documented yet, so I thought I would outline them here.

Indexing Foreign Keys

The first feature is auto indexing of foreign keys. Most people incorrectly assume that databases index foreign keys by default. Well, they don't. Primary keys are auto indexed, but foreign keys are not. This means any query based on the foreign key will be doing full table scans. This is any OneToMany, ManyToMany or ElementCollection relationship, as well as many OneToOne relationships, and most queries on any relationship involving joins or object comparisons. This can be a major perform issue, and you should always index your foreign keys fields.

EclipseLink 2.5 makes indexing foreign key fields easy with a new persistence unit property:

"eclipselink.ddl-generation.index-foreign-keys"="true"

This will have EclipseLink create an index for all mapped foreign keys if EclipseLink is used to generate the persistence unit's DDL. Note that DDL generation is now standard in JPA 2.1, so to enable DDL generation in EclipseLink 2.5 you can now use:

"javax.persistence.schema-generation.database.action"="create"

EclipseLink 2.5 and JPA 2.1 also support several new DDL generation features, including allowing user scripts to be executed. See, DDL generation for more information.

Query Cache Invalidation

EclipseLink has always supported a query cache. Unlike the object cache, the query cache is not enabled by default, but must be enabled through the query hint "eclipselink.query-results-cache". The main issue with the query cache, is that the results of queries can change when objects are modified, so the query cache could become out of date. Previously the query cache did support time-to-live and daily invalidation through the query hint "eclipselink.query-results-cache.expiry", but would not be kept in synch with changes as they were made.

In EclipseLink 2.5 automatic invalidation of the query cache was added. So if you had a query "Select e from Employee e" and had enabled query caching, every execution of this query would hit the cache and avoid accessing the database. Then if you inserted a new Employee, in EclipseLink 2.5 the query cache for all queries for Employee will automatically get invalidated. The next query will access the database, and get the correct result, and update the cache so all subsequent queries will once again obtain cache hits. Since the query cache is now kept in synch, the new persistence unit property "eclipselink.cache.query-results"="true" was added to enable the query cache on all named queries. If, for some reason, you want to allow stale data in your query cache, you can disable invalidation using the QueryResultsCachePolicy.setInvalidateOnChange() API.

Query cache invalidation is also integrated with cache coordination, so even if you modify an Employee on another server in your cluster, the query cache will still be invalidated. The query cache invalidation is also integrated with EclipseLink's support for Oracle Database Change Notification. If you have other applications accessing your database, you can keep the EclipseLink cache in synch with an Oracle database using the persistence unit property "eclipselink.cache.database-event-listener"="DCN". This support was added in EclipseLink 2.4, but in EclipseLink 2.5 it will also invalidate the query cache.

Tuners

EclipseLink 2.5 added an API to make it easier to provide tuning configuration for a persistence unit. The SessionTuner API allows a set of tuning properties to be configured in one place, and provides deployment time access to the EclipseLink Session and persistence unit properties. This makes it easy to have a development, debug, and production configuration of your persistence unit, or provide different configurations for different hardware. The SessionTuner is set through the persistence unit property "eclipselink.tuning".

Concurrent Processing

The most interesting performance feature provided in EclipseLink 2.5 is still in a somewhat experimental stage. The feature allows for a session to make use of concurrent processing.

There is no public API to configure it as of yet, but if you are interested in experimenting it is easy to set through a SessionCustomizer or SessionTuner.


public class MyCustomizer implements SessionCustomizer {
  public void customize(Session session) {
    ((AbstractSession)session).setIsConcurrent(true);
  }
}

Currently this enables two main features, one is the concurrent processing of result sets. The other is the concurrent loading of load groups.

In any JPA object query there are three parts. The first is the execution of the query, the second is the fetching of the data, and the third is the building of the objects. Normally the query is executed, all of the data is fetched, then the objects are built from the data. With concurrency enabled two threads will be used instead, one to fetch the data, and one to build the objects. This allows two things to be done at the same time, allowing less overall time (but the same amount of CPU). This can provide a benefit if you have a multi-CPU machine, or even if you don't, it allows the client to be doing processing at the same time as the database machine.

The second feature allows all of the relationships for all of the resulting objects to be queried and built concurrently (only when using a shared cache). So, if you queried 32 Employees and also wanted each Employee's address, the address queries could all be executed and built concurrently, resulting in significant less response time. This requires the usage of a LoadGroup to be set on the query. LoadGroup defines a new API setIsConcurrent() to allow concurrency to be enabled (this defaults to true when a session is set to be concurrent).

A LoadGroup can be configured on a query using the query hint "eclipselink.load-group", "eclipselink.load-group.attribute", or through the JPA 2.1 EntityGraph query hint "javax.persistence.loadgraph".

Note that for concurrency to improve your application's performance you need to have spare CPU time. So, to benefit the most you need multiple-CPUs. Also, concurrency will not help you scale an application server that is already under load from multiple client requests. Concurrency does not use less CPU time, it just allows for the CPUs to be used more efficiently to improve response times.

Batch Writing, and Dynamic vs Parametrized SQL, how well does your database perform?

2013-05-28T10:43:00.004-07:00

One of the most effective database optimizations is batch writing. Batch writing is supported by most modern databases and part of the JDBC standard and is supported by most JPA providers.

Normal database access consists of sending each DML (insert, update, delete) statement to the database in a separate database/network access. Each database access has a certain amount of overhead to it, and the database must process each statement independently. Batch writing has two forms, dynamic and parametrized. Parametrized is the most common, and normally provides the best benefit, as dynamic can have parsing issues.

To understand batch writing, you must first understand parametrized SQL. SQL execution is composed of two parts, the parse and the execute. The parse consists of turning the string SQL representation to the database representation. The execute consists of executing the parsed SQL on the database. Databases and JDBC support bind parameters, so the the arguments to the SQL (the data) does not have to be embedded in the SQL. This avoids the cost of converting the data into text, and allows for the same SQL statement to be reused, with multiple executions. This allows for a single parse and multiple executes, aka "parametrized SQL". Most JDBC DataSource implementations and JPA providers support parametrized SQL and statement caching, this effectively avoids ever having a parse in a running application.

Example dynamic SQL

INSERT INTO EMPLOYEE (ID, NAME) VALUES (34567, "Bob Smith")

Example parametrized SQL

INSERT INTO EMPLOYEE (ID, NAME) VALUES (?, ?)

Parametrized batch writing involves executing a single DML statement, but with a set of bind parameters for multiple homogenous statements, instead of bind parameters for a single statement. This effectively allows for a large batch of homogenous inserts, updates, or deletes, to be processed by the database and network as a single operation, instead of n operations. The database only needs to perform the minimal amount of work, as there is only a single statement, so at most only a single parse. It is also compatible with statement caching, so no statement parsing needs to occur at all. The limitation is that all of the statement's SQL must be identical. So, it works really good for say inserting 1,000 Orders, as the insert SQL is the same for each Order, only the bind parameters differ. But it does not help for inserting 1 Order, or for inserting 1 Order, 1 OrderLine, and 1 Customer. Also, all of the statements must be part of the same database transaction.

Dynamic batch writing involves chaining a bunch of heterogeneous dynamic SQL statements into a single block, and sending the entire block to the database in a single database/network access. This is beneficial in that there is only a single network access, so if the database is remote or across a slow network, this can make a big difference. The drawback is that parameter binding is not allowed, and the database must parse this huge block of SQL when it receive it. It some cases the parsing costs can outweigh the network benefits. Also, dynamic SQL is not compatible with statement caching, as each SQL is different.

JDBC standardizes batch writing through its Statement and PrepareStatement batch APIs (as of JDBC 2.0, which was JDK 1.2, aka a long time ago). The JDBC batch API requires different JDBC code, so if you are using raw JDBC, you need to rewrite your code to switch between batching and non-batching APIs. Most JDBC drivers now support these APIs, but some do not actually send the DML to the database as a batch, they just emulate the APIs. So how do you know if you are really getting batch writing? The only real way is to test it, and measure the performance difference.

The JPA specification does not standardize batch writing configuration, but most JPA providers support it. Normally batch writing is enabled in JPA through persistence unit properties, so turning it on or off is a simple matter of configuration, and requires no coding changes. Some JPA providers may not support batch writing when using optimistic locking, and may not re-order SQL to enable it to be batched, so even with batch writing enabled, you may still not be getting batching writing. Always test your application with batch writing on and off, and measure the difference to ensure it is actually functioning.

EclipseLink supports both parametrized and dynamic batch writing (since EclipseLink 1.0). In EclipseLink, batch writing is enabled through the "eclipselink.jdbc.batch-writing" persistence unit property. EclipseLink provides three options, "JDBC", "Buffered", and "Oracle-JDBC". The "JDBC" option should always be used.

"Buffered" is for JDBC drivers that do not support batch writing, and chains dynamic SQL statements into a single block itself. "Buffered" does not support parametrized SQL, and is not recommended.

"Oracle-JDBC" uses the Oracle database JDBC API that predates the JDBC standard API, and is now obsolete. Previous to EclipseLink 2.5, this option allowed batch writing when using optimistic locking, but now the regular "JDBC" option supports optimistic locking.

EclipseLink 2.5 supports batch writing with optimistic locking on all (compliant) database platforms, where as previously it was only supported on selected database platforms. EclipseLink 2.5 also provides a "eclipselink.jdbc.batch-writing" query hint to disable batch writing for native queries that cannot be batched (such as DDL or stored procedures on some database platforms).

EclipseLink supports parametrized SQL through the "eclipselink.jdbc.bind-parameters", and "eclipselink.jdbc.cache-statements" persistence unit properties. However, these don't normally need to be set, as parameter binding is the default, so you would only set the property to disable binding. Statement caching is not on by default, but only relevant to EclipseLink if using EclipseLink's connection pooling, if you are using a JDBC or Java EE DataSource, then you must configure statement caching in your DataSource config.

When batch writing is enabled in EclipseLink, by default it is parametrized batch writing. To enable dynamic batch writing, you must disable parameter binding. This is the same to enable buffered batch writing.

Supporting batch writing is not incredibly difficult, most JPA providers support this, ordering the SQL such that it can be batched is the difficult part. During a commit or flush operation, EclipseLink automatically groups SQL by table to ensure homogenous SQL statements can be batched (and at the same time still maintains referential integrity constraints and avoids dead locks). Most JPA providers do not do this, so even if they support batch writing, a lot of the time the SQL does not benefit from batching.

To enabled batch writing in EclipseLink, add the following to persistence unit property;

"eclipselink.jdbc.batch-writing"="JDBC"

You can also configure the batch size using the "eclipselink.jdbc.batch-writing.size" persistence unit property. The default size is 100.

"eclipselink.jdbc.batch-writing.size"="1000"

Batch writing is very database, and JDBC driver dependent. So I was interested in which databases, drivers it worked with, and what the benefit was. I made two tests, one does a batch of 50 inserts, and one does a batch of 100 updates (using optimistic locking). I tried all of the batch writing options, as well as not using any batching.

Note, this is not a database benchmark, I am not comparing the databases between each other, only to themselves.

Each database is running on different hardware, some are local, and some are across a network, so do not compare one database to another. The data of interest is the percentage benefit enabling batch writing has over not using batch writing. For the insert test I also measured the difference between using parametrized versus dynamic SQL, and parametrized SQL without statement caching. The result is the number of transactions processed in 10 seconds (run 5 times, and averaged), so a bigger number is a better result.

Database: MySQL Version: 5.5.16
Driver: MySQL-AB JDBC Driver Version: mysql-connector-java-5.1.22

Insert Test

Option	Average Result	% Difference from non batched
parametrized-sql, no batch	483	0%
dynamic-sql, no batch	499	3%
parametrized-sql, no statement caching	478	-1%
dynamic-sql, batch	499	3%
parametrized-sql, batch	509	5%

Update Test

Option	Average Result	% Difference from non batched
parametrized-sql	245	0%
dynamic-sql, batch	244	0%
parametrized-sql, batch	248	1%

So the results seem to indicating batch writing has no affect whatsoever (5% is within the variance). What this really means, is that the MySQL JDBC driver does not actually use batch processing, it just emulates the JDBC batch APIs and executes statements one by one underneath.

MySQL does have batch processing support though, it just requires different SQL. The MySQL JDBC driver does support this, but requires the rewriteBatchedStatements=true JDBC connect property to be set. This can easily be set by modifying your connect URL, such as;

jdbc:mysql://localhost:3306/test?rewriteBatchedStatements=true

MySQL: rewriteBatchedStatements=true

Insert Test

Option	Average Result	% Difference from non batched
parametrized-sql, no batch	504	0%
dynamic-sql, no batch	508	0%
parametrized-sql, no statement caching	483	-4%
dynamic-sql, batch	1292	156%
parametrized-sql, batch	2181	332%

Update Test

Option	Average Result	% Difference from non batched
parametrized-sql	250	0%
dynamic-sql, batch	669	167%
parametrized-sql, batch	699	179%

So, it appears batch writing does make a big difference in MySQL, if configured correctly (why the JDBC driver does not do this by default, I have no idea). Parametrized batch writing does the best, being 332% faster for inserts, and 179% faster for updates. Dynamic batch writing also performs quite well. Interestingly there appears to be little difference between dynamic and parametrized SQL on MySQL (my guess is either MySQL is really faster at parsing, or does little optimization for prepared statements).

PostgreSQL Version: 9.1.1
PostgreSQL 8.4 JDBC4

Insert Test

Option	Average Result	% Difference from non batched
parametrized-sql, no batch	479	0%
dynamic-sql, no batch	418	-12%
parametrized-sql, no statement caching	428	-10%
dynamic-sql, buffered	1127	135%
dynamic-sql, batch	1127	135%
parametrized-sql, batch	2037	325%

Update Test

Option	Average Result	% Difference from non batched
parametrized-sql	233	0%
dynamic-sql, batch	395	69%
parametrized-sql, batch	707	203%

The results show batch writing makes a big difference on PostgreSQL. Parametrized batch writing performs the best, being 325% faster for inserts, and 203% faster for updates. Dynamic batch writing also performs quite well. For PostgreSQL I also measure the performance of EclipseLink's buffered batch writing, which performs the same as dynamic JDBC batch writing, so I assume the driver is doing the same thing. Parametrized SQL outperforms dynamic SQL by about 10%, but parametrized SQL without statement caching performs similar to dynamic SQL.

Oracle Database 11g Enterprise Edition Release 11.1.0.7.0
Oracle JDBC driver Version: 11.2.0.2.0

Insert Test

Option	Average Result	% Difference from non batched
parametrized-sql, no batch	548	0%
dynamic-sql, no batch	494	-9%
parametrized-sql, no statement caching	452	-17%
dynamic-sql, buffered	383	-30%
dynamic-sql, batch	489	-10%
parametrized-sql, batch	3308	503%

Update Test

Option	Average Result	% Difference from non batched
parametrized-sql	282	0%
dynamic-sql, batch	258	-8%
parametrized-sql, batch	1672	492%

The results show parametrized batch writing makes a big difference on Oracle, being 503% faster for inserts, and 492% faster for updates. Dynamic batch writing does not provide any benefit, this is because Oracle's JDBC driver just emulates dynamic batch writing and executes statements one by one, so it has the same performance as dynamic SQL. Buffered batch writing actually has worse performance than not batching at all. This is because of the parsing cost for the huge block of dynamic SQL, this may vary in different configurations, if the database is remote or across a slow network, I have seen this provide a benefit.

Parametrized SQL with statement caching provides about a 10% benefit over dynamic SQL, and points out that to benefit from parametrized you need to use statement caching, otherwise the performance can be worse than dynamic SQL. Of coarse there are other benefits to parametrized SQL, as it removes CPU processing from the server, which may not help much in this single threaded case, but can make a huge difference in a multi-threaded case where the database is a bottleneck.

Apache Derby Version: 10.9.1.0 - (1344872)
Apache Derby Embedded JDBC Driver Version: 10.9.1.0 - (1344872)
(local)

Insert Test

Option	Average Result	% Difference from non batched
parametrized-sql, no batch	3027	0%
dynamic-sql, no batch	24	-99%
parametrized-sql, no statement caching	50	-98%
dynamic-sql, batch	24	-99%
parametrized-sql, batch	3252	7%

Update Test

Option	Average Result	% Difference from non batched
parametrized-sql	1437	0%
dynamic-sql, batch	6	-99%
parametrized-sql, batch	2172	51%

The results show parametrized batch writing makes a difference on Derby, being 7% faster for inserts, and 51% faster for updates. This result difference is not as much as other database because my database was local. For a networked database, it would be a bigger difference, but this does show that batch writing can provide a benefit even for local databases, so it is not just a network optimization. The really interesting results from Derby are the horrible performance of the dynamic and non-cached statements. This shows the Derby has a huge parsing cost, so if you are using Derby, using parametrized SQL with statement caching is really important.

DB2/NT64 Version: SQL09070
IBM Data Server Driver for JDBC and SQLJ Version: 4.0.100

The results are basically similar to Oracle, in that parametrized batch writing gives a big performance benefit. Dynamic batch writing has worse performance then no batching with parametrized SQL, and dynamic SQL and parametrized SQL without statement caching result in worse performance.

Microsoft SQL Server Version: 10.50.1617
Microsoft SQL Server JDBC Driver 2.0 Version: 2.0.1803.100

The results were similar to PostgreSQL, showing both parametrized and dynamic batch writing providing a significant benefit. Parametrized batch writing performed the best, and parametrized SQL outperformed dynamic SQL, and no statement caching.

UPDATE

It was requested that I also test H2 and HSQL, so here are the results.

Database: H2 Version: 1.3.167 (2012-05-23)
Driver: H2 JDBC Driver Version: 1.3.167 (2012-05-23)
(local)

Insert Test

Option	Average Result	% Difference from non batched
parametrized-sql, no batch	4757	0%
dynamic-sql, no batch	3210	-32%
parametrized-sql, no statement caching	4757	0%
dynamic-sql, buffered	1935	-59%
dynamic-sql, batch	3293	-30%
parametrized-sql, batch	5753	20%

The results show H2 performs 20% faster with parametrized batch writing. H2 is an in-memory database (backed by a persistent log file), so is not expected to benefit as much as there is no network involved. Dynamic batch writing, and dynamic SQL perform worse the parametrized SQL. Interestingly using statement caching with parametrized SQL makes no difference. My assumption is that H2 is always caching prepared statements in its connection, so the user does not need to do their own statement caching.

Database: HSQL Database Engine Version: 1.8.1
Driver: HSQL Database Engine Driver Version: 1.8.1
(local)

Insert Test

Option	Average Result	% Difference from non batched
parametrized-sql, no batch	7319	0%
dynamic-sql, no batch	5054	-30%
parametrized-sql, no statement caching	6776	-7%
dynamic-sql, batch	5500	-24%
parametrized-sql, batch	9176	25%

The results show HSQL performs 25% faster with parametrized batch writing. HSQL is an in-memory database (backed by a persistent log file), so is not expected to benefit as much as there is no network involved. Dynamic batch writing, and dynamic SQL perform worse the parametrized SQL.

But what if I'm not querying by id? (database and cache indexes)

2013-03-07T11:18:00.000-08:00

Most data models define a sequence generated numeric primary key. This is the most efficient key to use as it is a single efficient guaranteed unique value. Some applications also use a UUID, which is a little less efficient in terms of time and space, but has its advantages in distributed systems and databases.

In JPA the primary key is defined as the JPA Id, and since JPA's object cache is indexed by Id, any queries by this Id obtain cache hits, and avoid database access. This is great when querying by id, and for traversing relationships across foreign keys based on the id, but what about queries not using the id?

Sequence generated ids are great for computers and databases, but are not very useful to people. How many times how you been to a store or website, and to look up your record they asked your for your sequence generated id?

Probably not very often, you are more likely to be asked for your phone number, email address, ssn or other such key that is easy to remember. These data keys can be considered alternative or secondary keys, and are very common in databases. Querying using alternative keys are very common, so it is important that they perform optimally.

The most important thing to ensure is that the columns are properly indexed in the database, otherwise each query will require a full table scan. The second thing to ensure is that the columns are indexed in the JPA object cache.

EclipseLink Database Indexes

To create a database index on a column you can use your own DDL script, or use the @Index annotation in EclipseLink (since EclipseLink 2.2). JPA 2.1 will also defined its own @Index annotation. For the EclipseLink index annotation you can just put this on the attributes you would like to index.

@Entity
@Index(columnNames={"F_NAME", "L_NAME"})
public class Employee {
  @Id
  private long id;
  @Index
  @Column(name="F_NAME")
  private String firstName;
  @Index
  @Column(name="L_NAME")
  private String lastName;
  @Index(unique=true)
  private String ssn;
  @Index
  private String phoneNumber;
  @Index
  private String emailAddress;
  ...
}

JPA 2.1 Database Indexes

The upcoming JPA 2.1 spec (draft) also defines support for database indexes. EclipseLink 2.5 is the reference implementation for JPA 2.1, so in EclipseLink 2.5 (dev builds) you can also create database indexes using the JPA 2.1 annotations. This is a little bit more complex, as you cannot define the @Index on attributes, only inside table annotations.

@Entity
@Table(indexes={
  @Index(name="EMP_SSN_INDEX", unique=true, columnList={"SSN"}),
  @Index(name="EMP_PHONE_INDEX", columnList="PHONENUMBER"),
  @Index(name="EMP_EMAIL_INDEX", columnList="EMAILADDRESS"),
  @Index(name="EMP_F_NAME_INDEX", columnList="F_NAME"),
  @Index(name="EMP_L_NAME_INDEX", columnList="L_NAME"),
  @Index(name="EMP_NAME_INDEX", columnList={"F_NAME", "L_NAME"}) })
public class Employee {
  @Id
  private long id;
  @Column(name="F_NAME")
  private String firstName;
  @Column(name="L_NAME")
  private String lastName;
  private String ssn;
  private String phoneNumber;
  private String emailAddress;
  ...
}

EclipseLink Cache Indexes

EclipseLink also supports indexes on the object cache (since EclipseLink 2.4). This allows JPQL and Criteria queries on indexed fields to obtain cache hits, and avoid all database access. The @CacheIndex annotation is used to index an attribute, or set of columns. When a set of columns are indexed, any queries using the attributes that map to those columns will use the index.

Cache indexes only provide a benefit to queries that expect a single result. Indexing a field such as firstName would provide no benefit, as there are many results with the same first name, and EclipseLink could never be certain is has them all loaded in the cache, so must access the database.

@Entity
@CacheIndex(columnNames={"F_NAME", "L_NAME"})
public class Employee {
  @Id
  private long id;
  @Column(name="F_NAME")
  private String firstName;
  @Column(name="L_NAME")
  private String lastName;
  @CacheIndex
  private String ssn;
  @CacheIndex
  private String phoneNumber;
  @CacheIndex
  private String emailAddress;
  ...
}

So, is indexing worth it? I created an example benchmark that shows the difference between indexed and non-indexed queries. These results were obtained accessing an Oracle database across a LAN from a desktop machine. Results are queries per second, so a bigger number is better. The test consisted of querying a Customer object by name, with a database size of 1,000 customers.

Config	Average Result (q/s)	% Difference
No index, no cache	7,155	0%
No index, cache	8,095	13%
Database index, cache	9,900	38%
Database index, cache index	137,120	1,816%

The results show that the object cache indexed by id provides a little benefit (13%), as it avoids having to rebuild the object, but still has to access the database. Note the test did not access any relationships on Customer, if it had this object cache would still provide a major benefit in avoiding relationship queries. The database index provides a better benefit, (38%), this is a factor of the size of the table, the bigger the table the bigger the benefit. The object cache index provides the best benefit, almost a 20x improvement.

This shows it is important to properly index your database and your cache. Avoid indexing everything, as there is a cost in maintaining indexes. For the database, index any commonly queried columns, for the cache, index any secondary key fields.

See the EclipseLink UserGuide for more info on caching.

The source code for the benchmarks used in this post can be found here, or downloaded here.

Got Cache? improve data access by 97,322%

2013-01-03T07:23:00.000-08:00

Caching is the most valuable optimization one can make in software development. Making something run faster is nice, but it can never beat not having to run anything at all, because you already have the result cached.

JPA caches many things. The most important thing to cache is the EntityManagerFactory, this is generally done for you in JavaEE, but in JavaSE you need to do this yourself, such as storing it in a static variable. If you don't cache your EntityManagerFactory, then your persistence unit will be redeployed on every call, which will really suck.

Other caches in JPA include the cache of JDBC connections, the cache of JDBC statements, the result set cache, and most importantly the object cache, which is what I would like to discuss today.

JPA 1.0 did not define caching, although most JPA providers did support a cache in some form or another. JPA 2.0 defined caching through the @Cacheable annotation and the <shared-cache-mode> persistence.xml element. Some describe caching in JPA as two levels. Conceptually there is the L1 cache on an EntityManager, and the L2 cache on the EntityManagerFactory.

The EntityManager cache is an isolated, transactional cache, that only caches the objects read by that EntityManager, and shares nothing with other EntityManagers. The main purpose of the L1 cache is to maintain object identity (i.e. person == person.getSpouse().getSpouse()), and maintain transaction consistency. The L1 cache will also improve performance by avoiding querying the same object multiple times. The only way to avoid the L1 cache is to refresh, create a new EntityManager, or call clear().

The EntityManagerFactory cache is a shared cache across all EntityManagers, and reflects the current committed state of the database (stale data can be possible depending on your configuration and if you have other applications accessing the database). The main purpose of the L2 cache is to improve performance by avoiding queries for objects that have already been read. The L2 cahe is normally what is referred to when caching is discussed in JPA, and what the JPA <shared-cache-mode> and @Cacheable refer to.

There are many types of caches provided by the various JPA providers. Some provide data caches, some provide object caches, some have relationship caches, some have query caches, some have distributed caches, or coordinated caches.

EclipseLink provides an object cache, what I would call a "live" object cache. I believe most other JPA providers provide a data cache. The difference between a data cache, and an object cache, is that a data cache just caches the object's row, where as an object cache caches the entire object, including its relationships.

Caching relationships is normally more important than caching the object's data, as each relationship normally represent a database query, so saving n database queries to build an object's relationships is more important than saving the 1 query for the object itself. Some JPA providers augment their data cache with a relationship cache, or a query cache. If a data cache caches relationships at all, it is normally in the form of caching only the ids of the related objects. This can be a major issue, consider caching a OneToMany relationship, if you only have a set of ids, then you need to query the database for each id that is not in the cache, causing n database queries. With an object cache, you have the related objects, so never need to query the database.

The other advantage to caching objects is that you also save the cost of building the objects from the data. If the object or query is read-only, the cached object can be used directly, otherwise it only needs to be copied, not rebuilt from data.

EclipseLink also supports not caching relationships through the @Noncacheable annotation. Also the @Cache(isolation=PROTECTED) option can be used to ensure read-only entities and queries always copy the cached objects. So you can simulate a data cache with EclipseLink.

One should not underestimate the performance benefits of caching. Where as other JPA optimization may improve performance by 10-20%, or 2-5x for the major ones, caching has the potential to improve performance by factors of 100x even 1000x.

So what are the numbers? In this simple benchmark I compare reading a simple Order object, and it relationships (orderLines, customer). I compared the various caching options.
(result is queries per second, so bigger number is better, test was single threaded, randomly querying an order from a data set of 1000 orders, tests were run 5 times and averaged, database was an Oracle database over a local area network, low end hardware was used).

Cache Option	Cache Config	Average Result (q/s)	% Difference
No Cache	@Cacheable(false)	965	0%
Object Cache	@Cacheable(true)	36,544	3,686%
Object Cache	@Cache(isolation=PROTECTED)	35,107	3,538%
Data Cache	@Cache(isolation=PROTECTED) + @Noncacheable(true)	1,889	95%
Read Only Cache	@ReadOnly	940,123	97,322%
Protected Read Only Cache	@ReadOnly + @Cache(isolation=PROTECTED)	625,602	64,729%

The results show that although a data cache provides a significant benefit (~2x), it does not compare with an object cache (~100x). Marking the objects as @ReadOnly provides a significant additional benefit (~1000x).

The object cache, caches objects by their Id. This is great for find() or merge() operations, but does not help as much with queries. In EclipseLink any query by Id will also hit the object cache, but queries not by Id will have to hit the database. For each database result the object cache will still be checked, so the cost of building the objects and most importantly their relationships can still be avoided.

EclipseLink also supports a query cache. The query cache is configured independently of the object cache, and is configured per query, and not enabled by default. The query cache caches query results by query name and query parameters. This allows any query to obtain a cache hit. The query cache is configured through the "eclipselink.query-results-cache" query hint.

EclipseLink can execute queries in-memory against the object cache. This is not used by default, but can be configured on any query. Since the object cache does not normally contain the entire database, this works best with a FULL cache type, that has been preloaded. This is configured on the query through the "eclipselink.cache-usage" query hint.

This next benchmark compares the various caching options with a query. Each query is for the orders for a customer id, this will result in 10 Order objecs per query. Random customer ids are used.

Cache Option	Cache Config	Average Result (q/s)	% Difference
No Cache	@Cacheable(false)	186	0%
Object Cache	@Cacheable(true)	1,021	448%
Object Cache	@Cache(isolation=PROTECTED)	1,085	483%
Data Cache	@Cache(isolation=PROTECTED) + @Noncacheable(true)	198	6%
Read Only Query	"eclipselink.read-only"="true"	1,391	647%
Read Only Query - Protected Cache	"eclipselink.read-only"="true" + @Cache(isolation=PROTECTED)	1,351	626%
Query Cache	"eclipselink.query-results-cache"="true"	5,114	2,649%
In-memory Query	"eclipselink.cache-usage"="CheckCacheOnly"	2,397	1,188%

This shows that the object cache can still provide a significant benefit to queries through the benefit of caching the relationships (~5x). The query cache performs the best with ~25x benefit, and in-memory querying also performing well with a ~10x benefit. A data cache provide little benefit to queries.

I have measured the performance of several caching options in this post, but by no means have detailed all of the caching options in EclipseLink.
Other caching options available in EclipseLink include:

@Cache - type : FULL, WEAK, SOFT, SOFT_CACHE, HARD_CACHE
@Cache - size : size of cache in number of objects
@Cache - expiry : millisecond time to live expiry
@Cache - expiryTimeOfDay : daily expiry
@CacheIndex : non-id cache indexing
"eclipselink.cache.coordination" : clustered cache synchronization or invalidation
"eclipselink.cache.database-event-listener" : database event driven cache invalidation (Oracle DCN)
"eclipselink.query-results-cache.expiry" : query cache time to live expiry
"eclipselink.query-results-cache.expiry-time-of-day" : query cache daily expiry
TopLink Grid : integration with Oracle Coherence distributed cache

See the EclipseLink UserGuide for more info on caching.

The source code for the benchmarks used in this post can be found here, or download here.

What caching options to use depends on the application and its data. Caching may not be suitable to all types of applications or data, but for those in which it is applicable, it will normally provide the biggest performance benefit that is attainable.

EclipseLink 15x faster than other JPA providers

2012-11-13T12:01:00.000-08:00

I'm sorry to disappoint those of you looking to see some canned benchmark showing how EclipseLink dominates over all competitors in performance and scalability.

It is not that this would be difficult to do, or that I don't have such a benchmark that shows this, but that I give the JPA community credit for not attributing much weight to a benchmark produced by a single JPA provider that shows their JPA product excels against their competitors. If you really want such a benchmark, scroll to the bottom of this post.

There are many such provider produced benchmarks out there. There are websites, blog posts, and public forum posts from these vendors marketing their claims. If you believe these claims, and wish to migrate your application to some previously unheard of JPA provider, then I have an email that I would like to send to you from a Nigerian friend who needs help transferring money into your country.

The only respectable benchmark that I know of that heavily utilizes JPA is the SPECjEnterprise2010 benchmark from the Standard Performance Evaluation Corporation, a respected, independent standards body that produces many industry standard benchmarks. SpecJ is more than a JPA benchmark, as it measures the performance of the entire JavaEE stack, but JPA is a major component in it.

One could argue that SpecJ is too broad of a benchmark, and defining the performance of an entire JavaEE stack with a single number is not very meaningful. I have helped in producing Oracle's SpecJ results for several years, both with TopLink/EclipseLink JPA on WebLogic and previously with TopLink CMP on OC4J. I can honestly tell you that producing good numbers on the benchmark is not easy, and does require you to significantly optimize your code, and in particular optimize your concurrency and scalability. Having good JPA performance will not guarantee you good results on SpecJ, as there are other components involved, but having poor performance, concurrency or scalability in any component in your stack will prevent you from having good results. SpecJ emulates a real multi-user application, with a separate driver machine (or many) that simulate a large number of concurrent users hitting the system. It is very good at finding any performance, concurrency, or scalability bottlenecks in your stack.

One could also argue that it is very difficult for small JPA providers to publish a SpecJ result. Publishing a result requires having a full JavaEE stack and being integrated with it, and having access to real hardware, and investing the time and having the expertise to analyze and optimize the implementation on the real hardware. Of course, one could also argue if you don't have access to real hardware and real expertise, can you really expect to produce a performant and scalable product?

Ideally if you are looking for the most performant JPA provider for your application, you would benchmark it on your own application, and on your production hardware. Writing a good performance benchmark and tests can be a difficult task. It can be difficult to get consistent results, and easy to misinterpret things. In general, if a result doesn't make sense, there is probably something fishy going on. I have seen many such invalid performance comparisons from our users over the years (and from competitors as well). One that I remember was a user had a test showing TopLink was 100x slower than their JDBC code for a simple query. After looking at the test, I found their JDBC code executed a statement, but did not actually fetch any rows, or even close the statement. Once their JDBC code was fixed to be valid, TopLink was actually faster than their non-optimized JDBC code.

In EclipseLink and Oracle TopLink we take performance and scalability very seriously. In EclipseLink we run a large suite of performance tests every week, and every release to ensure our performance is better than our previous release, and better than any of our main competitors. Although our tests show we have the best performance today, this has not always been the case. We did not write the tests to showcase our performance, but to find problem areas where we were slower than our competitors. We then optimized those areas to ensure we were then faster than our competitors. I'm not claiming EclipseLink, or Oracle TopLink are faster than every other JPA implementation, in every use case. There will be some use cases where we are not faster, and some where our default configuration differs from another JPA provider that makes us slower until the configuration is fixed.

For example, EclipseLink does not automatically join fetch EAGER relationships, we view this as wrong, as EAGER does not mean JOIN FETCH, we have a separate @JoinFetch annotation to specify this. So there are some benchmarks, such as one from an object database that supports JPA that shows EclipseLink slower for ElementCollection mappings, this is solely because of the configuration, and we would be faster with the simple addition of a @JoinFetch.

EclipseLink also does not enable batch writing by default. You can configure this easily in the persistence.xml, but some other JPA providers use batch writing by default on some databases, so this could show a difference. EclipseLink also requires the usage of an agent to enable LAZY and other optimizations, some "benchmarks" fail to use this agent, which will seriously affect EclipseLink's performance.

The raw throughput of JPA is rarely ever the main factor in an application's performance. How optimized the database access is, what features the JPA provider supports, and how well the application makes use of JPA, do have a big impact on an applications performance. EclipseLink and Oracle TopLink have a large feature set of performance and scalability features. I'm not going to go over all of these today, but you can browse my blog for other posts on some of these.

Finally, if you really want a benchmark showing EclipseLink is 15x faster than any other JPA implementation, (even ones that claim this fact), you can find it here. When running this benchmark on my desktop, accessing a local Derby database, it does 1,016,884 queries per second. This is over 15x (88x in fact) faster than other JPA providers, even 145x faster than the self claimed "fastest JPA implementation on the face of the Earth".

Granted EclipseLink contains an integrated, feature rich, object cache, where as other JPA providers do not. So, I also ran the "benchmark" with caching disabled. Even then EclipseLink was 2.7x faster. Of course EclipseLink also supports statement caching, other JPA providers do not, so I also disabled this. Again EclipseLink was 2.2x faster than the self claimed "fastest JPA implementation on the face of the Earth".

Was it a fair, objective benchmark that emulates a real application environment? Probably not... but don't take my, or anyone else's word for it.

JPQL vs SQL, why not have both

2012-05-15T12:29:00.003-07:00

One of the most common questions I see on JPA, is users wanting to know how to write some specific SQL query as a JPQL query. Some of the time, they just need to learn JPQL, and their SQL can easily be converted to JPQL. Other times their SQL is using one of the many features on SQL that are not provided in JPQL, and their only option is to use a native SQL query.

In EclipseLink 2.4, we have greatly enhanced our JPQL support to support most features of SQL. I refer to EclipseLink's JPQL extensions as the EclipseLink Query Language, or EQL.

The JPQL support in EclipseLink 1.0 followed the JPA 1.0 BNF. It was not until the 2.1 release that we started adding some extensions, and removing restrictions. We introduced FUNC and TREAT in 2.1. FUNC allows calling any database function, and TREAT allows casting to a subclass for entities with inheritance.

In the upcoming 2.4 release several new features have been added: (available for download today here)

ON
UNION
INTERSECT
EXCEPT
NULLS FIRST/LAST
CAST
EXTRACT
REGEXP
FUNCTION
OPERATOR
SQL
COLUMN
TABLE

EQL not only allows usage of more of the SQL syntax and functionality, but also allows the mixing of SQL, and SQL constructs within JPQL. EQL provides a hybrid query language that is object-oriented and database platform independent, but can still access raw data and database specific functionality when required.

ON

SQL defines an ON clause to joins, but JPA 2.0 JPQL does not. JPQL does not require an ON clause because when a relationship is joined, the ON clause comes from the join columns already defined in the mapping. Sometimes however it is desirable to append additional conditions to the join condition, normally in the case of outer joins.

EclipseLink supports the ON clause, both for relationships joins, and to define joins between two independent objects.

The JPA 2.1 draft defines an ON clause, but only on relationship joins, not on joins between independent objects.

SELECT e FROM Employee e LEFT JOIN e.address ON a.city = :city

SELECT e FROM Employee e LEFT JOIN MailingAddress a ON e.address = a.address

UNION, INTERSECT, EXCEPT

SQL supports UNION, INTERSECT and EXCEPT, but JPA 2.0 JPQL does not. Most unions can be done in terms of joins, but some can not, and some are more difficult to express using joins.

EclipseLink supports UNION, INTERSECT and EXCEPT, including the ALL option to include duplicates.

SELECT MAX(e.salary) from Employee e where e.address.city = :city1
UNION SELECT MAX(e.salary) from Employee e where e.address.city = :city2

SELECT e from Employee e join e.phones p where p.areaCode = :areaCode1
INTERSECT SELECT e from Employee e join e.phones p where p.areaCode = :areaCode2

SELECT e from Employee e
EXCEPT SELECT e from Employee e WHERE e.salary > e.manager.salary

NULLS FIRST

SQL supports NULLS FIRST, and NULLS LAST ordering options but JPA 2.0 JPQL does not.

EclipseLink supports NULLS FIRST, and NULLS LAST in the ORDER BY clause.

SELECT e FROM Employee e LEFT JOIN e.manager m ORDER BY m.lastName NULLS FIRST

CAST

SQL supports a CAST function to convert between datatypes, JPA 2.0 JPQL does not support this function.

EclipseLink supports CAST allowing any value to be cast to any database type supported by the database.

SELECT CAST(e.salary NUMERIC(10,2)) FROM Employee e

EXTRACT

SQL supports an EXTRACT function for accessing date/time values, JPA 2.0 JPQL does not support any functions for accessing date/time values.

EclipseLink supports EXTRACT allowing any database supported date/time part value to be extracted from the date/time.

SELECT EXTRACT(YEAR, e.startDate) FROM Employee e

REGEXP

Regular expression comparisons are supported by many databases, although there is no standard SQL syntax that has been adopted by major databases yet. JPA 2.0 JPQL does not support regular expressions.

EclipseLink supports REGEXP on Oracle, PostgreSQL, MySQL, MongoDB, and other supporting databases.

SELECT e FROM Employee e WHERE e.lastName REGEXP '^Dr\.*'

FUNC and FUNCTION

SQL supports a lot more database functions than JPQL. Specific database vendors also provide their own set of functions. Users and libraries can also define their own database functions. JPA 2.0 JPQL provides no mechanism to call database specific functions. The JPA 2.1 draft defines a special FUNCTION operator in JPQL to allow calling a specific database function.

EclipseLink 2.1 provided this support using the FUNC operator. EclipseLink 2.4 will also provide the same functionality using the JPA 2.1 FUNCTION operator.

SELECT FUNC('YEAR', e.startDate) AS year, COUNT(e) FROM Employee e GROUP BY year

OPERATOR

Some SQL functions have special syntax such as EXTRACT that takes a date part keyword (YEAR, MONTH, DAY), or CAST that takes a type name (NUMBER(10,2)). The FUNCTION operator cannot be used to call these special functions because of their special syntax. FUNCTION is also database specific, and does not allow the JPA provider to map the function to a different name on a different database platform.

EclipseLink has support for over 80 database functions through defined EclipseLink ExpressionOperators. EclipseLink has always supported these operator using EclipseLink Expression queries, and now supports these operators using the JPQL OPERATOR keyword. OPERATOR allows calling any EclipseLink ExpressionOperator. EclipseLink ExpressionOperators are database independent, in that if a database provides a equivalent function it is used. ExpressionOperators support generating any syntax to allow calling database functions that require special syntax. The list of supported EclipseLink ExpressionOperators is here. Users can also define their own ExpressionOperators.

SELECT e FROM Employee e WHERE OPERATOR('ExtractXml', e.resume, '@years-experience') > 10

SQL

The SQL operator allows for any SQL to be embedded in the JPQL query. This allows a hybridization of JPQL and SQL, giving the advantages of both in the same query. Previously if any part of the query required something not supported by JPQL, the entire query would need to be rewritten as a native SQL query. Now JPQL can still be used, and the SQL operator can be used just for the parts that require SQL. The SQL operator accepts a variable number of arguments, which are translated into the SQL string using the ? marker.

SELECT p FROM Phone p WHERE SQL('CAST(? AS CHAR(3))', e.areaCode) = '613'

SELECT SQL('EXTRACT(YEAR FROM ?)', e.startDate) AS year, COUNT(e) FROM Employee e GROUP BY year

SELECT e FROM Employee e ORDER BY SQL('? NULLS FIRST', e.startDate)

SELECT e FROM Employee e WHERE e.startDate = SQL('(SELECT SYSDATE FROM DUAL)')

COLUMN

The COLUMN operator allows for any unmapped column to be referenced in JQPL. This can be used to access unmapped columns such as foreign key columns, inheritance discriminators, or system columns such as ROWID.

SELECT e FROM Employee e WHERE COLUMN('MANAGER_ID', e) = :id

SELECT e FROM Employee e WHERE COLUMN('ROWID', e) = :id

TABLE

The TABLE operator allows for any unmapped table to be referenced in JQPL. This can be used to access join, collection, history, auditing, or system tables for use in JPQL queries.

SELECT e, a.LAST_UPDATE_USER FROM Employee e, TABLE('AUDIT') a WHERE a.TABLE = 'EMPLOYEE' AND a.ROWID = COLUMN('ROWID', e)

JOIN FETCH

JPQL does not allow using an alias on a JOIN FETCH, and does not allow nested join fetches. EclipseLink allows these.

SELECT e FROM Employee e JOIN FETCH e.address a ORDER BY a.city

SELECT e FROM Employee e JOIN FETCH e.manager m JOIN FETCH m.manager

Criteria API

You may be asking yourself, this is all great, but what about the Criteria API, can these extensions be used with that?

EclipseLink 2.4 will provide a JpaCriteriaBuilder interface that allows access to EclipseLink specific functionality. Both Criteria queries and JPQL queries get translated to EclipseLink Expression queries, before being translated to SQL. EclipseLink Expressions support all of the above functionality through their API. JpaCriteriaBuilder defines two API toExpression() and fromExpression() to create Criteria Expression objects from EclipseLink Expression objects.

JpaCriteriaBuilder cb = (JpaCriteriaBuilder)em.getCriteriaBuilder();
CriteriaQuery query = cb.createQuery(Employee.class);
Root emp = query.from(Employee.class);
query.where(cb.fromExpression(cb.toExpression(emp).get("firstName").regexp("^Dr\.*")));

Summary

EQL offers a lot of functionality to make querying easier and more powerful. There are still a few features of SQL that would be difficult to expression with EQL, but the majority of SQL is feasible. Of coarse, there is nothing in JPA that forces you to use JPQL, and if you love SQL, you are still free to use native SQL queries.

Objects vs Data, and Filtering a JOIN FETCH

2012-04-23T11:43:00.001-07:00

JPA deals with objects, and SQL deals with rows. Some developers love JPA because it allows them to use objects in an object-oriented fashion. However, other developers have trouble understanding JPA precisely because it is object-oriented.

Even though object-oriented programming has been popular for several decades, there are still a lot of procedural languages out their. Developers used to procedural programming have a different mindset than object-oriented developers. This is also true of developers with a lot of experience with SQL. JPA and JPQL can be difficult to understand for developers used to SQL and JDBC.

This can lead to some odd usages and misunderstandings of JPA. In this blog post, I would like to highlight a couple of these misunderstandings, and provide a solution to my favorite, which I call Filtering a JOIN FETCH.

Dobject Model

One JPA usage that I find aggravating is what I call the Dobject model. This is a data model that has been made into an object model. Sometimes this comes across as a class that has the same name as the database table, with all of the same field names, and no relationships. Sometimes there are relationships, but they have the same name as the foreign key columns.

@Entity
public class EMP {
  @Id
  long EMP_ID;
  String F_NAME;
  String L_NAME;
  long MGR_ID;
  long ADDR_ID;
}

The above is an unusual class, and not very object-oriented. It is probably not as useful as it could if it had relationships instead of foreign keys.

@Entity
public class employees {
  @Id
  long empId;
  String fName;
  String lName;
  @ManyToOne
  Employee mgrId;
  @OneToOne
  Address addrId;
}

This one is very confused. First of all, it seems to be named after its table, where the name employees might make sense, but an object is a single entity, so should not be pluralized. Also, classes in Java should start with an upper case letter, as classes are proper names of the real world entity that they represent.

This class at least has relationships, which I suppose is an improvement, but they are named like they are foreign keys, not objects. This normally leads the user to try to query them as values instead of as objects.

Select e from employees e where e.mgrId = 4

Which does not work, because mgrId is an Employee object, not an Id value. The relationship should be named after what it represents, i.e. manager not the foreign key.

A more object-oriented way to define the class would be:

@Entity
public class Employee {
  @Id
  long id;
  String firstName;
  String lastName;
  @ManyToOne(fetch=FetchType.LAZY)
  Employee manager;
  @OneToMany(mappedBy="manager")
  List<Employee> managedEmployees;
  @OneToOne
  Address address;
}

How not to write a DAO

In JPA you do not normally execute queries to insert, update or delete objects. To update and object you just find or query it, change it through its set methods, and commit the transaction. JPA automatically keeps track of what changed and updates what is required to be updated, in the order that it is required to be updated.

To insert an object you call persit() on the EntityManager. To delete an object you call remove() on the EntityManager. This is different than SQL or JDBC that requires you to execute a query to perform any modification.

JPA does allow UPDATE and DELETE queries through JPQL. These are for batch updates and deletes, not for the deletion or updating or single objects. This can lead to the very confused Data Access Object below:

public class EmployeeDOA {
  public void insert(Employee employee) {
    em.persist(employee);
  }
  public void update(Employee employee) {
    Query query = em.createQuery("Update Employee e set e.firstName = :firstName, e.lastName = :lastName where e.id = :id");
    query.setParameter("id", employee.getId());
    query.setParameter("firstName", employee.getFirstName());
    query.setParameter("lastName", employee.getLastName());
    query.executeUpdate();
  }
  public void delete(Employee employee) {
    Query query = em.createQuery("Delete from Employee e where e.id = :id");
    query.setParameter("id", employee.getId());
    query.executeUpdate();
  }
}

This is wrong. You do not execute queries to update or delete objects. This will leave the objects in your persistence context in an invalid state, as they are not aware of the query updates. In is also not using JPA correctly, or as it was intended, and not benefiting from its full functionality. A better Data Access Object would be:

public class EmployeeDOA {
  public void persist(Employee employee) {
    em.persist(employee);
  }
  public Object merge(Employee employee) {
    return em.merge(employee);
  }
  public void remove(Employee employee) {
    em.remove(employee);
  }
}

Note that there is no update(). JPA does not have or require and update(), merge() can be used for detached objects, but is not the equivalent of update(), you do not need to call update in JPA, this is one of its benefits.

JPQL vs SQL

JPQL is not SQL. It looks a lot like SQL, has similar syntax and uses the same standard naming for operators and functions, but it is not SQL. This can be very confusing for someone experienced with SQL. When they try to use their SQL in place of JPQL it does not work.

Select * from Employee e join Address a on e.addressId = a.id where a.city = 'Ottawa'

This is SQL, not JPQL.
The equivalent JPQL would be:

Select e from Employee e where e.address.city = 'Ottawa'

Of coarse if you prefer SQL, JPA fully allows you to use SQL, you just need to call createNativeQuery instead of createQuery. However, most users prefer to use JPQL. I suppose this is because JPQL lets them deal with objects, and even if they don't quite understand objects, they do understand there is some benefit there.

JPQL also defines the JOIN syntax, but it does not have a ON clause, and JOIN is based on relationships, not foreign keys.

Select e from Employee e join e.address a where a.city = 'Ottawa'

The JOIN syntax in JPQL allows you do query collection relationships, and use OUTER joins and FETCH. A join FETCH allows you read an object an its relationship in a single query (as appose to a possible dreaded N+1 queries).

Select e from Employee e left join fetch e.address where e.address.city = 'Ottawa'

Notice that I did not use an alias on this JOIN FETCH. This is because JPQL does not allow this, which I will get into later. There is also no ON clause in JPQL because the relationship is always joined by the foreign keys defined in the relationship's join columns in its mapping.

Sometimes it is desirable to place additional conditions in a JOIN ON clause. This is normally in the case of OUTER joins, where placing the condition in the WHERE clause would result in empty joins being filtered. A ON clause is something that is part of the JPA 2.1 draft, so it is coming. EclipseLink already supports the ON clause, as well as aliasing JOIN FETCH in its 2.4 development milestones, see:

www.eclipse.org/eclipselink/downloads/milestones

Filtering a JOIN FETCH

A common misunderstanding I see users make occurs when querying a OneToMany relationship.

Select d from Department d join d.employees e where e.address.city = 'Ottawa'

This query results in all department objects that have any employee living in Ottawa. The confusion comes when they access the department's employees. Each department contains all of its employees, however they were expecting just the employees that live in Ottawa.

This is because JPA deals with objects, and a Department object represents a specific department, and has a specific identity. If I issue two queries for a particular department, I get back the same identical (==) instance (provided both queries use the same EntityManager). A specific department always has the same employees, it represent the real world department, and does not change, just because you queried it differently. This is important for caching, but also within the same persistence context, if you query the same department, two different ways, you should also get back the same exact department.

If you really want the department, and only the employees of the department that live in Ottawa, you can use the following query:

Select d, e from Department d join d.employees e where e.address.city = 'Ottawa'

This will give you an List of Object[] that contain the Department and the Employee. For each employee you will get back n Object[] (rows), where n is the number of departments in the employee. The same employee will be duplicated n times, each with its different department. If you access the departments any of the employees they will contain all of the employee's departments, not just the ones in Ottawa.

But, if you really, really want the department to only have the employees that live it Ottawa, you can do this. I'm not sure I would recommend it, but it is possible, at least in EclipseLink 2.4 it will be. EclipseLink allows you to use an alias on a JOIN FETCH. This support was intended for OneToOne and ManyToOne relationships, to avoid having to join it twice just to get an alias, as well as to allow using it in an ORDER BY or in other ways that would not filter the results and alter how the objects are built. But, there is nothing stopping you from using it with a OneToMany to filter the contents of the fetched OneToMany results.

If you are going to do this, you should be careful, and at least ensure you set the "javax.persistence.cache.storeMode" and "javax.persistence.cache.retrieveMode" to BYPASS, to avoid corrupting the shared cache.

Query query = em.createQuery("Select d from Department d join fetch d.employees e where e.address.city = 'Ottawa'");
query.setHint("javax.persistence.cache.storeMode", "BYPASS");
query.setHint("javax.persistence.cache.retrieveMode", "BYPASS");
List departmentsWithFilteredEmployees = query.getResultList();

EclipseLink 2.4 is currently under development, but its milestone builds already contain this functionality. The EclipseLink 2.4 milestone builds can be download from:
www.eclipse.org/eclipselink/downloads/milestones

For more information of the many JPQL extensions and enhancements in EclipseLink 2.4 see:
wiki.eclipse.org/EclipseLink/UserGuide/JPA/Basic_JPA_Development/Querying/JPQL

EclipseLink JPA supports MongoDB

2012-04-02T12:13:00.000-07:00

EclipseLink 2.4 will support JPA access to NoSQL databases. This support is already part of the EclipseLink development trunk and can be tried out using the milestone or nightly builds. Initial support is provided for MongoDB and Oracle NoSQL. A plug-able platform and adapter layer allows for other databases to be supported.

NoSQL is a classification of database systems that do not conform to the relational database or SQL standard. They have various roots, from distributed internet databases, to object databases, XML databases and even legacy databases. They have become recently popular because of their use in large scale distributed databases in Google, Amazon, and Facebook.

There are various NoSQL databases including:

Mongo DB
Oracle NoSQL
Cassandra
Google BigTable
Couch DB

EclipseLink's NoSQL support allows the JPA API and JPA annotations/xml to be used with NoSQL data. EclipseLink also supports several NoSQL specific annotations/xml including @NoSQL that defines a class to map NoSQL data.

EclipseLink's NoSQL support is based on previous EIS support offered since EclipseLink 1.0. EclipseLink's EIS support allowed persisting objects to legacy and non-relational databases. EclipseLink's EIS and NoSQL support uses the Java Connector Architecture (JCA) to access the data-source similar to how EclipseLink's relational support uses JDBC. EclipseLink's NoSQL support is extendable to other NoSQL databases, through the creation of an EclipseLink EISPlatform class and a JCA adapter.

Let's walk through an example of using EclipseLink's NoSQL support to persist an ordering system's object model to a MongoDB database.

The source for the example can be found here, or from the EclipseLink SVN repository.

Ordering object model

The ordering system consists of four classes, Order, OrderLine, Address and Customer. The Order has a billing and shipping address, many order lines, and a customer.

public class Order implements Serializable {
private String id;
private String description;
private double totalCost = 0;
private Address billingAddress;
private Address shippingAddress;
private List orderLines = new ArrayList();
private Customer customer;
...
}
public class OrderLine implements Serializable {
private int lineNumber;
private String description;
private double cost = 0;
...
}
public class Address implements Serializable {
private String street;
private String city;
private String province;
private String country;
private String postalCode;
....
}
public class Customer implements Serializable {
private String id;
private String name;
...
}

Step 1 : Decide how to store the data

There is no standard on how NoSQL databases store their data. Some NoSQL databases only support key/value pairs, others support structured hierarchical data such as JSON or XML.

MongoDB stores data as BSON (binary JSON) documents. The first decision that must be made is how to store the objects. Normally each independent object would compose a single document, so a single document could contain Order, OrderLine and Address. Since customers can be shared amongst multiple orders, Customer would be its own document.

Step 2 : Map the data

The next step is to map the objects. Each root object in the document will be mapped as an @Entity in JPA. The objects that are stored by being embedded within their parent's document are mapped as @Embeddable. This is similar to how JPA maps relational data, but in NoSQL embedded data is much more common because of the hierarchical nature of the data format. In summary, Order and Customer are mapped as @Entity, OrderLine and Address are mapped as @Embeddable.

The @NoSQL annotation is used to map NoSQL data. This tags the classes as mapping to NoSQL data instead of traditional relational data. It is required in each persistence class, both entities and embeddables. The @NoSQL annotation allows the dataType and the dataFormat to be set.

The dataType is the equivalent of the table in relational data, its meaning can differ depending on the NoSQL data-source being used. With MongoDB the dataType refers to the collection used to store the data. The dataType is defaulted to the entity name (as upper case), which is the simple class name.

The dataFormat depends on the type of data being stored. Three formats are supported by EclipseLink, XML, Mapped, and Indexed. XML is the default, but since MongoDB uses BSON, which is similar to a Map in structure, Mapped is used. In summary, each class requires the @NoSql(dataFormat=DataFormatType.MAPPED) annotation.

@Entity
@NoSql(dataFormat=DataFormatType.MAPPED)
public class Order

@Embeddable
@NoSql(dataFormat=DataFormatType.MAPPED)
public class OrderLine

Step 3 : Define the Id

JPA requires that each Entity define an Id. The Id can either be a natural id (application assign id) or a generated id (id is assign by EclipseLink). MongoDB also requires an _id field in every document. If no _id field is present, then Mongo will auto generate and assign the _id field using an OID (object identifier) which is similar to a UUID (universally unique identifier).

You are free to use any field or set of fields as your Id in EclipseLink with NoSQL, the same as a relational Entity. To use an application assigned id as the Mongo id, simply name its field as "_id". This can be done through the @Field annotation, which is similar to the @Column annotation (which will also work), but without all of the relational details, it has just a name. So, to define the field Mongo will use for the id include @Field(name="_id") in your mapping.

To use the generated Mongo OID as your JPA Id, simply include @Id, @GeneratedValue, and @Field(name="_id") in your object's id field mapping. The @GeneratedValue tells EclipseLink to use the Mongo OID to generate this id value. @SequenceGenerator and @TableGenerator are not supported in MongoDB, so these cannot be used. Also the generation types of IDENTITY, TABLE and SEQUENCE are not supported. You can use the EclipseLink @UUIDGenerator if you wish to use a UUID instead of the Mongo OID. You can also use your own custom generator. The id value for a Mongo OID or a UUID is not a numeric value, it can only be mapped as String or byte[].

@Id
@GeneratedValue
@Field(name="_id")
private String id;

Step 4 : Define the mappings

Each attribute in your object has too be mapped. If no annotation/xml is defined for the attribute, then it mapping will be defaulted. Defaulting rules for NoSQL data, follow the JPA defaulting rules, so most simple mappings do not require any configuration if defaults are used. The field names used in the Mongo BSON document will mirror the object attribute names (as uppercase). To provide a different BSON field name, the @Field annotation is used.

Any embedded value stored in the document is persisted using the @Embedded JPA annotation. An embedded collection will use the JPA @ElementCollection annotation. The @CollectionTable of the @ElementCollection is not used or supported in NoSQL, as the data is stored within the document, no separate table is required. The @AttributeOverride is also not required nor supported with NoSQL, as the embedded objects are nested in the document, and do not require unique field names. The @Embedded annoation/xml is normally not required, as it is defaulted, the @ElementCollection is required, as defaulting does not currently work for @ElementCollection in EclipseLink.

The relationship annotations/xml @OneToOne, @ManyToOne, @OneToMany, and @ManyToMany are only to be used with external relationships in NoSQL. Relationships within the document use the embedded annotations/xml. External relationships are supported to other documents. To define an external relationship a foreign key is used. The id of the target object is stored in the source object's document. In the case of a collection, a collection of ids is stored. To define the name of the foreign key field in the BSON document the @JoinField annotation/xml is used.

The mappedBy option on relationships is not supported for NoSQL data, for bi-directional relationships, the foreign keys would need to be stored on both sides. It is also possible to define a relationship mapping using a query, but this is not currently supported through annotations/xml, only through a DescriptorCustomizer.

@Basic
private String description;
@Basic
private double totalCost = 0;
@Embedded
private Address billingAddress;
@Embedded
private Address shippingAddress;
@ElementCollection
private List orderLines = new ArrayList();
@ManyToOne(fetch=FetchType.LAZY)
private Customer customer;

Step 5 : Optimistic locking

Optimistic locking is supported with MongoDB. It is not required, but if locking is desired, the @Version annotation can be used.

Note that MongoDB does not support transactions, so if a lock error occurs during a transaction, any objects that have been previously written will not be rolled back.

@Version
private long version;

Step 6 : Querying

MongoDB has is own JSON based query by example language. It does not support SQL (i.e. NoSQL), so querying has limitations.

EclipseLink supports both JPQL and the Criteria API on MongoDB. Not all aspects of JPQL are supported. Most basic operations are supported, but joins are not supported, nor sub-selects, group bys, or certain database functions. Querying to embedded values, and element collections are supported, as well as ordering, like, and selecting attribute values.

Not all NoSQL database support querying, so EclipseLink's NoSQL support only supports querying if the NoSQL platform supports it.

Query query = em.createQuery("Select o from Order o where o.totalCost > 1000");
List orders = query.getResultList();

Query query = em.createQuery("Select o from Order o where o.description like 'Pinball%'");
List orders = query.getResultList();

Query query = em.createQuery("Select o from Order o join o.orderLines l where l.description = :desc");
query.setParameter("desc", "shipping");
List orders = query.getResultList();

Native queries are also supported in EclipseLink NoSQL. For MongoDB the native query is in MongoDB's command language.

Query query = em.createNativeQuery("db.ORDER.findOne({\"_id\":\"" + oid + "\"})", Order.class);
Order order = (Order)query.getSingleResult();

Step 7 : Connecting

The connection to a Mongo database is done through the JPA persistence.xml properties. The "eclipselink.target-database" property must define the Mongo platform "org.eclipse.persistence.nosql.adapters.mongo.MongoPlatform". A connection spec must also be defined through "eclipselink.nosql.connection-spec" to be "org.eclipse.persistence.nosql.adapters.mongo.MongoConnectionSpec". Other properties can also be set such as the "eclipselink.nosql.property.mongo.db", "eclipselink.nosql.property.mongo.host" and "eclipselink.nosql.property.mongo.port". The host and port can accept a comma separated list of values to connect to a cluster of Mongo databases.

<persistence-unit name="mongo-example" transaction-type="RESOURCE_LOCAL">
<class>model.Order</class>
<class>model.OrderLine</class>
<class>model.Address</class>
<class>model.Customer</class>
<properties>
<property name="eclipselink.target-database" value="org.eclipse.persistence.nosql.adapters.mongo.MongoPlatform">
<property name="eclipselink.nosql.connection-spec" value="org.eclipse.persistence.nosql.adapters.mongo.MongoConnectionSpec">
<property name="eclipselink.nosql.property.mongo.port" value="27017">
<property name="eclipselink.nosql.property.mongo.host" value="localhost">
<property name="eclipselink.nosql.property.mongo.db" value="mydb">
<property name="eclipselink.logging.level" value="FINEST">
</property>
</property>

Summary

The full source code to this demo is available from SVN.

To run the example you will need a Mongo database, which can be downloaded from, http://www.mongodb.org/downloads.

EclipseLink also support NoSQL access to other data-sources including:

Oracle NoSQL
XML files
JMS
Oracle AQ

NoSQL

2012-03-29T07:41:00.002-07:00

In the beginning data was free and wild. It was not confined to rows and columns and not bounded to standardization. Data access was unruly and proprietary. These were the first "NoSQL" databases. They consisted of flat file, hierarchical and network databases such as VSAM, IMS and ADABASE.

Then there was SQL, and things were good.

SQL was developed during the golden age of data in the 1970s. Database access became standardized through the SQL language and the relational model. The 1970s saw the birth of relational database products such as RDBMS, Ingres, Oracle and DB2. The 1980s saw ANSI standardization of the SQL language, and the adoption of client-server computing.

However, the legacy databases still existed, as well as the legacy applications that accessed them. New applications needed to access the old data, and this was in general a very painful experience.

Back in the good old Smalltalk days during the 1990s, Smalltalk was unofficially adopted as the programming language of choice for large corporate projects. It was the beginning of the commercial adoption of object-oriented programming, both Smalltalk, C++ and other OO languages. Things were great, but there was a dark side. All of the data was stored in relational databases, or worse legacy mainframe databases. Fitting round objects into square relational tables was difficult and cumbersome. Two solutions emerged, object-oriented databases, and object-relational mapping.

New commercial object-oriented database management systems (OODBMS) emerged in the 1990s including Versant, Gemstone and ObjectStore. They were integrated with their respective languages, Smalltalk and C++, and stored data as it was represented in memory, instead of in the relational model. These were the 2nd generation of "NoSQL" databases. There was little standardization and solutions were mainly proprietary. Access to the data from non object-oriented languages was difficult. The world did not adopt this new model, as it had previously adopted the relational model. The worlds data remained in the trusted, standardized and universally accessible relational model.

Object-relational mapping allowed objects to be used in the programming model, but have them converted to relational rows and SQL when persisted. A lot of OR mapping frameworks were built, including many corporate in-house solutions. TopLink, a product from The Object People became the leading OR mapper in the Smalltalk language. In C++ there was Persistence, as well as various other products in various languages.

Although the relational model was the industry standard for any new applications, much of the worlds data remained in mainframe databases. The data was slowly being migrated, but most corporations still had mainframe data. Consulting at TopLink clients in the 90s I found most clients were building applications on relational database, but still had to get some data from the mainframe. This is when we created the first version of TopLink's "NoSQL" support. Of coarse NoSQL was not a buzz word at the time, so the offering was called TopLink for the Mainframe. The main problem was that everyone's mainframe data and access was different, so the product involved lots of consulting.

When Java came along, TopLink moved from Smalltalk to Java. OR mapping became very popular in Java and many new products came to market. The first real OR standard came in the form on EJB CMP. It had is "issues" to say the least, and was coupled with the J2EE platform. A new competing standard of JDO was created in retaliation to CMP. To reconcile the issue of having two competing Java standards, JPA was created to replace them both, and was adopted by most OR mapping products.

In response to the popularity of object-oriented computing, the relational database vendors created the object-relational paradigm. This allowed storage of structured object types and collections in relational tables. SQL3 (SQL 1999) defined new query syntax to access this data. Despite some initial hype, the object-relational paradigm was not successful, and although the features remain in Oracle, DB2 and Postgres, the world stayed with the trusted relational model.

The panic around Y2K had the good fortune of getting most corporations and governments off mainframe databases, and into relational databases. Some legacy data still remained, so we also offered TopLink for the Mainframe in Java. At that time the Internet was taking off, and XML was becoming popular. Since XML is hierarchical data that you could convert any mainframe data to, it became part of our solution for accessing legacy data and the TopLink SDK was born.

With the explosion of the Internet, XML was becoming increasingly popular. This lead to once again the questioning of the relational model, and the creation of XML databases (the 3rd generation of NoSQL). There were several XML databases that achieved much hype, but limited market success. The relational database vendors responded by adding XML support for storage of XML in relational tables.

Again the world stayed with the relational model.

The TopLink SDK also provided simple object to XML mapping, perhaps the first such product to do so. As XML usage in Java became mainstream, the TopLink SDK was split into two products. TopLink Moxy become TopLink's object to XML mapping solution. TopLink EIS became TopLink's legacy data persistence solution.

Around 2009 the term NoSQL was used to categories the new distributed databases being used at Google, Amazon and Facebook. The databases categorized themselves as
being highly scalable, not adhering to ACID transaction semantics, and having limited querying. The NoSQL term grew to include the various other non-relational databases that have emerged throughout the ages.

Is the relational model dead? Will the world switch to the NoSQL model, and will data once again be free? Only time will tell. If history teaches us anything, one would expect the relational model to persist. NoSQL has already been renamed in some circles to "Not Only SQL", to leave room for the NoSQL databases to support the SQL standard. In fact, some NoSQL databases already have support for JDBC drivers. My intuition is a union of the two models, perhaps this has already begun with some NoSQL databases adding SQL support, and some relational databases extended their clustering support such as MySQL cluster.

EclipseLink 2.4 will contain JPA support for NoSQL databases. Initial support with include MongoDB and Oracle NoSQL. This support is already available in the EclipseLink nightly builds. Technically, this is not new functionality, as EclipseLink (formerly TopLink) has been supporting non-relational data for over a decade, but the JPA support is new.

In the upcoming months I will be blogging about some of the new features in EclipseLink to support NoSQL. This blog post is solely an introduction, so sorry to those expecting hard content.

How to improve JPA performance by 1,825%

2011-06-09T06:37:00.000-07:00

The Java Persistence API (JPA) provides a rich persistence architecture. JPA hides much of the low level dull-drum of database access, freeing the application developer from worrying about the database, and allowing them to concentrate on developing the application. However, this abstraction can lead to poor performance, if the application programmer does not consider how their implementation affects database usage.

JPA provides several optimization features and techniques, and some pitfalls waiting to snag the unwary developer. Most JPA providers also provide a plethora of additional optimization features and options. In this blog entry I will explore the various optimizations options and techniques, and a few of the common pitfalls.

The application is a simulated database migration from a MySQL database to an Oracle database. Perhaps there are more optimal ways to migrate a database, but it is surprising how good JPA's performance can be, even in processing hundreds of thousand or even millions of records. Perhaps it is not a straight forward migration, or the application's business logic is required, or perhaps the application has already been persisted through JPA, so using JPA to migrate the database is just easiest. Regardless, this fictitious use case is a useful demonstration of how to achieve good performance with JPA.

The application consists of an Order processing database. The model contains a Customer, Order and OrderLine. The application reads all of the Orders from one database, and persists them to the second database. The source code for the example can be found here.

The initial code for the migration is pretty simple:

EntityManagerFactory emf = Persistence.createEntityManagerFactory("order");
EntityManager em = emf.createEntityManager();
EntityManagerFactory emfOld = Persistence.createEntityManagerFactory("order-old");
EntityManager emOld = emfOld.createEntityManager();
Query query = emOld.createQuery("Select o from Order o");
List orders = query.getResultList();
em.getTransaction().begin();
// Reset old Ids, so they are assigned from the new database.
for (Order order : orders) {
order.setId(0);
order.getCustomer().setId(0);
}
for (Order order : orders) {
em.persist(order);
for (OrderLine orderLine : order.getOrderLines()) {
em.persist(orderLine);
}
}
em.getTransaction().commit();
em.close();
emOld.close();
emf.close();     
emfOld.close();

The example test runs this migration using 3 variables for the number of Customers, Orders per Customer, and OrderLines per Order. So, 1000 customers, each with 10 orders, and each with 10 order lines, would be 111,000 objects.

The test was run on a virtualized 64 bit Oracle Sun server with 4 virtual cores and 8 gigs of RAM. The databases run on similar machines. The test is single threaded, running in Oracle Sun JDK 1.6. The tests are run using EclipseLink JPA 2.3, and migrating from a MySQL database to an Oracle database.

This code functions fine for a small database migration. But as the database size grows, some issues become apparent. It actually handles 100,000 objects surprisingly well, taking about 2 minutes. This is surprisingly well, given it is thoroughly unoptimized and persisting all 100,000 objects in a single persistence context and transaction.

Optimization #1 - Agent

EclipseLink implements LAZY for OneToOne and ManyToOne relationships using byte code weaving. EclipseLink also uses weaving to perform many other optimizations, such as change tracking and fetch groups. The JPA specification provides the hooks for weaving in EJB 3 compliant application servers, but in Java SE or other application servers weaving is not performed by default. To enable EclipseLink weaving in Java SE for this example the EclipseLink agent is used. This is done using the Java -javaagent:eclipselink.jar option. If dynamic weaving is unavailable in your environment, another option is to use static weaving, for which EclipseLink provides an ant task and command line utility.

Optimization #2 - Pagination

In theory at some point you should run out of memory by bringing the entire database into memory in a single persistence context. So next I increased the size to 1 million objects, and this gave the expect out of memory error. Interestingly this was with only using a heap size of 512 meg. If I had used the entire 8 gigs of RAM, I could, in theory, have persisted around 16 million objects in a single persistence context. If I gave the virtualized machine the full 98 gigs of RAM available on the server, perhaps it would even be possible to persist 100 millions objects. Perhaps we are beyond the day when it does not make sense to pull an entire database into RAM, and perhaps this is no longer such as crazy thing to do. But, for now, lets assume it is an idiotic thing to do, so how can we avoid this?

JPA provides a pagination feature that allows a subset of a query to be read. This is supported in JPA in the Query setFirstResult,setMaxResults API. So instead of reading the entire database in one query, the objects will be read page by page, and each page will be persisted in its own persistence context and transaction. This avoids ever having to read the entire database, and also should, in theory, make the persistence context more optimized by reducing the number of objects it needs to process together.

Switching to using pagination is relatively easy to do for the original orders query, but some issues crop up with the relationship to Customer. Since orders can share the same customer, it is important that each order does not insert a new customer, but uses the existing customer. If the customer for the order was already persisted on a previous page, then the existing one must be used. This requires the usage of a query to find the matching customer in the new database, which introduces some performance issues we will discuss later.

The updated code for the migration using pagination is:

EntityManagerFactory emf = Persistence.createEntityManagerFactory("order");
EntityManagerFactory emfOld = Persistence.createEntityManagerFactory("order-old");
EntityManager emOld = emfOld.createEntityManager();
Query query = emOld.createQuery("Select o from Order o order by o.id");
int pageSize = 500;
int firstResult = 0;
query.setFirstResult(firstResult);
query.setMaxResults(pageSize);
List orders = query.getResultList();
boolean done = false;
while (!done) {
if (orders.size() < pageSize) {
        done = true;
    }
    EntityManager em = emf.createEntityManager();
    em.getTransaction().begin();
    Query customerQuery = em.createNamedQuery("findCustomByName");
    // Reset old Ids, so they are assigned from the new database.
    for (Order order : orders) {
        order.setId(0);
        customerQuery.setParameter("name", order.getCustomer().getName());
        try {
            Customer customer = (Customer)customerQuery.getSingleResult();
            order.setCustomer(customer);
        } catch (NoResultException notPersistedYet) {
            // Customer does not yet exist, so null out id to have it persisted.
            order.getCustomer().setId(0);
        }
    }
    for (Order order : orders) {
        em.persist(order);
        for (OrderLine orderLine : order.getOrderLines()) {
            em.persist(orderLine);
        }
    }
    em.getTransaction().commit();
    em.close();
    firstResult = firstResult + pageSize;
    query.setFirstResult(firstResult);
    if (!done) {
        orders = query.getResultList();
    }
}
emOld.close();
emf.close();     
emfOld.close();

Optimization #3 - Query Cache

This will introduce a lot of queries for customer by name (10,000 to be exact), one for each order. This is not very efficient, and can be improved through caching. In EclipseLink there is both an object cache and a query cache. The object cache is enabled by default, but objects are only cached by Id, so this does not help us on the query using the customer's name. So, we can enable a query cache for this query. A query cache is specific to the query, and caches the query results keyed on the query name and its parameters. A query cache is enabled in EclipseLink through using the query hint "eclipselink.query-results-cache"="true". This should be set where the query is defined, in this case in the orm.xml. This will reduce the number of queries for customer to 1,000, which is much better.

There are other solutions to using the query cache. EclipseLink also supports in-memory querying. In-memory querying means evaluating the query on all of the objects in the object cache, instead of accessing the database. In-memory querying is enabled through the query hint "eclipselink.cache-usage"="CheckCacheOnly". If you enabled a full cache on customer, then as you persisted the orders all of the existing customers would be in the cache, and you would never need to access the database. Another manual solution is to maintain a Map in the migration code keying the new customer's by name. For all of the above solutions if the cache is made fixed sized (query cache defaults to a size of 100), you would never need all of the customers in memory at the same time, so there would be no memory issues.

Optimization #4 - Batch Fetch

The most common performance issue in JPA is in the fetch of relationships. If you query n orders, and access their order-lines, you get n queries for order-line. This can be optimized through join fetching and batch fetching. Join fetching, joins the relationship in the original query and selects from both tables. Batch fetch executes a second query for the related objects, but fetches them all at once, instead of one by one. Because we are using pagination, this make optimizing the fetch a little more tricky. Join fetch which still work, but since order-lines is join fetched, and there are 10 order-lines per order, the page size that was 500 orders, in now only 50 orders (and their 500 order-lines). We can resolve this by increasing the page size to 5000, but given in a real application the number of order-lines in not fixed, this becomes a bit of a guess. But the page size was just a heuristic number anyway, so no real issue. Another issue with join fetching with pagination is the last and first object may not have all of its related objects, if it falls in-between a page. Fortunately EclipseLink is smart enough to handle this, but it does require 2 extra queries for the first and last order of each page. Join fetching also has the draw back that it is selecting more data when a OneToMany is join fetched. Join fetching is enable in JPQL using join fetch o.orderLine.

Batch fetching normally works by joining the original query with the relationship query, but because the original query used pagination, this will not work. EclipseLink supports three types of batch fetching, JOIN, EXISTS, and IN. IN works with pagination, so we can use IN batch fetching. Batch fetch is enabled through the query hint "eclipselink.batch"="o.orderLines", and "eclipselink.batch.type"="IN". This will reduce the n queries for order-line to 1. So for each batch/page of 500 orders, there will be 1 query for the page of orders, and 1 query for the order-lines, and 50 queries for customer.

Optimization #5 - Read Only

The application is migrating from the MySQL database to the Oracle database. So is only reading from MySQL. When you execute a query in JPA, all of the resulting objects become managed as part of the current persistence context. This is wasteful in JPA, as managed objects are tracked for changes and registered with the persistence context. EclipseLink provides a "eclipselink.read-only"="true" query hint that allows the persistence context to be bypassed. This can be used for the migration, as the objects from MySQL will not be written back to MySQL.

Optimization #6 - Sequence Pre-allocation

We have optimized the first part of the application, reading from the MySQL database. The second part is to optimize the writing to Oracle.

The biggest issue with the writing process is that the Id generation is using an allocation size of 1. This means that for every insert there will be an update and a select for the next sequence number. This is a major issue, as it is effectively doubling the amount of database access. By default JPA uses a pre-allocation size of 50 for TABLE and SEQUENCE Id generation, and 1 for IDENTITY Id generation (a very good reason to never use IDENTITY Id generation). But frequently applications are unnecessarily paranoid of holes in their Id values and set the pre-allocaiton value to 1. By changing the pre-allocation size from 1 to 500, we reduce about 1000 database accesses per page.

Optimization #7 - Cascade Persist

I must admit I intentionally added the next issue to the original code. Notice in the for loop persisting the orders, I also loop over the order-lines and persist them. This would be required if the order did not cascade the persist operation to order-line. However, I also made the orderLines relationship cascade, as well as order-line's order relationship. The JPA spec defines somewhat unusual semantics to its persist operation, requiring that the cascade persist be called every time persist is called, even if the object is an existing object. This makes cascading persist a potentially dangerous thing to do, as it could trigger a traversal of your entire object model on every persist call. This is an important point, and I added this issue purposefully to highlight this point, as it is a common mistake made in JPA applications. The cascade persists causes each persist call to order-line to persist its order, and every order-line of the order again. This results in an n^2 number of persist calls. Fortunately there are only 10 order-lines per order, so this only results in 100 extra persist calls per order. It could have been much worse if the customer defined a relationship back to its orders, then you would have 1000 extra calls per order. The persist does not need to do anything, as the objects are already persisted, but the traversal can be expensive. So, in JPA you should either mark your relationships cascade persist, or call persist in your code, but not both. In general I would recommend only cascading persist for logically dependent relationships (i.e. things that would also cascade remove).

Optimization #8 - Batch Writing

Many databases provide an optimization that allows a batch of write operations to be performed as a single database access. There is both parametrized and dynamic batch writing. For parametrized batch writing a single parametrized SQL statement can be executed with a batch of parameter vales instead of a single set of parameter values. This is very optimal as the SQL only needs to be executed once, and all of the data can be passed optimally to the database.

Dynamic batch writing requires dynamic (non-parametrized) SQL that is batched into a single big statement and sent to the database all at once. The database then needs to process this huge string and execute each statement. This requires the database do a lot of work parsing the statement, so is no always optimal. It does reduce the database access, so if the database is remote or poorly connected with the application, this can result in an improvement.

In general parametrized batch writing is much more optimal, and on Oracle it provides a huge benefit, where as dynamic does not. JDBC defines the API for batch writing, but not all JDBC drivers support it, some support the API but then execute the statements one by one, so it is important to test that your database supports the optimization before using it. In EclipseLink batch writing is enabled using the persistence unit property "eclipselink.jdbc.batch-writing"="JDBC".

Another important aspect of using batch writing is that you must have the same SQL (DML actually) statement being executed in a grouped fashion in a single transaction. Some JPA providers do not order their DML, so you can end up ping-ponging between two statements such as the order insert and the order-line insert, making batch writing in-effective. Fortunately EclipseLink orders and groups its DML, so usage of batch writing reduces the database access from 500 order inserts and 5000 order-line inserts to 55 (default batch size is 100). We could increase the batch size using "eclipselink.jdbc.batch-writing.size", so increasing the batch size to 1000 reduces the database accesses to 6 per page.

Optimization #9 - Statement caching

Every time you execute an SQL statement, the database must parse that statement and execute it. Most of the time application executes the same set of SQL statements over and over. By using parametrized SQL and caching the prepared statement you can avoid the cost of having the database parse the statement.

There are two levels of statement caching. One done on the database, and one done on the JDBC client. Most databases maintain a parse cache automatically, so you only need to use parametrized SQL to make use of it. Caching the statement on the JDBC client normally provides the bigger benefit, but requires some work. If your JPA provider is providing you with your JDBC connections, then it is responsible for statement caching. If you are using a DataSource, such as in an application server, then the DataSource is responsible for statement caching, and you must enable it in your DataSource config. In EclipseLink, when using EclipseLink's connection pooling, you can enable statement caching using the persistence unit property "eclipselink.jdbc.cache-statements"="true". EclipseLink uses parametrized SQL by default, so this does not need to be configured.

Optimization #10 - Disabling Caching

By default EclipseLink maintains a shared 2nd level object cache. This normally is a good thing, and improves read performance significantly. However, in our application we are only inserting into Oracle, and never reading, so there is no point to maintaining a shared cache. We can disable this using the EclipseLink persistence unit property "eclipselink.cache.shared.default"="false". However, we are reading customer, so we can enable caching for customer using, "eclipselink.cache.shared.Customer"="true".

Optimization #11 - Other Optimizations

EclipseLink provides several other more specific optimizations. I would not really recommend all of these in general as they are fairly minor, and have certain caveats, but they are useful in use cases such as migration where the process is well defined.

These include the following persistence unit properties:

"eclipselink.persistence-context.flush-mode"="commit" - Avoids the cost of flushing on every query execution.
"eclipselink.persistence-context.close-on-commit"="true" - Avoids the cost of resuming the persistence context after the commit.
"eclipselink.persistence-context.persist-on-commit"="false" - Avoids the cost of traversing and persisting all objects on commit.
"eclipselink.logging.level"="off" - Avoids some logging overhead.

The fully optimized code:

EntityManagerFactory emf = Persistence.createEntityManagerFactory("order-opt");
EntityManagerFactory emfOld = Persistence.createEntityManagerFactory("order-old");
EntityManager emOld = emfOld.createEntityManager();
System.out.println("Migrating database.");
Query query = emOld.createQuery("Select o from Order o order by o.id");
// Optimization #2 - batch fetch
// #2 - a - join fetch
//Query query = emOld.createQuery("Select o from Order o join fetch o.orderLines"); // #2 - b - batch fetch (batch fetch is more optimal as avoids duplication of Order data)
query.setHint("eclipselink.batch", "o.orderLines"); query.setHint("eclipselink.batch.type", "IN");
// Optimization #3 - read-only
query.setHint("eclipselink.read-only", "true");
// Optimization #4 - pagination int pageSize = 500; int firstResult = 0; query.setFirstResult(firstResult);
query.setMaxResults(pageSize); 
List orders = query.getResultList();
boolean done = false;
while (!done) {
    if (orders.size() < pageSize) {
        done = true;
    }
    EntityManager em = emf.createEntityManager();
    em.getTransaction().begin();
    Query customerQuery = em.createNamedQuery("findCustomByName");
    // Reset old Ids, so they are assigned from the new database.
    for (Order order : orders) {
        order.setId(0);
        customerQuery.setParameter("name", order.getCustomer().getName());
        try {
            Customer customer = (Customer)customerQuery.getSingleResult();
            order.setCustomer(customer);
        } catch (NoResultException notPersistedYet) {
            // Customer does not yet exist, so null out id to have it persisted.
            order.getCustomer().setId(0);
        }
    }
    for (Order order : orders) {
        em.persist(order);
        // Optimization #5 - avoid n^2 persist calls
        //for (OrderLine orderLine : order.getOrderLines()) {
        //    em.persist(orderLine);
        //}
    }
    em.getTransaction().commit();
    em.close();
    firstResult = firstResult + pageSize;
    query.setFirstResult(firstResult);
    if (!done) {
        orders = query.getResultList();
    }
}
emOld.close();
emf.close();     
emfOld.close();

The optimized persistence.xml:

<persistence-unit name="order-opt" transaction-type="RESOURCE_LOCAL">
    <!--  Optimization #7, 8 - sequence preallocation, query result cache -->
    <mapping-file>META-INF/order-orm.xml</mapping-file>
    <class>model.Order</class>
    <class>model.OrderLine</class>
    <class>model.Customer</class>
    <properties>
        <!-- Change this to access your own database. -->
        <property name="javax.persistence.jdbc.driver" value="oracle.jdbc.OracleDriver" />
        <property name="javax.persistence.jdbc.url" value="jdbc:oracle:thin:@ottvm028.ca.oracle.com:1521:TOPLINK" />
        <property name="javax.persistence.jdbc.user" value="jsutherl" />
        <property name="javax.persistence.jdbc.password" value="password" />
        <property name="eclipselink.ddl-generation" value="create-tables" />
        <!--  Optimization #9 - statement caching -->
        <property name="eclipselink.jdbc.cache-statements" value="true" />
        <!--  Optimization #10 - batch writing -->
        <property name="eclipselink.jdbc.batch-writing" value="JDBC" />
        <property name="eclipselink.jdbc.batch-writing.size" value="1000" />
        <!--  Optimization #11 - disable caching for batch insert (caching only improves reads, so only adds overhead for inserts) -->
        <property name="eclipselink.cache.shared.default" value="false" />
        <!--  Except for Customer which is shared by orders -->
        <property name="eclipselink.cache.shared.Customer" value="true" />
        <!--  Optimization #12 - turn logging off -->
        <!-- property name="eclipselink.logging.level" value="FINE" /-->
        <property name="eclipselink.logging.level" value="off" />
        <!--  Optimization #13 - close EntityManager on commit, to avoid cost of resume -->
        <property name="eclipselink.persistence-context.close-on-commit" value="true" />
        <!--  Optimization #14 - avoid auto flush cost on query execution -->
        <property name="eclipselink.persistence-context.flush-mode" value="commit" />
        <!--  Optimization #15 - avoid cost of persist on commit -->
        <property name="eclipselink.persistence-context.persist-on-commit" value="false" />
    </properties>
</persistence-unit>

The optimized orm.xml:

<?xml version="1.0" encoding="UTF-8"?>
<entity-mappings version="2.1"
    xmlns="http://www.eclipse.org/eclipselink/xsds/persistence/orm"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <named-query name="findCustomByName">
        <query>Select c from Customer c where c.name = :name</query>
        <hint name="eclipselink.query-results-cache" value="true"/>
    </named-query>
    <entity class="model.Order">
        <table-generator name="ORD_SEQ" allocation-size="500"/>
    </entity>
    <entity class="model.Customer">
            <table-generator name="CUST_SEQ" allocation-size="500"/>
    </entity>

</entity-mappings>

So, what is the result? The original un-optimized code took on average 133,496 milliseconds (~2 minutes) to process ~100,000 objects. The fully optimized code took only 6,933 milliseconds (6 seconds). This is very good, and means it could process 1 million objects in about 1 minute. The optimized code is an 1,825% improvement on the original code.

But, how much did each optimization affect this final result? To answer this question I ran the test 3 times with the fully optimized version, but with each optimization missing. This worked out better than starting with the unoptimized version and only adding each operation separately, as some optimizations get masked by the lack of others. So, in the table below the bigger the % difference, the better the optimization (that was removed) was.

Optimization	Average Result (ms)	% Difference
None	133,496	1,825%
All	6,933	0%
1 - no agent	7,906	14%
2 - no pagination	8,679	25%
3 - no read-only	8,323	20%
4a - join fetch	11,836	71%
4b - no batch fetch	17,344	150%
5 - no sequence pre-allocation	30,396	338%
6 - no persist loop	7,947	14%
7 - no batch writing	75,751	992%
8 - no statement cache	7,233	4%
9 - with cache	7,925	14%
10 - other	7,332	6%

This shows that batch writing was the best optimization, followed by sequence pre-allocation, then batch fetching.

Data Partitioning - Scaling the Database

2011-05-04T07:03:00.000-07:00

In Enterprise Java most of the effort is normally done to scale the mid-tier application and its server. Pretty much every Java application server supports clustering and scaling out the application to several mid-tier machines. Even if the application server does not officially support clustering, it is normally pretty easy to have a "cluster" of application servers fronted by a round-robin load-balancer. This would even work with something as simple as Tomcat.

The mid-tier normally scales very well to a cluster, as it does not have any shared in-memory data, as this data is normally stored in a database. All of the mid-tier cluster members access the same database, and life is good. The application can have unlimited performance, simply by adding more mid-tier machines. But what happens when the poor database machine suddenly can't take any more requests?

The most common solution to scaling the database seems to be to buy a bigger and badder database machine. If 8 cores is not cutting it, then perhaps 16 cores will. This solution in general works pretty good, assuming hardware vendors can keep stuffing more cores into their machines. This solution seems to be used in Enterprise Java performance benchmarks such as SpecJ, if you look at the nodes column of the SpecJ 2010 results, all the results have a single database node, some with as many as 40 cores, even though some have 8 mid-tier nodes.

http://www.spec.org/jEnterprise2010/results/jEnterprise2010.html

But what happens when you can't stuff any more cores into a machine, or the cost of an insanely multi-core machine greatly outweighs the cost of multiple lower end hardware machines? Perhaps this is just a hardware problem, and you just need to wait for the hardware vendors to make a bigger and badder machine, but there are other solutions from the wonderful world of software.

The first solution to look into is to optimize your own application (as always). Generally the application's code is not so efficient in its database access, and by optimizing the number and types of queries hitting the database, using parametrized SQL, using batch writing, using lazy, join and batch fetching, a significant load can be removed from the database. But perhaps you already did that, or don't have the expertise, or just don't feel like it.

The second solution is to optimize your database. By ensuring your database is configured optimally, has the correct indexes, queries are using the optimal query plan, and the disk access optimally, its performance, and thus scalability can be improved. But perhaps you already did that, or don't have the expertise, or just don't feel like it.

The third solution is to investigate caching in the mid-tier. By caching objects and data in the mid-tier, you can offload a lot of the queries hitting the database, and improve your application's performance to boot. Most JPA providers support caching, and some such as EclipseLink offer quite advanced caching functionality including invalidation, and coordinated clustered caches. JPA 2.0 defines some basic caching annotations to enable and access the cache.

Caching mainly benefits reads, but some caching solutions such as Oracle Coherence offer the ability to offload writes as well. Oracle TopLink Grid provides JPA support for Coherence.

Caching can be a good solution, but there can be issues with stale data, clustering, and mid-tier contention. Some caching solution are very good, but not always as good at managing concurrent access to data as relational databases that have been doing it for decades. Also if your database has become a bottleneck because of writes, then caching reads may not be a solution.

The best solution is to scale the database through clustering the database across multiple machines. This could be a real clustered database, such as Oracle RAC, or just multiple regular database instances. Clustered database are good, and can improve your scalability without much work, but depending on your application you may also have to partition your data across the database nodes for optimal scalability. Without partitioning, if you write a row on one node, then access it on another, the other node must request the latest copy of the data from the other node, this can potentially make performance worse.

Partitioning splits your data across each of the database nodes. There is horizontal partitioning, and vertical partitioning. Vertical partitioning is normally the easiest to implement. You can just put half of your classes on one database, and the other half on another. Ideally the two sets would be isolated from each other and not have any cross database relationships.

For horizontal partitioning you need to split your data across multiple database nodes. Each database node will have the same tables, but each node's table will only store part of the data. You can partition the data by the data values, such as range partitioning, value partitioning, hash partitioning, or even round robin.

To enable data partitioning you require your persistence solution to be aware of how to partition the data. EclipseLink 2.2 added support for partitioning data. Both vertical and horizontal partitioning is supported. Several partitioning options are provided at the Session, Entity and Query level,

Range partitioning - each partition maps a range of field values.
Value partitioning - each field value maps to a partition.
Hash partitioning - the field value is hashed to determine its partition.
Pinned partitioning - allows an Entity or query to be vertically partitioned.
Round robin - allows load balancing of requests across multiple database nodes.
Replication - allows data to be replicated across multiple database nodes.

EclipseLink supports partitioning across any database, including both clustered databases such as Oracle RAC, and standard databases such as MySQL. Relationships and queries across database partitions are supported, but joins across partitions are only supported for clustered databases.

So how does data partitioning with JPA and EclipseLink scale? To determine the answer, I developed a simple order processing example. The example defines an Order, OrderLine and Customer. The example client processes orders for a Customer using 16 client threads. The application is primarily insert oriented, so heavily uses the database. I first ran the application without partitioning on a single MySQL database instance. To give the poor database no chance of keeping up to the mid-tier client, I ran the mid-tier on a virtualized 8 core machine with 16g of ram (Oracle Sun hardware, Oracle Linux OS). I ran the MySQL database on a similar machine, but only gave it 1 virtual core and 8g or RAM. So, I was pretty sure the application would be bottlenecking on the database. This was the goal, to simulate a cluster of mid-tier machines accessing a single database machine.

Next, I enable partitioning of the Order and OrderLine by the ORDER_ID using hash partitioning across two database nodes. I also hash partitioned Customer by its ID. This resulted in about half of the transactions going to one database, and half to the other. Because the Order and the OrderLine shared the same ORDER_ID, they were partitioned to the same database node, so I did not need to worry about transactions spanning multiple nodes. The read for the Customer could go to either node, but because it was a non-transactional read, this was just routed separately by EclipseLink, which has support for using different connections for non-transactional reads versus transactional writes. Having writes span multiple nodes in a single transaction is normally not desirable. EclipseLink allows this and can be integrated with JTA to give 2-phase commit across the nodes. If JTA is not used EclipseLink still does a 2-stage commit, but there are no gaurentees if all of the writes succeed, but the commit transaction fails.

The resulting order was mapped as,

@Entity
@Table(name="PART_ORDER")
@HashPartitioning(
    name="HashPartitionByOrderId",
    partitionColumn=@Column(name="ORDER_ID"),
    connectionPools={"default","node2"})
@Partitioned("HashPartitionByOrderId")
public class Order implements Serializable {
    @Id
    @GeneratedValue(strategy=GenerationType.TABLE)
    @Column(name="ORDER_ID")
    private long id;
    @Basic
    private String description;
    @Basic
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    @OneToMany(mappedBy="order", cascade=CascadeType.ALL, orphanRemoval=true)
    @OrderBy("lineNumber")
    private List orderLines = new ArrayList();
    @ManyToOne(fetch=FetchType.LAZY)
    private Customer customer;
}

For the full source code for the example see here.

For the second run a 2nd MySQL database was added running on a separate 1 core machine. The result showed a 66% increase in scalability for the application, processing close to 2x as many orders. The test application was run for 1 minute and the total number of processed orders for all 16 client threads totaled. This was run 5 times and the results averaged for each configuration.

The results:

Configuration	Threads	Average processed orders	%STD	%DIF
Single database	16	11,150	0.4%	0%
2 node partition	16	18,583	2.2%	66%

The results show that through effective partitioning the database can be scaled out to multiple machines as well as the mid-tier.

JVM Performance - Part III - Concurrent Maps

2011-03-07T08:33:00.000-08:00

Concurent Maps

The main difference between Hashtable and HashMap is that Hashtable is synchronized. For this reason Hashtable is still used in a lot of concurrent code because it is, in theory, thread safe. This theory is however normally just a theory, because if you don't write concurrent code correctly, it will still not be thread safe no matter how many synchronized methods you have.

For example you could call get() on the Hashtable, then if it is not there call put(), both operations are synchronized and thread-safe, but in between your get() and put() another thread could have done the same thing and put something there already, in which case your code may be incorrect and have thrown away some other thread's data. With a Hashtable the solution to this is to synchronize the whole operation on the map.

Object value = map.get(key);
if (value == null) {
    synchronized (map) {
        value = map.get(key);
        if (value == null) {
            value = buildValue(key);
            map.put(key, value);
        }
    }
}

JDK 1.5 added the ConcurrentMap implementation that is thread safe, and designed and optimized for concurrent access. It basically has pages inside the map, to avoid locks on concurrent access to different pages. It also provides useful API such as putIfAbsent() to allow something to be put in the map unless it is already there, in a thread safe manner. Using putIfAbsent() is more efficient than using a synchronized get() and put() in both concurrency and performance.

Object value = map.get(key);
if (value == null) {
    Object newValue = buildValue(key);
    value = map.putIfAbsent(key, value);
    if (value == null) {
        value = newValue;
    }
}

So, how does the performance and concurrency of HashMap, Hashtable and ConcurrentMap stack up? This test compare the performance for gets and puts in various Map implementations using 1 to 32 threads. It does 100 gets or puts in a Map of size 100. Two machines were tested. The first machines is my Windows XP desktop, that has two cores. The second machine is an 8 core Linux server. All tests were run 5 times and averaged, Oracle Sun JDK 1.6.23 was used.

Threads, is the number of thread running the test. The average is the total number of operations performed in the time period by all threads in total. The %STD is the percentage standard deviation in the results. The %DIF is the percentage difference between the run and the single threaded run (for the same Map type).

Concurrent Map Performance Comparison (desktop, 2cpu)

Map	Opperation	Threads	Average	%STD	%DIF (with 1 thread)
HashMap	get	1	3551306	0.06%	0%
HashMap	get	2	4121102	0.03%	16%
HashMap	get	4	4132506	0.15%	16%
HashMap	get	8	4227485	0.68%	19%
HashMap	get	16	4402532	1.36%	23%
HashMap	get	32	4426514	1.61%	24%
Hashtable	get	1	1132956	0.06%	0%
Hashtable	get	2	364236	0.08%	-211%
Hashtable	get	4	274603	0.14%	-312%
Hashtable	get	8	277188	1.08%	-308%
Hashtable	get	16	277881	0.78%	-307%
Hashtable	get	32	296779	2.51%	-281%
ConcurrentHashMap	get	1	2771098	0.04%	0%
ConcurrentHashMap	get	2	3466451	2.30%	25%
ConcurrentHashMap	get	4	3458492	0.33%	24%
ConcurrentHashMap	get	8	3510282	0.31%	26%
ConcurrentHashMap	get	16	3613182	2.36%	30%
ConcurrentHashMap	get	32	3599489	2.23%	29%
HashMap	put	1	3897925	0.07%	0%
HashMap	put	2	2614784	0.01%	-49%
HashMap	put	4	2473011	0.21%	-57%
HashMap	put	8	2482743	0.30%	-57%
HashMap	put	16	2506519	0.53%	-55%
HashMap	put	32	2579715	0.30%	-51%
Hashtable	put	1	1042076	0.33%	0%
Hashtable	put	2	474199	3.06%	-119%
Hashtable	put	4	179550	6.71%	-480%
Hashtable	put	8	183102	1.63%	-469%
Hashtable	put	16	393085	0.68%	-165%
Hashtable	put	32	398277	1.10%	-161%
ConcurrentHashMap	put	1	1336292	0.21%	0%
ConcurrentHashMap	put	2	557880	3.71%	-139%
ConcurrentHashMap	put	4	390736	1.64%	-241%
ConcurrentHashMap	put	8	362653	1.21%	-268%
ConcurrentHashMap	put	16	1492123	0.20%	11%
ConcurrentHashMap	put	32	1564926	0.10%	17%

Concurrent Map Performance Comparison (server, 8cpu)

Map	Opperation	Threads	Average	%STD	%DIF (with 1 thread)
HashMap	get	1	3047533	0.0%	0%
HashMap	get	2	7500603	0.1%	146%
HashMap	get	4	14080828	0.01%	362%
HashMap	get	8	25160569	0.01%	769%
HashMap	get	16	17215757	1.2%	464%
HashMap	get	32	11797330	7.7%	287%
Hashtable	get	1	1165834	0.06%	0%
Hashtable	get	2	434485	16.9%	-168%
Hashtable	get	4	203231	2.7%	-473%
Hashtable	get	8	201290	2.1%	-479%
Hashtable	get	16	358459	2.3%	-225%
Hashtable	get	32	303975	4.7%	-283%
ConcurrentHashMap	get	1	2119602	0.0%	0%
ConcurrentHashMap	get	2	5044317	0.1%	137%
ConcurrentHashMap	get	4	9422460	0.09%	344%
ConcurrentHashMap	get	8	10195480	0.0%	381%
ConcurrentHashMap	get	16	9799273	1.2%	362%
ConcurrentHashMap	get	32	9557975	0.1%	350%
HashMap	put	1	1729801	0.02%	0%
HashMap	put	2	1347323	0.1%	-28%
HashMap	put	4	1267770	0.02%	-36%
HashMap	put	8	1056226	0.0%	-63%
HashMap	put	16	1055462	0.01%	-63%
HashMap	put	32	1055139	0.01%	-63%
Hashtable	put	1	1391458	0.08%	0%
Hashtable	put	2	211793	13.1%	-556%
Hashtable	put	4	191052	2.8%	-628%
Hashtable	put	8	200480	3.4%	-594%
Hashtable	put	16	399748	2.3%	-248%
Hashtable	put	32	400840	3.2%	-247%
ConcurrentHashMap	put	1	1503588	0.2%	0%
ConcurrentHashMap	put	2	441143	0.8%	-240%
ConcurrentHashMap	put	4	380565	1.1%	-295%
ConcurrentHashMap	put	8	354054	11.8%	-324%
ConcurrentHashMap	put	16	1736618	2.8%	15%
ConcurrentHashMap	put	32	1699647	5.7%	13%

Very interesting results. Given the desktop machine has 2 CPUs, I would have expected the 2+ threaded tests to have at most 2x the single threaded test. For the server results with 8 CPUs, I would expect the results to double until 8 threads, then flatten out.

Given Hashtable is synchronized, HashMap is not, and ConcurrentHashMap is partially synchronized, I would have expect HashMap to be about 2x, Hashtable to be about 1x, and ConcurrentHashMap to be somewhere in between. The thing I like best about running performance tests, is that you rarely get what you expect, and these results are not what I would have expected.

My basic premise holds, in that for get() HashMap had the best concurrency, then ConcurrentHashMap, and then Hashtable. I would have not expected Hashtable to do so bad though. Given it is synchronized, only one thread can perform a get() at one time, so naively one would expect the same results as a single thread. The reality is that it had 5x worse performance than a single thread. The reasons for this include that synchronization has a certain overhead, which modern JVMs optimize out when only a single thread is accessing the object, but with multiple threads this overhead becomes very apparent. Also, contention in general has a huge performance cost, as the threads are busy waiting for the lock to become available. In addition just having multiple threads adds some overhead with context switching and such, so in general the more threads, the worse the peformance unless the threads are doing something useful.

The desktop results for get() are worse than I would have expected with 2 CPUs, but perhaps the second CPU was busy with other things, such as the OS and garbage collection, etc.

The put() results are much more perplexing. First of all, I know that running concurrent puts on a HashMap does not make much sense, as HashMap does not support concurrent puts. I ran the test anyway, just to see what would happen, and the result is quite surprising. I would expect similar results to the get() test. However, HashMap had much worse results, similar to Hashtable, as if it were having some contention. How could this be given that HashMap has no synchronized methods? My only explanation is that the puts required modifying the same memory locations, so were experiencing contention on the memory access. If you have a better explanation, please comment.

I would have also expected the put() concurrency for ConcurrentHashMap to be better. I think the primary reason for this was that my tests looped over the same set of keys in the same order, so each thread was trying to put the same key at the same time. Since ConcurrentHashMap works by having multiple pages and only having to lock a page on put() instead of the entire Map, it still had contention because for the most part the threads were accessing the same keys at the same time. I think this is also why the > 8 threads fared better. With more threads they got more out of synch, and had less page conflicts. This is an important point, ConcurrentHashMap will only perform well when it has a large enough size to have many pages, and when the access to it is random. If all the threads are accessing a single page, it is really no better than a Hashtable.

So, what does all this mean? In general, if you have static meta-data that is read-only and requires concurrent access, then use HashMap (it is only not thread safe with puts, gets are fine). If you require concurrent read-write access, then use ConcurrentHashMap. If you just like being old school, then use Hashtable, but beware the hidden costs of concurrency.

JVM Performance, Part II - JVM Bazaar

2011-01-26T11:48:00.000-08:00

As promised, I have results from running some of the tests in other JVMs. The most interesting results were from the Map test and the Method execution test, so I have re-run those on various JVMs and environments.

The first result comes from running with Oracle (Sun) JDK 1.6.23 on the same Windows XP machine. Next I ran on Oracle (Sun) JDK 1.5.5, and then Oracle JRockit 1.6. I then ran the tests on another machine, an old 4 cpu 8 core AMD machine running Linux. The results are very surprising, each different machine and environment seemed to have very different behavior.

Note: These tests are not in any way attempting to compare Windows vs Linux performance, or Oracle Sun JDK vs Oracle JRockit, completely different hardware, JVMs and environments are used. The goal is to see how different operations compare on the same JVM, and how one JVM and environment can differ with another. Some of the Windows results are faster most likely because the machine is newer, and a faster CPU, some of the Linux results are faster possibly because the machine has 8 CPUs instead of 2, so can process garbage collection on other threads.

The test uses a Map of size 100 testing instantiation of the Map, 100 puts and 100 gets. The average result is included. The %DIF is between the Map and the Hashtable results for that same JVM.

Map Operation Performance Comparison

Map	WXP-JDK1.6.7	%DIF	WXP-JDK1.6.23	%DIF	WXP-JDK1.5.5	%DIF
Hashtable	357229	0%	379673	0%	219961	0%
HashMap	274587	-30%	403501	+6.2%	238916	+8.6%
LinkedHashMap	269840	-32%	356207	-6.5%	228787	+4.0%
IdentityHashMap	110801	-222%	121691	-211%	60670	-262%
ConcurrentHashMap	119068	-200%	183159	-107%	109912	-100%
HashSet	281960	-26%	407709	+7.3%	228936	4.0%

Map	LX-JDK1.6.20	%DIF	LX-JDK1.5.5	%DIF	LX-JRK1.6.5	%DIF
Hashtable	343250	0%	307227	0%	474378	0%
HashMap	496258	+44%	375805	+22%	642267	+35%
LinkedHashMap	476535	+38%	417273	+35%	595331	+25%
IdentityHashMap	552437	+60%	478350	+55%	716368	+51%
ConcurrentHashMap	293529	-16%	299675	-2.5%	331730	-43%
HashSet	451615	+31%	381843	+24%	633824	+33%

The first good thing to notice is that for the same machine, the results get better with newer JVM's. That is very nice, and something that most Java developers are familiar with, each new JVM version having better performance than the previous.

The second things to notice is that the Map types have huge variations based on the JVM. HashMap was faster than Hashtable in 1.6.23 and even 1.5.5, but somehow slower in 1.6.7. I found this very perplexing, but no matter how many times, or in which order I ran the tests the results were the same. I also compared the code between the releases, and did not find any code changes that could account for the change in performance. I assume it has to do with how the JVM manages memory and synchronized methods. One would think that the code for Hashtable, HashMap, and IdentityHashMap was all the same, since the functionality is the essentially identical, but in fact they share none of the same code, and use quite different data structures. I assume this accounts for the difference in performance under the different JVMs, so that none of them is inferior, they are just different.

Note that these tests are still single threaded tests. So although they are comparing the raw performance difference between Hashtable and HashMap, the far bigger issue is the affect Hashtable's synchronized methods have on concurrency. This is also true for ConcurrentHashMap, altough its raw throughput seems to be worse, it concurrency is far better, and in a mutli-threaded environment it will fair much better. I will hopefully explore this in another blog post.

Method Execution Performance Comparison

Execution type	WXP-JDK1.6.7	%DIF	WXP-JDK1.6.23	%DIF	WXP-JDK1.5.5	%DIF
Normal	25533130	0%	29608720	0%	18557941	0%
synchronized	13383707	-93%	14373780	-105%	1254183	-1379%
Block synchronized	8244087	-203%	8812159	-235%	1253624	-1380%
final	26873873	+6.1%	30812911	+4.0%	18853714	+1.5%
In-lined	26816109	+5.2%	30815433	+4.0%	19734119	+6.3%
volatile	1503727	-1539%	3290550	-799%	1448561	-1181%
Reflection	159069	-1564%	145585	-20237%	82713	-22336%

Execution type	LX-JDK1.5.5	%DIF	LX-JDK1.6.20	%DIF	LX-JRK1.6.5	%DIF
Normal	4666753	0%	4840361	0%	4257995	0%
synchronized	2250871	-107%	4413458	-9.6%	3465398	-22%
Block synchronized	2252474	-107%	4410721	-9.7%	3672440	-15%
final	4666384	0.0%	4911130	+1.4%	4392700	+3.1%
In-lined	4668538	0.0%	4874539	0.7%	4042412	-5.3%
volatile	3236491	-44%	3231105	-49%	3032121	-40%
Reflection	2645911	-76%	2568412	-88%	132232	-3120%

The first good thing to note again is that newer JVM have for the most part better performance. This is one of the great aspects of developing using the Java platform.

The second thing to notice is once again the behavior for the various JVMs is very different. They all seem to be consistent in that synchronized methods, volatile and reflection are slower, but the degree of the difference is pretty major. Synchronized methods seem to be much better in JDK 1.6 vs 1.5, but in the latest Linux JVM almost have no overhead at all. Reflection also improved a lot from 1.5 to 1.6, but again the Linux overhead is less than a tenth of the Windows JVM results (except for JRockit). Volatile has similar differences. This could have more to do with the hardware than the OS or JVM, as the Linux machine has 8 CPUs, so may be doing some memory managment using its other CPUs (but this does not explain the JRockit result...).

Summary

So what have we learned? I think the most important thing is that different JVMs and environment can have very different behavior, so it is important to test in the environment that you will go into production in. If you do not have that luxury, then it is important to test in a variety of different environments to ensure you are not making trade offs in one environment that could hurt you in another.

I would not get too obsessed with worrying about how your applications performance may differ in different environments. By in large, the main performance optimizations of reducing CPU usages, reducing message sends, reducing memory and garbage, improving concurrency will improve your application's performance not matter what the environment is.

What is faster? JVM Performance

2010-12-17T08:03:00.000-08:00

Java performance optimization is part analysis, and part superstition and witch craft.

In deciding how to optimize their applications a lot of developers search the web for what people say is faster, or go by what they have heard from their co-workers sister's boyfriend's cousin, or by what seems to be the "public opinion". Sometimes the public opinion is correct, and their application benefits, and sometimes it is not, and they waste their time and effort and make little performance improvement, or make things worse.

True performance optimization involves measuring the current performance. Profiling the application and determining the bottlenecks. Investigating and implementing optimizations to avoid the bottlenecks.

But how should one code on a day to day basis for optimal performance? Are synchronized methods slow? Are final methods fast? What is faster Vector, ArrayList or LinkedList? What is the optimal way to iterate a List? Is refection slow? I will attempt to answer these questions in this post.

In the EclipseLink project we are generally very concerned about performance, as we are a persistence library used by other applications, so like the JDK we are part of the plumbing and need to be as optimized as possible.

We have a number of performance test suites in EclipseLink, mainly to measure our persistence performance, but we have one test suite that has nothing to do with persistence. Our Java performance test suite just measures the performance of different operations in Java. These tests help us determine how to best optimize our code.

The tests function by running a certain operation for a set amount of time and measuring the number of operations in the time period. Next the next operation is run and measured the same way. This is then repeated 5 times, and the max/min values are rejected, the average of the middle 3 is computed, the standard deviation is computed, and the averages of the two operations are compared. Tests are single threaded.

The tests were run on Oracle Sun JDK 1.6.0_07 on 32 bit Windows.

Maps

There are several different Map implementations in Java. This test compares the performance for various sizes of Maps. The test instantiates the Map, does n (size) puts, then n gets and n removes.

Map Operation Performance Comparison

Map	Size	Average (operations/10 seconds)	%STD	%DIF (with Hashtable)
Hashtable	10	3605235	0.03%	0%
HashMap	10	2908854	0.02	-23%
LinkedHashMap	10	2594492	0.03%	-38%
IdentityHashMap	10	1346278	0.01%	-167%
ConcurrentHashMap	10	1009259	0.0%	-257%
HashSet	10	2927254	0.02%	-23%
Hashtable	100	357229	0.01%	0%
HashMap	100	274587	0.03%	-30%
LinkedHashMap	100	269840	0.03%	-32%
IdentityHashMap	100	110801	0.02%	-222%
ConcurrentHashMap	100	119068	0.01%	-200%
HashSet	100	281960	0.04%	-26%
Hashtable	1000	34034	6.2%	0%
HashMap	1000	27818	0.06%	-22%
LinkedHashMap	1000	25514	0.03%	-33%
IdentityHashMap	1000	11650	0.04%	-192%
ConcurrentHashMap	1000	12420	2.9%	-174%
HashSet	1000	26888	0.04%	-26%

These results are quite interesting, I never would have expected such a big difference in performance between classes doing basically the same well defined thing, that has been around for quite some time. It is surprising that Hashtable performs better than HashMap, when HashMap is suppose to be the replacement for Hashtable, and does not suffer from its limitation of using synchronized methods, which are suppose to be slow. Note: After exploring other JVMs in my next post, it seems that this anomaly only exists in JDK 1.6.7, in the latest JVM, and most other JVM, HashMap seems to be faster than Hashtable.

I would expect LinkedHashMap and ConcurrentHashMap to be somewhat slower, as they have additional overhead and perform better in specific use cases, but it is odd that IdentityHashMap is so much slower, given my test object did not define a hashCode(), so was using its identity, so the map was doing the identical thing as HashMap and Hashtable. Also interesting that HashSet has the same performance as HashMap, given it in theory could be a simpler data structure (in reality it is a subclass of HashMap, so performs the same).

Note that these are single threaded results. Although Hashtable outperformed HashMap in this test, in a multi-threaded (and multi-CPU) environment, Hashtable would perform much worse because of the method synchronization. Assuming read-only access of coarse, as concurrent access to a HashMap would blow up. ConcurrentHashMap does perform the best in a concurrent read-write environment.

Lists

For Maps, Hashtable turned out to have the best performance. Lets now investigate Lists, will the old Vector implementation have better performance than ArrayList? The list test instantiates a list, then does n (size) adds followed by n gets by index.

List Operation Performance Comparison

List	Size	Average (operations/10 seconds)	%STD	%DIF (with Vector)
Vector	10	11625453	0.006%	0%
ArrayList	10	18448453	0.08%	+56%
LinkedList	10	12362290	0.03%	+6.0%
Vector	100	1241420	0.2%	0%
ArrayList	100	1893581	0.1%	+52%
LinkedList	100	1012740	0.01%	-22%
Vector	1000	132182	0.2%	0%
ArrayList	1000	223969	0.1%	+69%
LinkedList	1000	12689	0.02%	-941%

So, it seems that ArrayList is much faster than Vector. Given both Vector and Hashtable are synchronized, I assume it is not the synchronized methods, just the implementation. LinkedList surprisingly performs as good as Vector for small sizes, but then performs much worse on larger sizes because of the indexed get. This is expected as LinkedList performs good in specific use cases, such as removing from the head or middle, but this was not tested.

Iteration

There are many way to iterate a List in Java. For a Vector a Enumeration can be used, or an Iterator in any List. A simple for loop with an index can also be used for any List. Java 5 also defines a simplified for syntax for iterating any Collection. This test compares the performance of iteration, the List has already been pre-populated.

Iteration Performance Comparison

List	Average (operations/10 seconds)	%STD	%DIF (with Vector)
for index (Vector)	6079272	0.06%	0%
for index (ArrayList)	6048324	0.03%	-0.5%
Enumeration (Vector)	4736049	0.02%	-28%
Iterator (Vector)	2840094	0.05%	-114%
Iterator (ArrayList)	1935122	0.008%	-214%
for (Vector)	2841567	0.05%	-113%
for (ArrayList)	1933576	0.04%	-214%

So using a for loop with an index is faster than an Enumerator or Iterator. This is expected, as it avoids the cost of the iteration object instance. It is interesting that a Enumerator is faster than an Iterator, and a Vector Iterator is faster than an ArrayList Iterator. Unfortunately there is no magic in the Java 5 for syntax, it just calls iterator().

So the optimal way to iterate a List is:


int size = list.size();
for (int index = 0; index < size; index++) {
    Object object = list.get(index);
    ...
}

However, the Java 5 syntax is simpler, so I must admit I would use it in any non performance critical code, and maybe some day the JVM will do some magic and for will have better performance.

Method Execution

A method in Java can be defined in several different ways. It can be synchronized, final, called reflectively, or not called at all (in-lined). This next test tries to determine the performance overhead to a method call. The test executes a simple method that just increments an index. The test executes the method 100 times to avoid the test call overhead. For the reflective usage the Method object and arguments array are cached.

Method Execution Performance Comparison

Method	Average (100 operations/10 seconds)	%STD	%DIF (with normal)
Normal	25533130	0.7%	0%
synchronized	13383707	4.3%	-93%
Block synchronized	8244087	4.7%	-203%
final	26873873	1.2%	+6.1%
In-lined	26816109	1.5%	+5.2%
volatile	1503727	0.1%	-1539%
Reflection	159069	3.2%	-15646%

Interesting results. Synchronized methods are slower, and using a synchronized block inside a method seems to be slower than just making the method synchronized. Final methods seem to be slightly faster, but very minor, and seem to have the same improvement as in-lining the method, so I assume that is what the JVM is doing.

Calling a method through reflection is significantly slower. Reflection has improved much since 1.5 added byte-code generation, but if you profile a reflective call you will see several isAssignable() checks before the method is invoked to ensure the object and arguments are of the correct type, it is odd that these cannot be handled through casts in the generated code, or typing errors just trapped.

CORRECTION
Originally volatile was showing an improvement in performance because of a bug in the volatile test. After correcting the bug the test now shows a huge overhead in performance. The usage of volatile on the int field that is being incremented in the test method is causing a 15x performance overhead. This is surprising, and much larger than the synchronized overhead, but still smaller than the reflection overhead. This is especially odd as the field is an int which one would think is atomic anyway. I imaging the affect the volatile has on the JVM memory usage is somehow causing the overhead. In spite of the overhead, I would still use volatile when required, in concurrent pieces of code where the same variable is concurrently modified. I have done multi-threaded tests comparing its concurrency with synchronizing the method, and using volatile is better than using synchronization for concurrency.

Note that this difference in cost of execution is on a method that does basically nothing. Generally, performance bottlenecks in applications are in places that do something, not nothing, and it is the something that is the bottleneck, not how the method is called. Even in the reflective call, the cost was less than 1 millionth of a second, so on a method that toke 1 millisecond the overhead would be irrelevant. Also, I assume these results are very JVM specific, other JVMs, older JVMs, and the JVMs of the future may behave much differently.

Reflection

There are two types of reflection, field reflection and method reflection. In JPA this normally corresponds to using FIELD access of PROPERTY access (annotating the fields or the get methods). Which is faster? This test sets a single variable, either directly, by calling a set method, or through field or set method reflection.

Reflection Performance Comparison

Type	Average (operations/10 seconds)	%STD	%DIF (with in-lined)
In-lined	44105730	0.7%	0%
Set method	46189638	1.2%	+4.7%
Field Reflection	23488987	1.4%	-87%
Method Reflection	10862684	0.8%	-306%

So its appears that in JDK 1.6 field reflection is faster than set method reflection. Part of this difference is that for method reflection an Object array must be created to call the set method, where as field reflection does not require this. Method reflection improved a lot in JDK 1.5, but it still seems slower than field reflection, at least for set methods. Both have an overhead over a compiled method execution or direct variable access.

If you look at the results the difference between reflection and normal execution is much less than in the previous test. This is because in the previous test the method was executed 100 times inside the test, so the overhead of calling the test method was reduced, where as this test only called it once, so had the test method call overhead in it. This highlights an important fact, that how you call the method that sets the field has a big impact on the performance. If you use a complex layering of interfaces and generated code to call a set method directly, versus calling it reflectively, the overhead of the layering may be worse than the reflective call.

In EclipseLink field reflection is used by default, but it depends if you annotate your fields or get methods. If you annotate your get methods, then method reflection is used. If you use weaving (enabled through either, the EclipseLink agent, static weaving, JEE, or Spring) and use FIELD access, then EclipseLink will weave in a generic get and set method into your classes to avoid reflection. This amounts to a minor optimization in the context of database persistence, and can be disabled using the persistence unit property "eclipselink.weaving.internal"="false".

Summary

So this post has presented a lot of data on how different things perform in Oracle Sun JDK 1.6 on Windows. It would be interesting to see how the same tests perform in different JVMs in different environments. Perhaps I will investigate that next.

The source code to the above tests is in the EclipseLink SVN repository,
here

Batch Fetching, part II - Journey to Big Data

2010-11-18T11:17:00.000-08:00

In my last post I investigated different ways to optimize the loading of objects and their relationships. I presented some performance results, but from a very small database with small query result sets. In this post I will investigate how those same query optimization strategies scale to larger data sets.

The previous post only had a database of 12 rows. So for this run I will increase the database size by 5,000 to 60,000 rows. Still not a huge database, but should be big enough to highlight any performance differences in the results. I will also increase the size of the query result set from 6 employees, to 5,000 employees.

The first thing that I noticed in this run was that Oracle has a limit of 1,000 parameters per statement. Since the IN batch fetching binds a large array, and I'm reading 5,000 objects, this limit was exceeded and the run blew up with a database error. The BatchFetchPolicy in EclipseLink accounts for this and defines a size for the max number of ids to include the an IN. The default size limit in EclipseLink was suppose to be 500, but I think I remember increasing it to 100,000 to test something when I was developing the feature, and, well..., I guess never set it back, oops, I will fix this...

EclipseLink defines a JPA Query hint "eclipselink.batch.size" that allows the size to be set. So I will set this to 500 for the test. This means that to read in all of the 5,000 objects, the IN batch fetch will need to execute 10 queries per batch fetched relationship. It will be interesting to see how it compares to the other query optimization techniques.

The run time was also increased to 10 minutes from 1 minute because reading 5,000 objects obviously takes longer than reading 6. Also, since the last run our lab got a new database machine. The old database was running on Linux on an old server machine, and the new database is running on a new Oracle Sun server machine running Linux on a virtualized environment. The client machine is the same, my old desktop running Oracle Sun JDK 1.6 on Windows XP. Both databases were Oracle 11g.

Big Data Results, run 1, simple (fetch address, phoneNumbers)

Query	Average (queries/10 minutes)	%STD	%DIF (of standard)
standard	27	4.5%	0%
join fetch	307	0.1%	+1037%
batch fetch (JOIN)	310	0.2%	+1048%
batch fetch (EXISTS)	309	0.1%	+1044%
batch fetch (IN)	261	0.1%	+866%

The results show that join fetching and batch fetching have basically equivalent performance, and about 10x better performance than the non-optimize query. IN batch fetching does not perform as well as the others with this larger result set. It performs better than I expected, given it has huge IN statement and has to execute 10 queries per relationship. Note that these results differ from the previous post that showed IN batch fetching performing the best for queries with small result sets.

The second run uses the complex query which fetches 9 different relationships.

Big Data Results, run 2, complex (fetch address, phoneNumbers, projects, manager, managedEmployees, emailAddresses, responsibilities, jobTitle, degrees)

Query	Average (queries/10 minutes)	%STD	%DIF (of standard)
standard	6	0.0%	0%
join fetch	59	1.5%	+383%
batch fetch (JOIN)	125	0.3%	+1983%
batch fetch (EXISTS)	124	0.3%	+1966%
batch fetch (IN)	80	3.2%	+1455%

The results show batch fetching having about 2x the performance of join fetching, and 20x the performance of the non-optimize query. Join fetching still performs 10x faster than the non-optimize case, which is different than the small result set run which gave it worse performance than the non-optimized query. IN batch fetching again did not perform as well as JOIN and EXISTS batch fetching, but still out performed join fetching and was 15x faster than the non-optimized query.

Note that these results are not universal. Expect that every database, every machine, every environment, every query, and every object model will give different results. The basics should be the same though, batch fetch should have better performance than non-optimized queries, and better performance than join fetching for objects with complex relationships. IN batch fetching will perform worse for large result sets, but have similar performance for small result sets. Join fetching will perform well for objects with a small number of relationships.

To see how the results differ in different environments, I did a few more runs on different databases. The next run is for a local Oracle 10g database installed on my desktop. This database is slower than the new server, but will not require a network trip since it is on the same machine as the Java client.

Big Data Results, run 3, simple (fetch address, phoneNumbers)

Query	Average (queries/10 minutes)	%STD	%DIF (of standard)
standard	11	0.0%	0%
join fetch	346	0.7%	+2718%
batch fetch (JOIN)	310	0.0%	+2718%
batch fetch (EXISTS)	343	0.6%	+3018%
batch fetch (IN)	260	0.1%	+2263%

Big Data Results, run 4, complex (fetch address, phoneNumbers, projects, manager, managedEmployees, emailAddresses, responsibilities, jobTitle, degrees)

Query	Average (queries/10 minutes)	%STD	%DIF (of standard)
standard	1	0.0%	0%
join fetch	39	1.1%	+3800%
batch fetch (JOIN)	106	0.8%	+10500%
batch fetch (EXISTS)	113	0.0%	+11200%
batch fetch (IN)	36	1.2%	+3500%

These results show similar results for the previous simple run, but about a 30x improvement over the non-optimize query, which is a bigger difference. The JOIN batch fetch did not seem to perform as well as the EXISTS or join fetch.

The complex run only completed a single run in the 10 minutes for the non-optimized query. The JOIN and EXISTS batch fetching performed the best, over 100x faster than the non-optimized query. Join fetching and IN batch fetching did not perform as well, but were both still over 30x faster than the non-optimized query.

Batch fetching - optimizing object graph loading

2010-08-09T12:03:00.000-07:00

Probably the biggest impedance mismatch between object-oriented object models and relational database data models, is the way that data is accessed.

In a relational model, generally a single big database query is constructed to join all of the desired data for a particular use case or service request. The more complex the data, the more complex the SQL. Optimization is done by avoiding unnecessary joins and avoiding fetching duplicate data.

In an object model an object or set of objects are obtained and the desired data is collected by traversing the object's relationships.

With object relation mapping (ORM/JPA) this typically leads to multiple queries being executed and typically the dreaded "N queries" problem, or executing a separate query per object in the original result set.

Consider an example relational model EMPLOYEE, ADDRESS, PHONE tables. EMPLOYEE has a foreign key ADDR_ID to ADDRESS ADDRESS_ID, and PHONE has a foreign key EMP_ID to EMPLOYEE EMP_ID.

To display employee data including address and phone for all part-time employees you would have to following SQL,

Big database query

SELECT E.*, A.*, P.* FROM EMPLOYEE E, ADDRESS A, PHONE P WHERE E.ADDR_ID = A.ADDRESS_ID AND E.EMP_ID = P.OWNER_ID AND E.STATUS = 'Part-time'

You would then need to group and format this data to get all the phone numbers for each employee to display.

...

ID	Name	Address	Phone
6578	Bob Jones	17 Mountainview Dr., Ottawa	519-456-1111, 613-798-2222
7890	Jill Betty	606 Hurdman Ave., Ottawa	613-711-4566, 613-223-5678

The corresponding object model defines Employee, Address, Phone classes.
In JPA, Employee has a OneToOne to Address, and a OneToMany to Phone, Phone has a ManyToOne to Employee.

To display employee data including address and phone for all part-time employees you would have the following JPQL,

Simple JPQL

Select e from Employee e where e.status = 'Part-time'

To display this data you would write the Employee data, get and write the Address from the Employee, and get and write each of the Phones. The displayed result is of coarse the same as using SQL, but the SQL generated is quite different.

...

ID	Name	Address	Phone
6578	Bob Jones	17 Mountainview Dr., Ottawa	519-456-1111, 613-798-2222
7890	Jill Betty	606 Hurdman Ave., Ottawa	613-711-4566, 613-223-5678

N+1 queries problem

SELECT E.* FROM EMPLOYEE E WHERE E.STATUS = 'Part-time'
... followed by N selects to ADDRESS
SELECT A.* FROM ADDRESS A WHERE A.ADDRESS_ID = 123
SELECT A.* FROM ADDRESS A WHERE A.ADDRESS_ID = 456
...
... followed by N selects to PHONE
SELECT P.* FROM PHONE P WHERE P.OWNER_ID = 789
SELECT P.* FROM PHONE P WHERE P.OWNER_ID = 135
...

This will of coarse have very pathetic performance (unless all of the objects were already in the cache). There are a few ways to optimize this in JPA. The most common method is to use join fetching. A join fetch is where an object, and its related objects are fetched in a single query. This is quite easy to define in JPQL, and is similar to defining a join.

JPQL with join fetch

Select e from Employee e join fetch e.address, join fetch e.phones where e.status = 'Part-time'

This produces the same SQL as in the SQL case,

SQL for JPQL with join fetch

SELECT E.*, A.*, P.* FROM EMPLOYEE E, ADDRESS A, PHONE P WHERE E.ADDR_ID = A.ADDRESS_ID AND E.EMP_ID = P.OWNER_ID AND E.STATUS = 'Part-time'

The code to display the Employee results is the same, the objects are just loaded more efficiently.

JPA only defines join fetches using JPQL, with EclipseLink you can also use the @JoinFetch annotation to have a relationship always be join fetched. Some JPA providers will always join fetch any EAGER relationship, this may seem like a good idea, but is generally a very bad idea. EAGER defines if the relationship should be loaded, not how it should be accessed from the database. A user may want every relationship loaded, but join fetching every relationship, in particular every ToMany relationships will lead to a huge join (outer joins at that), fetching a huge amount of duplicate data. Also for ManyToOne relationships such as parent, owner, manager where there is a shared reference (that is probably already in the cache), join fetching this duplicate parent for every child will perform much worse than a separate select (or cache hit).

JPQL does not allow the aliasing of the join fetch, so if you wish to also query on the relationship, you have to join it twice. This is optimized out to a single join for ToOne relationships, and for ToMany relationships you really need the second join to avoid filtering the object's related objects. Some JPA providers do support using an alias for a join fetch, but the JPA spec does not allow it, and EclipseLink does not support this as of yet (but there is a bug logged).

Nesting join fetches are not directly supported by JPQL either (there is also a bug for this). EclipseLink supports nested join fetches through the Query hints "eclipselink.join-fetch" and "eclipselink.left-join-fetch".

Nested join fetch query hint

query.setHint("eclipselink.join-fetch", "e.projects.milestones");

Join fetching is fine, and the best solution in many use cases, but it is a very relational database centric approach. Another, more creative and object-oriented solution is to use batch fetching. Batch fetching is much harder for the traditional relational mindset to comprehend, but once understood is quite powerful.

JPA only defines join fetches, not batch fetches. To enable batch fetching in EclipseLink the Query hint "eclipselink.batch" is used, in our example this would be,

Batch fetch query hint

query.setHint("eclipselink.batch", "e.address");
query.setHint("eclipselink.batch", "e.phones");

In a batch fetch the original query is executed normally, the difference is how the related objects are fetched. Once the employees are retrieved and their first address is accessed, ALL of the addresses for ONLY the SELECTED employees are fetched. There are several different forms of batch fetching, for a JOIN batch fetch the SQL will be,

SQL for batch fetch (JOIN)

SELECT E.* FROM EMPLOYEE E WHERE E.STATUS = 'Part-time'
SELECT A.* FROM EMPLOYEE E, ADDRESS A WHERE E.ADDR_ID = A.ADDRESS_ID ANDE.STATUS = 'Part-time'
SELECT P.* FROM EMPLOYEE E, PHONE P WHERE E.EMP_ID = P.OWNER_ID AND E.STATUS = 'Part-time'

The first observation is that 3 SQL statements occurred instead of 1 with a join fetch. This may lead one to think that this is less efficient, but it actually is more efficient in most cases. The difference between 1 and 3 selects is pretty minimal, the main issue with the unoptimized case was that N selects were executed, which could be 100s or even 1000s.

The main benefit to batch fetching is that only the desired data is selected. In the join fetch case, the EMPLOYEE and ADDRESS data were duplicated in the result for every PHONE. If each employee had 5 phones numbers, 5 times as much data would be selected. This is true for any ToMany relationship and becomes exasperated if you join fetch multiple or nested ToMany relationships. For example if an employee's projects were join fetched, and the project's milestones, for say 5 projects per employee and 10 milestones per project, you get the employee data duplicated 50 times (and project data duplicated 10 times). For a complex object model, this can be a major issue.

Join fetching typically needs to use an outer join to handle the case where an employee does not have an address or phone. Outer joins are general much less efficient in the database, and add a row of nulls to the result. With batch fetching if an employee does not have an address or phone, it is simply not in the batch result, so less data is selected. Batch fetching also allows for a distinct to be used for ManyToOne relationships. For example if the employee's manager was batch fetched, then the distinct would ensure that only the unique managers were selected, avoiding the selecting of any duplicate data.

The draw backs to JOIN batch fetching is that the original query is executed multiple times, so if it is an expensive query, join fetching could be more efficient. Also if only a single result is selected, then batch fetching does not provide any benefit, where as join fetching can still reduce the number of SQL statements executed.

There are a few other forms of batch fetching. EclipseLink 2.1 supports three different batch fetching types, JOIN, EXISTS, IN (defined in the BatchFetchType enum). The batch fetch type is set using the Query hint "eclipselink.batch.type". Batch fetching can also be always enabled for a relationship using the @BatchFetch annotation.

Batch fetch query hints and annotations

query.setHint("eclipselink.batch.type", "EXISTS");

@BatchFetch(type=BatchFetchType.EXISTS)

The EXISTS option is similar to the JOIN option, but the batch fetch uses an exists and a sub-select instead of a join. The advantage of this is that no distinct is required, which can be an issue with lobs or complex queries.

SQL for batch fetch (EXISTS)

SELECT E.* FROM EMPLOYEE E WHERE E.STATUS = 'Part-time'
SELECT A.* FROM ADDRESS A WHERE EXISTS (SELECT E.EMP_ID FROM EMPLOYEE E WHERE E.ADDR_ID = A.ADDRESS_ID AND E.STATUS = 'Part-time')

SELECT P.* FROM PHONE P WHERE EXISTS (SELECT E.EMP_ID FROM EMPLOYEE E, WHERE E.EMP_ID = P.OWNER_ID AND E.STATUS = 'Part-time')

The IN option is quite different than the JOIN and EXISTS options. For the IN option the original select is not included, instead an IN clause is used to filter the related objects just for the original object's id. The advantage of the IN option is that the original query does not need to be re-executed, which can be more efficient if the original query is complex. The IN option also supports pagination and usage of cursors, where as the other options do not work effectively as they must select all of the related objects, not just the page. In EclipseLink IN also makes use of the cache, so that if the related object can be retrieved from the cache, it is, and is not included in the IN, so only the minimal required data is selected.

The issues with the IN option is that the set of ids in the IN can be very big, and inefficient for the database to process. If the set of ids is too big, it much be split up into multiple queries. Also composite primary keys can be an issue. EclipseLink supports nested INs for composite ids on databases such as Oracle that support it, but some databases do not support this. IN also requires the IN part of SQL to be dynamically generated, and if the IN batch sizes are not always the same, can lead to dynamic SQL on the database.

SQL for batch fetch (IN)

SELECT E.* FROM EMPLOYEE E WHERE E.STATUS = 'Part-time'
SELECT A.* FROM ADDRESS A WHERE A.ADDRESS_ID IN (1, 2, 5, ...)
SELECT P.* FROM PHONE P WHERE P.OWNER_ID IN (12, 10, 30, ...)

Batch fetching can also be nested, by using the dot notation in the hint.

Nested batch fetch query hint

query.setHint("eclipselink.batch", "e.projects.milestones");

Something that batch fetching allows that join fetching does not is the optimal loading of a tree. If you set the @BatchFetch annotation on a children relationship in a tree structure, then a single SQL statement will be used for each level.

So, what does all of this mean? Well, every environment and use case is different, so there is no perfect solution to use in all cases. Different types of query optimization will work better in different situations. The following results are provided as an example of the potential performance improvements from following these approaches.

The results were obtained running in a single thread in JSE accessing an Oracle database over a local network on low end hardware. Each test was run for 60 seconds, and number of operations recorded. Each test run was run 5 times. The high/low results were rejected and the middle 3 results average, the % standard deviation between the runs in included. The numbers themselves are not important, only the % difference between the results.

The source code for this example and comparison run can be found here.

The example performs a simple query for Employees using an unoptimized query, join fetching and each type of batch fetching. After the query each employee's address and phone numbers are accessed. The JPQL NamedQuery queries a small result set of 6 employees by salary from a tiny database of 12 employees.

Simple run (fetch address, phoneNumbers)

Query	Average (queries/minute)	%STD	%DIF (of standard)
standard	5897	0.5%	0
join fetch	14024	1.1%	+137%
batch fetch (JOIN)	11190	4.5%	+89%
batch fetch (EXISTS)	13764	0.4%	+133%
batch fetch (IN)	14341	0.6%	+143%

From the first run's results it seems that join fetching and batch fetching were similar, and about 1.9 to 2.4 times faster than the non optimized query. IN batch fetching seemed to perform the best, and JOIN batch fetching the worst for this small data set.

The first run was kind of simple though. It only fetched two relationships. What happens when more relationships are fetched? This next run fetches all 9 of the employee relationships, including OneToOnes, OneToManys, ManyToOnes, ManyToManys, and ElementCollections.

Complex run (fetch address, phoneNumbers, projects, manager, managedEmployees, emailAddresses, responsibilities, jobTitle, degrees)

Query	Average (queries/minute)	%STD	%DIF (of standard)
standard	1438	0.7%	0%
join fetch	1121	0.4%	-22%
batch fetch (JOIN)	3395	3.8%	+136%
batch fetch (EXISTS)	3768	2.6%	+162%
batch fetch (IN)	3893	0.5%	+170%

The second run results show that as more relationships are fetched, join fetching starts to have major issues. Join fetching actually had worse performance that the non optimized query (-22%). This is because the join becomes very big, and a lot of data must be processed. Batch fetching on the other hand did even better than the simple run (2.3 to 2.7x faster). IN batch fetching still performed the best, and JOIN batch fetching the worse.

The amount of data is very small however. What happens when the size of the database and the query are scaled up? I will investigate that next.

Java Persistence Performance

Optimizing Java Serialization - Java vs XML vs JSON vs Kryo vs POF

Java serialization bytes for Order object

Externalizable

Externalizable class

Externalizable serialization bytes for Order object

Other Serialization Options

EclipseLink MOXy - XML and JSON

JAXB annotated classes

EclipseLink MOXy serialization XML for Order object

EclipseLink MOXy serialization JSON for Order object

Kryo

Kryo serialization bytes for Order object

Oracle Coherence POF

POF PortableObject

POF serialization bytes for Order object

Results

Order with 1 OrderLine

Order with 100 OrderLines

EclipseLink JPA

EclipseLink supports HQL and several advanced new JPQL features

HQL Compatibility

Array Expressions

Hierarchical Selects

Historical Queries

Cool performance features of EclipseLink 2.5

Indexing Foreign Keys

Query Cache Invalidation

Tuners

Concurrent Processing

Batch Writing, and Dynamic vs Parametrized SQL, how well does your database perform?

Example dynamic SQL

Example parametrized SQL

Database: MySQL Version: 5.5.16 Driver: MySQL-AB JDBC Driver Version: mysql-connector-java-5.1.22

MySQL: rewriteBatchedStatements=true

PostgreSQL Version: 9.1.1 PostgreSQL 8.4 JDBC4

Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 Oracle JDBC driver Version: 11.2.0.2.0

Apache Derby Version: 10.9.1.0 - (1344872) Apache Derby Embedded JDBC Driver Version: 10.9.1.0 - (1344872) (local)

DB2/NT64 Version: SQL09070 IBM Data Server Driver for JDBC and SQLJ Version: 4.0.100

Microsoft SQL Server Version: 10.50.1617 Microsoft SQL Server JDBC Driver 2.0 Version: 2.0.1803.100

** UPDATE **

Database: H2 Version: 1.3.167 (2012-05-23) Driver: H2 JDBC Driver Version: 1.3.167 (2012-05-23) (local)

Database: HSQL Database Engine Version: 1.8.1 Driver: HSQL Database Engine Driver Version: 1.8.1 (local)

But what if I'm not querying by id? (database and cache indexes)

EclipseLink Database Indexes

JPA 2.1 Database Indexes

EclipseLink Cache Indexes

Got Cache? improve data access by 97,322%

EclipseLink 15x faster than other JPA providers

JPQL vs SQL, why not have both

ON

UNION, INTERSECT, EXCEPT

NULLS FIRST

CAST

EXTRACT

REGEXP

FUNC and FUNCTION

OPERATOR

SQL

COLUMN

TABLE

JOIN FETCH

Criteria API

Summary

Objects vs Data, and Filtering a JOIN FETCH

Dobject Model

How not to write a DAO

JPQL vs SQL

Filtering a JOIN FETCH

EclipseLink JPA supports MongoDB

Ordering object model

Step 1 : Decide how to store the data

Step 2 : Map the data

Step 3 : Define the Id

Step 4 : Define the mappings

Step 5 : Optimistic locking

Step 6 : Querying

Step 7 : Connecting

Summary

NoSQL

Database: MySQL Version: 5.5.16
Driver: MySQL-AB JDBC Driver Version: mysql-connector-java-5.1.22

PostgreSQL Version: 9.1.1
PostgreSQL 8.4 JDBC4

Oracle Database 11g Enterprise Edition Release 11.1.0.7.0
Oracle JDBC driver Version: 11.2.0.2.0

Apache Derby Version: 10.9.1.0 - (1344872)
Apache Derby Embedded JDBC Driver Version: 10.9.1.0 - (1344872)
(local)

DB2/NT64 Version: SQL09070
IBM Data Server Driver for JDBC and SQLJ Version: 4.0.100

Microsoft SQL Server Version: 10.50.1617
Microsoft SQL Server JDBC Driver 2.0 Version: 2.0.1803.100

UPDATE

Database: H2 Version: 1.3.167 (2012-05-23)
Driver: H2 JDBC Driver Version: 1.3.167 (2012-05-23)
(local)

Database: HSQL Database Engine Version: 1.8.1
Driver: HSQL Database Engine Driver Version: 1.8.1
(local)