Thursday, January 3, 2013

Got Cache? improve data access by 97,322%

Caching is the most valuable optimization one can make in software development. Making something run faster is nice, but it can never beat not having to run anything at all, because you already have the result cached.

JPA caches many things. The most important thing to cache is the EntityManagerFactory, this is generally done for you in JavaEE, but in JavaSE you need to do this yourself, such as storing it in a static variable. If you don't cache your EntityManagerFactory, then your persistence unit will be redeployed on every call, which will really suck.

Other caches in JPA include the cache of JDBC connections, the cache of JDBC statements, the result set cache, and most importantly the object cache, which is what I would like to discuss today.

JPA 1.0 did not define caching, although most JPA providers did support a cache in some form or another. JPA 2.0 defined caching through the @Cacheable annotation and the <shared-cache-mode> persistence.xml element. Some describe caching in JPA as two levels. Conceptually there is the L1 cache on an EntityManager, and the L2 cache on the EntityManagerFactory.

The EntityManager cache is an isolated, transactional cache, that only caches the objects read by that EntityManager, and shares nothing with other EntityManagers. The main purpose of the L1 cache is to maintain object identity (i.e. person == person.getSpouse().getSpouse()), and maintain transaction consistency. The L1 cache will also improve performance by avoiding querying the same object multiple times. The only way to avoid the L1 cache is to refresh, create a new EntityManager, or call clear().

The EntityManagerFactory cache is a shared cache across all EntityManagers, and reflects the current committed state of the database (stale data can be possible depending on your configuration and if you have other applications accessing the database). The main purpose of the L2 cache is to improve performance by avoiding queries for objects that have already been read. The L2 cahe is normally what is referred to when caching is discussed in JPA, and what the JPA <shared-cache-mode> and @Cacheable refer to.

There are many types of caches provided by the various JPA providers. Some provide data caches, some provide object caches, some have relationship caches, some have query caches, some have distributed caches, or coordinated caches.

EclipseLink provides an object cache, what I would call a "live" object cache. I believe most other JPA providers provide a data cache. The difference between a data cache, and an object cache, is that a data cache just caches the object's row, where as an object cache caches the entire object, including its relationships.

Caching relationships is normally more important than caching the object's data, as each relationship normally represent a database query, so saving n database queries to build an object's relationships is more important than saving the 1 query for the object itself. Some JPA providers augment their data cache with a relationship cache, or a query cache. If a data cache caches relationships at all, it is normally in the form of caching only the ids of the related objects. This can be a major issue, consider caching a OneToMany relationship, if you only have a set of ids, then you need to query the database for each id that is not in the cache, causing n database queries. With an object cache, you have the related objects, so never need to query the database.

The other advantage to caching objects is that you also save the cost of building the objects from the data. If the object or query is read-only, the cached object can be used directly, otherwise it only needs to be copied, not rebuilt from data.

EclipseLink also supports not caching relationships through the @Noncacheable annotation. Also the @Cache(isolation=PROTECTED) option can be used to ensure read-only entities and queries always copy the cached objects. So you can simulate a data cache with EclipseLink.

One should not underestimate the performance benefits of caching. Where as other JPA optimization may improve performance by 10-20%, or 2-5x for the major ones, caching has the potential to improve performance by factors of 100x even 1000x.

So what are the numbers? In this simple benchmark I compare reading a simple Order object, and it relationships (orderLines, customer). I compared the various caching options.
(result is queries per second, so bigger number is better, test was single threaded, randomly querying an order from a data set of 1000 orders, tests were run 5 times and averaged, database was an Oracle database over a local area network, low end hardware was used).

Cache OptionCache ConfigAverage Result (q/s)% Difference
No Cache@Cacheable(false)9650%
Object Cache@Cacheable(true)36,5443,686%
Object Cache@Cache(isolation=PROTECTED)35,1073,538%
Data Cache@Cache(isolation=PROTECTED) + @Noncacheable(true)1,88995%
Read Only Cache@ReadOnly940,12397,322%
Protected Read Only Cache@ReadOnly + @Cache(isolation=PROTECTED)625,60264,729%

The results show that although a data cache provides a significant benefit (~2x), it does not compare with an object cache (~100x). Marking the objects as @ReadOnly provides a significant additional benefit (~1000x).

The object cache, caches objects by their Id. This is great for find() or merge() operations, but does not help as much with queries. In EclipseLink any query by Id will also hit the object cache, but queries not by Id will have to hit the database. For each database result the object cache will still be checked, so the cost of building the objects and most importantly their relationships can still be avoided.

EclipseLink also supports a query cache. The query cache is configured independently of the object cache, and is configured per query, and not enabled by default. The query cache caches query results by query name and query parameters. This allows any query to obtain a cache hit. The query cache is configured through the "eclipselink.query-results-cache" query hint.

EclipseLink can execute queries in-memory against the object cache. This is not used by default, but can be configured on any query. Since the object cache does not normally contain the entire database, this works best with a FULL cache type, that has been preloaded. This is configured on the query through the "eclipselink.cache-usage" query hint.

This next benchmark compares the various caching options with a query. Each query is for the orders for a customer id, this will result in 10 Order objecs per query. Random customer ids are used.

Cache OptionCache ConfigAverage Result (q/s)% Difference
No Cache@Cacheable(false)1860%
Object Cache@Cacheable(true)1,021448%
Object Cache@Cache(isolation=PROTECTED)1,085483%
Data Cache@Cache(isolation=PROTECTED) + @Noncacheable(true)1986%
Read Only Query"eclipselink.read-only"="true"1,391647%
Read Only Query - Protected Cache"eclipselink.read-only"="true" + @Cache(isolation=PROTECTED)1,351626%
Query Cache"eclipselink.query-results-cache"="true"5,1142,649%
In-memory Query"eclipselink.cache-usage"="CheckCacheOnly"2,3971,188%

This shows that the object cache can still provide a significant benefit to queries through the benefit of caching the relationships (~5x). The query cache performs the best with ~25x benefit, and in-memory querying also performing well with a ~10x benefit. A data cache provide little benefit to queries.

I have measured the performance of several caching options in this post, but by no means have detailed all of the caching options in EclipseLink.
Other caching options available in EclipseLink include:

  • @Cache - type : FULL, WEAK, SOFT, SOFT_CACHE, HARD_CACHE
  • @Cache - size : size of cache in number of objects
  • @Cache - expiry : millisecond time to live expiry
  • @Cache - expiryTimeOfDay : daily expiry
  • @CacheIndex : non-id cache indexing
  • "eclipselink.cache.coordination" : clustered cache synchronization or invalidation
  • "eclipselink.cache.database-event-listener" : database event driven cache invalidation (Oracle DCN)
  • "eclipselink.query-results-cache.expiry" : query cache time to live expiry
  • "eclipselink.query-results-cache.expiry-time-of-day" : query cache daily expiry
  • TopLink Grid : integration with Oracle Coherence distributed cache
See the EclipseLink UserGuide for more info on caching.

The source code for the benchmarks used in this post can be found here, or download here.

What caching options to use depends on the application and its data. Caching may not be suitable to all types of applications or data, but for those in which it is applicable, it will normally provide the biggest performance benefit that is attainable.