Wednesday, May 4, 2011

Data Partitioning - Scaling the Database

In Enterprise Java most of the effort is normally done to scale the mid-tier application and its server.  Pretty much every Java application server supports clustering and scaling out the application to several mid-tier machines.  Even if the application server does not officially support clustering, it is normally pretty easy to have a "cluster" of application servers fronted by a round-robin load-balancer.  This would even work with something as simple as Tomcat.

The mid-tier normally scales very well to a cluster, as it does not have any shared in-memory data, as this data is normally stored in a database.  All of the mid-tier cluster members access the same database, and life is good.  The application can have unlimited performance, simply by adding more mid-tier machines.  But what happens when the poor database machine suddenly can't take any more requests?

The most common solution to scaling the database seems to be to buy a bigger and badder database machine.  If 8 cores is not cutting it, then perhaps 16 cores will.  This solution in general works pretty good, assuming hardware vendors can keep stuffing more cores into their machines.  This solution seems to be used in Enterprise Java performance benchmarks such as SpecJ, if you look at the nodes column of the SpecJ 2010 results, all the results have a single database node, some with as many as 40 cores, even though some have 8 mid-tier nodes.

But what happens when you can't stuff any more cores into a machine, or the cost of an insanely multi-core machine greatly outweighs the cost of multiple lower end hardware machines?  Perhaps this is just a hardware problem, and you just need to wait for the hardware vendors to make a bigger and badder machine, but there are other solutions from the wonderful world of software.

The first solution to look into is to optimize your own application (as always).  Generally the application's code is not so efficient in its database access, and by optimizing the number and types of queries hitting the database, using parametrized SQL, using batch writing, using lazy, join and batch fetching, a significant load can be removed from the database.  But perhaps you already did that, or don't have the expertise, or just don't feel like it.

The second solution is to optimize your database.  By ensuring your database is configured optimally, has the correct indexes, queries are using the optimal query plan, and the disk access optimally, its performance, and thus scalability can be improved.    But perhaps you already did that, or don't have the expertise, or just don't feel like it.

The third solution is to investigate caching in the mid-tier.  By caching objects and data in the mid-tier, you can offload a lot of the queries hitting the database, and improve your application's performance to boot.  Most JPA providers support caching, and some such as EclipseLink offer quite advanced caching functionality including invalidation, and coordinated clustered caches.  JPA 2.0 defines some basic caching annotations to enable and access the cache.

Caching mainly benefits reads, but some caching solutions such as Oracle Coherence offer the ability to offload writes as well.  Oracle TopLink Grid provides JPA support for Coherence.

Caching can be a good solution, but there can be issues with stale data, clustering, and mid-tier contention.  Some caching solution are very good, but not always as good at managing concurrent access to data as relational databases that have been doing it for decades.  Also if your database has become a bottleneck because of writes, then caching reads may not be a solution.

The best solution is to scale the database through clustering the database across multiple machines.  This could be a real clustered database, such as Oracle RAC, or just multiple regular database instances.  Clustered database are good, and can improve your scalability without much work, but depending on your application you may also have to partition your data across the database nodes for optimal scalability. Without partitioning, if you write a row on one node, then access it on another, the other node must request the latest copy of the data from the other node, this can potentially make performance worse.

Partitioning splits your data across each of the database nodes.  There is horizontal partitioning, and vertical partitioning.  Vertical partitioning is normally the easiest to implement.  You can just put half of your classes on one database, and the other half on another.  Ideally the two sets would be isolated from each other and not have any cross database relationships.

For horizontal partitioning  you need to split your data across multiple database nodes.  Each database node will have the same tables, but each node's table will only store part of the data.  You can partition the data by the data values, such as range partitioning, value partitioning, hash partitioning, or even round robin.

To enable data partitioning you require your persistence solution to be aware of how to partition the data.   EclipseLink 2.2 added support for partitioning data.  Both vertical and horizontal partitioning is supported.  Several partitioning options are provided at the Session, Entity and Query level,
  • Range partitioning - each partition maps a range of field values.
  • Value partitioning - each field value maps to a partition.
  • Hash partitioning - the field value is hashed to determine its partition.
  • Pinned partitioning - allows an Entity or query to be vertically partitioned.
  • Round robin - allows load balancing of requests across multiple database nodes.
  • Replication - allows data to be replicated across multiple database nodes.
EclipseLink supports partitioning across any database, including both clustered databases such as Oracle RAC, and  standard databases such as MySQL. Relationships and queries across database partitions are supported, but joins across partitions are only supported for clustered databases.

So how does data partitioning with JPA and EclipseLink scale?  To determine the answer, I developed a simple order processing example.  The example defines an Order, OrderLine and Customer.  The example client processes orders for a Customer using 16 client threads.  The application is primarily insert oriented, so heavily uses the database.  I first ran the application without partitioning on a single MySQL database instance.  To give the poor database no chance of keeping up to the mid-tier client, I ran the mid-tier on a virtualized 8 core machine with 16g of ram (Oracle Sun hardware, Oracle Linux OS).  I ran the MySQL database on a similar machine, but only gave it 1 virtual core and 8g or RAM.  So, I was pretty sure the application would be bottlenecking on the database.  This was the goal, to simulate a cluster of mid-tier machines accessing a single database machine.

Next, I enable partitioning of the Order and OrderLine by the ORDER_ID using hash partitioning across two database nodes.  I also hash partitioned Customer by its ID.  This resulted in about half of the transactions going to one database, and half to the other.  Because the Order and the OrderLine shared the same ORDER_ID, they were partitioned to the same database node, so I did not need to worry about transactions spanning multiple nodes.  The read for the Customer could go to either node, but because it was a non-transactional read, this was just routed separately by EclipseLink, which has support for using different connections for non-transactional reads versus transactional writes.  Having writes span multiple nodes in a single transaction is normally not desirable.  EclipseLink allows this and can be integrated with JTA to give 2-phase commit across the nodes.  If JTA is not used EclipseLink still does a 2-stage commit, but there are no gaurentees if all of the writes succeed, but the commit transaction fails.

The resulting order was mapped as,
public class Order implements Serializable {
    private long id;
    private String description;
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    @OneToMany(mappedBy="order", cascade=CascadeType.ALL, orphanRemoval=true)
    private List orderLines = new ArrayList();
    private Customer customer;

For the full source code for the example see here.

For the second run a 2nd MySQL database was added running on a separate 1 core machine.  The result showed a 66% increase in scalability for the application, processing close to 2x as many orders. The test application was run for 1 minute and the total number of processed orders for all 16 client threads totaled. This was run 5 times and the results averaged for each configuration.

The results:
ConfigurationThreadsAverage processed orders%STD%DIF
Single database1611,1500.4%0%
2 node partition1618,5832.2%66%

The results show that through effective partitioning the database can be scaled out to multiple machines as well as the mid-tier.


  1. Great post. Thanks for sharing the whole source code as it will be more easy to understand by examining the code. The partitioning process is successfully carried out and it will be more easy to do the same for other people just by following your piece of code. Cheers !
    sap erp system

  2. Eclipselink documentation is scarce and allways send you to Javadoc so, partitioning can be set via xml?


  3. Yes, you can use the eclipselink orm.xml to configure partitioning. The xml elements mirror the annotations.

    The schema is here,

  4. Excellent pieces. Keep posting such kind of information on your blog. I really impressed by your blog.
    Vee Eee Technologies

  5. Hi James, nice post.

    I have a question, is there any way one can change the Partition configuration dynamically?

    A question/request, could you please provide us with an example about how to 'properly' use the @RoundRobinPartitioning attribute?

    And one last question, at EclipseLink documentation says "Partitioning can be set at the session, entity, relationship and query level", how can a partition be set at a query level?


  6. You can access the EclipseLink metadata through the Session and ClassDescriptor API to change the descriptor's PartitioningPolicy. Obviously you need to be very careful changing anything at runtime. A better way is to use EclipseLink's external meta-data, You could also implement your own adaptive PartitioningPolicy.

    RoundRobinPartitioning can be used for reads, normally you must replicate the writes, so the object can be read from any database partition (unless you are using a RAC).

    The query hint is, "eclipselink.partitioning" to set the PartitioningPolicy for a query.