Tuesday, August 13, 2013

Optimizing Java Serialization - Java vs XML vs JSON vs Kryo vs POF

Perhaps I'm naive, but I always thought Java serialization must surely be the fastest and most efficient way to serialize Java objects into binary form. After all, Java is on it's 7th major release, so this is not new technology, and since every JDK seems to be faster than the last, I incorrectly assumed serialization must be very fast and efficient by now. I thought, since Java serialization is binary, and language dependent, it must be much faster and more efficient than XML or JSON. Unfortunately, I was wrong, if you concerned about performance, I would recommend avoiding Java serialization.

Now, don't get me wrong, I'm not trying to dis Java. Java serialization has many requirements, the main one being able to serialize anything (or at least anything that implements Serializable), into any other JVM (even a different JVM version/implementation), even running a different version of the classes being serialized (as long as you set a serialVersionUID). The main thing is it just works, and that is really great. Performance is not the main requirement, and the format is standard and must be backward compatible, so optimization is very difficult. Also, for many types of use cases, Java serialization performs very well.

I got started on this journey into the bowels of serialization while working on a three tier concurrency benchmark. I noticed a lot of the CPU time being spent inside Java serialization, so I decided to investigate. I started by serializing a simple Order object that had a couple of fields. I serialized the object and output the bytes. Although the Order object only had a few bytes of data, I was not that naive to think it would serialize to only a few bytes, I knew enough about serialization that it would at least need to write out the full class name, so it knew what it had serialized, so it could read it back. So I expected, maybe 50 bytes or so. The result was over 600 bytes, that's when I realized Java serialization was not as simple as I had imagined.

Java serialization bytes for Order object

----sr--model.Order----h#-----J--idL--customert--Lmodel/Customer;L--descriptiont--Ljava/lang/String;L--orderLinest--Ljava/util/List;L--totalCostt--Ljava/math/BigDecimal;xp--------ppsr--java.util.ArrayListx-----a----I--sizexp----w-----sr--model.OrderLine--&-1-S----I--lineNumberL--costq-~--L--descriptionq-~--L--ordert--Lmodel/Order;xp----sr--java.math.BigDecimalT--W--(O---I--scaleL--intValt--Ljava/math/BigInteger;xr--java.lang.Number-----------xp----sr--java.math.BigInteger-----;-----I--bitCountI--bitLengthI--firstNonzeroByteNumI--lowestSetBitI--signum[--magnitudet--[Bxq-~----------------------ur--[B------T----xp----xxpq-~--xq-~--
(note "-" means an unprintable character)

As you may have noticed, Java serialization writes out not only the full class name of the object being serialized, but also the entire class definition of the class being serialized, and all of the referenced classes. The class definition can be quite large, and seems to be the main performance and efficiency issue, especially when writing out a single object. If you are writing out a large number of objects of the same class, then the class definition overhead is not normally a big issue. One other thing that I noticed, is that if your object has a reference to a class (such as a meta-data object), then Java serialization will write the entire class definition, not just the class name, so using Java serialization to write out meta-data is very expensive.

Externalizable

It is possible to optimize Java serialization through implementing the Externalizable interface. Implementing this interface avoids writing out the entire class definition, just the class name is written. It requires that you implement the readExternal and writeExternal methods, so requires some work and maintenance on your part, but is faster and more efficient than just implementing Serializable.

One interesting note on the results for Externalizable is that it was much more efficient for a small number of objects, but actually output slightly more bytes than Serializable for a large number of objects. I assume the Externalizable format is slightly less efficient for repeated objects.

Externalizable class

public class Order implements Externalizable {
    private long id;
    private String description;
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    private List orderLines = new ArrayList();
    private Customer customer;

    public Order() {
    }

    public void readExternal(ObjectInput stream) throws IOException, ClassNotFoundException {
        this.id = stream.readLong();
        this.description = (String)stream.readObject();
        this.totalCost = (BigDecimal)stream.readObject();
        this.customer = (Customer)stream.readObject();
        this.orderLines = (List)stream.readObject();
    }

    public void writeExternal(ObjectOutput stream) throws IOException {
        stream.writeLong(this.id);
        stream.writeObject(this.description);
        stream.writeObject(this.totalCost);
        stream.writeObject(this.customer);
        stream.writeObject(this.orderLines);
    }
}

Externalizable serialization bytes for Order object

----sr--model.Order---*3--^---xpw---------psr--java.math.BigDecimalT--W--(O---I--scaleL--intValt--Ljava/math/BigInteger;xr--java.lang.Number-----------xp----sr--java.math.BigInteger-----;-----I--bitCountI--bitLengthI--firstNonzeroByteNumI--lowestSetBitI--signum[--magnitudet--[Bxq-~----------------------ur--[B------T----xp----xxpsr--java.util.ArrayListx-----a----I--sizexp----w-----sr--model.OrderLine-!!|---S---xpw-----pq-~--q-~--xxx

Other Serialization Options

I started to investigate what other serialization options there were in Java. I started with EclipseLink MOXy, which supports serializing objects to XML or JSON through the JAXB API. I was not expecting XML serialization to outperform Java serialization, so was quite surprised when it did for certain use cases. I also found a product Kryo, which is an open source project for optimized serialization. I also investigated the Oracle Coherence POF serialization format. There are pros and cons with each product, but my main focus was on how their performance compared, and how efficient they were.

EclipseLink MOXy - XML and JSON

The main advantage of using EclipseLink MOXy to serialize to XML or JSON is that both are standard, portable formats. You can access the data from any client using any language, so are not restricted to Java, as with Java serialization. You can also integrate your data with web services and REST services. Both formats are also text based, so human readable. No coding or special interfaces are required, only meta-data. The performance is quite acceptable, and outperforms Java serialization for small data-sets.

The draw backs are that the text formats are less efficient than optimized binary formats, and JAXB requires meta-data. So you need to annotate your classes with JAXB annotations, or provide an XML configuration file. Also, circular references are not handled by default, you need to use an @XmlIDREF to handle cycles.

JAXB annotated classes

@XmlRootElement
public class Order {
    @XmlID
    @XmlAttribute
    private long id;
    @XmlAttribute
    private String description;
    @XmlAttribute
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    private List orderLines = new ArrayList();
    private Customer customer;
}

public class OrderLine {
    @XmlIDREF
    private Order order;
    @XmlAttribute
    private int lineNumber;
    @XmlAttribute
    private String description;
    @XmlAttribute
    private BigDecimal cost = BigDecimal.valueOf(0);
}

EclipseLink MOXy serialization XML for Order object

<order id="0" totalCost="0"><orderLines lineNumber="1" cost="0"><order>0</order></orderLines></order>

EclipseLink MOXy serialization JSON for Order object

{"order":{"id":0,"totalCost":0,"orderLines":[{"lineNumber":1,"cost":0,"order":0}]}}

Kryo

Kryo is a fast, efficient serialization framework for Java. Kryo is an open source project on Google code that is provided under the New BSD license. It is a small project, with only 3 members, it first shipped in 2009 and last shipped the 2.21 release in Feb 2013, so is still actively being developed.

Kryo works similar to Java serialization, and respects transient fields, but does not require a class be Serializable. I found Kryo to have some limitations, such as requiring classes to have a default constructor, and encountered some issues in serializing java.sql.Time, java.sql.Date and java.sql.Timestamp classes.

Kryo serialization bytes for Order object

------java-util-ArrayLis-----model-OrderLin----java-math-BigDecima---------model-Orde-----

Oracle Coherence POF

The Oracle Coherence product provides its own optimized binary format called POF (portable object format). Oracle Coherence is an in-memory data grid solution (distributed cache). Coherence is a commercial product, and requires a license. EclipseLink supports an integration with Oracle Coherence through the Oracle TopLink Grid product that uses Coherence as the EclipseLink shared cache.

POF provides a serialization framework, and can be used independently of Coherence (if you already have Coherence licensed). POF requires that your class implement a PortableObject interface and read/write methods. You can also implement a separate Serializer class, or use annotations in the latest Coherence release. POF requires that each class be assigned an constant id ahead of time, so you need some way determine this id. The POF format is a binary format, very compact, efficient, and fast, but does require some work on your part.

The total bytes for POF was 32 bytes for a single Order/OrderLine object, and 1593 bytes for 100 OrderLines. I'm not going to give the results, as POF is part of a commercially licensed product, but is was very fast.

POF PortableObject

public class Order implements PortableObject {
    private long id;
    private String description;
    private BigDecimal totalCost = BigDecimal.valueOf(0);
    private List orderLines = new ArrayList();
    private Customer customer;

    public Order() {
    }
    
    public void readExternal(PofReader in) throws IOException {
        this.id = in.readLong(0);
        this.description = in.readString(1);
        this.totalCost = in.readBigDecimal(2);
        this.customer = (Customer)in.readObject(3);
        this.orderLines = (List)in.readCollection(4, new ArrayList());
    }

    public void writeExternal(PofWriter out) throws IOException {
        out.writeLong(0, this.id);
        out.writeString(1, this.description);
        out.writeBigDecimal(2, this.totalCost);
        out.writeObject(3, this.customer);
        out.writeCollection(4, this.orderLines);
    }
}

POF serialization bytes for Order object

-----B--G---d-U------A--G-------

Results

So how does each perform? I made a simple benchmark to compare the different serialization mechanisms. I compared the serialization of two different use cases. The first is a single Order object with a single OrderLine object. The second is a single Order object with 100 OrderLine objects. I compared the average serialization operations per second, and measure the size in bytes of the serialized data. Different object models, use cases, and environments will give different results, but this gives you a general idea of the performance differences in different serializers.

The results show that Java serialization is slow for a small number of objects, but good for a large number of objects. Conversely, XML and JSON can outperform Java serialization for a small number of objects, but Java serialization is faster for a large number of objects. Kryo and other optimized binary serializers outperform Java serialization with both types of data.

You may wonder why it is relevant that something that takes less than a millisecond is of any relevance to performance, and you may be right. In general you would only have a real performance problem if you wrote out a very large number of objects, and then Java serialization performs quite well, so is the fact that it performs very poorly for a small number of object relevant? For a single operation, this is probably true, but if you execute many small serialization operations, then the cost is relevant. A typical server servicing many clients will typically send out many small requests, so while the cost of serialization is not great enough to make any of these single requests take a long time, it will dramatically effects the scalability of the server.

Order with 1 OrderLine

SerializerSize (bytes)Serialize (operations/second)Deserialize (operations/second)% Difference (from Java serialize)% Difference (deserialize)
Java Serializable636128,63419,1800%0%
Java Externalizable435160,54926,67824%39%
EclipseLink MOXy XML101348,05647,334170%146%
Kryo90359,368346,984179%1709%

Order with 100 OrderLines

SerializerSize (bytes)Serialize (operations/second)Deserialize (operations/second)% Difference (from Java serialize)% Difference (deserialize)
Java Serializable2,71516,47010,2150%0%
Java Externalizable2,81116,20611,483-1%12%
EclipseLink MOXy XML6,6287,3042,731-55%-73%
Kryo121622,86231,49938%208%

EclipseLink JPA

In EclipseLink 2.6 development builds, and to some degree 2.5, we have added the ability to choose your serializer anywhere that EclipseLink does serialization.

One such place is in serialized @Lob mappings. You can now use the @Convert annotation to specify a serializer such as @Convert(XML), @Convert(JSON), @Convert(Kryo). In addition to optimizing performance, this provides an easy mechanism to write XML and JSON data to your database.

Also for EclipseLink cache coordination you can choose your serializer using the "eclipselink.cache.coordination.serializer" property.

The source code for the benchmarks used in this post can be found here, or downloaded here.