The previous post only had a database of 12 rows. So for this run I will increase the database size by 5,000 to 60,000 rows. Still not a huge database, but should be big enough to highlight any performance differences in the results. I will also increase the size of the query result set from 6 employees, to 5,000 employees.
The first thing that I noticed in this run was that Oracle has a limit of 1,000 parameters per statement. Since the IN batch fetching binds a large array, and I'm reading 5,000 objects, this limit was exceeded and the run blew up with a database error. The
BatchFetchPolicy
in EclipseLink accounts for this and defines a size
for the max number of ids to include the an IN. The default size limit in EclipseLink was suppose to be 500, but I think I remember increasing it to 100,000 to test something when I was developing the feature, and, well..., I guess never set it back, oops, I will fix this...EclipseLink defines a JPA Query hint
"eclipselink.batch.size"
that allows the size to be set. So I will set this to 500 for the test. This means that to read in all of the 5,000 objects, the IN batch fetch will need to execute 10 queries per batch fetched relationship. It will be interesting to see how it compares to the other query optimization techniques.The run time was also increased to 10 minutes from 1 minute because reading 5,000 objects obviously takes longer than reading 6. Also, since the last run our lab got a new database machine. The old database was running on Linux on an old server machine, and the new database is running on a new Oracle Sun server machine running Linux on a virtualized environment. The client machine is the same, my old desktop running Oracle Sun JDK 1.6 on Windows XP. Both databases were Oracle 11g.
Big Data Results, run 1, simple (fetch address, phoneNumbers)
Query | Average (queries/10 minutes) | %STD | %DIF (of standard) |
standard | 27 | 4.5% | 0% |
join fetch | 307 | 0.1% | +1037% |
batch fetch (JOIN) | 310 | 0.2% | +1048% |
batch fetch (EXISTS) | 309 | 0.1% | +1044% |
batch fetch (IN) | 261 | 0.1% | +866% |
The results show that join fetching and batch fetching have basically equivalent performance, and about 10x better performance than the non-optimize query. IN batch fetching does not perform as well as the others with this larger result set. It performs better than I expected, given it has huge IN statement and has to execute 10 queries per relationship. Note that these results differ from the previous post that showed IN batch fetching performing the best for queries with small result sets.
The second run uses the complex query which fetches 9 different relationships.
Big Data Results, run 2, complex (fetch address, phoneNumbers, projects, manager, managedEmployees, emailAddresses, responsibilities, jobTitle, degrees)
Query | Average (queries/10 minutes) | %STD | %DIF (of standard) |
standard | 6 | 0.0% | 0% |
join fetch | 59 | 1.5% | +383% |
batch fetch (JOIN) | 125 | 0.3% | +1983% |
batch fetch (EXISTS) | 124 | 0.3% | +1966% |
batch fetch (IN) | 80 | 3.2% | +1455% |
The results show batch fetching having about 2x the performance of join fetching, and 20x the performance of the non-optimize query. Join fetching still performs 10x faster than the non-optimize case, which is different than the small result set run which gave it worse performance than the non-optimized query. IN batch fetching again did not perform as well as JOIN and EXISTS batch fetching, but still out performed join fetching and was 15x faster than the non-optimized query.
Note that these results are not universal. Expect that every database, every machine, every environment, every query, and every object model will give different results. The basics should be the same though, batch fetch should have better performance than non-optimized queries, and better performance than join fetching for objects with complex relationships. IN batch fetching will perform worse for large result sets, but have similar performance for small result sets. Join fetching will perform well for objects with a small number of relationships.
To see how the results differ in different environments, I did a few more runs on different databases. The next run is for a local Oracle 10g database installed on my desktop. This database is slower than the new server, but will not require a network trip since it is on the same machine as the Java client.
Big Data Results, run 3, simple (fetch address, phoneNumbers)
Query | Average (queries/10 minutes) | %STD | %DIF (of standard) |
standard | 11 | 0.0% | 0% |
join fetch | 346 | 0.7% | +2718% |
batch fetch (JOIN) | 310 | 0.0% | +2718% |
batch fetch (EXISTS) | 343 | 0.6% | +3018% |
batch fetch (IN) | 260 | 0.1% | +2263% |
Big Data Results, run 4, complex (fetch address, phoneNumbers, projects, manager, managedEmployees, emailAddresses, responsibilities, jobTitle, degrees)
Query | Average (queries/10 minutes) | %STD | %DIF (of standard) |
standard | 1 | 0.0% | 0% |
join fetch | 39 | 1.1% | +3800% |
batch fetch (JOIN) | 106 | 0.8% | +10500% |
batch fetch (EXISTS) | 113 | 0.0% | +11200% |
batch fetch (IN) | 36 | 1.2% | +3500% |
These results show similar results for the previous simple run, but about a 30x improvement over the non-optimize query, which is a bigger difference. The JOIN batch fetch did not seem to perform as well as the EXISTS or join fetch.
The complex run only completed a single run in the 10 minutes for the non-optimized query. The JOIN and EXISTS batch fetching performed the best, over 100x faster than the non-optimized query. Join fetching and IN batch fetching did not perform as well, but were both still over 30x faster than the non-optimized query.
Thank you so much for this information!!
ReplyDeleteThis kind of insight is so valuable, and yet so rare to difficult to find online
Hi My Batch query is working fine , but not able to iterate result set class cast exception thrown
ReplyDeleteFINE: SELECT DISTINCT t1.ID AS a1, t1.CREATE_TIME AS a2, t1.CREATOR AS a3, t1.DELETABLE AS a4, t1.IS_PARTIAL AS a5, t1.NAME AS a6, t1.STATE AS a7, t1.SCOPE AS a8, t1.UPDATE_TIME AS a9, t1.UPDATEABLE AS a10, t1.UPDATER AS a11, t1.VERSION AS a12, t1.DATATABLE_ID AS a13 FROM STRINGATTRIBUTE t0, DATACOLUMN t2, DATARECORD t1 WHERE (((t1.DATATABLE_ID = ?) AND ((t1.STATE <> ?) AND (t2.ID LIKE ? AND t0.STRING_VALUE LIKE ?))) AND ((t0.DATARECORD_ID = t1.ID) AND (t2.ID = t0.DATACOLUMN_ID))) ORDER BY t1.UPDATE_TIME DESC LIMIT ?, ?
FINE: SELECT DISTINCT STRING_VALUE, DATACOLUMN_ID, DATARECORD_ID FROM STRINGATTRIBUTE WHERE (DATARECORD_ID IN (?,?,?,?,?))
But when I start iterating batch attribute then below exception is thrown I used eclipselink 2.4
Batch is set as
query.setHint(QueryHints.BATCH, TABLE_ALIAS+".strAttributes");
query.setHint(QueryHints.BATCH_TYPE, BatchFetchType.IN);
WARNING: java.lang.ClassCastException: org.eclipse.persistence.queries.ComplexQueryResult cannot be cast to java.util.Collection