Thursday, March 29, 2012

NoSQL

In the beginning data was free and wild. It was not confined to rows and columns and not bounded to standardization. Data access was unruly and proprietary. These were the first "NoSQL" databases. They consisted of flat file, hierarchical and network databases such as VSAM, IMS and ADABASE.

Then there was SQL, and things were good.

SQL was developed during the golden age of data in the 1970s. Database access became standardized through the SQL language and the relational model. The 1970s saw the birth of relational database products such as RDBMS, Ingres, Oracle and DB2. The 1980s saw ANSI standardization of the SQL language, and the adoption of client-server computing.

However, the legacy databases still existed, as well as the legacy applications that accessed them. New applications needed to access the old data, and this was in general a very painful experience.

Back in the good old Smalltalk days during the 1990s, Smalltalk was unofficially adopted as the programming language of choice for large corporate projects. It was the beginning of the commercial adoption of object-oriented programming, both Smalltalk, C++ and other OO languages. Things were great, but there was a dark side. All of the data was stored in relational databases, or worse legacy mainframe databases. Fitting round objects into square relational tables was difficult and cumbersome. Two solutions emerged, object-oriented databases, and object-relational mapping.

New commercial object-oriented database management systems (OODBMS) emerged in the 1990s including Versant, Gemstone and ObjectStore. They were integrated with their respective languages, Smalltalk and C++, and stored data as it was represented in memory, instead of in the relational model. These were the 2nd generation of "NoSQL" databases. There was little standardization and solutions were mainly proprietary. Access to the data from non object-oriented languages was difficult. The world did not adopt this new model, as it had previously adopted the relational model. The worlds data remained in the trusted, standardized and universally accessible relational model.

Object-relational mapping allowed objects to be used in the programming model, but have them converted to relational rows and SQL when persisted. A lot of OR mapping frameworks were built, including many corporate in-house solutions. TopLink, a product from The Object People became the leading OR mapper in the Smalltalk language. In C++ there was Persistence, as well as various other products in various languages.

Although the relational model was the industry standard for any new applications, much of the worlds data remained in mainframe databases. The data was slowly being migrated, but most corporations still had mainframe data. Consulting at TopLink clients in the 90s I found most clients were building applications on relational database, but still had to get some data from the mainframe. This is when we created the first version of TopLink's "NoSQL" support. Of coarse NoSQL was not a buzz word at the time, so the offering was called TopLink for the Mainframe. The main problem was that everyone's mainframe data and access was different, so the product involved lots of consulting.

When Java came along, TopLink moved from Smalltalk to Java. OR mapping became very popular in Java and many new products came to market. The first real OR standard came in the form on EJB CMP. It had is "issues" to say the least, and was coupled with the J2EE platform. A new competing standard of JDO was created in retaliation to CMP. To reconcile the issue of having two competing Java standards, JPA was created to replace them both, and was adopted by most OR mapping products.

In response to the popularity of object-oriented computing, the relational database vendors created the object-relational paradigm. This allowed storage of structured object types and collections in relational tables. SQL3 (SQL 1999) defined new query syntax to access this data. Despite some initial hype, the object-relational paradigm was not successful, and although the features remain in Oracle, DB2 and Postgres, the world stayed with the trusted relational model.

The panic around Y2K had the good fortune of getting most corporations and governments off mainframe databases, and into relational databases. Some legacy data still remained, so we also offered TopLink for the Mainframe in Java. At that time the Internet was taking off, and XML was becoming popular. Since XML is hierarchical data that you could convert any mainframe data to, it became part of our solution for accessing legacy data and the TopLink SDK was born.

With the explosion of the Internet, XML was becoming increasingly popular. This lead to once again the questioning of the relational model, and the creation of XML databases (the 3rd generation of NoSQL). There were several XML databases that achieved much hype, but limited market success. The relational database vendors responded by adding XML support for storage of XML in relational tables.

Again the world stayed with the relational model.

The TopLink SDK also provided simple object to XML mapping, perhaps the first such product to do so. As XML usage in Java became mainstream, the TopLink SDK was split into two products. TopLink Moxy become TopLink's object to XML mapping solution. TopLink EIS became TopLink's legacy data persistence solution.

Around 2009 the term NoSQL was used to categories the new distributed databases being used at Google, Amazon and Facebook. The databases categorized themselves as
being highly scalable, not adhering to ACID transaction semantics, and having limited querying. The NoSQL term grew to include the various other non-relational databases that have emerged throughout the ages.

Is the relational model dead? Will the world switch to the NoSQL model, and will data once again be free? Only time will tell. If history teaches us anything, one would expect the relational model to persist. NoSQL has already been renamed in some circles to "Not Only SQL", to leave room for the NoSQL databases to support the SQL standard. In fact, some NoSQL databases already have support for JDBC drivers. My intuition is a union of the two models, perhaps this has already begun with some NoSQL databases adding SQL support, and some relational databases extended their clustering support such as MySQL cluster.

EclipseLink 2.4 will contain JPA support for NoSQL databases. Initial support with include MongoDB and Oracle NoSQL. This support is already available in the EclipseLink nightly builds. Technically, this is not new functionality, as EclipseLink (formerly TopLink) has been supporting non-relational data for over a decade, but the JPA support is new.

In the upcoming months I will be blogging about some of the new features in EclipseLink to support NoSQL. This blog post is solely an introduction, so sorry to those expecting hard content.

4 comments :

  1. Sql is not dead but sql products must (and are already) evolve to support new things.

    Sql products currently suffer from a few important deficiencies:

    - things like sharding and rebalancing, replication across multiple data centers, scaling horizontally for reads, writes, and querying are limited, hard or expensive (usually all three). In practice, you loose most of the querying flexibility if you try to scale SQL products. Forget about expensive joins beyond a few million rows. Order by simply is not practical either. Etc.

    - SQL is great if you have a schema that won't change a lot and schema less relational databases don't really exist. This makes schema evolution difficult/tedious and a lot of applications have little need for an overly restrictive schema. Workarounds basically mean giving up on what makes SQL useful (e.g. storing xml or json in a blob).

    - Related to that, some queries that really complicated in SQL are really easy in e.g. SOLR with a proper schema. Officially it's a text indexing solution but I find it extremely useful as an alternative means for querying structured data as well.

    - ACID is nice to have but not always feasible. Eventual consistency is a more relaxed model that comes at a price but that also helps you scale. Do you really need to be able to read your own writes immediately? Is it OK for data to take a few minutes to replicate world wide?

    An OSS, relational product that scales, replicates & shards and provides flexibility and choice with regard to schema less operation, indexing and eventual consistency would convert a lot of current nosql products back into the sql camp. But I wouldn't necessarily call such a product a relational database anymore.

    On the other hand, not all data is relational or even object relational. Graph databases for example are really tedious to retrofit into a relational model. Indexing random attributes in a json tree (like mongodb does) is actually kind of neat and saves you from the usual object modeling & mapping to tables. A lot of relational databases could be document oriented instead. Couchdb style map reduce querying is a nice way to do stuff that would cripple most SQL databases, even for modest data sets.

    ReplyDelete
  2. Hi James.

    Question: Is there any work planned (or underway)...that you know of...to support Cassandra with new NoSQL JPA?

    Thanks.

    ReplyDelete
    Replies
    1. Cassandra support is something we are interested in, but it will most likely not be in the 2.4 release.

      You can build your own platform for it if you are adventurous. I can provide you with any help that you need. The best way to start would be to just copy and rename the MongoPlatform and Mongo JCA adapter classes.

      Delete
    2. https://github.com/impetus-opensource/Kundera

      is already moving in this direction

      Delete