A Scala Compiler Plugin for Avro Records

Background

We are attempting to build a scala compiler plugin to auto-generate avro classes based on some simple definitions. This plugin is for the Scala 2.8 compiler, and for the Avro 1.3.0 runtime.

Usage

Let us define a simple record class. Normally in Avro, you would write a JSON file which looks like this:

{"namespace" : "localhost.test","protocol"  : "Test","types" : [   { "name" : "Item", "type" : "record", "fields" : [      { "name" : "name", "type" : "string" },      { "name" : "cost", "type" : "double" }   ]},   { "name" : "ItemList", "type" : "record", "fields" : [      { "name" : "items" , "type" : { "type" : "array", "items" : "Item" } }   ]}]}

In which case, the Avro compiler will generate Item.java and ItemList.java for you to use in your application (which you can then use in your Scala application).

With our plugin, you can instead write Scala case classes which accomplish the same task, but look a lot cleaner:

package localhost.testimport com.googlecode.avro.annotation.AvroRecord@AvroRecordcase class Item (var name: String, var cost: Double)@AvroRecordcase class ItemList (var items: List[Item])

That’s all you need to do! Our compiler plugin will automatically generate the necessary methods that make your case classes Avro serializable. All you need to do is to run scalac with the plugin:

$ scalac -classpath target/avro-scala-compiler-plugin-1.0-SNAPSHOT.jar -Xpluginsdir target -d target/classes test.scala

Now you can use the classes as such:

import localhost.test._import java.io.ByteArrayOutputStreamimport java.nio.ByteBufferimport org.apache.avro.specific._import org.apache.avro.io._import scala.reflect.Manifestdef toByteArray[ T 

Which produces the following output (assuming that the generated classes are in your classpath):

Original itemList: {"items": [{"name": "Pen", "cost": 0.5}, {"name": "Chair", "cost": 15.99}]}New itemList: {"items": [{"name": "Pen", "cost": 0.5}, {"name": "Chair", "cost": 15.99}]}

We can also handle unions in a typesafe manner:

package localhost.testunionimport com.googlecode.avro.annotation.{AvroRecord, AvroUnion}@AvroUnionsealed trait Car @AvroRecordcase class Honda(var model: String) extends Car @AvroRecordcase class Toyota(var model: String) extends Car @AvroRecordcase class Dealer(var cars: List[Car])

Which we can then use as such:

import localhost.testunion._val dealer = Dealer ( List ( Honda( "Pilot" ), Toyota( "Camry" ) ) )println("Dealer: " + dealer)

Which will output:

Dealer: {"cars": [{"model": "Pilot"}, {"model": "Camry"}]}
Performance

I ran some very preliminary performance tests comparing Avro serialization when using the supplied Java compiler, Avro serialization with this plugin, and native Java serialization. Here are the results:

See http://code.google.com/p/avro-scala-compiler-plugin/source/browse/trunk/script_test_classes.scala for the test code. The record of interest was

case class Record(var x: Int, var y: String, var z: Boolean)

See http://code.google.com/p/avro-scala-compiler-plugin/source/browse/trunk/test_classes.scala for the actual code of the records.

Stock indicates a SpecificRecord generated by the supplied Avro compiler.

Plugin indicates a SpecificRecord generated by this compiler plugin.

Java indicates a Scala case class which implements java.io.Serializable.

In addition to performance, I measured the length of the byte string generated by each of these methods (for the record given above). The results were:

  • Stock - 18 bytes
  • Plugin - 18 bytes
  • Java - 100 bytes
Known Limitations
  • This only works for Scala case class classes at the top level.
  • Several Avro features are punted on. These include enumerations, fixed record fields, and maps.
  • Right now, you have to add the companion object class manually. I plan to fix this soon.
  • null fields are not properly handled right now.
TODO List
  • Move the schema to a Scala object, so that schemas do not have to be reparsed every new instance is generated
  • To implement the above fix, I introduced a hack where you have to add the companion object class manually. See Known Limitations.
  • Handle maps (which Avro supports).
  • Handle null fields.
  • Do the right thing when it comes to byte arrays and strings. Right now the user is forced to use an Array[Byte] for a byte array, and a String for a string. However, since these are represented in Avro as java.nio.ByteBuffer and org.apache.avro.util.Utf8 internally, the user should be able to specify said types and get the indended meaning (right now this is an error).
  • Use the Scala Option[T] class to handle fields which can be null.
  • Consider using the Scala Either[A,B] class as a shorthand for a simple two part union.
  • Be more robust with error handling (rather than throwing random exceptions).
  • Find a way to make this work in the Scala REPL.
  • Recover the lost performance when using the plugin versus the Avro Java compiler.
  • Find a way to integrate this into a build process better (right now you have to set the correct classpaths and stuff yourself).