Sunday, August 31, 2008

More on "Towards a Streaming SQL Standard"

The title of the presentation "Towards a Streaming SQL Standard" might be a little misleading, in my opinion. The title of the paper might have been "Oracle does this, Streambase does that, let's make a new language construct that enables both products to do this and that". And this is a great example of cooperation between vendors who researchers probably share a common academic heritage.

I don't pretend to understand why the new SPREAD construct is important. It will probably dawn on me when my team has an actual use case that the SPREAD construct solves. This is the case that I experienced a few months ago with Coral8 when we needed windows that flushed themselves when a certain column changed value (to be fair, Streambase already had this in their language).

A few points:

1) Despite what I think about Streambase's marketing and sales organization, you must admit that Zdonik and Cherniack are first-class researchers, and have contributed a lot to the field of CEP.

2) I get confused about Oracle's CEP offering. Is this paper talking about the CEP product that came with the BEA purchase or the original Oracle CEP product? Does any of this include work that they may have incorporated from the purchase of ESPER?

3) I would love to see Coral8's and Aleri's responses to this paper. Do their versions of Streaming SQL already do what the new SPREAD operator is purporting to do?

4) Will any of this standards work bubble up to the work that the STAC A1 council is doing? The STAC A1 council must be vigilant to ensure that we don't include benchmarks that might show off a certain, vendor-specific feature, unless this new feature solves an important business case. Likewise, if the SPREAD operator is important enough that all vendors rush to implement it, then this should be part of the STAC benchmark suite.

5) If the SPREAD operator is important, I expect that Coral8 will implement it right away.

©2008 Marc Adler - All Rights Reserved.
All opinions here are personal, and have no relation to my employer.


Jon said...

Thanks much for pointing out the VLDB paper, Marc. I read it this afternoon and have mostly digested it. And I think we can do everything in the paper, in the Aleri Streaming Platform, with one addition that I've toyed with before.

Please correct me if I'm wrong, but the concept of "batch" in the paper seems to correspond directly with our notion of "transaction". In the Aleri Streaming Platform, a "transaction" is simply a sequential block of events. Events have timestamps carried in the "rowtime" field. Since all events in the batch/transaction have the same timestamp by design, it holds that batch equivalence is a refinement of timestamp equivalence. In other words, the definition in the paper of "batch" coincides with "transaction".

[NB. You probably remember that an "event" in Aleri is a record with an operation (insert/update/delete/upsert). The "tuples" in the paper appear to be all "inserts". Also, our notion of "transaction" is not as complicated as one in a relational database, which might be nested, have rollbacks, etc.]

All of our streams work on batches/transactions. That is, a stream processes a batch/transaction and produces a batch/transaction. It's an invariant. That's the way we maintain atomicity of joins: a single event, for instance, might force multiple rows to change in the join. To communicate the atomicity of the change to downstream streams, you have to keep the multiple changes together. You get led to the invariant almost immediately: you might as well treat single events as a degenerate, single-event batch/transaction.

As for the SPREAD operation, I think that can be implemented with one addition to our programmable FlexStreams. Inside a FlexStream, you specify one block of code, for each input stream, for processing events. When a transaction hits a FlexStream, the block of code is run for each event, and the resulting output events (if any) are gathered up as another transaction block. That preserves the one-transaction-in, one-transaction-out invariant. If we add another "commit" operation---one that gathers up the events and creates a transaction block, so that multiple transaction blocks could be produced from a single block---you'd have everything you need for the single-stream SPREAD operation. The multi-stream SPREAD of section 5.1.2 could be programmed with the new data structures in the 3.0 release: either with the "eventCache" data structure, or the simpler "dictionary" data structure, depending on the use case.

Of course, the fact that SPREAD is programmable from simpler operations means that there's (potentially) more flexibility in the Aleri Streaming Platform.

I've got an example of the "eventCache" data structure that will appear in the next Aleri CEP Newsletter (hopefully in the next few days!) It's a cool example for maintaining the top of an order book. I'll try to ping you when it's up.

Thanks again,
-Jon Riecke
Lead Platform Architect
Aleri Inc.

Mark Tsimelzon said...

Marc, I've responded to this paper in my Coral8 blog. The crux of this paper, in my opinion, is not so much the new SPREAD operator, but rather it is a good explanation of how different CEP engines may have very different execution models, and what can be done about it, at least in theory.

Jon said...

Hi Marc,

I promised a link to the "eventCache" example. It's at Aleri CEP Newsletter, Issue 4, in the article "Top o' the Book to You." Comments are always welcome.

-Jon Riecke
Lead Platform Architect
Aleri Inc.