Saturday, February 07, 2009

Request: Streaming R or R-like Functions in CEP Systems

I have been looking a bit into R lately. I have written here before about being able to transform R into something that can take real-time streams of data, or integrating R (or SPlus) into a CEP system. Here is a small example. Let's say that you want to take a sample of a "window" of streaming data every few seconds and compute some function on that sample.

Let's see how we can do this in R using static data.

In R, we can define a range of stock prices that range from $10.00 to $100.00, in increments of 25 cents.

> StockPrices<-seq(10.0, 100.0, 0.25)
> StockPrices
[1] 10.00 10.25 10.50 10.75 11.00 11.25 11.50 11.75 12.00 12.25 12.50 12.75 13.00 13.25 13.50 13.75 14.00
......
[358] 99.25 99.50 99.75 100.00


We can take a sample of 10 random stock prices like this:

> sample(StockPrices, 10, rep=F)
[1] 20.50 19.00 78.75 44.75 62.75 45.25 66.25 70.00 71.00 91.00

We can compute the average price of a sample of 10 random prices like this:

> mean(sample(StockPrices, 10, rep=F))
[1] 52.775


Using the Coral8 term of "window", let's say that we have a window called StockPrices that contains one minute's worth of prices. Every 5 seconds, we want to take a random sampling of 10 prices in that window and compute the average price of that random sample.

How can we do it in your CEP system? I would welcome examples in Coral8, Apama, Streambase, Aleri, Esper, StreamSQL, and any other CEP system.


©2009 Marc Adler - All Rights Reserved.
All opinions here are personal, and have no relation to my employer.

9 comments:

Hans said...

Do you like the R syntax particularly? I mean, you could write a component to do this in C# pretty easily, right?

Anonymous said...

If you just want to take a random sample of size from a window and compute some statistic (such as the mean) then you could do this with a custom aggregation UDF. I think it would be fairly straightforward in Coral8 using the C++ SDK. Of course it wouldn't look anything like the original R code, but it would be a lot faster than R.

Hans said...

It's a good point that R is not very fast in general, not to mention the cost of pushing the data to and from the runtime.

Anonymous said...

I would also like to know how you can cooperate between Coral8 and R, Matlab and alike. Hans earlier posted in his blog the window technique. If this is easy to implement it would be a good first step. On the other hand I would argue that such window technique does not allow online processing, by definition, because everytime the integrated app would compute the passed window in the whole. If we would pass pointers that specify which data is new the external app would have the chance to compute an online algorithm.

btw, just recently tim bass discussed online algorithms and cep in his cep blog

marc said...
This comment has been removed by the author.
marc said...

Maybe CEP systems like Esper can just expand the language (or pre-defined functions) to include some of the more popular functions of R.

For example, a Coral8 statement could be:

CREATE WINDOW MyWindow
KEEP 5 MINUTES;

ON EXPIRATION
INSERT INTO MySampleWindow
SELECT SAMPLE(10)
FROM MyWindow


Imagine that MyWindow has a retention policy of 5 minutes. Every 5 minutes, as a bunch of rows expire from MyWindow, you want to take a sample of 10 of these rows and put them into MySampleWindow for further analysis.

(Of course, this is not proper CCL syntax, but you can get my drift.)

Hans said...

About online versus offline processing:

With my suggestion, online processing is a special case of window processing, with a window of size 1. MATLAB could do a bit better because you can compile the code as a library and incorporate that directly into the CEP engine (actually, I like the term EPP and I think I'll start using it).

But I am curious why anyone would want to do this. I am sure that there are valid use cases. I wonder why use an EPP if you are doing an online algorithm in R or MATLAB. Why not just process the events in R or MATLAB and save the effort of pushing back and forth? Or similarly, if you want the performance, why not code the algorithm in the EPL? After all, EPLs do incremental algorithms pretty well.

About how vendors could solve the problem:

Whatever the solution, they should allow users to incorporate their own code (C#,C++,Java,compiled to a library) as SQL functions without the incredibly burdensom syntax that we see today. Then we can write our own sample() function and don't have to rely on them to produce every single function that we need.

Damir Sudarevic said...

Not sure if this is exactly what you are looking for, but here is an example of 'randomly' sampling a stream into a window; so sampling happens as data rows enter the window -- as opposed to "out of the window".
Randomly Sampling a Stream
Regards.

Alex said...

@Marc
You can do that today in Esper without extending the EPL. Simply use a named window (similar to Coral8 public window) and use the iterator API or JDBC driver (possibly from a remote process) to iterate over the window and do the sampling (in Java for Esper) at regular intervals. I believe Coral8 can n do pretty much the same.

@Hans - Hans - you see that some solutions like Esper do allow this use case without dependance on vendor' roadmap to extend their EPL.

This said there are also advantages in doing it at the EPL level as Marc seems to ask for - f.e. to ensure simple reuse from many different applications without going thru custom code. Esper' EPL can be extended by the user f.e. with a custom window implementation or custom aggregation - depending on the use case semantics (let 10% of data flowing thru randomly - sliding or tumbling; extract 10% (or fixed number of events) randomly out of a stream etc)
Some examples are documented here
http://esper.codehaus.org/esper-3.0.0/doc/reference/en/html/extension.html