CEP and MapReduce: Connected in complex ways

by Andrew Brust, ZDNet.com

In a recent post, I discussed MPP (Massively Parallel Processing) data warehouse technology and how it too is a Big Data technology. I also explained some things MPP has in common, architecturally, with Hadoop’s MapReduce parallel computation approach to handling large data sets.

There’s another three-letter acronym that is at once its own category and yet also a legitimate Big Data player: CEP (Complex Event Processing). CEP engines process continuous data flows (like feeds, sensor readings and other event-driven data that is large in volume and very fast-flowing), in real-time. CEP engines are available from so-called mega-vendors like Oracle, Sybase/SAP and Microsoft, integration players like TIBCO and pure play companies like StreamBase and OneMarketData.

As with other data-related domains, there’s a lot of cross-pollination and conflation happening between CEP and Hadoop/MapReduce. As an extreme example, I’ve seen one blog post that asserted Hadoop is a CEP Engine. But there are other, perhaps less-controversial examples as well.

In August of 2011, Twitter acquired a company called BackType, and a product of theirs called Storm. Storm processes streaming data in parallel, over a cluster of machines, using a processing architecture called topologies. Technology like this works well for a social media data hotbed like Twitter. Just think of the real-time processing necessary to determine trending topics, and you’ll start to get the idea. The keepers of Storm point out that MapReduce processes “jobs” in batch while Storm processes streams in real-time. Storm and MapReduce have a divide-and-conquer parallelism in common, and Storm even uses the ZooKeeper distributed process coordinator that began as an Apache sub-project of Hadoop. Twitter made Storm open source in September, 2011. So look for mashups of Storm and other data-centric technologies on the horizon.

Another interesting CEP-MapReduce crossover comes from Microsoft. It has devised a way for its StreamInsight CEP engine (which ships as part of the SQL Server relational database) to work in concert with the Microsoft-Hortonworks Hadoop on Windows distribution. Interestingly, Microsoft has figured out a way to shim the StreamInsight engine into the reducer step of Hadoop’s MapReduce jobs. This is explained, complete with examples, in a Microsoft blog post.

These are just a couple of examples of how CEP and Big Data technologies can be combined in ways that are logical and beneficial. Lots of other data technologies, like data visualization, predictive analytics and more, fit in with Big Data, and stretch its definition, in a likewise manner. We’ll certainly look at more of them in future posts. Brust’s Blog