by Jack Vaughan, TechTarget
Interview with Tony Baer of Ovum
You write that machine learning will be the biggest disruptor for big data analytics in 2017. Still, one wonders if machine learning projects will be limited to the top echelon of companies, or if use will be much broader than that.
Tony Baer: It’s broad in that, in many cases, businesses and consumers are already using services that have machine learning embedded in them — they just don’t realize it. But in terms of how many companies have data scientists on board, ones that are writing or using machine learning algorithms and doing their own internal development, that will still be limited. That’s even though there are libraries available for machine learning, so you don’t have to just write them from scratch anymore.
There are also emerging collaboration tools that are designed to connect the data scientist to the data engineer or the business. You’re seeing an upswing of tooling, but largely, the appeal of that is going to be limited to those organizations that have very deep resources — the same types of organizations, really, who were the pioneers with Hadoop.
Tooling is one thing. But it sometimes seems people don’t realize machine learning projects require a learning phase that can be time-consuming and full of trial and error.
Baer: That’s right. There’s also an interesting thing going on. A few years ago, data science was the hot thing. Everyone wanted to be called a data scientist and wanted that on their business card. Now, the shiny new thing is machine learning, and so all these would-be data scientists want to jump on.
What they may be forgetting is step one: You really have to learn the data science. It’s not synonymous with machine learning. It’s synonymous with science, in that you are constantly testing hypotheses. It’s the blocking and tackling of the scientific method. It requires a lot of patience and perseverance.
The spectrum in machine learning goes all the way from the anomaly detection cluster on one end to deep learning and cognitive [computing] at the very deep end of the pool. But you need to get data science mastered before you can go on to using machine learning, which includes advanced pattern recognition and many different approaches along that spectrum.
For machine learning, in the short term, the widest impact is going to be through capabilities that are packaged into analytics or applications, such as supply chain optimization, or the smart electric grid, or threat and fraud detection. It will be embedded in these applications. The headlines will talk about individual companies with courageous data scientists that are writing brilliant models. But, when it comes to broad impact, it is going to be via capabilities that are packaged under the hood.
You mentioned machine learning adopters being similar to Hadoop adopters. That technology has taken a while to germinate. Now, it seems bound for the cloud. At what pace do you think Hadoop can move to the cloud?
Baer: What I would call Hadoop is a multicomponent operating system. It’s very much about mix and match, which made it hard to explain, and probably confused the market quite a bit. Now, in the cloud, it’s even harder to explain because, when you go into the Amazon cloud, you may not be using [the Hadoop Distributed File System] — you’re probably using S3 (i.e., Amazon Simple Storage Service).
Hadoop wasn’t born to be on the cloud, but that is going to be the key adoption trend. From my conversations with vendors, it seems that a year ago, 15% to 20% of new workloads were going to the cloud. Now, it’s one-third. And I’m basically expecting that we’re going to hit the 50% mark for new workloads in 12 to 18 months.
It’s fair to say data streaming bears a resemblance to complex event processing (CEP), in which the emphasis was somewhat on the “complex.” We’re dealing with different events these days, mostly things like cell phone activity and clickstreams. But are things really different this time?
Baer: Complex event processing was a solution looking for a problem — well, except in some specialized cases, like financial services, where bleeding edge is part of what they do, part of how they compete. But now, we have the perfect storm.
That’s because infrastructure has become more accessible and inexpensive, especially with the cloud. And with CEP, when you worked on a small number of events, that wasn’t too intriguing. But when you can scale out with infrastructure like we now have, it might become a viable idea. IoT alone is really moving this.
There are use cases that use IoT and have real value. IoT is increasing the urgency for real-time streaming analytics. Examples include anything that involves some physical movement of things, whether it be supply chain, network optimizations or smart cities and the like. Or, for example, anything in the field that is working. That is asset management and fleet management. Events, such as clickstreams, are drivers, too. There are all these use cases that are tangible and actually have clear business value.
We have more smart devices out there that are generating real information. That’s ultimately what’s driving streaming analytics. It’s a mix of open source and proprietary technologies. On the other hand, with CEP, the processing was expensive. The few tools that were out there were proprietary and required very specialized skills. With open source, the barriers to learning and experimenting come down. It’s kind of a perfect storm in that all those things are happening. Read the interview.
DCL: Mr Baer is incorrect in saying CEP was “a solution in search of a problem”. Even twenty years ago there were many examples of event processing systems that needed CEP in order to improve their effective use by humans. One such was chip fabrication. The use of event-driven simulation in early phases of chip design was paramount in detecting bugs before manufacture. But current analysis tools could not deal with the large numbers of events produced by these simulations. So when this step failed Intel in one famous case and a bug appeared only after a chip had been manufactured and deployed in products, it cost the company a lot of money. Later review of the simulation libraries showed the bug had been caught by the simulations, but missed by the analyzers. CEP was inspired in part by the need for better analysis tools for simulator output in this particular area. (See the Proceedings of the 29th. IEEE Design Automation Conference, pp.414-419, 1992, for a presentation of this application of CEP). Stock market trading was another area of CEP applications at this time. Also the control systems for automated sewage plants, believe that or not! And real-time fraud detection has always been at the forefront of using “simple CEP” techniques. True enough, as Mr. Baer says, today there are many new challenges for CEP applications in our evolving event-driven mobile world. CEP is essential in building smart grids and defenses against cyber attacks.