W. Roy Schulte 13 November 2016. The opinions in this document are my own, and do not represent the position of my employer or any other company.
Strata+Hadoop in New York on September 27-29, 2016 was not just a conference for Hadoop, HDFS and batch processing. It covered a broad swath of data and analytics, including aspects of stream analytics (complex-event processing (CEP)), conventional business intelligence (BI), advanced analytics, machine learning (ML), and artificial intelligence (AI). Among the many vendors who participated were data discovery companies (such as Tableau), BI and analytics vendors (such as Hitachi Pentaho and MicroStrategy), and vendors of emerging data science platforms (such as DataRobot) that are not dependent on Hadoop or Spark (although they may be used with them).
Open source was a key theme of the conference. However, most of the exhibit floor and many of the presentations came from commercial vendors, such as Collibra, Hewlett Packard Enterprise, IBM, Informatica, Intel, SAS Institute, Syncsort, Teradata, and others. These commercial vendors described their support for open source technologies such as Hadoop, HDFS, python, R and RStudio, and Spark. They mostly focused on two aspects of open source:
• Making open source technologies easier to use, and
• Integration between open source technologies and traditional products and data (e.g., CSV, relational and even mainframe data in the case of Syncsort).
• IBM uses Spark as the core of its “Analytics Operating System”. It rolled out Watson DataWorks (renamed a month later to Watson Data Platform) which is an end-to-end “cognitive” (AI)-based data integration tool suite. IBM also says that it is the leading contributor to Spark ML analytics.
• SAS Institute presented its “4th generation” Viya product, a cloud-friendly, dockerized platform that supports python and R, in addition to the SAS language (DATA Step).
• Teradata has integrated its Astor Analytics with Hadoop (it was only on Astor DB in the past), and Astor Analytics will support additional DBMSs in the future.
• Informatica offers Data Lake Management and still supports data integration pipelines, but in addition to its traditional, mostly-batch extract-transform-load (ETL), it is promoting real-time stream data ingestion, via a new stream analytics engine, Informatica Intelligent Streaming (IIS). IIS, which is based on Spark Streaming, seems to supersede its older RulePoint stream analytics engine.
An informal poll of the audience of 200 people at a session on the first day revealed:
• The majority of attendees at this particular session are using Spark (at least for POCs, sometimes for production). This is the first time at a Strata conference where there were more Spark users than Hadoop users at this session.
• Only one attendee in this session was using DynamoDB and one on Kudu (however, a few other companies at Strata mentioned Kudu, including GE who said it is considering Kudu for its Predix IoT platform). Kudu is faster than HDFS and Parquet (which doesn’t say much because HDFS is slow), but it is slower than Cassandra, a popular platform that was widely discussed.
• Many people are using Kafka, none at this particular session use Kinesis.
• Amazon was the most widely used cloud provider by far at this session, and a few people were on Microsoft and Google clouds. None were using IBM or Oracle cloud infrastructure in this particular audience, and none indicated that they were using SAP cloud service. Some new vendors (e.g., Anodot and iguaz.io) said later in the conference that their products are only offered on Amazon initially, but they will port to other cloud platforms or run on-premises in the future.
• Many people (25% or more?) use SOLR, some on Elastic Search.
SQL seems to have made a comeback in terms of popularity at Strata. No one had ever stopped using it, of course, but now many presenters explicitly recommended SQL for handling structured data. Most people still use CSV as the most common format for data exchange.
Internet of Things (IoT)
There was considerable discussion of the IoT. A few examples (there were many others):
• Cisco, Dell EMC, GE, Intel and other companies described their IoT architectures. Pretty much all of them leverage some open source alongside closed source technologies. Stream analytics, Kafka and various implementations of advanced analytics are at the core of all of these architectures. Spark and Hadoop usually also play a big role too.
• Teradata is partnering with Siemens and three other IoT vendors. Teradata edge analytics can be used for real-time scoring in distributed IoT topologies. Teradata acquired ThinkBig Analytics which does deep learning projects on Tensorflow.
Ultra-high performance bragging rights
• At least 2 GPU-based DBMSs (MapD and Kinetica) were demonstrating their products. MapD keeps up to 1 to 3 TB in GPU RAM and larger datasets of 10-15 TB are handled in multi-CPU RAM. The company announced a benchmark at GTC Europe with IBMSoftlayer, Bitfusion.io and Nvidia where it reportedly scanned 147 billion rows in a second, in the cloud.
• Google set a new bar for big benchmarks, demonstrating BigQuery running through 1 trillion rows (1.09 PetaBytes) in 223 seconds. Where does it all end? Google suggested that companies use BigQuery for their enterprise data warehouses.
Spark was everywhere at the conference, including attention to Spark ML and MLlib. Quite a few companies are also working with Spark Streaming. Spark v.2 is a major advance over Spark v.1.6 in terms of analytics, dataframes merged with datasets, structured stream processing and faster performance. However, v.2 is only in a preview state and is reportedly not ready for serious production use. Maybe in a year?
In addition to Spark Streaming, there was considerable discussion of other stream analytics platforms and hybrid batch/streaming platforms including Apache Apex, Apache Beam, Apache Flink, Apache Samza, IBM Streams and other products. Not much mention of Apache Storm which seems to be past its prime.
Many young companies used Strata to emerge from stealth mode to public mode, or to go from a quiet public mode to make a major marketing splash. A few examples (there were many more):
• Alation announced a new release of its Alation Data Catalog, 4.0.
• Anodot showed a sophisticated tool for advanced anomaly detection. It uses multiple algorithms to find anomalies in large data sets, such as time series databases.
• Bitwise announced a next-generation data integration tool, Hydrograph, that provides ETL-like functionality on Hadoop.
• Compellon20/20 Platform is a new cloud service that uses AI to generate custom predictive models with a minimum of human direction. It provides prescriptive insights (decision management) that recommend scenarios to pursue, whom to target and what to offer. Based on Spark and running on Amazon and Google clouds.
• Immuta announced a leading-edge virtual data fabric that makes sensitive data available for analytics without introducing data leaks or allowing access to attributes that should be private according to regulations and policies.
• DataRobot showed its predictive modeling tool that is assisted by machine learning. It enables non-expert, “citizen data scientists” to build hundreds of models in one click and explore the top models on a “leaderboard.”
• iguaz.io came out of stealth mode with a new big data platform for IoT. Iguaz.io is only running on Amazon for now. It uses Spark Streaming and dataframe RDDs, Kinesis and DynamoDB, although you can also integrate with MQTT and other IoT technologies.
• Zoomdata announced deals with Cloudera, Teradata and Google to support more data sources and platforms for Zoomdata’s real-time visual analytics.