What is QDrill?

Consumable analytics is the attempt to address the shortage of skilled data analysts in many organizations by offering analytic functionality in a form more familiar to in-house expertise. Providing consumable analytics for Big Data faces three main challenges. The first challenge is making the analytics algorithms run in a distributed fashion in order to analyze Big Data in a timely manner. The second challenge is providing an easy interface to allow in-house expertise to run these algorithms in a distributed fashion while minimizing the learning cycle and existing code rewrites. The third challenge is running the analytics on data of different formats stored on heterogeneous data stores.

QDrill addresses these challenges. QDrill introduces the Analytics Adaptor extension for Apache Drill, a schema-free SQL query engine for non-relational storage. The Analytics Adaptor uses the proposed Distributed Analytics Query Language (DAQL) for invoking data mining algorithms from within the Drill standard SQL query statements. The adaptor allows using any sequential single-node data mining library (e.g. WEKA) and makes its algorithms run in a distributed fashion without having to rewrite them.

We evaluate QDrill against Apache Mahout. The evaluation shows that QDrill outperforms Mahout in Updatable Aggregatable model training and scoring phase while almost keeping the same performance for Updatable Non-Aggregatable model training. QDrill is more scalable and offers an easier interface, no storage overhead and the whole algorithms repository of WEKA, with the ability to extend to use algorithms from other data mining libraries.
02
QDrill Architecture

Analytics Adapter

Supported Algorithms

03
Distributed Analytics Query Language (DAQL)
Train a Weka model in a distributed fashion using our TRAIN MODEL and qdm_ensemble_weka addition to Apache Drill

SQL> USE dfs.tmp;
SQL> ALTER SESSION SET `store.format`='model';
SQL> TRAIN MODEL [model name] AS
SELECT qdm_ensemble_weka(mymodel) FROM (
   SELECT qdm_ensemble_weka(‘[algorithm]',‘[args]', data.columns) as mymodel FROM (
     SELECT org_data.columns, qdm_ladp(,org_data.columns) as sample from`[Data Source]` AS org_data
   )as mydata group by sample );

Score using the trained Weka model in a distributed fashion on testing data using our APPLYING and qdm_score_weka addition to Apache Drill

SQL> USE dfs.tmp;
SQL> ALTER SESSION SET `store.format`=json';
SQL> CREATE TABLE results AS
SELECT mydata.columns, qdm_score_weka(‘[args]', mymodel.columns[0], data.columns)
FROM `[Data Source]` AS data APPLYING [model name] AS mymodel;

05
Team Members
Shadi Khalifa
PhD Candidate at Queen's University

khalifa@cs.queensu.ca
http://cs.queensu.ca/~khalifa/

Patrick Martin
Professor at Queen's University

martin@cs.queensu.ca
http://research.cs.queensu.ca/home/martin/

02
Analytics Adapter
Drill is powerful in terms of accessing and joining data from heterogeneous sources, which is usually a cumbersome task when done in data mining libraries. On the other hand, Drill does not have any data mining capabilities. Developing data mining algorithms for Drill is time consuming and so would likely be limited to a handful of algorithms, nothing compared to those available in the well-established data mining libraries. The proposed QDrill with the Analytics Adaptor solves these issues by using Drill to load and join data from heterogeneous sources and using the pre-existing data mining algorithms of well-established data mining libraries to train and score data mining models.

The proposed Analytics Adaptor optimizes and provides access to various data mining libraries. The Analytics Adaptor works with Analytics Plugins that transform the data loaded by Drill to a data structure understandable by the data mining libraries. This way, algorithms from more than one library can be used together, leaving it to the Analytics Adaptor to resolve the inter-library data format conversion. In addition, the plugins invoke the APIs of the data mining library to train and score data mining models. All these details are hidden from users.
02
Supported Algorithms



As a prototype, we extended Apache Drill 1.2 as outlined previously and created a plugin for the weka-dev-3.7.13 data mining library in the Analytics Adapter in order to access the WEKA data mining algorithms using DAQL. The plugin also converts the data loaded by Drill to the ARFF format accepted by WEKA. In addition, we created a Model Storage Plugin that can store and load WEKA models from any data store supported by Drill.

QDrill now supports 100% of WEKA classification algorithms without any code re-writes. Unlike Mahout (2 algorithms) and MlLib (5 algorithms) that had to rewrite the code of the algorithms to run in a distributed fashion.

QDrill distributes the execution of all WEKA's classification algorithms making them run 70% to 95% faster and yield up to 18% better prediction accuracy.