Use big data processing tools in the Hadoop or Spark ecosystem in a xap has dual advantages. As usual with Hadoop or Spark, you get the power and ability of all of your data. But putting them in a interactive application opens up the Hadoop/Spark engine to other users and their inquiries of the data.

This article explains how to approach working with these tools in Exaptive.

Working with Hadoop/Spark at the Application Level

SQL: The most common way applications interact with big data engines is through SQL. SQL is often spoken of as the English of data. SQL is understood by many parts of the Hadoop/Spark stack including HIVE, Drill, Spark SQL, Impala, and HBase Phoenix. Using out of the box Exaptive components to dynamically generate SQL to interact with your database(s) could not be easier. Exaptive includes components to dynamically template query strings, as well as make an ODBC connection to your Hadoop/Spark system to let your application control what it needs.

With just a few small components, application inputs like a Select List, Radio Buttons, and a Text box can be easily combined into a SQL query string.

Your user sees this.

But underneath you dynamically generate a query that can be passed to HIVE, Spark, or any other aspect of your Hadoop system.

Building your own Hadoop/Spark Components

Each server side component runs standard Python or R code, meaning all the powerful Hadoop and Spark libraries are available. This opens up big data engines to the world of connection libraries available for these languages, everything from ultra low level MapReduce direct design to very high level machine learning control like Spark MLlib.

Example: working with HBase. HBase is an incredibly powerful columnar storage engine that utilizes the Hadoop file system and engine. The HappyBase library for Python uses the native Thrift API that ships with HBase to control its query and aggregation features. Building a custom component with this library would give your component full capabilities with the HBase engine, and open up the application it is used in to the full power of that engine.  This same method works exactly the same for other non-SQL, direct interactions with Hadoop like Cassandra, MapReduce, Pig, Sqoop, and Flume.


When it comes to big data processing engines, nothing is hotter than Apache Spark. Spark has native support for both Python and R (as well as SQL, as discussed above). This allows your component to serve the driver role in a SparkR or PySpark context. Currently, neither necessary library are pip/CRAN installable and we are working to make this a first class feature of Exaptive components.  Please check back soon as we get this uniquely powerful capability online.