In a previous blog, we showed how ultra-fast visualization of big data is achieved with in-memory, in-datasource, and on-demand data access and aggregation using out-of-the-box Spotfire data connectors. In this blog, we provide more details on how this can be achieved with the TIBCO Spotfire Connector for Apache Spark SQL.
Your company stores business data on a Hadoop cluster file system, and data scientists in your organization use this data to perform transformations and fit models with the Apache Spark platform. As a business analyst, you can also use Spark to access this data. In Spotfire, it is as easy as connecting to any other data source. Let’s see how.
Apache Spark™ is a fast and general engine for large-scale data processing. Spark brings high performance to a variety of operations on data by leveraging in-memory processing. With this and other optimizations, Spark can be 100x faster than Hadoop for large scale data processing. We’ll discuss a common configuration, in which Spark is running on a Hadoop cluster along with Hive, and show how to access data on the Hadoop cluster using the TIBCO Spotfire Connector for Apache Spark SQL.
We are working with data from a retailer. Each record in our tables represents a shipment of a single type of product. Along with customer information, the 12 columns of shipment information include the number and value of cases shipped, and the number and value of shipped cases that went missing, if any. The row counts of our tables range from 100,000 to 10,000,000. We would like to identify the characteristics of shipments that are most likely to go missing
Open Spotfire and click Apache Spark SQL on the Add Data page.
In the Apache Spark SQL Connection dialog, enter the server address and user credentials. The Authentication method may be one of the following: No Authentication, Kerberos, Username, Username and password, or Username and password (SSL). Our authentication method is Username only, and our username is hive.
Click Connect, and a list of databases will be shown in the drop-down list. Our data is stored in the “default” database. We choose “default” from the list and click OK.
The Views in Connection dialog lists available tables, and allows you to push one or more of them into the middle column. A view in the middle column can be based on a custom query. You can also create relations between views, to execute joins.
Select one of the views, and its columns are shown in the right-hand column, as above. Here you can designate Primary Keys and Prompts. If you have used other Data Connectors in Spotfire, this may be familiar. We choose the rawcfr5c view, which will return the 12 columns from the rawcfr5c table.
In Spotfire, data can be accessed either by loading the individual records into memory and operating on them locally (“in-memory” option), or by keeping the row-level information in the data source and dynamically issuing aggregation queries to render Bar Charts, Cross Tables, and other visualizations (“in-database” option). You can restrict the view to allow loading in only one of these ways; we keep the default, Allow All Methods, and click OK.
The Add Data Tables is the last stop before viewing the data in Spotfire. Here we will specify the load method “Import data table” for the rawcfr5c view, which will load row-level information into memory for use in Spotfire. Note that the rawcfr5 and rawcfr5c views differ only in that rawcfr5c is in-memory, while rawcfr5 is in-database. Whether you load the row-level information into memory, or keep the data table external, you always have the option to trigger a refresh using specified criteria by checking the “Load on demand” checkbox. Trigger criteria may be based on filtering or marking in a related table, or by changing a property value with a user control. We click OK.
In Spotfire, the two loaded tables have 100,000 records. We use a Treemap to investigate missing shipments within nested categories of channel, country, and sector. Below we see that missing cases make up a large fraction of the value of online shipments, and that resellers of yard and garden products in the UK should be analyzed further. Drilling down into UK Yard and Garden Resellers shows which sites are having the worst problems with missing cases.
Each time we refresh our Data Table in Spotfire, the Spark SQL Connector launches a Spark job. Part of that Spark job loads the underlying Hive table into the distributed memory that Spark manages. By pre-caching the underlying Hive table, the Spark SQL Connector can query it directly from Spark distributed memory, skipping the load step. Your Hadoop administrator can perform this caching with a single command in the beeline interface, as below.
[hadoop@ip-172-31-29-255 ~]$ /usr/lib/spark/bin/beeline
beeline> !connect jdbc:hive2://172.31.29.255:20013
… <authenticate> …
0: jdbc:hive2://126.96.36.199:20013> cache table rawcfr5c;
Caching will be particularly helpful with in-database connections because Spotfire generates frequent refreshes for these, as noted earlier. (In our examples, the Hive tables rawcfr5c, rawcfr6c, rawcfr7c are cached.) In tests with the shipment data, we have seen up to 5 times faster refresh performance between cached and uncached versions of the tables.
Business analysts can benefit from the speed of Spark SQL by accessing data with the TIBCO Spotfire Connector for Apache Spark SQL. No familiarity with Spark or SQL is needed. As with other Data Connectors in Spotfire, relations, custom queries, primary keys, prompting, and load methods are all available. Finally, by caching tables in Spark distributed memory beforehand, the speed of refreshing data in Spotfire is greatly improved. This is especially important for “in-database” Spotfire Data Tables.