Thursday, September 20, 2012

Using Hive Scripts to Analyze and Summarize BAM data


As I have completed up to publishing usage data, now I need to analyze and summarize those data. This can be simply done by a hive script and scheduling it within BAM. In the main menu of BAM you will find a manage menu. In manage menu, there is a menu item analyze. Under analyze menu item you get two more sub menus, one to list existing scripts and one to add new scripts.

Now go to 'add' sub menu there(Main>Manage>Analytics>Add). Here you get the chance to write your script and schedule it.

Bellow is a simple script written by Shariq Muhammed, SE @ WSO2. I used this script to summarize data in one of my tables created while pumping data in to BAM. I have removed some parts init as It won't be relevant to you.


CREATE EXTERNAL TABLE IF NOT EXISTS UsageStatsTable (id STRING,
        payload_ServerName STRING,
        payload_TenantID STRING,
        payload_Data STRING,
        payload_Value BIGINT,
        timestamp BIGINT) 
        STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( 
        //sort Properties
);

CREATE EXTERNAL TABLE IF NOT EXISTS UsageStatsHourFact (id String, 
        hour_fact STRING,
        payload_ServerName STRING, 
        payload_TenantID STRING,        
        payload_Data STRING,
        payload_Value BIGINT) 
        STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( 
        //sort properties
);

select #some columns from my table#

insert into table #some other table#
like above you can select and group the pumped data and insert summery data into a new table. If you don't know hive syntax, it is similar to SQL and you can have a great tutorial @ the following link, https://cwiki.apache.org/Hive/tutorial.html.






No comments:

Post a Comment