Internship at WSO2: November 2012

Friday, November 30, 2012

Hive Summarization script for billing needs

This contains links for the posts related to the Hive Summarization script for billing needs
Hive & Me Part 1
Hive & Me Part 2

Database Space usage monitoring for storage server

For our new product “Storage Server” we need monitoring functionalities. It was possible to change my main project to cater that need. In my main project I collect usage data for all the tenants and publish data of the usage exceeded tenants to the BAM server. I just needed to publish all the data to the BAM so it can work as a monitoring feature for Storage Server.
I removed some parts from my main project to suit the current need. I didn't need tenantBillingService”, which hold me back in the main project. And do not need to change “stratos common” (which is used to get the package details which was needed in the calculations) anymore, as we are publishing all the details that we are collecting. Component level architecture of the project will be

WSO2 Test Automation Hackathon - Summary

This contains links for the post related to the WSO2 Test Automation Hackathon
Beginning the Test Automation
Clarity Framework
Things that you should remember when test automation
Ending The Automation Hackathon

Measuring/Billing database usage in StratosLive - Summery

This Article collects all the posts under the Measuring/Billing database usage in StratosLive.

My Job
WSO2 Data Services Server User Guide
Need to find I/O rates, bandwidth used by each Database user
Limiting The Resource Use
I continued
Suggestions and replies
Collecting and summarizing the captured data
Followed the BAM samples.
Do you need data to play with?
Prototype version 1
Prototype version 1 has to be verified.
1st Verification
OSGi Services
Publishing to BAM
Using OSGi console to debug things

[Break for Test Automation]

Back to the Frozen project
WSO2 Storage Server
The Inevitable Change
Strange things do happen
Using Hive Scripts to Analyze and Summarize BAM data
Difference between two time ignoring the date
Replacing for ntask(quartz-scheduler), using timer task
It is almost 'THE END'

Wednesday, November 28, 2012

BB @ WSO2

We had a great basketball tournament lately, where our house become the first. Below schedule give some idea about the games we played and 4 houses we have. BTW I am a Wild Boar.

Match - Who won

match 1 - Cloud Bots

match 2 - Wild Boars

match 3 - Cloud Bots

match 4 - Wild Boars

match 5 - Titans

match 6 - Cloud Bots

With above we had match 5 and match 6 repeating as 3rd place match and final. Cloud Bots were fully confident of their win in the match 5

Unusual thing happened in the final matches. Heroes were changed, Legions became the 3rd and we(Wild Boars) Became the Champions. I was there for the finals (bad luck) but was there for all the games and even the practice matches. It was such a wonderful experience.

Below are some fine clicks by Harindu

CEO @ play

I am in blue

Wild Boars Captain (in blue)

GReg Basic Filter Improvement - Continues

If you haven't read the first part of the project following links will help you.

GReg Basic Filter Improvement - Starting
GReg Basic Filter Improvement - Feedbacks

I completed the project for GReg generic artifacts (go to the bottom of the page if you don't know what are generic artifacts). Now I have to do the same thing with inbuilt basic artifacts like Service, WSDL, Policy, Schema. In future all the artifact types will be defined with RXTs and there my current implementation will work for all. Till then I have to do the same changes with other inbuilt types.
So I decided to add the filter by LC and name to all inbuilt artifacts. But for services being the most important one we decided to have the full basic filter as in generic artifacts. following are some screen shots from the current state of the project.

Life cycle Filter (Use of negative filtering 'NOT')

Life cycle Filter

Other 'filter by' criteria will be followed by a text-box

GReg generic artifacts - When you deploy a GReg instance you get some in built artifact types. But if you need to have your own Metadata type, you can define it yourself. WSO2 Governance Registry provides the flexibility to configure and extend its functionalities and capabilities. One of its configurable capabilities is its Metadata model, which can be extended in such a way that anyone can use it to store any custom type of data, such as Services, WSDLs, etc., which is already there. To do this, you only need to add an XML file (registry extension or .rxt file) which includes the new Metadata model artifact as a resource to the registry.
This section contains detailed information on how to create the registry extension file up to content element, how to create a content which describes the data model of the artifact, how to deploy your file in the WSO2 Governance Registry, and how to add a menu item by adding a menu element to the registry extension file. More information: http://docs.wso2.org/wiki/display/Governance411/Configurable+Governance+Artifacts

GReg Basic Filter Improvement - Feedbacks

As I said before projects get change with the feedback that they are getting. So I like to shear some comments I got from the other members

I got the below comment from Senaka (GReg - Team Lead)

Hi Malinga,
Looks good. Now, say I take a random RXT Foo, it has three columns (in addition to Lifecycle Status),
1. Name
2. Version
3. Domain
How will this work? Will it be an exact match for name? or will what you enter only need to match a portion of the name (i.e. starts with). Will the same work for Version/Domain?
Lifecycle piece looks good, but there can be assets without an LC. For those, I think the "Select LC" box needs to also have a "None" in addition to what you have. Also, if no asset has an LC we don't display the LC-status column at all. In such a situation, the Filter-By should also not have it. Is that accounted for already?

So I decided to use the advance filter within the GReg to filter by all other fields other than life-cycle. So how does it work, I don't have to think about. And I haven't taken those 2nd, and 3rd facts in to the account.

SO like above some more feed backs came in and project got changed according to the feedback. I have listed some more comments below

Vijitha: "Can we make the drop down (second form left) "Is" & "Is Not" ?"

Senaka: "So this would now read as "Lifecycle is PatchLifeCycle in Any". May be its better to change Any in the state dropdown to "Any State", which will make it clearer to the user. Or he/she will have to expand the dropdown to understand what we have got there. I'm talking about a first time user."

I am really thankful to each and everyone who gave me feedback and advice. Here I listed some I found in the mail thread. I got some feedback from other people too, when I met them face to face.
So I will blog on what I will build in the future posts.

GReg Basic Filter Improvement - Starting

Introduction:
Currently in GREG we have a basic filter that can be used to filter services using its name. This is used to filter services on the fly without going in to the advance filter. This project will improve its functionality of that basic search to search main few columns that is selected by the rxt (columns which is in the list artifact page, these columns are defined in the rxt). Here our main concern will be searching with life-cycles.

Below sketch will give you a idea on how UI will change after this project.

Search by life-cycle:
When you select life-cycle from the first drop down (to filter by life-cycle) it will change the text-box next to it to a drop down menu listing the available LCs. and when you select a LC it will create another drop-down containing the possible states within the selected LC. Both of these drop-downs will have <any> item that will filter without considering the subject drop-down.

Search by other fields:
Where you select search by any other field it will show a text-box or a drop-down according to the selected column.

This is only the start:

This is only the start, this designs and functionality change with the feedback of the others. I might complete with a one that is not even close to what you see here. Keep in touch to know what happens.

Comment with any ideas you have, your ideas might get reflected in the next GReg release.

Thursday, November 15, 2012

Hive & Me Part 2

Continued from Hive & Me Part 1.....

After creating the required MySQL and Hive tables I moved in to the logic part of the script. I have to get the sum of all the bandwidth-In and bandwidth-Out entries separately. Then sum (bandwidth-In)-sum (bandwidth-Out) will give the current value and sum (bandwidth-Out) will give the history value. But doing it hourly is extremely costly. But if we can sum the entries from the last hour and calculate the current and history values based on the early current and history values, it will be better. I got to know we are keeping the time of the last run of the script in a MySQL table, and we write it to the hive configuration file using a Java class. I used that value to get the sum of the entries in the last hour. But it is not possible to add this last hour summarization to the previous current, and history values in the same query. So I add the summarization of the last hour with new id and sum the final and last hour rows in the table.

INSERT INTO TABLE REGISTRY_USAGE_HOURLY_ANALYTICS 
SELECT concat(TID, "LastHour"), TID, HISTORY_USAGE, CURRUNT-HISTORY_USAGE FROM 
(SELECT TENANT_ID AS TID,
        sum(PAYLOAD_VALUE) AS HISTORY_USAGE 
FROM USAGE_STATS_TABLE
WHERE USAGE_STATS_TABLE.PAYLOAD_TYPE = 'ContentBandwidth-Out' AND Timestmp > ${hiveconf:last_hourly_ts}
GROUP BY SERVER_NAME, PAYLOAD_TYPE, TENANT_ID) table1
JOIN
(SELECT TENANT_ID AS TID2,
        sum(PAYLOAD_VALUE) AS CURRUNT 
FROM USAGE_STATS_TABLE
WHERE USAGE_STATS_TABLE.PAYLOAD_TYPE = 'ContentBandwidth-In' AND Timestmp > ${hiveconf:last_hourly_ts}
GROUP BY SERVER_NAME, PAYLOAD_TYPE, TENANT_ID) table2
ON(table2.TID2 = table1.TID);

Above script get the summery of the usage in the last hour and inset it in to the table. Below query add the last our summary to final(in the last hour) and create the final value for the current hour.

INSERT INTO TABLE REGISTRY_USAGE_HOURLY_ANALYTICS 
SELECT concat(TENANT_ID, "Final"),
        TENANT_ID, 
        sum(HISTORY_USAGE) as HISTORY_USAGE,
        sum(CURRENT_USAGE) as CURRENT_USAGE
FROM REGISTRY_USAGE_HOURLY_ANALYTICS
GROUP BY TENANT_ID

This query results in a MySQL table where each tenant has two rows as 'final' and 'last hour'. Final row gives the current (size of all the data that user currently have in his directory) and history (size of all the data that user has deleted up to now). This information should be available to for each tenant correct to the last hour.

Hive & Me Part 1

Started with the new project to summarize registry bandwidth data (refers to the space used in the registry). As you might know we can have BAM to summarize data in Cassandra key spaces using hive scripts. It was not easy to work with lack of examples under hive.

What I have to do

There was a table in Cassandra that contains registry usage data. When a user adds or remove something from his registry a entry is marked as “registryBandwidth-In” (when we adds something) or “registryBandwidth-Out”(when he deletes something). I have to summarize those recodes in such a way that we have access to the current (size of all the data that user currently have in his directory) and history (size of all the data that user has deleted up to now). This information should be available to for each tenant correct to the last hour.

Implementation Plan

If I can write the current and history values in to a MySQL table, where each tenant will have a separate row, it is good enough. First I thought of having a table in hive with current and history values and a MySQL table mapped to it.

Below code uses the JDBC Storage Handler for Hive and more information on how to use it can be found in Kasun's blog: http://kasunweranga.blogspot.com/2012/06/jdbc-storage-handler-for-hive.html

CREATE EXTERNAL TABLE IF NOT EXISTS REGISTRY_USAGE_HOURLY_ANALYTICS ( 
        ID STRING,
        TENANT_ID STRING,       
        HISTORY_USAGE BIGINT,
        CURRENT_USAGE BIGINT) 
        STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler' TBLPROPERTIES (
        "mapred.jdbc.driver.class" = "com.mysql.jdbc.Driver", 
        "mapred.jdbc.url" = "jdbc:mysql://localhost:3306/WSO2USAGE_DB",
        "mapred.jdbc.username" = "root",
        "mapred.jdbc.password" = "root",
        "hive.jdbc.update.on.duplicate" = "true",
        "hive.jdbc.primary.key.fields" = "ID",
        "hive.jdbc.table.create.query" = "CREATE TABLE REGISTRY_USAGE_HOURLY_ANALYTICS (
        ID VARCHAR(50),
        TENANT_ID VARCHAR(50),
        HISTORY_USAGE BIGINT,
        CURRENT_USAGE  BIGINT)"
);

This will create a 2 tables, One is a Hive table and other is a mySQL table. Both will have the name "REGISTRY_USAGE_HOURLY_ANALYTICS" What ever we write the to the hive table will be written to the MySQL table. In the next code block I create a mapping to the MySQL table. Using this temporary hive table I can query the MySQL table.


CREATE EXTERNAL TABLE IF NOT EXISTS REGISTRY_USAGE_HOURLY_ANALYTICS_TEMP ( 
        ID STRING,
        TENANT_ID STRING,       
        HISTORY_USAGE BIGINT,
        CURRENT_USAGE BIGINT) 
        STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler' TBLPROPERTIES (
        "mapred.jdbc.driver.class" = "com.mysql.jdbc.Driver", 
        "mapred.jdbc.url" = "jdbc:mysql://localhost:3306/WSO2USAGE_DB",
        "mapred.jdbc.username" = "root",
        "mapred.jdbc.password" = "root",
        "hive.jdbc.primary.key.fields" = "TENANT_ID",
        "mapred.jdbc.input.table.name" = "REGISTRY_USAGE_HOURLY_ANALYTICS"
);

Continued to the part 2....

Friday, November 2, 2012

End of the quite October to hopeful November

It was a really quite October, when you look in to the blog, nothing is written. Actually it was a busy October, that is why I didn't had time to write articles to the blog. I worked with BAM and hive summarization scripts for BAM. So I am thinking about writing on "hive and summarization scripts for BAM". Next project I worked in was to improve the basic filter functionality of GREG basic filter and add filter by LC(life-cycle) to it.

What you should expect in the up coming month
About Hive
About BAM summarization scripts
About Bandwidth usage data summarization
About Greg Basic Filter improvement
About Greg LC filtering feature.

Looking for lot of articles in this November! Hopefully :)

Internship at WSO2