Showing posts with label billing. Show all posts
Showing posts with label billing. Show all posts

Thursday, November 15, 2012

Hive & Me Part 1

Started with the new project to summarize registry bandwidth data (refers to the space used in the registry). As you might know we can have BAM to summarize data in Cassandra key spaces using hive scripts. It was not easy to work with lack of examples under hive.

What I have to do
There was a table in Cassandra that contains registry usage data. When a user adds or remove something from his registry a entry is marked as “registryBandwidth-In” (when we adds something) or “registryBandwidth-Out”(when he deletes something). I have to summarize those recodes in such a way that we have access to the current (size of all the data that user currently have in his directory) and history (size of all the data that user has deleted up to now). This information should be available to for each tenant correct to the last hour.

Implementation Plan
If I can write the current and history values in to a MySQL table, where each tenant will have a separate row, it is good enough. First I thought of having a table in hive with current and history values and a MySQL table mapped to it.

Below code uses the JDBC Storage Handler for Hive and more information on how to use it can be found in Kasun's blog: http://kasunweranga.blogspot.com/2012/06/jdbc-storage-handler-for-hive.html

CREATE EXTERNAL TABLE IF NOT EXISTS REGISTRY_USAGE_HOURLY_ANALYTICS ( 
        ID STRING,
        TENANT_ID STRING,      
        HISTORY_USAGE BIGINT,
        CURRENT_USAGE BIGINT)
        STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler' TBLPROPERTIES (
        "mapred.jdbc.driver.class" = "com.mysql.jdbc.Driver",
        "mapred.jdbc.url" = "jdbc:mysql://localhost:3306/WSO2USAGE_DB",
        "mapred.jdbc.username" = "root",
        "mapred.jdbc.password" = "root",
        "hive.jdbc.update.on.duplicate" = "true",
        "hive.jdbc.primary.key.fields" = "ID",
        "hive.jdbc.table.create.query" = "CREATE TABLE REGISTRY_USAGE_HOURLY_ANALYTICS (
        ID VARCHAR(50),
        TENANT_ID VARCHAR(50),
        HISTORY_USAGE BIGINT,
        CURRENT_USAGE  BIGINT)"
);

This will create a  2 tables, One is a Hive table and other is a mySQL table. Both will have the name "REGISTRY_USAGE_HOURLY_ANALYTICS" What ever we write the to the hive table will be written to the MySQL table. In the next code block I create a mapping to the MySQL table. Using this temporary hive table I can query the MySQL table.

CREATE EXTERNAL TABLE IF NOT EXISTS REGISTRY_USAGE_HOURLY_ANALYTICS_TEMP (
        ID STRING,
        TENANT_ID STRING,      
        HISTORY_USAGE BIGINT,
        CURRENT_USAGE BIGINT)
        STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler' TBLPROPERTIES (
        "mapred.jdbc.driver.class" = "com.mysql.jdbc.Driver",
        "mapred.jdbc.url" = "jdbc:mysql://localhost:3306/WSO2USAGE_DB",
        "mapred.jdbc.username" = "root",
        "mapred.jdbc.password" = "root",
        "hive.jdbc.primary.key.fields" = "TENANT_ID",
        "mapred.jdbc.input.table.name" = "REGISTRY_USAGE_HOURLY_ANALYTICS"
);

Continued to the part 2....

Wednesday, September 26, 2012

It is almost 'THE END'

Now this is the summery of What I have done


 It is agreed to measure the database space usage by each tenant. Here we will not limit the tenant(in terms of database access) on its DB usage but will keep track on excess DB space use by each tenant.

Component level view of the process.



Changes to each component:

Rss-manager: This component will be used to gather usage data from the RSS. And this will add those data to a queue which in turn will be retrieved by usage agent component. This Usage data collection will be handle through couple of newly added classes. And this is scheduled to be run daily. And it is configurable to run starting from a given time and repeated with given time gap(currently decided to run it in 24h intervals). Here we will only interested in tenants with exceeded usage. So it is needed to know the usage plan of a interested tenant, in order to get its limits. We thought of only publishing information about those tenants who exceeds the space limits, due to two reasons.
  1. To reduce the data transfer between components and to the BAM server.
  2. Exceeded DB size is all we need for billing calculations.

Usage-agent: This component will retrieve usage data from the queue(above mentioned) in the rss-manager. This is handled by newly added class, DatabaseUsageDataRetrievalTask. This is also scheduled to be run daily. And it is configurable to run starting from a given time and repeated with given time gap(currently decided to run it in 24h intervals).

Stratos-commons: This is where usage plan details are manipulated. Here plan details are read from 'multitenancy-packages.xml' and made available for use through a service. Here I have changed the xml file, xml reading class, data storing bean, to contain DB usage related data.

Dependencies: this depends on the yet to develop component (to get the tenant usage plan given the tenant domain/id) and that component is required for the RSS-Manager component changed to work perfectly.


Tuesday, September 18, 2012

The Inevitable Change


Change is there in everything you see, that is why 'change' is known as the only 'not changing thing'. When I moved in to my old(main) project, I felt that I have to start from the scratch again. Code-lines I wrote didn't work and almost all the lines had errors in them. Those errors due to missing classes, missing methods, changed signatures, etc. It was a hard job bring it back to the earlier state. By now it is collecting and publishing usage data as intended. I worked over a week on this, but added nothing extra, only took it back to where it was.

Project Progress

What I have done
Completed collecting database usage data.
Completed publishing them to BAM

What more to do
By now I don't have a way to get the usage plan of each tenant, so for the sake of testing I have hard corded it.
Need to analyze and summarize usage data that was sent to BAM, this is done using hive scripts.
Need to cleanup and reformat the code according to the best practices where I have missed them.