Showing posts with label mysql. Show all posts
Showing posts with label mysql. Show all posts

Thursday, November 15, 2012

Hive & Me Part 2

Continued from Hive & Me Part 1.....

After creating the required MySQL and Hive tables I moved in to the logic part of the script. I have to get the sum of all the bandwidth-In and bandwidth-Out entries separately. Then sum (bandwidth-In)-sum (bandwidth-Out) will give the current value and sum (bandwidth-Out) will give the history value. But doing it hourly is extremely costly. But if we can sum the entries from the last hour and calculate the current and history values based on the early current and history values, it will be better. I got to know we are keeping the time of the last run of the script in a MySQL table, and we write it to the hive configuration file using a Java class. I used that value to get the sum of the entries in the last hour. But it is not possible to add this last hour summarization to the previous current, and history values in the same query. So I add the summarization of the last hour with new id and sum the final and last hour rows in the table.

INSERT INTO TABLE REGISTRY_USAGE_HOURLY_ANALYTICS 
SELECT concat(TID, "LastHour"), TID, HISTORY_USAGE, CURRUNT-HISTORY_USAGE FROM
(SELECT TENANT_ID AS TID,
        sum(PAYLOAD_VALUE) AS HISTORY_USAGE
FROM USAGE_STATS_TABLE
WHERE USAGE_STATS_TABLE.PAYLOAD_TYPE = 'ContentBandwidth-Out' AND Timestmp > ${hiveconf:last_hourly_ts}
GROUP BY SERVER_NAME, PAYLOAD_TYPE, TENANT_ID) table1
JOIN
(SELECT TENANT_ID AS TID2,
        sum(PAYLOAD_VALUE) AS CURRUNT
FROM USAGE_STATS_TABLE
WHERE USAGE_STATS_TABLE.PAYLOAD_TYPE = 'ContentBandwidth-In' AND Timestmp > ${hiveconf:last_hourly_ts}
GROUP BY SERVER_NAME, PAYLOAD_TYPE, TENANT_ID) table2
ON(table2.TID2 = table1.TID);


 Above script get the summery of the usage in the last hour and inset it in to the table. Below query add the last our summary to final(in the last hour) and create the final value for the current hour.

INSERT INTO TABLE REGISTRY_USAGE_HOURLY_ANALYTICS 
SELECT concat(TENANT_ID, "Final"),
        TENANT_ID,
        sum(HISTORY_USAGE) as HISTORY_USAGE,
        sum(CURRENT_USAGE) as CURRENT_USAGE
FROM REGISTRY_USAGE_HOURLY_ANALYTICS
GROUP BY TENANT_ID

This query results in a MySQL table where each tenant has two rows as 'final' and 'last hour'. Final row gives the current (size of all the data that user currently have in his directory) and history (size of all the data that user has deleted up to now). This information should be available to for each tenant correct to the last hour.

Hive & Me Part 1

Started with the new project to summarize registry bandwidth data (refers to the space used in the registry). As you might know we can have BAM to summarize data in Cassandra key spaces using hive scripts. It was not easy to work with lack of examples under hive.

What I have to do
There was a table in Cassandra that contains registry usage data. When a user adds or remove something from his registry a entry is marked as “registryBandwidth-In” (when we adds something) or “registryBandwidth-Out”(when he deletes something). I have to summarize those recodes in such a way that we have access to the current (size of all the data that user currently have in his directory) and history (size of all the data that user has deleted up to now). This information should be available to for each tenant correct to the last hour.

Implementation Plan
If I can write the current and history values in to a MySQL table, where each tenant will have a separate row, it is good enough. First I thought of having a table in hive with current and history values and a MySQL table mapped to it.

Below code uses the JDBC Storage Handler for Hive and more information on how to use it can be found in Kasun's blog: http://kasunweranga.blogspot.com/2012/06/jdbc-storage-handler-for-hive.html

CREATE EXTERNAL TABLE IF NOT EXISTS REGISTRY_USAGE_HOURLY_ANALYTICS ( 
        ID STRING,
        TENANT_ID STRING,      
        HISTORY_USAGE BIGINT,
        CURRENT_USAGE BIGINT)
        STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler' TBLPROPERTIES (
        "mapred.jdbc.driver.class" = "com.mysql.jdbc.Driver",
        "mapred.jdbc.url" = "jdbc:mysql://localhost:3306/WSO2USAGE_DB",
        "mapred.jdbc.username" = "root",
        "mapred.jdbc.password" = "root",
        "hive.jdbc.update.on.duplicate" = "true",
        "hive.jdbc.primary.key.fields" = "ID",
        "hive.jdbc.table.create.query" = "CREATE TABLE REGISTRY_USAGE_HOURLY_ANALYTICS (
        ID VARCHAR(50),
        TENANT_ID VARCHAR(50),
        HISTORY_USAGE BIGINT,
        CURRENT_USAGE  BIGINT)"
);

This will create a  2 tables, One is a Hive table and other is a mySQL table. Both will have the name "REGISTRY_USAGE_HOURLY_ANALYTICS" What ever we write the to the hive table will be written to the MySQL table. In the next code block I create a mapping to the MySQL table. Using this temporary hive table I can query the MySQL table.

CREATE EXTERNAL TABLE IF NOT EXISTS REGISTRY_USAGE_HOURLY_ANALYTICS_TEMP (
        ID STRING,
        TENANT_ID STRING,      
        HISTORY_USAGE BIGINT,
        CURRENT_USAGE BIGINT)
        STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler' TBLPROPERTIES (
        "mapred.jdbc.driver.class" = "com.mysql.jdbc.Driver",
        "mapred.jdbc.url" = "jdbc:mysql://localhost:3306/WSO2USAGE_DB",
        "mapred.jdbc.username" = "root",
        "mapred.jdbc.password" = "root",
        "hive.jdbc.primary.key.fields" = "TENANT_ID",
        "mapred.jdbc.input.table.name" = "REGISTRY_USAGE_HOURLY_ANALYTICS"
);

Continued to the part 2....

Saturday, June 30, 2012

1st Verification – Verification of “index length + data length = actual disk space used”


I think I found where DBs are stored. As I am working in XAMPP there were stored in /opt/lampp/var/mysql there where folders for each database. When tried to open them it gives me a error saying “you don't have permission” Next thing I tried was trying to open it in the terminal using sudo. Unlucky me it gave me a error saying “sudo: cd: command not found”. Have to find a way to open such folders. I'll write a post on this, If I find a way (Or any one who know a way can help me by commenting below). Till then I used “sudo nautilus path/”

There were 3 files for each table in each db. A .frm file, .myd file, .myi file. All three are contributing to the space used
  • FRM files contain table structure information.
  • MYD contains the actual table data.
  • MYI files contain the indexes.
Using “sudo ls -hs path” list the files inside the folder given by the path with sizes.



-s, --summarize (display only a total for each argument)
-h, --human-readable ( print sizes in human readable format (e.g., 1K 234M 2G))



so It showed that figures given in the information schema is similar to the results given by the above command. But size of the FRM file was not take in to account by me. But looks like all the .frm files are having the same length (12K) so I can sum them up If I know the number of tables. However I have to check why others ignore this file when they are calculating.

Some databases only have .frm file

Going through the database folders, I saw that some folders only have a .frm file in it. Searching for that I found out that there is two major engines within MySQL: MyISAM and InnoDB. If the table belongs to InnoDB, it only includes .frm file. Data of that kind of DBs are stored in a single or multiple (you can configure that) .idb files.

Prototype version 1 has to be verified.


This prototype is tested on my own machine with several MySQL users created by me. So this is not guaranteed to work in the real scenario. This should work with the actual billing architecture, to be any use for my work. This is the second and biggest verification that I have to do.

Before that there is a simple and basic verification that I have to consider. In prototype code I calculate a table size as follows (here information from information_schema.TABLES is used)

Table Size = Data Length + Index Length

Both Data Length and Index Length can be found in information_schema.TABLES. I have to verify that figure given by the above calculation is the correct table size.

Prototype version 1


As I was asked to build a proof of concept(POC), as I mentioned in before posts, I started working on it. By now I have a working prototype that logs the DB size information. Here I will create two tables one contain user details like user name, user DB size limit and binary information column 'exceeded' that will indicate whether subject user has exceeded his limits (this column is not used feather in my currant implementation, but I thought it would be helpful to have such a column in future).

Second table is a logging table, this logs the disk space used by each user, time to time. This again has 3 columns. A time stamp, user name and exceeded number of bytes (used number of bytes-user limit in bytes) are the information that is included in that table.

Below is the simple code that do that job
include code here.

import java.sql.*;

public class sizeCheck {

 public static void main(String args[]) {
  Connection con = null;
  Statement stat, stat2;
  boolean debug = true;
  try {
   Class.forName("com.mysql.jdbc.Driver").newInstance();
   con = DriverManager.getConnection("jdbc:mysql://localhost/", "root", "root");

   if (debug && !con.isClosed())
    System.out.println("Successfully connected to " + "MySQL server using TCP/IP...");

   stat = con.createStatement();
   stat2 = con.createStatement();

   // query to select all of the data from a table
   String selectQuery = "SELECT user,sizelimit FROM `Quota`.`Quota`";
   // get the results
   ResultSet results = stat.executeQuery(selectQuery);

   // output the results
   while (results.next()) {
    String getDBsizes =
                        "SELECT SUM(DATA_LENGTH) as sumData, SUM(INDEX_LENGTH) as sumIndex FROM `information_schema`.`TABLES` WHERE TABLE_SCHEMA LIKE '" +
                                results.getString("user") + "%'";
    ResultSet sizeResult = stat2.executeQuery(getDBsizes);
    sizeResult.next();
    if (debug) {
     System.out.println(sizeResult.getInt("sumData"));
     System.out.println(sizeResult.getInt("sumIndex"));
     System.out.println(sizeResult.getInt("sumData") + sizeResult.getInt("sumIndex"));
    }
    if (sizeResult.getInt("sumData") + sizeResult.getInt("sumIndex") > results.getInt("sizelimit")) {
     int exccededGB =
                      sizeResult.getInt("sumData") + sizeResult.getInt("sumIndex") -
                              results.getInt("sizelimit");
     String setExceeded =
                          "UPDATE `Quota`.`Quota` SET `exceeded` = '1' WHERE `Quota`.`user` = '" +
                                  results.getString("user") + "';";
     stat2.executeUpdate(setExceeded);
     String logExceeded =
                          "INSERT INTO `Quota`.`exceed` (`datetime` ,`user` ,`exceedbytes` ,`other`) VALUES (CURRENT_TIMESTAMP , '" +
                                  results.getString("user") +
                                  "', '" +
                                  exccededGB +
                                  "', '');";
     stat2.execute(logExceeded);
     // String revoke =
     // "REVOKE INSERT, UPDATE, CREATE, ALTER ON `"+
     // results.getString("user") +"\\_%` . * FROM '"+
     // results.getString("user") +"'@'localhost';";
     // if(debug)
     // System.out.println(revoke);
     // stat2.execute(revoke);
    } else {

     String setExceeded =
                          "UPDATE `Quota`.`Quota` SET `exceeded` = '0' WHERE `Quota`.`user` = '" +
                                  results.getString("user") + "';";
     stat2.executeUpdate(setExceeded);
     // String grant = "GRANT ALL PRIVILEGES ON `"+
     // results.getString("user") +"\\_%` . * TO '"+
     // results.getString("user") +"'@'localhost';";
     // if(debug)
     // System.out.println(grant);
     // stat2.execute(grant);
    }
   }

  } catch (Exception e) {
   System.err.println("Exception: " + e.getMessage());
  } finally {
   try {
    if (con != null)
     con.close();
   } catch (SQLException e) {
   }
  }

 }

}
you can download it form below


Monday, June 11, 2012

Do you need data to play with?


If you are doing any data related work, you need some big data to work with. As a example I am working with mySQL, and I want to try a big query that will populate my 'SLOW QUERY LOG'. To get in to the slow query log that query must run for some time (although we can define that time, my tables were to small to produce at least a 1ms). I needed some big tables that will populate my databases with some big data.

Thats where I found the following .sql file that have lot of data in it. This is some kind of a sampple given by mySQL. Anyway here it is.

https://docs.google.com/a/wso2.com/file/d/0B4VQdLMBav1WTXQ3cnY5RHdWWkE/edit
go to it and save it..

Who thought that installing MySQL will that that long?


I wanted to see that list of users, so as most of you know it is done by ,
select * from mysql.user;

But this was very tedious, so I thought I should go to a GUI based server like PHPmyadmin. SO I thought of installing WAMP like thing here. As I am working in ubuntu there is no WAMP, I have to work with XAMPP, which is very similar.

It was a easy job with installing it. Just go to http://www.apachefriends.org/en/xampp-linux.html and follow there 4 steps and you will be done.


OR
use below

Then When I tried to start the servers it gave me the following massege
XAMPP is currently only availably as 32 bit application. Please use a 32 bit compatibility library for your system.

To fix that problem I had to install 32 bit libraries (which took several minutes to download [<70MB]) to my computer. I got help from following site (http://preprocess.me/comment/386)

you have to run the below command to install 23 bit libs
sudo apt-get install ia32-libs


But when I started PHPmyadmin it gave me this error.
#2002 - The server is not responding (or the local MySQL server's socket is not correctly configured)

I always knew that something like this will popup as I had a already installed mySQL server in my computer.

So I thought of removing that existing mySQL sever. This page helped me in that process.
A little problem came with the commands. When copying and pasting it to he bash. So Here are the corrected ones.
apt-get --purge remove mysql-server
apt-get --purge remove mysql-client
apt-get --purge remove mysql-common
apt-get autoremove
apt-get autoclean
//befor purge it should be -- not –

Now restart the server and Everything will work fine.

Saturday, June 2, 2012

Suggestions and replies


With my suggestion, others started to give their feedback. There were few interesting and informative feedbacks as listed below.

Sanjeewa Malalgoda: I was looking into this subject some times back and found some points. AFAIU only number of transactions is not enough .found some interesting tools like dbtuna[1] and jetprofiler[2] I hope we can have a look at them and get some idea. I have tested jetprofiler and it gives
lot of information related the db usage.



I went through them and got to this below conclusion which I made as my reply.

“It is always good to know a person who has worked in the same area. I went through those 2 tools and they mainly target the management and administrative aspects of the db server. It gives us nice graphical representations about existing data. This can be very useful when understanding the usage patterns of users. But this do not give us any new informations, this only presents the date found in the information_schema, logs. As you have used it you might know more about it, so correct me if I am wrong.”

Jet Profiler

  1. Install Java 1.6 separately
  2. Unzip jetprofiler_v2.0.5.zip to the desired folder (e.g. /usr/local/bin/jetprofiler or /home/USER/bin/jetprofiler).
  3. Run ./jetprofiler

I continued


So my last day suggestion way not up to the standard, so Amilam asked me to restructure it, so this is the version 2

Problem:

Until now we don't have a way to measure the usage of the StratosLive RSS. There can be many views of the usage like I/O per unit time, accumulated DB size per user, Bandwidth used by a user (i think this is taken in to account by now). We need to start measuring(tracking) the usage at least under one view (ideally in all). Real problem comes with the limitations of mySQL. In mySQL, it do not have in built support for above mention things.

we cannot limit DB space: http://www.howtoforge.com/forums/showthread.php?t=1944
we cannot limit/measure bandwidth/IO rate of a individual database/user : http://forums.mysql.com/read.php?10,543847,543847
There is a way to set the MAX_QUERIES_PER_HOUR, MAX_CONNECTIONS_PER_HOUR, but there is no way to get the current state of those variables (without messing with mySQL code base, http://forums.mysql.com/read.php?35,179219)

Solutions:

There are external suggestions for limiting database size, we can do something like that, as a example,
MySQL Quota Daemon: http://lrem.net/software/mysql-quota-daemon.xhtml
MySQL Quota-Tool: http://projects.marsching.org/mysql_quota/

And we can limit the user to a MAX_QUERIES_PER_HOUR using in built support in mySQL (http://dev.mysql.com/doc/refman/5.5/en/grant.html). This limit will not be a problem to the user while service is protected from getting extreme number of request per unit time (kind of DOS)

And there are suggested ways to limit the DB size per user, using cronjobs that will notify the user of excess usage of space.

How we can use what we have:

1. We can limit the size of the DB at the time of DB creation based on the usage plan.
2. We can define transactions per hour for all or selected usage plans

Please reply with your feedback on this view, it will be very helpful...

Limiting The Resource Use


Resuming from where I stopped yesterday, I started by looking how to limit the resource use.

Current state of the MAX_QUERIES_PER_HOUR, MAX_UPDATES_PER_HOUR, MAX_CONNECTIONS_PER_HOUR, MAX_USER_CONNECTIONS variables.


Yes, I can limit it, but the ultimate goal of this project is to measure the resource use. So it is always better to have something that can do my prime goal. So I needed to know whether I can get any information about the current state of MAX_QUERIES_PER_HOUR, MAX_UPDATES_PER_HOUR, MAX_CONNECTIONS_PER_HOUR, MAX_USER_CONNECTIONS variables.

I checked in MYSQL metadata schema(information_schema, performance_schema)

If mySQL can limit it, it should have a count somewhere right?. That was how I thought, so I started looking through some default schema like information_schema, performance_schema, etc. Sorry to say that I found nothing. But still I think it should save this data somewhere.


Limiting the database size (MySQL Quota Daemon)




I went though that library I found last day (MySQL Quota Daemon : http://lrem.net/software/mysql-quota-daemon.xhtml) and it is a good tool (it requires perl). Another very small piece of code for doing the same thing was MySQL Quota-Tool (http://projects.marsching.org/mysql_quota/)