Sunday, May 15, 2016

Focusing on implementing govt policies using the big data tool zeppelin

It was good to know from the goverment that it published lots of data collected over the period of time at

I picked and amenities data about the villages from to do some analysis.

I believe govterment is doing sufficient analysis to find where and with what force it should use its machinery to promote its schemes.

I have been doing some analysis using the Apache Spark and eco system around it. But was interested in a quick visualization, which would help to understand the data quickly. A possible use would be using R as I wanted to build the reports quickly. I explored some of the capabilities of R and Shiny App in my earlier post of Custer Analysis of banking data.

Recently I came to know about a fantastic tool, its a web based notebook, with the in-built support for Apache-Spark, with a support of multiple langues like Scala, Python, spark sql and so on and most important that this it is opensource.
I picked one of the csv from the the whole data, and which is for one of the district in Karnataka state is Gulbarga and started doing some analysis.

Loading the data into the dataframe/table

It is easy to accomodate spark sql also in the notebook paragraph/sections.
Following is a very simple query to show the population spread in the villages of Gulbarga district.

Goverment make policies and spend money on that, and find the effectiveness of it based on the result. We can use the collected data to understand where should be the maximum penetration of the schemes, i.e. find the villages which needs the goverment schemes most. One of the example where goverment can initiates its policies to reduce the gap of male-female ratio, we can understand from the data available, where should be the more focus.

Changed the minbenchmark to 80% and same got updated on the fly

I stated to analyse this data to check for the education facilities in the villages which is in progress, would be publishing that information in later posts.

Installation details:
a) Zeppelin was deployed on Ubuntu VirtualBox with Windows as host.
b) Set your java home (1.7) before starting Zeppelin.
c) To start execute ' start' in the ZEPPELIN_HOME\bin

Monday, September 15, 2014

Cluster Analysis using R with banking customer balance distribution

Once a wise man (Yachine Phuneli) taught me while teaching me about java, semiconductor and wafer fabrication that how to teach/explain.

First tell to your audience/readers what are going to to tell.
Than tell the thing which you planned to tell.
And than tell what you just told.

I am going to describe how to do Cluster Analysis using R

Cluster Analysis
Cluster analysis has a vital role in numerous fields we are going to see it in the banking business to segment customers into small groups that can later be targeted for future marketing activities.
In machine learning and data mining it is used to efficiently find nearest neighbours and in summarization.

Cluster analysis aims to group data objects based on the information that is available that describes the objects and their relationships. The main goal is to group similar objects together, and the greater the similarity within a group the better and the greater the difference between group the more diverse the clustering.  A clustering is an entire collection of clusters; a cluster on the other hand is just one part of the entire picture. There are different types of clusters and also different types of clustering.

Types of Clustering Algorithms

1.     Partitioning-based clustering: are algorithms that determine all the clusters at once in most cases.
o    K-means clustering
o    K-medoids clustering
o    EM (expectation maximization) clustering
2.     Hierarchical clustering: these algorithms find successive clusters using previously established ones.
o    Divisive clustering is a top down approach.
o    Agglomerative clustering is a bottom up approach.
With the help of data mining methods, such as clustering algorithm, it is possible to discover the key characteristics from the bank’s data and possibly use those characteristics for future prediction also.
According to the facts established based on the information released by banks, attracting new customer costs five to six times more than customer retention. Retaining existing customers is the best core marketing strategy to become profitable in the very competitive banking industry. In order to maximise the profit, how to retain the existing customers has become a subject to be urgently solving for banks.

Use clustering to produce an initial working hypothesis, refine this hypothesis, then use prediction to generalize the refined hypothesis to data and evaluate how well it performs.

For this study following was the Bank customer data for the Balance distribution.

Before doing the cluster analysis above data was transformed based on the type or range of values. This would help us to do the scaling and find number of cluster quickly. We can keep the data as it is but I opted to find clustering based on the simplified data.
Balance conversion based on range:
=IF(AND(Sheet2!I2>=0,Sheet2!I2<=1500),"1",IF(AND(Sheet2!I2>1500,Sheet2!I2<=3000),"2",IF(AND(Sheet2!I2>3000,Sheet2!I2<6000 -=""> If Balance between 0 to 1500 : 1, If Balance between 1501 to 3000:2, If Balance between 3000 to 6000:3, Greater than 6000 :4
Age conversion based on the range:
Customer type conversion based on type::
Product conversion ::
=IF(Sheet2!C6="SavingAccount","1","2")  -> 1 if SavingAccount, 2 if CurrentAccount

So we got the data transformed in the following way:

    Find No Of Cluster

1) Load csv data from file:

 2)  Once data is loaded, per form the scaling of data by executing 
Ø  bDataScale <- ankcustomerdata="" o:p="" scale="">

 3)     Loading NBClust library and find the number of clusters for the data available in bdata,
Ø  library(NbClust)
Ø  nc <- bdatascale="""15," method="kmeans""2," nbclust="" o:p="">
Above command  takes a while to calculate.

Following is the output of NbClust function

The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.

The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.

All 1000 observations were used.

* Among all indices:                                               
* 4 proposed 2 as the best number of clusters
* 9 proposed 3 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 2 proposed 9 as the best number of clusters
* 2 proposed 11 as the best number of clusters
* 2 proposed 12 as the best number of clusters
* 2 proposed 15 as the best number of clusters

                   ***** Conclusion *****                           
 * According to the majority rule, the best number of clusters is  3 

4)      Plot the chart with the number of cluster we have obtained.

     Calculate K-means cluster
   Now to find the cluster based on Quarterly average balance, bdata loaded from .csv file is

Here I want to consider only AvgBalQ1, AvgBalQ2, AvgBalQ3, AvgBalQ4 for kmeans cluster.
Ø   test <-bdata c="" o:p="">
Scale data
Ø  scaledata <-scale o:p="" test="">
Set seet so that every time we calculate the kmeans it would be consistent
Ø   set.seed(1234)
Find the number of cluster/center as 3,
Ø   km <- centers="3," iter.max="500)<o:p" kmeans="" nstart="10," scaledata="">
Check the cluster size data, 401 data in cluster 1, 310 is cluster 2 and 289 in cluster 3.
Ø   km$size
[1] 401    310     289

Let’s have a look of the function kmeans

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))

Input to kmeans function
x: A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
centers: Either the number of clusters or a set of initial (distinct) cluster centers. If a number, a random set of (distinct) rows in x is chosen as the initial centers.
iter.max: The maximum number of iterations allowed.
nstart: If centers is a number, nstart gives the number of random sets that should be chosen.
algorithm: The algorithm to be used. It should be one of "Hartigan-Wong", "Lloyd", "Forgy" or "MacQueen". If no algorithm is specified, the algorithm of Hartigan and Wong is used by default.

Result returned from kmeans function call
cluster: A vector of integers indicating the cluster to which each point is allocated.
centers: A matrix of cluster centers.
whithnss: The within-cluster sum of squares for each cluster.
size: The number of points in each cluster.

(Fig: Quarter wise Product data distribution)
Now if bank wants take some necessary actions on the data by grouping the customer base and start a product or campaign for them, For e.g.
a)     If we see the above data distribution where bank wants to suggest the high net worth/higher account balance customer from the saving account to opt for other product, or
b)     If bank wants to offer those customers or a group of customers (cluster) additional services without charges.
Partitioning of data and making cluster/group would be helpful to take necessary actions for every cluster. It would be easy to group customers in cluster and then plan business activities on respective clusters accordingly.
K-means clustering is the most popular partitioning method. It requires the analyst to specify the number of clusters to extract. 
 So if we group data in 3 clusters (see the next chapter to see how to do we get 3 as a number of cluster)

(Fig: Quarterly Average Balance with 3 cluster data distribution)
If we see these four graphs for every quarter, Cluster 3 (in blue color) has some customer whose balance is greater than customer in other clusters. We can mark these clusters back to the original data and find those specific customers from the cluster to take appropriate action. 
Kmeans clustering with 3 clusters of sizes 401, 310, 289
Cluster means:
           AvgBalQ1 AvgBalQ2 AvgBalQ3 AvgBalQ4
Cluster 1     4019     2189     5059     5497
Cluster 2     2808     6413     3630     4044
Cluster 3     5611     6527     7222     6638
Percentage of within cluster variance accounted for by each cluster:
Cluster 1: 42.22%,  Cluster 2: 26.07%  Cluster 3: 31.72%

For every measure or non-factor or numeric value based parameter we choose these graph would be keep                                                              changing. Because their cluster means would be changing and so their variance.
For e.g. If we try to obtain data with 3 cluster but by including age also in the data, than our cluster size, mean and cluster data percentage changes.
Kmeans clustering with 3 clusters of sizes 353, 366, 281
Cluster means:
            Age    AvgBalQ1   AvgBalQ2   AvgBalQ3 AvgBalQ4
Cluster 1                31.5      3658       3858       5972        4869
Cluster 2                59.8      3366        4640     2972        4882
Cluster 3                57.1      5624        6023     7279        6659
Percentage of within cluster variance accounted for by each cluster:
Cluster 1: 34.21%, Cluster 2: 34.84%, Cluster 3: 30.95%

Now lets do some more analysis with the with the distribution of the data.

Data Interpretation: Understand the data based on the different combination of factors/dimensions and measures.
Factors/Dimensions: A dimension is a broad grouping of related data about a major aspect of your business. For example, you have a dimension called Products.
Measure: A measure is a performance indicator that is quantifiable and used to determine how well a business is operating. For example, useful measures may be Average yearly balance.
In the following figures, I tried to figure out only two interpretation (there could be more) and two possible actions which can be taken.
Note: I not sure which city exists in which state in US. I have just generated cities and states using talend (how to generate data using talend, check in the previous post) for the purpose of this study, as these cities are states were available in RowGenerator for generating records and moved some of them to make data look better.

(Fig: Product wise Gender data distribution)

If we try to analyse the above graph, it suggest that there are few female customer for a bank who has current account as product. Which are interesting and a bank can interrupt it in two ways:
I.        Interpretation:
a.     Either these customers are unware about which product they should have and they are paying unnecessary charges for CA, rather than going for SA.
b.     Or, they are the some entrepreneur or running small business.
II.        Actions:
a.     Customer can be notified by the bank executive about the possible change in their product from CA to SA and win the loyalty of customer by showing them bank care for their customer’s money.
b.     Marketing team target these customers with their specific needs or keep these customer’s in mind for the future product which would meet for these specific female customer who has CurrentAccount and are entrepreneur.

(Fig: Age wise Gender data distribution)
I.        Data Interpretation:
a.     Male customers are more rather than female or enterprise customers.
b.     Enterprise customer’s mainly lying in the range of age from 38-67.
II.        Possible Actions:
a.     Either products for Enterprises and Female customer are not very effective, so few new product which is more suitable to them can be launched with some campaign, etc.
b.     Start-up(Enterprise) does not have any enough accounts with the bank as the most of the customers are in the range of 40-70 and few of them are from 20-40. If an organization/startup can open account with the bank, there would be chances to get more salary accounts also from the same enterprise customer.

(Fig: City wise Gender data distribution)
I.        Data Interpretation
a.     Albany city has more enterprise customers than all other cities.
b.     Atlanta city has more personal customer (male + female)
II.        Possible Actions:
a.     Any campaigns which are targeted for personal customers should include cities like Atlanta.
b.     Any campaigns which are targeted for enterprise customer should include cities like Albany.
(Fig: City wise Gender and State wise Gender data distribution)

(Fig: Per Quarter Average Balance per Gender)
I.        Data Interpretation
a.     Q3 and Q4 average balance increases in comparison to Q1 and Q2. See the movement of boxplots not only the dots, in all quarters.
b.     Q1 and Q2 Enterprise customer are more towards lesser balance in comparison to Q3 and Q4.
II.        Possible Actions:
a.     If these are the regular trends for every year, that shows that cash flow for these customer increases  in last two quarters than it would be more appropriate time for having campaigns. Bank would be able to get better return on their investment on those campaigns.
ROA = Margin * Asset Velocity
Asset velocity = Sales / Assets
More sales with more competitive product.
More sales mean more asset velocity.
More asset velocity means more return on assets.
More ROA is more profit.

b.     Bank can offer few additional facilities for the customer who are not using overdraft or limit facility. Or based on the economic environment (linking of external data with bank dataset) in the country if organisations are looking for funds for capital expenditure, bank can offer products accordingly.

  (Fig1: Gender wise Product data distribution)                           (Fig2: Yearly Average Balance per Product)
I.        Data Interpretation
a.     There are more female current account customers than male. (Fig 1)
b.     There are few customers who have saving account but their average balance is more than the normal data population. (Fig 2)
II.        Possible Actions:
a.     Create new product or launch campaign or correct current products to increase the male customer base with current account.
b.     Customer can be suggested to move their funds from savings accounts to fixed deposit this would help bank to win the loyalty of customer by showing them bank care for their customer’s money.

(Fig: Quarter wise State data distribution)
I.        Data Interpretation
a.     Alaska and Louisiana has more customer base. Florida has the least customer base.
II.        Possible Actions:
a.     Specific marketing campaigns should be planned to get customer base in the states where number of customer are not enough.
b.     If banking operations (operational efficiency) are the problem for the lower number of customer which can be identified by
                                          i.    Checking the number of accounts/customer trend every year and see if customer base has a diminishing trend in every year. Find the exact problem and take corrective actions for it, one of the example for reducing customer base is the operational efficiency. So perform check about operational efficiency of every branch in Florida and other lower customer base states.
                                         ii.    There was not enough marketing or campaign done as these were not the area of focus in the previous years.

Cluster analysis is one of the important technique to analyse your data, which is not easy specially if you are not statistician/mathematician. R is wonderful tool to do all the analytics work with a ease. I have used R, RStudio and ShinyApp (Web-based framework for R for more practical visualization and statistical analysis )

Thanks to my friend and colleague "Alex" at Misys to introduce me to this wonderful tool and technology  The "R".

Sunday, August 17, 2014

Talend Big Data Integration with Hadoop

Hadoop can be downloaded from the Apache Hadoop website at This would include core modules like Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN, and Hadoop MapReduce. Additional Hadoop-related projects like Hive, Pig, Hbase and many more can be downloaded from their respective Apache websites. But setting up and doing all this work is not easy.

Once this is done, doing the data-integration is another challenge.
Writing each and everything for the data-integration directly with the pig or hive or other script is bit difficult route to do the required work.

Instead of worrying and doing too much, I think easier way would be to use the sandbox provided  by any of hadoop vendor and use the data integration platform by talend. Good thing is this that both hadoop installation and talend big data integration studio software is free and available under apache license :-)

Some of the hadoop vendors are HortonWorks, Cloudera, MapR, etc.

When I first tried the HortonWorks with talend big data integration platform, it made things very easy.
Of-course for enterprise or cloud there is license. But as I mentioned eaerlier, these are available under apache license and free.

So thought of describing how this can be done. In the later post I can explain how talend and hortonworks can be used for their different features. So this post is kind of overview and later post can be in detail for different features.

So what all we need.
1) Virtual machine where hadoop can be installed.
2) Hadoop applicance to be installed on virtual machine, i.e. HortonWorks sandbox.
3) Talend Open Studio for Big Data
4) Create a job for data load on HDFS.

1) Virtual Machine:
  I have use oracle virtual box from There was an issue with 4.3 latest version so i used 4.3.12 from the link:
  One you are finished with installation of virtual box, proceed for next step.

2) Hadoop applicance to be installed on virtual machine, i.e. HortonWorks sandbox:
  a) Download:
  There are multiple sandbox available from HortonWorks for virtualbox, vmvare, Hyper-v. As we have
   installed virtualbox, download that one. It can be donwloaded from the following link:
   b) Setup:
   Start virtualbox, Go to file menu and click on Import Alliance.
Fig: Importing of HortonWorks hadoop appliance-1.

      Select your appliance(.ova) file and click on next. You would see the import virtual appliance wizard,
      select the appropriate memory and other parameters based on your need. You can set the network

Fig: Importing of HortonWorks hadoop appliance-2.

       Change the network settings to get it's own IP for virtual box. By default it would be NAT change it to          Bridge Adapter.
Fig: Setting of the Network Adapter.
  c) Verify  

  •       You can verify your installation and find the IP of VM running hadoop.
  •       Click on the HortonWorks sandbox and click on start button. It will take some time to boot start all services of hadoop, pig, hive, etc and will prompt for authetication.
  •       Username is "root" and password is "hadoop".
  •       On the command prompt type ifconfig (as its linux) and note down the ip which would be used in talend studio during the HDFSConnection creation.
  •       You can use the IP with Port 8000 (default) to check about the successful installation and running        of services.

You can browse the various services at :
Fig: Accessing hadoop services using the browser from remote machine.

You can execute some script or browse the older scripts from the pig menu.
                                                  Fig: Accessing and running pig script from remote machine.

3) Talend open studio:
   You can download the talend open studio for big data (not the data integration ) from the following link:

    If you want you can download Talend Big Data Sandbox for any of the hadoop provider but as i wanted to keep these two on separate machine which would be the ideal case.
    It would be an .exe file if you are installation on windows, execute it and select the installation location and you are done with the talend open studio installation.

 4Create a job for data load on HDFS:
     What we want to do here is to create a job  which generate few records and write it to HDFS on the hadoop we have installed in the step 2. Createing and running the job is happening on my local box and hadoop is running on some remote machine.

      At the end it would look like following and execution of the job would create a file of 100 records in HDFS of hadoop.
     a) Start talend by starting TOS_BD-win-x86_64.exe
     b) Create a new project and open it.
     c) Design a job: Following steps would make job designing complete.
 i)) Create a new job by right click on the Repository->Job Design
Fig: Create a new job
ii) Create a new HDFSConnection to connect the the HDFS. As this is going to connect to the sandbox which we have installed earlier, provide the IP and Port (8020) for NameNode URI.
Fig: Create a HDFS Connection and set the URI to the sandbox IP and Port (8020)

  iii) Create a tRowGenerator object which would generate the required rows based on the schema created for it. For this example i have created 100 rows. Other information about the schema is available the below images.

Fig: tRowGenerator which would generate 100 rows data.

     Fig: Schema for the data generation

     iv) Link tHDFSConnection to tRowGenerator: Right click on the tHDFSConnection_1 and click on trigger-> On subjob ok and connect to tRowGenerator_1. This would move start the generator work after connection is successful.
  v) Write data to HDFS using tHDFSOutput. Right click on the tRowGenerator_1 and drag it to tHDFSOutput_1. If you need to see the generated output, you can add tLogRow_1 and connect this also in the similar way.

Fig: tHDFSOutput object to write to 

    Now our design is complete, save it.

     d) Run/Execute job to create the file in HDFS. You can either click on the F6 or Run the job by going the Run tab and click on Run button. On Successful execution you would see the filter with the mygeneratedout.csv in the location provided in tHDFSOutput_1, which is "/user/hue/mygeneratedout.csv"

   You can click on the file to see the content.

Thanks for reading the blog. :-)

Soon would be write few more blogs on Hadoop and Talend which explains their specific feature in more details.