Cluster Analysis

Cluster analysis has a vital role in numerous fields we are going
to see it in the banking business to segment customers into small groups that
can later be targeted for future marketing activities.

In machine learning and data mining it is used to efficiently find nearest neighbours
and in summarization.

Cluster analysis aims to group
data objects based on the information that is available that describes the
objects and their relationships. The main goal is to group similar objects
together, and the greater the similarity within a group the better and the
greater the difference between group the more diverse the clustering. A
clustering is an entire collection of clusters; a cluster on the other hand is
just one part of the entire picture. There are different types of clusters and
also different types of clustering.

##
Types of Clustering Algorithms

1.
**Partitioning-based clustering:** are
algorithms that determine all the clusters at once in most cases.

o K-means
clustering

o K-medoids
clustering

o EM
(expectation maximization) clustering

2.
**Hierarchical clustering:** these algorithms find successive clusters using previously
established ones.

o Divisive
clustering is a top down approach.

o Agglomerative
clustering is a bottom up approach.

With the help of data mining methods, such as clustering
algorithm, it is possible to discover the key characteristics from the bank’s
data and possibly use those characteristics for future prediction also.

According to the facts established based on the information
released by banks, attracting new customer costs five to six times more than
customer retention. Retaining existing customers is the best core marketing
strategy to become profitable in the very competitive banking industry. In
order to maximise the profit, how to retain the existing customers has become a
subject to be urgently solving for banks.

Use
clustering to produce an initial working hypothesis, refine this hypothesis,
then use prediction to generalize the refined hypothesis to data and evaluate
how well it performs.

For this study
following was the Bank customer data for the Balance distribution.

Before doing the cluster analysis above data was transformed based
on the type or range of values. This would help us to do the scaling and find
number of cluster quickly. We can keep the data as it is but I opted to find
clustering based on the simplified data.

**Balance
conversion based on range:**

=IF(AND(Sheet2!I2>=0,Sheet2!I2<=1500),"1",IF(AND(Sheet2!I2>1500,Sheet2!I2<=3000),"2",IF(AND(Sheet2!I2>3000,Sheet2!I2<6000 -=""> If Balance between 0 to 1500 : 1, If Balance between 1501 to 3000:2, If
Balance between 3000 to 6000:3, Greater than 6000 :4

**Age
conversion based on the range:**

=IF(AND(Sheet2!E2>=20,Sheet2!E2<=35),"1",IF(AND(Sheet2!E2>35,Sheet2!E2<=45),"2",IF(AND(Sheet2!E2>45,Sheet2!E2<=60),"3","4")))

**Customer
type conversion based on type::**

=IF(Sheet2!C6="Enterprise","1","2")

**Product
conversion ::**

=IF(Sheet2!C6="SavingAccount","1","2") -> 1 if SavingAccount, 2 if CurrentAccount

So
we got the data transformed in the following way:

Find No Of Cluster

1) Load csv data from file:

2) Once data is loaded, per form the scaling of data by
executing

Ã˜
bDataScale <- ankcustomerdata="" o:p="" scale="">

3)
Loading NBClust library and find the number of clusters for the
data available in bdata,

Ã˜
library(NbClust)

Ã˜
nc <- bdatascale="" max.nc="15," method="kmeans" min.nc="2," nbclust="" o:p="">

Above
command takes a while to calculate.

Following is the output of
NbClust function

The Hubert index is a graphical
method of determining the number of clusters.

In the plot of Hubert index, we
seek a significant knee that corresponds to a significant increase of the value
of the measure i.e the significant peak in Hubert index second differences plot.

The D index is a graphical method
of determining the number of clusters. In the plot of D index, we seek a
significant knee (the significant peak in Dindex second differences plot) that
corresponds to a significant increase of the value of the measure.

All 1000 observations were used.

*******************************************************************

* Among all indices:

* 4 proposed 2 as the best number
of clusters

* 9 proposed 3 as the best number
of clusters

* 1 proposed 7 as the best number
of clusters

* 1 proposed 8 as the best number
of clusters

* 2 proposed 9 as the best number
of clusters

* 2 proposed 11 as the best
number of clusters

* 2 proposed 12 as the best
number of clusters

* 2 proposed 15 as the best
number of clusters

***** Conclusion *****

* According to the majority rule, the best
number of clusters is 3

4)
Plot the chart with the number of cluster we have obtained.

Calculate K-means cluster

Now to find the cluster based on Quarterly average balance, bdata loaded from .csv file is

Here I want to consider only
AvgBalQ1, AvgBalQ2, AvgBalQ3, AvgBalQ4 for kmeans cluster.

Ã˜ test <-bdata c="" o:p="">

Scale data

Ã˜ scaledata
<-scale o:p="" test="">

Set seet so that every time we calculate the kmeans it would be
consistent

Ã˜ set.seed(1234)

Find the number of cluster/center as 3,

Ã˜ km <- centers="3," iter.max="500)<o:p" kmeans="" nstart="10," scaledata="">

Check the cluster size data, 401 data in cluster 1, 310 is cluster
2 and 289 in cluster 3.

Ã˜ km$size

[1] 401 310 289

Let’s have a look of the function kmeans

kmeans(x, centers, iter.max = 10,
nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd",
"Forgy", "MacQueen"))

__Input to kmeans function__

**x:** A numeric matrix of data, or an object that can be coerced
to such a matrix (such as a numeric vector or a data frame with all numeric
columns).

**centers:** Either the number of clusters or a set of initial (distinct)
cluster centers. If a number, a random set of (distinct) rows in x is chosen as
the initial centers.

**iter.max:** The maximum number of iterations allowed.

**nstart:** If centers is a number, nstart gives the
number of random sets that should be chosen.

**algorithm:** The algorithm to be used. It should be one
of "Hartigan-Wong", "Lloyd", "Forgy" or
"MacQueen". If no algorithm is specified, the algorithm of Hartigan
and Wong is used by default.

__Result returned from kmeans function call__

**cluster:** A vector of integers indicating the cluster to which each
point is allocated.

**centers:** A matrix of cluster centers.

**whithnss:** The within-cluster sum of squares for each cluster.

**size:** The number of points in each cluster.

**(Fig: Quarter
wise Product data distribution)**

Now if bank wants take some necessary actions on the data by
grouping the customer base and start a product or campaign for them, For e.g.

a)
If we see the above data distribution where bank wants to suggest
the high net worth/higher account balance customer from the saving account to opt
for other product, or

b)
If bank wants to offer those customers or a group of customers
(cluster) additional services without charges.

Partitioning of data and making cluster/group would be helpful to
take necessary actions for every cluster. It would be easy to group customers
in cluster and then plan business activities on respective clusters
accordingly.

**K-means** clustering is the most popular partitioning method. It
requires the analyst to specify the number of clusters to extract.

So if we group data in 3
clusters (see the next chapter to see how to do we get 3 as a number of cluster)

**(Fig: Quarterly
Average Balance with 3 cluster data distribution)**

If we see these four
graphs for every quarter, Cluster 3 (in blue color) has some customer whose
balance is greater than customer in other clusters. We can mark these clusters
back to the original data and find those specific customers from the cluster to
take appropriate action.

Kmeans clustering with 3 clusters of sizes 401, 310, 289

Cluster means:

AvgBalQ1 AvgBalQ2 AvgBalQ3 AvgBalQ4

Cluster 1 4019 2189
5059 5497

Cluster 2 2808 6413
3630 4044

Cluster 3 5611 6527
7222 6638

Percentage of within cluster variance accounted for by each
cluster:

Cluster 1: **42.22%, ** Cluster 2: **26.07%** Cluster 3: **31.72%**

**Note:**

For every measure or non-factor or numeric value based parameter
we choose these graph would be keep
changing. Because their cluster means would be changing and so their
variance.

For e.g. If we try to obtain data with 3 cluster but by including
age also in the data, than our cluster size, mean and cluster data percentage
changes.

Kmeans clustering with 3 clusters of sizes 353, 366, 281

Cluster means:

Age AvgBalQ1 AvgBalQ2 AvgBalQ3 AvgBalQ4

Cluster 1 31.5
3658
3858 5972 4869

Cluster 2 59.8 3366 4640 2972
4882

Cluster 3 57.1 5624 6023 7279
6659

Percentage of within cluster variance accounted for by each
cluster:

Cluster 1: **34.21%**, Cluster
2: **34.84%**, Cluster 3: **30.95%**

Now lets do some more analysis with the with the distribution of the data.

Data Interpretation: Understand the data based on the different
combination of factors/dimensions and measures.

**Factors/Dimensions:** A
dimension is a broad grouping of related data about a major aspect of your
business. For example, you have a dimension called Products.

**Measure:** A measure
is a performance indicator that is quantifiable and used to determine how well
a business is operating. For example, useful measures may be Average yearly
balance.

In the following figures, I tried to figure out only two
interpretation (there could be more) and two possible actions which can be
taken.

Note: I not sure which city exists in which state in US. I have
just generated cities and states using talend (how to generate data using talend, check in the previous post) for the purpose of this study, as
these cities are states were available in RowGenerator for generating records and
moved some of them to make data look better.

**(Fig: Product
wise Gender data distribution)**

If we try to analyse the above graph, it suggest that there are
few female customer for a bank who has current account as product. Which are
interesting and a bank can interrupt it in two ways:

I.
Interpretation:

a.
Either these customers are unware about which product they should
have and they are paying unnecessary charges for CA, rather than going for SA.

b.
Or, they are the some entrepreneur or running small business.

II.
Actions:

a.
Customer can be notified by the bank executive about the possible
change in their product from CA to SA and win the loyalty of customer by
showing them bank care for their customer’s money.

b.
Marketing team target these customers with their specific needs or
keep these customer’s in mind for the future product which would meet for these
specific female customer who has CurrentAccount and are entrepreneur.

**(Fig: Age
wise Gender data distribution)**

I.
Data Interpretation:

a.
Male customers are more rather than female or enterprise
customers.

b.
Enterprise customer’s mainly lying in the range of age from 38-67.

II.
Possible Actions:

a.
Either products for Enterprises and Female customer are not very
effective, so few new product which is more suitable to them can be launched
with some campaign, etc.

b.
Start-up(Enterprise) does not have any enough accounts with the
bank as the most of the customers are in the range of 40-70 and few of them are
from 20-40. If an organization/startup can open account with the bank, there
would be chances to get more salary accounts also from the same enterprise
customer.

**(Fig: City
wise Gender data distribution)**

I.
Data Interpretation

a.
Albany city has more enterprise customers than all other cities.

b.
Atlanta city has more personal customer (male + female)

II.
Possible Actions:

a.
Any campaigns which are targeted for personal customers should
include cities like Atlanta.

b.
Any campaigns which are targeted for enterprise customer should
include cities like Albany.

**(Fig: City
wise Gender and State wise Gender data distribution)**

**(Fig: Per Quarter
Average Balance per Gender)**

I.
Data Interpretation

a.
Q3 and Q4 average balance increases in comparison to Q1 and Q2.
See the movement of boxplots not only the dots, in all quarters.

b.
Q1 and Q2 Enterprise customer are more towards lesser balance in
comparison to Q3 and Q4.

II.
Possible Actions:

a.
If these are the regular trends for every year, that shows that cash
flow for these customer increases in
last two quarters than it would be more appropriate time for having campaigns.
Bank would be able to get better return on their investment on those campaigns.

ROA = Margin * Asset Velocity

Asset velocity = Sales / Assets

More sales with more competitive product.

More sales mean more asset velocity.

More asset velocity means more return on
assets.

More ROA is more profit.

b.
Bank can offer few additional facilities for the customer who are
not using overdraft or limit facility. Or based on the economic environment (linking* of external data with bank dataset*) in
the country if organisations are looking for funds for capital expenditure,
bank can offer products accordingly.

** (Fig1: Gender wise Product data distribution) (Fig2: Yearly Average Balance per Product)**

I.
Data Interpretation

a.
There are more female current account customers than male. (Fig 1)

b.
There are few customers who have saving account but their average
balance is more than the normal data population. (Fig 2)

II.
Possible Actions:

a.
Create new product or launch campaign or correct current products
to increase the male customer base with current account.

b.
Customer can be suggested to move their funds from savings
accounts to fixed deposit this would help bank to win the loyalty of customer
by showing them bank care for their customer’s money.

**(Fig: Quarter
wise State data distribution)**

I.
Data Interpretation

a.
Alaska and Louisiana has more customer base. Florida has the least
customer base.

II.
Possible Actions:

a.
Specific marketing campaigns should be planned to get customer
base in the states where number of customer are not enough.

b.
If banking operations (operational efficiency) are the problem for
the lower number of customer which can be identified by

i. Checking
the number of accounts/customer trend every year and see if customer base has a
diminishing trend in every year. Find the exact problem and take corrective
actions for it, one of the example for reducing customer base is the
operational efficiency. So perform check about operational efficiency of every
branch in Florida and other lower customer base states.

ii. There was
not enough marketing or campaign done as these were not the area of focus in
the previous years.

Cluster analysis is one of the important technique to analyse your data, which is not easy specially if you are not statistician/mathematician. R is wonderful tool to do all the analytics work with a ease. I have used R, RStudio and ShinyApp (Web-based framework for R for more practical visualization and statistical analysis )