Data Mining, Classification, And Clustering: The Building Blocks Of Analytics And Business Intelligence By Pankaj Dikshit, SVP (IT) at Goods and Services Tax Network

Data Mining, Classification, And Clustering: The Building Blocks Of Analytics And Business Intelligence

Pankaj Dikshit, SVP (IT) at Goods and Services Tax Network | Monday, 14 October 2019, 06:55 IST

Pankaj Dikshit, SVP (IT) at Goods and Services Tax Network We have all heard of and are familiar with the term data bases . It brings to mind a structured data set with columns, rows, and sheets that can be used to store information and retrieve it, at will. Databases store information that is known in a well formed template or schema and is organized. When needed, the databases can be queried to reach a specific point in the schema and extract the information or piece of data. This is made possible with the help of indexing and knowing the schema of the database. Hence it is said that databases are used to extract trivial information.

However, how does one extract information that is unknown? For example, if the question is asked: “what were the factors leading to people getting a job in this company?” That question could not be answered by a database as it is non-trivial information and is not readily available at a certain intersection of a row and column in a particular sheet. The answer to this question needs to be surmised by a specific science that is called Data Mining. In the 1980s when the retail boom was picking momentum in the USA, a leading retail chain approached the well known IT giant to research how they could increase sales of their merchandise. This led to the authoring of many papers on ‘Data Mining’. Data was used from the sales of the retail chain to arrive at algorithms to assess patterns of sales. For example, look at the table below:

“Data mining, classification, and clustering are the basic building blocks for advanced data processing and non-trivial data extraction which is not possible through simple database querying”

From this set of data, it was asked to assess as to which items are the best combinations, such that when one is bought the other is most likely to also be bought. The aim was, if the association is able to be deduced, then placement those items together would maximize their sales. The first basic algorithm that helped to answer that question was the A-Priori algorithm. It used the candidate item sets in sequences of 1 item, then 2 items and then 3 item sets, their frequency of occurrence and their minimum support needed and then arrived at the final candidate item set that was the most frequent selling combination. In the set of items listed above, the milk, beer, and sugar are the ones that were deduced to be the most likely purchases (at a minimum support of 30% and confidence of 70%). With this basic algorithm, the journey of data mining began and never looked back. Data mining is computationally very expensive and algorithms were definitely needed that could use lesser and lesser computing power and memory. Naturally, the A-Priori was followed by improvements and better algorithms e.g. A Priori TID, Frequent Pattern (F P) Growth, Tri Based A Priori, Hash-Based A Priori, Parallel A Priori and Pincer’s Algorithm. These algorithms attempted to reduce the number of steps and the order of compute and attempted to go top-down or bottom-up or both.

The businesses not being content only with the set of most frequently selling set of items were also keen on knowing the relationships that led a buyer from one item to another. Association Rule Mining dealt with this relationship; with the help of the frequency of sales of items and as a follow up of the results of the data mining process. Business wanted to know, for example from the above table, what was the best combination of products e.g. if I bought milk, then what would I most likely purchase with it? Or if I bought bread and milk, then what would be my next most likely purchase? These answers are provided by Association Rule mining where the antecedent/consequent rules are formed to provide the best likelihood combinations.

The next step in the evolution towards machine learning is that of classification. Classification is the most common machine learning activity. Classification aims to take a set of data which has already been classified using established methods and is verified and builds a model based on this verified data. It then uses the model to run on new and similar data to provide classifications for this unclassified data. An example data set which is already classified is tabled below:

The data provides the information of a set of loanees who failed to fully repay their loans. Using this data set the classification algorithms will build a model and train themselves. The model will then be applied on a similar set of live data (known as test data) to assess what would be the likelihood of those loan seekers with similar characteristics to repay their loans or not. The common classification algorithms are Decision Trees which can be binary or Multi-way decision trees. The data sets for training the classification algorithm models are available from multiple sources. The University of California, Irvine (UCI) data set is popular as it has a wide variety to choose from to train models. Other sources of data also exist that can be used to train the classification models. Popular classification algorithms, besides decision trees, are ID3, C4.5, SLIQ. They use multi-way decision trees using split measures (decision points). The leaf nodes in the decision trees are the classes. Classification is termed as supervised learning since it uses a data set with verified classes to train its model basis which it predicts the classes of the test data sets.

Clustering is un-supervised learning. Clustering does not use verified classified data sets to train its models. Clustering attempts to group data sets according to proximity or distance amongst its features. The distance measurement uses techniques like Manhattan Distance, Euclidean Distance or Markowski distance. If the groups in the data sets result into clear demarcations it is referred to Crisp clustering and if there possibilities of a data belonging to other groups then its termed as Fuzzy clustering.

K-Means clustering is a popular clustering algorithm that uses Euclidean distance measurements amongst its features. You pre-define the number of clusters. The cluster centers are chosen randomly and the distance of each pattern and the chosen cluster centers. After the first iteration, new cluster centers are chosen and the distance is calculated again. The algorithm is terminated once the cluster centers remain the same in two consecutive iterations. Other clustering algorithms that are popular are the Hierarchical Clustering (which uses dendrograms), Max-Min Clustering and Silhouette Validation Clustering.

Data mining, classification, and clustering are the basic building blocks for advanced data processing and non-trivial data extraction which is not possible through simple database querying. Advanced methods that use these same building blocks for data processing employ neural networks to classify data. Neural networks are suited as they are capable of being trained and subsequently applied on real data sets for classifying.