Tuesday, February 18, 2020

Understanding data type and Level of measurement for Machine Learning data


In this modern world, to be a good data scientist, one should have sound knowledge of statistics and computer programming language (for data extraction, handling big data, data wrangling, data visualization, model creation etc.).

For understanding statistics, first we will have to understand the data. In this blog, we will try to understand type of data and their measurement scale.

1.      Type of data/Variable:

There are the different types of data that a statistician/data scientist will come across. There are two types of data:

a.       Categorical Variable (Qualitative Variable): This kind of variable has qualitative value and not numerical value. For example :

                                            I.            A variable containing the data whether a person has an e-mail account or not. The value that this variable will contain as ‘Yes’ or ‘No’.

                                          II.            A variable containing the data for internet provider name in the residential houses. The data that this variable will contains as ‘Airtel’, ‘Vodafone’, ‘idea’ etc.



b.      Numerical Variable (Quantitative Variable): These variables contain numerical data. Numerical variable can be further categorised in two:

                                            I.            Discrete Variable: These variables contain the counts. For example:

A variable collecting the data for number of children in a family. It will contain the value as 1,2,3,4 etc. The number can’t be fractional. Here value is coming from counting process.

                                          II.            Continuous Variable: These variables contain the data that comes from measuring process. For example: A variable collecting the data for waiting time in a movie theatre queue. This variable can take value as 5 min, 5.1 min, 5.6 min etc.

Below is the figure representing the type of data/ variable:



2.     Levels of measurement:

Different Type of variables can be again classified based on the level of measurement, or measurement scale.

Categorical variable can be measured in nominal scale or ordinal scale.

·         Nominal scale: The nominal level represents categories that cannot be put in any order. For example :  Variable containing value of season i.e. winter, spring, summer, autumn

·         Ordinal scale: Ordinal scale represents categories that can be ordered. For example:

Variable containing rating of meal. It can hold value like ‘bad’, ‘good’, ‘excellent’ etc. Similarly different designation in the office. These have inherent order.

Numerical variable can be measure in interval scale or ratio scale.

·         Interval Scale: In this scale variable is measured in ordered scale in which difference in measurement is meaningful but does not involve a true zero point.

For example: Temperature measurement in degrees or Celsius. Here we can’t say that 20o Centigrade is not twice hot as 10o Centigrade. In the same way 0o Centigrade does not mean there is no temperature



·         Ratio Scale: In this scale variable is measured in ordered scale in which difference in measurement is meaningful and involves a true zero point.

For example: Age measurement. Here we can say that 10 year old boy is twice old as that of 5 year old boy. 0 years means the person does not exist.

Below is the figure representing the level of measurement:




References:

1.         Book: Statistics for Managers by David M. Levine, David F. Stephen and Kathryn A. Szabat

Sunday, February 9, 2020

DCOVA Frame work of Machine Learning



If we need to study data for a given goal, there are certain steps that need to be followed. These steps collectively called Framework. At the high level, DCOVA framework tells the steps that need to be performed for machine learning.


What does DCOVA Framework stands for? 


DCOVA stands for:

D - Define

C - Collect

O - Organize

V - Visualize

A – Analyze





Let’s discuss the above point in detail:  

Define:


First step is to define the problem statement clearly and identify the data requirement that is required to solve the problem. 


For Example: Problem statement is to detect the cancer on a patient at an early stage.

To solve the problem in example, first it needs to be identified the feature that could be useful for detecting the cancer. Please note that the feature selection should be done with help of domain expert.


Note: There is no guarantee that the feature selected is having impact on the outcome i.e. detection of cancer. But it will be nearest guess if we select it with help of domain expert. It is in realm of machine learning to identify the relation between feature and outcome and see how closely feature is related to outcome.
Once the feature has been identified, it needs to be seen what the sources of the data are. For the given example, Data could be residing in disparate system or the storage devices of multiple hospitals. 


Collect:

Second step of frame work is to collect data from the different sources. It may require to retrieve structured (data bases etc) or unstructured data (twitter, whatsapp etc.)



Organize and Visualize:

Next step is to organize and visualize the data. These two go hand in hand. Collected data needs to be organized and brought in machine readable format so that it can be fed to statistical algorithms for analyzing and predicting the output. Organization of data also needs to tackle the problem of missing value, outliers, duplicate etc. Visualization is used to understand the feature’s data spread around the mean, to understand the relationship between different features etc. Different feature engineering techniques are used so as to bring the data to a format that confirms to the requirement of different statistical algorithms. Feature selection technique is also used to narrow down on the features that have a say on the output.


Analyze:

Once the data is conforming to the requirement of the statistical algorithms, data is split into two i.e. training data and testing data. Training data is used to train the model and its accuracy is measured against the test data. Various algorithms are used to analyze on the algorithm which is giving maximum accuracy. Best model is selected for deployment in production. 


Basically Machine learning is mainly used to solve 3 kinds of problem:

  1. Regression problem: This is the area where ML is used to predict the future outcome based on the historical data. This is part of supervised learning where we have features to predict the label(outcome)



           Ex: Predicting sales volume for a future date using historical sales data


   2.Classification problem: This is the area where ML is used to classify data into different categories like Yes/No etc. This is part of supervised learning where we have features to predict the label(outcome)

Ex: Predicting Cancer in a person based on different parameters. This is a classification problem where outcome will be Yes or No.


  3.Clustering problem: This is the area where ML is used to classify the data into different clusters based on commonness of data. This is part of unsupervised learning. Here features are collected but there will not be any label to predict.  

Ex: For marketing a product, understanding the demography or commonness of population. Clustering exercise is used to make cluster (groups) based on the similarity/dissimilarity among the population.



Note: This is high level understanding of ML framework I wanted to share with you. More insight into ML/DL will follow. Keep a eye on the blog.

 


Thursday, February 6, 2020

Data Science (DS) Vs Artificial intelligence (AI) Vs Machine Learning (ML) Vs Deep Learning (DL)


Now a day we often hear the buzzwords like Data science (DS), Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL). In this blog, we will try to understand each term to avoid any confusion.

What is Data Science??

Data science is a field of study of data “.  Here data is being studied, analyzed and processed so as to gain more information from data.

As per Wikipedia:
The term ‘data science’ has appeared in various contexts over the past thirty years but did not become an established term until recently”

Why data science became so important lately??

                As you hear very often, data is new oil.  With the Industrial revolution, oil (main source of energy) becomes so important that each and every country is reliant on the oil to run their economy. Even at present, oil is the main source of energy and a small disruption in its flow bring chaos to world.
In the similar way, with the advent of technological revolution data became one of the most important tools to have edge in highly competitive world of market economy. Businesses, having more insight into data, thrive in market economy with competitive advantages. Insight into business scenario comes through analyzing the data, understanding it, having meaningful insight into it. Based on the insight, timely action helps the businesses to beat their competition. Now a day, many companies have a data science department to grow their business.

Currently data is being used to do following type of analytics:

1   Descriptive Analytics: This is the area where data is being studied in the business to get more insight into it like understanding trends, biases, variations etc. This is mainly about data mining, data aggregation and disaggregation to understand ‘What has happened?’

For example: Slicing and dicing sales data to understand which area / product is contributing more to the overall sales

  Predictive Analytics:  This is the area where data is being used to forecast with help of statistical and forecasting tools. This is mainly about ‘What could happen?’

For example: Using the past sales data to forecast future sale.

3   Prescriptive Analytics:  This is the area where optimization and simulation algorithm is used to advice on possible steps that needs to be taken in given scenario.

For example: Recommending promotion that needs to be used to increase the sales.


What is Artificial Intelligence (AI)?

Artificial intelligence is about equipping machine with a person like intelligence.
A person’s intelligence comes from the data that is gathered through various sources like by seeing, touch, smell etc and then brain helps to process the data and to make decision or act.  
Similarly, Artificial intelligence comes from gathering, understanding, learning data and making decision based on it. With the technological advancement in computing/processing power, machine is able to gather/store and process lot of data to get the insight to describe, predict and prescribe. Artificial Intelligence has become a buzzword since it is now being widely used in almost all sectors like health, finance, manufacturing, retail etc. It has intruded our day to day life too in form of google assistant, apple siri, amazon’s alexa etc.

Machine Leaning (ML)

Machine learning is a subset of AI or can be termed as an implementation of AI.
Here machine learns the data patterns based on the huge data provided to machine and then use that information to understand new data that is coming in. Mathematical/statistical models and programming language are used to implement it. There are different forms of machine learning:

1    Supervised learning :
Here historical data (training data) is used to understand the relationship between independent features (input) and labelled data (outcome). Statistical model are made based on the training data. Any new data’s (test data) outcome is predicted based on this statistical model. 

2    Un-Supervised learning:
In Un-supervised learning, data does not have any outcome column. Here model uses intrinsic pattern of the data to learn and give insight. Clustering is one such technique where data are clustered together based on their similarity.

3   Semi-supervised learning:
Here model uses concept of both supervised and unsupervised learning to get data insight.

4    Reinforcement learning:
This kind of learning doesn’t use any answer key to guide the execution of any function. The lack of training data results in learning from experience. The process of trial and error finally leads to long-term rewards.

Deep Learning (DL):

Deep Learning is a subset of machine learning where it tries to mimic human brain. As the brain receive information and tries to compare it with known item before making sense, in similar way deep learning tries to compare predicted value with outcome and try to self learn. It uses concept of neural networks to do so.. It tries to learn itself from given data without any human intervention. It is evolution on ML in the sense that apart from what ML can do ,it can also work on large data set as well as the complex scenario where machine learning fails like Speech recognition , image recognition etc.



 

Diagrammatic Depiction of DS /AI/ ML/DL