If we need to study data for a given goal, there are certain steps that need to be followed. These steps collectively called Framework. At the high level, DCOVA framework tells the steps that need to be performed for machine learning.
What does DCOVA Framework stands for?
DCOVA stands for:
D - Define
C - Collect
O - Organize
V - Visualize
A – Analyze
Let’s discuss the above point in detail:
Define:
First step is to define the problem statement clearly and identify the data requirement that is required to solve the problem.
For Example: Problem statement is to detect the cancer on a patient at an early stage.
To solve the problem in example, first it needs to be identified the feature that could be useful for detecting the cancer. Please note that the feature selection should be done with help of domain expert.
Note: There is no guarantee that the feature selected is having impact on the outcome i.e. detection of cancer. But it will be nearest guess if we select it with help of domain expert. It is in realm of machine learning to identify the relation between feature and outcome and see how closely feature is related to outcome.
Once the feature has been identified, it needs to be seen what the sources of the data are. For the given example, Data could be residing in disparate system or the storage devices of multiple hospitals.
Collect:
Second step of frame work is to collect data from the different sources. It may require to retrieve structured (data bases etc) or unstructured data (twitter, whatsapp etc.)
Organize and Visualize:
Next step is to organize and visualize the data. These two go hand in hand. Collected data needs to be organized and brought in machine readable format so that it can be fed to statistical algorithms for analyzing and predicting the output. Organization of data also needs to tackle the problem of missing value, outliers, duplicate etc. Visualization is used to understand the feature’s data spread around the mean, to understand the relationship between different features etc. Different feature engineering techniques are used so as to bring the data to a format that confirms to the requirement of different statistical algorithms. Feature selection technique is also used to narrow down on the features that have a say on the output.
Analyze:
Once the data is conforming to the requirement of the statistical algorithms, data is split into two i.e. training data and testing data. Training data is used to train the model and its accuracy is measured against the test data. Various algorithms are used to analyze on the algorithm which is giving maximum accuracy. Best model is selected for deployment in production.
Basically Machine learning is mainly used to solve 3 kinds of problem:
1. Regression problem: This is the area where ML is used to predict the future outcome based on the historical data. This is part of supervised learning where we have features to predict the label(outcome)
Ex: Predicting sales volume for a future date using historical sales data
2.Classification problem: This is the area where ML is used to classify data into different categories like Yes/No etc. This is part of supervised learning where we have features to predict the label(outcome)
Ex: Predicting
Cancer in a person based on different parameters. This is a classification problem
where outcome will be Yes or No.
Ex: For
marketing a product, understanding the demography or commonness of population.
Clustering exercise is used to make cluster (groups) based on the similarity/dissimilarity
among the population.
Note: This is high level understanding of ML framework I wanted to share with you. More insight into ML/DL will follow. Keep a eye on the blog.
No comments:
Post a Comment