Machine Learning Model Overview
Introduction to ML
Model.
Performing machinelearning involves creating model, which is trained on some training data and
then can process additional data to make predictions. A machine learning model can be a mathematical representation of a real-world
application. To generate a machine learning model you
will need to provide training data to a machine learning algorithm to learn
from.
The
model finds the pattern in the training data and compares the pattern with the
input test data and gives the output(predictions).
The algorithm with the patterns of the training data is called a model. There
are several algorithms that can be used on the need
Data
handling
preparing data files before applying machine learning
algorithms took a whole lot of time. The data handling refers to the data
cleaning and processing. It means handling
missing values in the dataset, the row containing the missing values are
dropped and another way in handling the missing values are filling the missing
values row with average or any other suitable method
Next is
dropping duplicate values in the data set, if there are rows with same
values in all the columns they can be
dropped ,Binning data (i.e.) data
bucketing classifying data based on a label value ,detecting outliers and
removing it as outliers can change the true nature of the data set and which also affects the output
Data
Preprocessing
Data preprocessing is an integral step in Machine
Learning as the quality of data and the useful information that can be derived
from it directly affects the ability of
our model to learn; therefore, it is extremely important that we preprocess
our data before feeding it into our model.
In dataset there are always few null values. It
doesn’t really matter whether it is regression or classification as model
cannot process the data with null values, they can be removed using drop function, imputation. Standardization is transforming our values such
that the mean of the values is 0 and the standard deviation is 1.
Handling categorical variables,
Categorical variables are basically the variables that are discrete and not
continuous. Multicollinearity
occurs in our dataset when we have features which are strongly dependent on
each other
Types of ml models:
Models are nothing but the algorithm used in the
machine learning process. There are many algorithms for each of the types of
machine learning. Overall the models are of three types supervised learning models, unsupervised learning models, reinforcement
learning models.
Each have different problem types and the models are build
according to the need. i.e. the algorithm is trained with the test data to give
the output, the output may be based on regression, classification, association,
clustering, the models may lie in one of these types
Supervised and Unsupervised
In Supervised learning the algorithm learns from the labelled data i.e. there will be a set of
labelled training data for the data set. A supervised learning algorithm
analyzes the training data and produces an inferred function, which can be used for mapping new the input data the
problem type of supervised learning is classification and regression. . The
output will be known in supervised learning.
The popular supervised learning
algorithm are
- Logistic regression:
Logistic regression is used for prediction of output which is binary.
Its called Regression but performs classification as based on the regression it
classifies the dependent variable into
either of the classes.
- Support vector machine:
Support Vector is used for both regression and Classification. It is based on the concept of decision planes that define decision boundaries.
It performs classification by finding the hyperplane that maximizes the margin
between the two classes with the help of support vectors.
- K nearest neighbor:
K-NN algorithm is one of the simplest
classification algorithms and it is used
to identify the data points that are separated into several classes to predict
the classification of a new sample point. It classifies new cases based on
a similarity measure
- Decision tree classification:
Decision tree builds classification or regression models in the form of a tree structure.
It breaks down a dataset into smaller and smaller subsets while at the same
time an associated decision tree is incrementally developed
Unsupervised
learning
Unsupervised
learning algorithm
learns from the unlabeled data, it identifies commonalities in the
data and reacts based on the presence or absence of such commonalities in each
new piece of data. The problem type of unsupervised
learning is association and clustering.
unsupervised learning finds the hidden structures in the data
The
popular unsupervised algorithm are
- K -means:
K-means clustering is a type of unsupervised
learning, which is used when you have
unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find
groups in the data, with the number of groups represented by the variable K.
The algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data
points are clustered based on feature similarity. The results of the
K-means clustering algorithm are:
The centroids of the K clusters, which can be
used to label new data
Labels for the training data (each data point is
assigned to a single cluster)
- C means:
Fuzzy
c-means (FCM) is a method of clustering which allows one piece of data to
belong to two or more clusters. This method is frequently used in
pattern recognition.
How to test your Data?
Testing of data means testing data set so that
it can be used in to train the model. The data set is portioned in two training
and test data
The test data must meet the
following two conditions:
·
Is large enough to yield statistically meaningful results.
· Is
representative of the data set as a whole. In other words, don't pick a test
set with different characteristics than the training set
The larger the training data set, the larger the
model learn. The dataset can be partitioned in the ratio of 80:20. After
training the model with the training
data set if it gives higher
accuracy with the test data just the check the test data and found that many of the examples s in the
test set are duplicates of examples in the training set. We've inadvertently
trained on some of our test data, and as a result, we're no longer accurately
measuring how well our model generalizes to new data
Cross validation techniques:
Random Subsampling
Random subsampling, is based on randomly
splitting the data into subsets, whereby the size of the subsets is defined by
the user. The random partitioning of the data can be repeated arbitrarily
often.
K-Fold Cross-Validation:
In K Fold cross validation,
the data is divided into k subsets. After that holdout method is performed k
times (i.e.) ( holdout method means portioning the dataset into train and test
and using test we can get the accuracy the model but there will error induced )
such that each time, one of the k subsets is used as the test set/ validation
set and the other k-1 subsets are put together to form a training set.
Leave-one-out
Cross-Validation:
Leave
one out cross validation works as follows.
The parameter optimization is performed (automatically) on 9 of the 10 data
set and then the performance of the tuned algorithm is tested on the 10th data set. So, in this step, the 10th data set is the
test set and the other nine pairs are the training data for optimizing the free
parameters of your algorithm. Now,
repeat the process 10 times, each time leaving out a different data set to use
as the single test case. You now get
test performance for all 10 data set.
That is the way that leave-one-out cross validation works.





Comments
Post a Comment