Machine Learning Dictionary

4 min readJul 5, 2020

Let’s understand some common words that are thrown around while building a machine learning model .

Coefficient of correlation

Correlation coefficient is statistical measure of strength of the relationship between the relative movements of 2 variables. The value of coefficient of correlation ranges between 0 and 1.

Coefficient of determination

Coefficient of determination or R-squared is the proportion of the variance in the dependent variable(y) that can be predicted by the independent variables.It ranges from 0 to 1.

If R squared is 0 , it means that the dependent variable cant be predicted from independent variables.
If R squared is 1 , it means that the dependent variable can be completely explained by the independent variables without any error.

Pearson’s R

Pearson’s R or Pearson’s Correlation coefficient is the statistic that measures linear correlation between 2 variables .The values of Person’s R range from -1 to 1.

A value of 0 represents no correlation between the 2 variables
A value of 1 represents total positive correlation between the 2 variables
A value of -1 represents total negative correlation between the 2 variables

Pearson’s R equation

cov is covariance .
Sigma x & y are standard deviations of x and y respectively

What is Scaling?

Scaling is one of the most important transformations we most often apply on our data set. Scaling is a technique we use during data preprocessing to standardize the features of a dateset to a fixed range.

Why do we perform scaling?

Machine learning algorithms don't perform well when the input numerical attributes have different scales.If scaling is not done , a machine learning algorithm might add higher weight to greater values and lower weight to smaller values without considering units etc..

Different methods of scaling?

There are two common ways to get all the attributes to have the same scale

Min Max Scaling
Standardization

Min Max Scaling

Min Max Scaling shifts and re scales the value so that they end up ranging from 0 to 1. This is done by subtracting the value by the minimum and then dividing it by the maximum minus minimum.Min max binds the value to a fixed range (0 to 1) and can be affected by outliers.

Standardization

Standardization doesn't bind the values to a fixed range.Standardization first subtracts the mean value and then divides by the standard deviation. So, the variance of the distribution is always 1 and the mean is always 0. Standardization is much less affected by outliers.

VIF (Variance Inflation Factor)

Variance inflation factor (VIF) calculates how well one independent variable is explained by all the other independent variables combined , it is a measure of the amount of multi-collinearity in a set of multiple regression variables. A high VIF indicates that the associated independent variable is highly collinear with the other variables in the model.

Sometimes VIF is infinite. Why does this happen?

According to the formulae for VIF, VIF can be infinite only if R-squared value is 1.So , we can conclude that if there is perfect correlation, then VIF = infinity. A large value of VIF indicates that there is a correlation between the variables.

Q-Q plot

A Q-Q plot or Quantile-Quantile plot is a plot of two quantiles between each other. It will help us assess if a set of data comes from some theoretical distribution i.e. Normal distribution or exponential distribution.

Where can we use Q-Q plots ?

While building a linear regression model , we assume that the residuals are normally distributed , we can use a Normal Q-Q plot to check that assumption.

Q-Q plot also helps in a scenario of linear regression when we have training and test data set received separately and then we can confirm using Q-Q plot that both the data sets are from populations with same distributions.