When it comes to data exploration and model building, there are multiple ways to perform certain tasks and often, it all boils down to the goals and the experience or flair of the Data Scientist. For Example, you may want to normalize data via the L1 (Manhattan Distance) or L2 (Euclidean Distance) or even a combination of both.
It’s common practice to interchange certain terms in Data Science. Methods are frequently interchanged with functions and vice-versa. These have similar meanings but by behaviour, functions take in one or more parameters, while methods are usually called upon objects… print(‘hello’) # function df.head # method We see the same interplay in the words mean and average… The words “mean” and “average” are often used interchangeably.
The substitution of one word for the other is common practice. The technical term is “arithmetic mean,” and “average” is technically a center location.
However, in practice among non-statisticians, “average” is commonly accepted for “arithmetic mean.” (openstax.org) Overview: Scaling and normalization are often used interchangeably. And to make matters more interesting, scaling and normalization are very similar!
In both scaling and normalization, you’re transforming the values of numeric variables so that the transformed data points have specific helpful properties. These properties can be exploited to create better features and models. Differences: free image from pixabay In Scaling, we’re changing the range of the distribution of the data… While in normalizing, we’re changing the shape of the distribution of the data.
Range is the difference between the smallest and largest element in a distribution. Scaling and normalization are so similar that they’re often applied interchangeably, but as we’ve seen from the definitions, they have different effects on the data. As Data Professionals, we need to understand these differences and more importantly, know when to apply one rather than the other.
Why Do We Scale Data?
Remember that in scaling, we’re transforming the data so that it fits within a specific scale, like 0-100 or 0-1. Usually 0-1. You want to scale data especially when you’re using methods based on measures of how far apart data points are.
For example, while using support vector machines (SVM) or clustering algorithms like k-nearest neighbors (KNN)… With these algorithms, a change of “1” in any numeric feature is given the same importance.
Let’s take an example from Kaggle. Imagine you’re looking at the prices of some products in both Yen and US Dollars. One US Dollar is worth about 100 Yen, but if you don’t scale your prices, algorithms like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US Dollar!
This clearly doesn’t fit our intuitions of the world. So generally, we may need to scale data for machine learning problems so that all variables have quite similar distribution range to avoid such issues. By scaling your variables, you can help compare different variables on equal footing…(Kaggle)
Some Common Types of Scaling: 1. Simple Feature Scaling: This method simply divides each value by the maximum value for that feature…The resultant values are in the range between zero(0) and one(1) Simple-feature scaling is the defacto scaling method used on image-data. When we scale images by dividing each image by 255 (maximum image pixel intensity) Let’s define a simple-feature scaling function …
This scaler takes each value and subtracts the minimum and then divides by the range(max-min). The resultant values range between zero(0) and one(1). Let’s define a min-max function…
Just like before, min-max scaling takes a distribution with range[1,10] and scales it to the range[0.0, 1]. Apply Scaling to a Distribution: Let’s grab a data set and apply Scaling to a numerical feature.
We’d use the Credit-One Bank credit loan customers dataset.
This time, we’ll use the minmax_scaling function from mlxtend.preprocessing. Let’s see the head of the data set. Ok, let’s for the sake of practice, scale the ‘Age’ column of the data
After scaling the data, we can see that the original dataset has a minimum age of 19 and a maximum of 75. And, the scaled dataset has a minimum of [0.] and maximum of [1.] The only thing that changes, when we scale the data is the range of the distribution… The shape and other properties remain the same. Why Do We Normalize Data? Normalization is a more radical transformation.
The point of normalization is to change your observations so that they can be described as a normal distribution… (Kaggle) Normal distribution: AKA, “bell curve”, is a specific statistical distribution where roughly equal observations fall above and below the mean, the mean and the median are about the same, and there are more observations closer to the mean. The normal distribution is also known as the Gaussian distribution.
In general, you’ll normalize your data if you’re going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method with “Gaussian” in the name probably assumes normality.) Note that Normalization is also referred to as Standardization in some Statistical journals. Standardization aims to normalize the distribution by considering how far away each observation is from the mean in terms of the standard deviation. An example is the Z-Score. Some Common Types of Normalization:
1. Z-Score or Standard Score: For each value in the distribution, we subtract the average or mean…And then divide by the Standard deviation. This gives a range from about minus 3 to 3, could be more, or less. We can easily code it up, let’s define a Z-score method…
2. Box-Cox Normalization: image credit| link A Box-Cox transformation is a transformation of a non-normal dependent variable into a normal shape.
The Box-Cox transformation is named after statisticians George Box and Sir David Roxbee Cox who collaborated on a 1964 paper and developed the technique… (link) How it Works… At the heart of the box-cox normalization is an exponent lambda (λ), which varies from -5 to 5.
All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve
For those of us with some ML skills, this process is akin to tuning the learning-rate alpha (α), in order to produce a finer fit to the data Box-Cox by default works for only positive values, but there’s a variant that can approximate negative values too. See this link. For more see this article.
Apply Normalization to a Distribution: Let’s continue with the Credit-One Bank credit loan customers dataset. This time, we’d apply box cox transformation to the same Age Column. We’d use the boxcox function from scipy.stats.
So we use the stats.boxcox function, which returns a tuple with the normalized Series as the first element. The original dataset minimum is 19 and the maximum is 75. While the normalized dataset minimum is 1.300 and maximum is 1.4301 Notice that in addition to changing the range of the Age distribution, the normalization