Data Preparation#

We have done some exploratory data analysis in the previous section and found out some patterns. We also cleaned the data to some extent. Lets go deeper into some of the more techniques.

Data Cleaning#

Most of the times, the data is damaged, or missing, we need to take care of it since Machine Learning models don’t work when the data is missing or not a number.

import pandas as pd
from sklearn.impute import SimpleImputer
import numpy as np

Imputing missing values#

df = pd.read_csv('Data.csv')
df.head()
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
# replace every occurrence of missing_values to one defined by strategy
# which can be mean, median, mode.

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
df.iloc[:, 1:3] = imputer.fit_transform(df.iloc[:, 1:3])
df.head()
Country Age Salary Purchased
0 France 44.0 72000.000000 No
1 Spain 27.0 48000.000000 Yes
2 Germany 30.0 54000.000000 No
3 Spain 38.0 61000.000000 No
4 Germany 40.0 63777.777778 Yes

Encoding categorical data#

Our dataset has numerical and categorical features. We need to convert categorical features to numerical. For that we can use LabelEncoder or OneHotEncoder.

One hot encoding is a popular technique to convert categorical variables to numerical. It creates a separate column for every variable and gives a value of 1 where the variable is present otherwise 0.

Label encoding is another popular technique to convert categorical variables to numerical. It replaces every categorical variable with a number.

# Label Encoder will replace every categorical variable with number. Useful for replacing yes by 1, no by 0.
# One Hot Encoder will create a separate column for every variable and give a value of 1 where the variable is present
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
lable_encoder = LabelEncoder()
temp = df.copy()
temp.iloc[:, 0] = lable_encoder.fit_transform(df.iloc[:, 0])
temp.head()
Country Age Salary Purchased
0 0 44.0 72000.000000 No
1 2 27.0 48000.000000 Yes
2 1 30.0 54000.000000 No
3 2 38.0 61000.000000 No
4 1 40.0 63777.777778 Yes
# you can pass an array of indices of categorical features
# one_hot_encoder = OneHotEncoder(categorical_features=[0])
# temp = df.copy()
# temp.iloc[:, 0] = one_hot_encoder.fit_transform(df.iloc[:, 0])

# you can achieve the same thing using get_dummies
pd.get_dummies(df.iloc[:, :-1])
Age Salary Country_France Country_Germany Country_Spain
0 44.000000 72000.000000 True False False
1 27.000000 48000.000000 False False True
2 30.000000 54000.000000 False True False
3 38.000000 61000.000000 False False True
4 40.000000 63777.777778 False True False
5 35.000000 58000.000000 True False False
6 38.777778 52000.000000 False False True
7 48.000000 79000.000000 True False False
8 50.000000 83000.000000 False True False
9 37.000000 67000.000000 True False False

Binarizing#

Often we need to do the reverse of what we’ve done above. That is, convert continuous features to discrete values. For instance, we want to convert the output to 0 or 1 depending on the threshold.

from sklearn.datasets import load_iris

iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target
feature_names = iris_dataset.feature_names

Now we’ll binarize the sepal width with 0 or 1 indicating whether the current value is below or above mean.

X[:, 1]
array([3.5, 3. , 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3. ,
       3. , 4. , 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.6, 3.3, 3.4, 3. ,
       3.4, 3.5, 3.4, 3.2, 3.1, 3.4, 4.1, 4.2, 3.1, 3.2, 3.5, 3.6, 3. ,
       3.4, 3.5, 2.3, 3.2, 3.5, 3.8, 3. , 3.8, 3.2, 3.7, 3.3, 3.2, 3.2,
       3.1, 2.3, 2.8, 2.8, 3.3, 2.4, 2.9, 2.7, 2. , 3. , 2.2, 2.9, 2.9,
       3.1, 3. , 2.7, 2.2, 2.5, 3.2, 2.8, 2.5, 2.8, 2.9, 3. , 2.8, 3. ,
       2.9, 2.6, 2.4, 2.4, 2.7, 2.7, 3. , 3.4, 3.1, 2.3, 3. , 2.5, 2.6,
       3. , 2.6, 2.3, 2.7, 3. , 2.9, 2.9, 2.5, 2.8, 3.3, 2.7, 3. , 2.9,
       3. , 3. , 2.5, 2.9, 2.5, 3.6, 3.2, 2.7, 3. , 2.5, 2.8, 3.2, 3. ,
       3.8, 2.6, 2.2, 3.2, 2.8, 2.8, 2.7, 3.3, 3.2, 2.8, 3. , 2.8, 3. ,
       2.8, 3.8, 2.8, 2.8, 2.6, 3. , 3.4, 3.1, 3. , 3.1, 3.1, 3.1, 2.7,
       3.2, 3.3, 3. , 2.5, 3. , 3.4, 3. ])
from sklearn.preprocessing import Binarizer
X[:, 1:2] = Binarizer(threshold=X[:, 1].mean()).fit_transform(X[:, 1].reshape(-1, 1))
X[:, 1]
array([1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       1., 1., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0.])

Feature Scaling#

Because in Machine Learning models, features are mapped into n-dimensional space. So let’s say there are two variables (x, y) which will be mapped in 2D co-ordinate system. If one variable, say y, is very huge and other, x, is very small, then the euclidean distance will be dominated by the bigger one and smaller one will be ignored. In this case we are losing valuable information, hence feature scaling is used to solve this problem.

Additional reasons for transformation:

  1. To more closely approximate a theoretical distribution that has nice statistical properties.

  2. To spread out data more evenly.

  3. To make data distribution more symmetric

  4. To make relationships between variables more linear.

  5. To make data more constant in variance (homoscedasticity).

There are 3 most used ways to scale features.#

  1. Min Max Scaling: Will scale the input to have minimum of 0 and maximum of 1. That is, it scales the data in the range of [0, 1] This is useful when the parameters have to be on same positive scale. But in this case, the outliers are lost. $\(X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}\)$

  2. Standardization: Will scale the input to have mean of 0 and variance of 1. $\(X_{stand} = \frac{X - \mu}{\sigma}\)$

  3. Normalizing: Will scale the input to make the norm of 1. For instance, for 3D data the 3 independent variables will lie on a unit Sphere.

  4. Log Transformation: Taking the log of data after any of above transformation.

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

For most applications, Standardization is recommended. Min Max Scaling is recommended for Neural Networks. Normalizing is recommended when Clustering eg. KMeans.

import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler

df = pd.read_csv('Data.csv').dropna()
print(df)
X = df[["Age", "Salary"]].values.astype(np.float64)
   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
5   France  35.0  58000.0       Yes
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes
standard_scaler = StandardScaler()
normalizer = Normalizer()
min_max_scaler = MinMaxScaler()

print("Standardization")
print(standard_scaler.fit_transform(X))

print("Normalizing")
print(normalizer.fit_transform(X))

print("MinMax Scaling")
print(min_max_scaler.fit_transform(X))
Standardization
[[ 0.69985807  0.58989097]
 [-1.51364653 -1.50749915]
 [-1.12302807 -0.98315162]
 [-0.08137885 -0.37141284]
 [-0.47199731 -0.6335866 ]
 [ 1.22068269  1.20162976]
 [ 1.48109499  1.55119478]
 [-0.211585    0.1529347 ]]
Normalizing
[[6.11110997e-04 9.99999813e-01]
 [5.62499911e-04 9.99999842e-01]
 [5.55555470e-04 9.99999846e-01]
 [6.22950699e-04 9.99999806e-01]
 [6.03448166e-04 9.99999818e-01]
 [6.07594825e-04 9.99999815e-01]
 [6.02409529e-04 9.99999819e-01]
 [5.52238722e-04 9.99999848e-01]]
MinMax Scaling
[[0.73913043 0.68571429]
 [0.         0.        ]
 [0.13043478 0.17142857]
 [0.47826087 0.37142857]
 [0.34782609 0.28571429]
 [0.91304348 0.88571429]
 [1.         1.        ]
 [0.43478261 0.54285714]]

Feature extraction#

Let’s explore some of the feature extraction techniques. Feature extraction is the process of transforming the data into a format that can be used for machine learning. It is a way to extract features from the data.

CountVectorizer#

CountVectorizer converts a bunch of documents to vector so that we can use it with models. It basically just counts the number of times a particular word has occured.

from sklearn.feature_extraction.text import CountVectorizer

docs = ["Mayur is a nice boy.", "Mayur rock! wohooo!", "My name is Mayur, and I am a Pythonista!"]
cv = CountVectorizer()
X = cv.fit_transform(docs)
print(X.todense())
print(cv.vocabulary_)
[[0 0 1 1 1 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0 1 1]
 [1 1 0 1 1 1 1 0 1 0 0]]
{u'and': 1, u'boy': 2, u'name': 6, u'is': 3, u'mayur': 4, u'am': 0, u'wohooo': 10, u'rock': 9, u'nice': 7, u'my': 5, u'pythonista': 8}

DictVectorizer#

DictVectorizer will convert mappings to vectors.

from sklearn.feature_extraction import DictVectorizer

docs = [{"Mayur": 1, "is": 1, "awesome": 2}, {"No": 1, "I": 1, "dont": 2, "wanna": 3, "fall": 1, "in": 2, "love": 3}]
dv = DictVectorizer()
X = dv.fit_transform(docs)
print(X.todense())
[[ 0.  1.  0.  2.  0.  0.  0.  1.  0.  0.]
 [ 1.  0.  1.  0.  2.  1.  2.  0.  3.  3.]]

TfidfVectorizer#

In many text analytics applications, we need to convert the text into vectors to use with Machine Learning algorithms. This is known as the Vector Space Model.

While CountVectorizer could be a solution, words like “the”, “a”, “in” etc. are common words and often are used in all kinds of documents. Using CountVectorizer gives more emphasis on such word counts which are not relevant.

You could circumvent this problems by using stop_words="english" which would filter out common words but let’s say you have a different vocabulary, for instance a conversation between 2 Computer Science students would have words like “RAM”, “processor”, “GPU” mentioned too often and you’d have to manually add the stop words everytime for all the problems you solve.

Thus in such scenarios, it is recommended to use TfidfVectorizer which will take care of such things. Every word is given a number according to the following formula:

\[ \text{tfidf }\left(\text{word}\right)=\text{tf}\left(\text{word},\text{document}_i\right)\cdot\text{idf}\left(\text{word}\right) \]

Where,

  1. tf(word, document_i) = Term Frequency of a word in the specific document i.

  2. idf(word) = Inverse Document Frequency of the word.

Inverse Document Frequency is defined as the log of ratio of number of documents to the number of times the word has occured in the any document.

\[ \text{idf }\left(w\right)=\log\left(\frac{n_d}{df\left(w\right)}\right)\]

Where,

  1. df(w) = number of times the word has occured in the any document.

What is does intuitively is if a word has occured too many times in other document as well (common words like “the”, “is”) then it gives lesser weightage to such words in contrast to words that have occured more number of times in a single document in contrast to others. Which basically means that if a particular word occurs more number of times in a single document only, then it might be an important feature.

Note that numerator and denominator are added with 1 to avoid underflow eg. when the document frequency is 0.

Sklearn additionally also normalizes the output of tfidf to have a norm of 1. This is important since we’re interested in similarities hence vectors like (1, 1) and (3, 3) are really the same (they go in same direction, just have different weights) which is achieved by dividing by the length of the vector.

\[v_i=\frac{v_i}{\left|v\right|_2}=\frac{v_i}{\sqrt{v_1^2+v_2^2+v_3^2+....+v_n^2}}\]
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vectorizer = TfidfVectorizer()
cv_vectorizer = CountVectorizer()
docs = ["Mayur is a Guitarist", "Mayur is Musician", "Mayur is also a programmer"]
X_idf = tfidf_vectorizer.fit_transform(docs)
X_cv = cv_vectorizer.fit_transform(docs)
print(X_idf.todense())
print(tfidf_vectorizer.vocabulary_)
print(X_cv.todense())
[[0.         0.76749457 0.45329466 0.45329466 0.         0.        ]
 [0.         0.         0.45329466 0.45329466 0.76749457 0.        ]
 [0.6088451  0.         0.35959372 0.35959372 0.         0.6088451 ]]
{'mayur': 3, 'is': 2, 'guitarist': 1, 'musician': 4, 'also': 0, 'programmer': 5}
[[0 1 1 1 0 0]
 [0 0 1 1 1 0]
 [1 0 1 1 0 1]]

We can see the “Mayur” and “is” are given less weightage than “guitarist”, “musician”, “programmer”