Data Preparation

Data Preparation#

We have done some exploratory data analysis in the previous section and found out some patterns. We also cleaned the data to some extent. Lets go deeper into some of the more techniques.

Data Cleaning#

Most of the times, the data is damaged, or missing, we need to take care of it since Machine Learning models don’t work when the data is missing or not a number.

import pandas as pd
from sklearn.impute import SimpleImputer
import numpy as np

Imputing missing values#

df = pd.read_csv('Data.csv')
df.head()

	Country	Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes

# replace every occurrence of missing_values to one defined by strategy
# which can be mean, median, mode.

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
df.iloc[:, 1:3] = imputer.fit_transform(df.iloc[:, 1:3])
df.head()

	Country	Age	Salary	Purchased
0	France	44.0	72000.000000	No
1	Spain	27.0	48000.000000	Yes
2	Germany	30.0	54000.000000	No
3	Spain	38.0	61000.000000	No
4	Germany	40.0	63777.777778	Yes

Encoding categorical data#

Our dataset has numerical and categorical features. We need to convert categorical features to numerical. For that we can use LabelEncoder or OneHotEncoder.

One hot encoding is a popular technique to convert categorical variables to numerical. It creates a separate column for every variable and gives a value of 1 where the variable is present otherwise 0.

Label encoding is another popular technique to convert categorical variables to numerical. It replaces every categorical variable with a number.

# Label Encoder will replace every categorical variable with number. Useful for replacing yes by 1, no by 0.
# One Hot Encoder will create a separate column for every variable and give a value of 1 where the variable is present
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

lable_encoder = LabelEncoder()
temp = df.copy()
temp.iloc[:, 0] = lable_encoder.fit_transform(df.iloc[:, 0])
temp.head()

	Country	Age	Salary	Purchased
0	0	44.0	72000.000000	No
1	2	27.0	48000.000000	Yes
2	1	30.0	54000.000000	No
3	2	38.0	61000.000000	No
4	1	40.0	63777.777778	Yes

# you can pass an array of indices of categorical features
# one_hot_encoder = OneHotEncoder(categorical_features=[0])
# temp = df.copy()
# temp.iloc[:, 0] = one_hot_encoder.fit_transform(df.iloc[:, 0])

# you can achieve the same thing using get_dummies
pd.get_dummies(df.iloc[:, :-1])

	Age	Salary	Country_France	Country_Germany	Country_Spain
0	44.000000	72000.000000	True	False	False
1	27.000000	48000.000000	False	False	True
2	30.000000	54000.000000	False	True	False
3	38.000000	61000.000000	False	False	True
4	40.000000	63777.777778	False	True	False
5	35.000000	58000.000000	True	False	False
6	38.777778	52000.000000	False	False	True
7	48.000000	79000.000000	True	False	False
8	50.000000	83000.000000	False	True	False
9	37.000000	67000.000000	True	False	False

Binarizing#

Often we need to do the reverse of what we’ve done above. That is, convert continuous features to discrete values. For instance, we want to convert the output to 0 or 1 depending on the threshold.

from sklearn.datasets import load_iris

iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target
feature_names = iris_dataset.feature_names

Now we’ll binarize the sepal width with 0 or 1 indicating whether the current value is below or above mean.

X[:, 1]

array([3.5, 3. , 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3. ,
       3. , 4. , 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.6, 3.3, 3.4, 3. ,
       3.4, 3.5, 3.4, 3.2, 3.1, 3.4, 4.1, 4.2, 3.1, 3.2, 3.5, 3.6, 3. ,
       3.4, 3.5, 2.3, 3.2, 3.5, 3.8, 3. , 3.8, 3.2, 3.7, 3.3, 3.2, 3.2,
       3.1, 2.3, 2.8, 2.8, 3.3, 2.4, 2.9, 2.7, 2. , 3. , 2.2, 2.9, 2.9,
       3.1, 3. , 2.7, 2.2, 2.5, 3.2, 2.8, 2.5, 2.8, 2.9, 3. , 2.8, 3. ,
       2.9, 2.6, 2.4, 2.4, 2.7, 2.7, 3. , 3.4, 3.1, 2.3, 3. , 2.5, 2.6,
       3. , 2.6, 2.3, 2.7, 3. , 2.9, 2.9, 2.5, 2.8, 3.3, 2.7, 3. , 2.9,
       3. , 3. , 2.5, 2.9, 2.5, 3.6, 3.2, 2.7, 3. , 2.5, 2.8, 3.2, 3. ,
       3.8, 2.6, 2.2, 3.2, 2.8, 2.8, 2.7, 3.3, 3.2, 2.8, 3. , 2.8, 3. ,
       2.8, 3.8, 2.8, 2.8, 2.6, 3. , 3.4, 3.1, 3. , 3.1, 3.1, 3.1, 2.7,
       3.2, 3.3, 3. , 2.5, 3. , 3.4, 3. ])

from sklearn.preprocessing import Binarizer
X[:, 1:2] = Binarizer(threshold=X[:, 1].mean()).fit_transform(X[:, 1].reshape(-1, 1))
X[:, 1]

array([1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       1., 1., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0.])

Feature Scaling#

Because in Machine Learning models, features are mapped into n-dimensional space. So let’s say there are two variables (x, y) which will be mapped in 2D co-ordinate system. If one variable, say y, is very huge and other, x, is very small, then the euclidean distance will be dominated by the bigger one and smaller one will be ignored. In this case we are losing valuable information, hence feature scaling is used to solve this problem.

Additional reasons for transformation:

To more closely approximate a theoretical distribution that has nice statistical properties.
To spread out data more evenly.
To make data distribution more symmetric
To make relationships between variables more linear.
To make data more constant in variance (homoscedasticity).

There are 3 most used ways to scale features.#

Min Max Scaling: Will scale the input to have minimum of 0 and maximum of 1. That is, it scales the data in the range of [0, 1] This is useful when the parameters have to be on same positive scale. But in this case, the outliers are lost. $$X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$$
Standardization: Will scale the input to have mean of 0 and variance of 1. $$X_{stand} = \frac{X - \mu}{\sigma}$$
Normalizing: Will scale the input to make the norm of 1. For instance, for 3D data the 3 independent variables will lie on a unit Sphere.
Log Transformation: Taking the log of data after any of above transformation.

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

For most applications, Standardization is recommended. Min Max Scaling is recommended for Neural Networks. Normalizing is recommended when Clustering eg. KMeans.

import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler

df = pd.read_csv('Data.csv').dropna()
print(df)
X = df[["Age", "Salary"]].values.astype(np.float64)

   Country   Age   Salary Purchased
 France  44.0  72000.0        No
  Spain  27.0  48000.0       Yes
Germany  30.0  54000.0        No
  Spain  38.0  61000.0        No
 France  35.0  58000.0       Yes
 France  48.0  79000.0       Yes
Germany  50.0  83000.0        No
 France  37.0  67000.0       Yes

standard_scaler = StandardScaler()
normalizer = Normalizer()
min_max_scaler = MinMaxScaler()

print("Standardization")
print(standard_scaler.fit_transform(X))

print("Normalizing")
print(normalizer.fit_transform(X))

print("MinMax Scaling")
print(min_max_scaler.fit_transform(X))

Standardization
[[ 0.69985807  0.58989097]
 [-1.51364653 -1.50749915]
 [-1.12302807 -0.98315162]
 [-0.08137885 -0.37141284]
 [-0.47199731 -0.6335866 ]
 [ 1.22068269  1.20162976]
 [ 1.48109499  1.55119478]
 [-0.211585    0.1529347 ]]
Normalizing
[[6.11110997e-04 9.99999813e-01]
 [5.62499911e-04 9.99999842e-01]
 [5.55555470e-04 9.99999846e-01]
 [6.22950699e-04 9.99999806e-01]
 [6.03448166e-04 9.99999818e-01]
 [6.07594825e-04 9.99999815e-01]
 [6.02409529e-04 9.99999819e-01]
 [5.52238722e-04 9.99999848e-01]]
MinMax Scaling
[[0.73913043 0.68571429]
 [0.         0.        ]
 [0.13043478 0.17142857]
 [0.47826087 0.37142857]
 [0.34782609 0.28571429]
 [0.91304348 0.88571429]
 [1.         1.        ]
 [0.43478261 0.54285714]]

Feature extraction#

Let’s explore some of the feature extraction techniques. Feature extraction is the process of transforming the data into a format that can be used for machine learning. It is a way to extract features from the data.

CountVectorizer#

CountVectorizer converts a bunch of documents to vector so that we can use it with models. It basically just counts the number of times a particular word has occured.

from sklearn.feature_extraction.text import CountVectorizer

docs = ["Mayur is a nice boy.", "Mayur rock! wohooo!", "My name is Mayur, and I am a Pythonista!"]
cv = CountVectorizer()
X = cv.fit_transform(docs)
print(X.todense())
print(cv.vocabulary_)

[[0 0 1 1 1 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0 1 1]
 [1 1 0 1 1 1 1 0 1 0 0]]
{u'and': 1, u'boy': 2, u'name': 6, u'is': 3, u'mayur': 4, u'am': 0, u'wohooo': 10, u'rock': 9, u'nice': 7, u'my': 5, u'pythonista': 8}

DictVectorizer#

DictVectorizer will convert mappings to vectors.

from sklearn.feature_extraction import DictVectorizer

docs = [{"Mayur": 1, "is": 1, "awesome": 2}, {"No": 1, "I": 1, "dont": 2, "wanna": 3, "fall": 1, "in": 2, "love": 3}]
dv = DictVectorizer()
X = dv.fit_transform(docs)
print(X.todense())

[[ 0.  1.  0.  2.  0.  0.  0.  1.  0.  0.]
 [ 1.  0.  1.  0.  2.  1.  2.  0.  3.  3.]]

TfidfVectorizer#

In many text analytics applications, we need to convert the text into vectors to use with Machine Learning algorithms. This is known as the Vector Space Model.

While CountVectorizer could be a solution, words like “the”, “a”, “in” etc. are common words and often are used in all kinds of documents. Using CountVectorizer gives more emphasis on such word counts which are not relevant.

You could circumvent this problems by using stop_words="english" which would filter out common words but let’s say you have a different vocabulary, for instance a conversation between 2 Computer Science students would have words like “RAM”, “processor”, “GPU” mentioned too often and you’d have to manually add the stop words everytime for all the problems you solve.

Thus in such scenarios, it is recommended to use TfidfVectorizer which will take care of such things. Every word is given a number according to the following formula:

\[ \text{tfidf }\left(\text{word}\right)=\text{tf}\left(\text{word},\text{document}_i\right)\cdot\text{idf}\left(\text{word}\right) \]

Where,

tf(word, document_i) = Term Frequency of a word in the specific document i.
idf(word) = Inverse Document Frequency of the word.

Inverse Document Frequency is defined as the log of ratio of number of documents to the number of times the word has occured in the any document.

\[ \text{idf }\left(w\right)=\log\left(\frac{n_d}{df\left(w\right)}\right)\]

Where,

df(w) = number of times the word has occured in the any document.

What is does intuitively is if a word has occured too many times in other document as well (common words like “the”, “is”) then it gives lesser weightage to such words in contrast to words that have occured more number of times in a single document in contrast to others. Which basically means that if a particular word occurs more number of times in a single document only, then it might be an important feature.

Note that numerator and denominator are added with 1 to avoid underflow eg. when the document frequency is 0.

Sklearn additionally also normalizes the output of tfidf to have a norm of 1. This is important since we’re interested in similarities hence vectors like (1, 1) and (3, 3) are really the same (they go in same direction, just have different weights) which is achieved by dividing by the length of the vector.

\[v_i=\frac{v_i}{\left|v\right|_2}=\frac{v_i}{\sqrt{v_1^2+v_2^2+v_3^2+....+v_n^2}}\]

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vectorizer = TfidfVectorizer()
cv_vectorizer = CountVectorizer()
docs = ["Mayur is a Guitarist", "Mayur is Musician", "Mayur is also a programmer"]
X_idf = tfidf_vectorizer.fit_transform(docs)
X_cv = cv_vectorizer.fit_transform(docs)
print(X_idf.todense())
print(tfidf_vectorizer.vocabulary_)
print(X_cv.todense())

[[0.         0.76749457 0.45329466 0.45329466 0.         0.        ]
 [0.         0.         0.45329466 0.45329466 0.76749457 0.        ]
 [0.6088451  0.         0.35959372 0.35959372 0.         0.6088451 ]]
{'mayur': 3, 'is': 2, 'guitarist': 1, 'musician': 4, 'also': 0, 'programmer': 5}
[[0 1 1 1 0 0]
 [0 0 1 1 1 0]
 [1 0 1 1 0 1]]

We can see the “Mayur” and “is” are given less weightage than “guitarist”, “musician”, “programmer”