Sign in

5 Tips For Better Feature Engineering

5 Tips For Better Feature Engineering

Feature Engineering is an important concept for machine learning because it helps you to extract meaningful information from the raw data by using mathematics, statistics, and domain knowledge. Domain knowledge means, we use the specialized discipline of knowledge to extract features from the raw data through different techniques. For that purpose, if we have to create some additional relevant features which increase the predictive power of the machine learning algorithm, then we add some extra features to enhance the performance even better. In the following, not only I will tell you the five tips for feature engineering, which will increase the accuracy of the machine learning model but also I will implement these tips, so you will know, how important, these tips are to improve the accuracy of the machine learning model.

Five Tips for Better Feature Engineering

  1. Get information as much as possible to know better you data, after gaining much information, you will be able to create additional features based on that data.
  2. Feature selection is a very important task in feature engineering because it will lay the foundation of your model’s accuracy.
  3. Create additional features based on the information you gained from the data.
  4. Handling of missing values in the data, by calculating their median.
  5. The transformation of variables into the numerical ones as the machine learning model works on the numerical data.

If you implement these five tasks, correctly on your data, then you will be able to gain a very high accuracy model. So now you know the techniques, let’s jump into the code to demonstrate these tips in the practical world.

import matplotlib.pyplot as plt

import seaborn as sns

import pandas as pd

import numpy as np

from sklearn import tree

from sklearn.model_selection import GridSearchCV

import re

sns.set()

We have imported all of our libraries which we are going to use here, also we have used seaborn.set () method to set aesthetic parameters in one step.

data_train = pd.read_csv(‘train.csv’)

data_test = pd.read_csv(‘test.csv’)

survived_data_train = data_train.Survived

data = pd.concat([data_train.drop([‘Survived’], axis=1), data_test])

Here we are importing our training and test datasets in data_train and data_test variables, and also we are separating the target variable from the start and keeping it safe in the survived_data_train variable. In the last line we are concatenating our train and test dataset and also we are dropping out our target variable.

Now we will check our data by using data_train.head () and also data_test.head () methods.

data_train.head()

data_test.head()

Now before we do the further process of feature engineering we first need to gain information related to our data. We have a Titanic’s passenger datasets which were onboard. So here we will use .info () method to take an inside look of our data.

1) Information Gain(IG):

data.info()

data[‘Sex’].value_counts()

sns.countplot(x=’Sex’, data=data);

Here we have counted how many males and females were onboard and then we showed it through the countplot () graph.

Now we will check it out the name of the passengers and what were the titles of the passengers who were onboard at that time, remember we are gaining information from the data so that we will be able to extract relevant features from the raw data.

data.Name.head(20)

Here you can see the title of passengers such as ‘Mr.’, ’Master’, ‘Miss’ etc. These titles give us information about their special status and profession etc. So, now what will we do here, we store all of these titles into a different column ‘titles’.

For that purpose, we have to extract these titles from names and store them in other new variable.

2) Feature Selection and Creating Additional Features:

In feature selection, we reduce the number of input variables, so in this case, we will try to reduce the number of columns such as Name, Cabin, Age, and Fare. We will also merge some columns like SibSp and Parch, which tell us the family members onboard, so we will try to merge their data into a single column. First of all, we need to extract titles from the Name column and then we will drop these unnecessary columns in the next process when we fill out all the missing values. So let’s extract the titles from the Names.

data[‘Title’] = data.Name.apply(lambda x: re.search(‘ ([A-Z][a-z]+)\.’,

x).group(1))

We have extracted the titles from names and also we have stored them in a different column ‘Title’. Now we can build a bar chart to show these titles and count them as well with the help of the bar plot.

sns.countplot(x=’Title’, data=data);

plt.xticks(rotation=45);

You can see in the figure below, we have successfully added the new column in our data. This new column ‘Title’ is a new feature for our dataset.

data.head()

Here we will do another important step to reduce the number of columns, as you can see we have different titles in our dataset and we have created a different column for that as well. So many titles have shown in the Title graph which occurs very often and many of them are those which don’t occur so often. So we will put these fewer occurring titles in some other variable.

data[‘Title’] = data[‘Title’].replace({‘Mlle’:’Miss’, ‘Mme’:’Mrs’, ‘Ms’:’Miss’})

data[‘Title’] = data[‘Title’].replace([‘Don’, ‘Dona’, ‘Rev’, ‘Dr’, ‘Major’, ‘Lady’,

‘Sir’, ‘Col’, ‘Capt’, ‘Countess’, ‘Jonkheer’],’Special’)

sns.countplot(x=’Title’, data=data);

plt.xticks(rotation=45);

So we have replaced here ‘Mlle’ and ‘Ms’ with ‘Miss’ and ‘Mme’ by ‘Mrs.’, as these are French titles, and all other tiles like ‘Don’, ‘Dona’, ‘Rev’ etc. we have put them in a separate variable called ‘Special’. So, this is the tail of our data after adding the features of Title. Take a close look, there are values in the column Mr., Special, and Master.

data.tail()

3) Handling Missing Values and Dropping Unnecessary:

Features:

You can also see in the above figure, that there is a column name ‘Cabin’ which has a lot of NaN values, this could also be interpreted such as these passengers, did not have a cabin. So, for now, we will create a different column ‘Has_cabin’ where we will show you, that either the passenger had a cabin or not, and we will fill these missing values as well. If it is True in the Has_cabin, that means passenger had the cabin, if it is False, that means this passenger did not have the cabin.

data[‘Has_Cabin’] = ~data.Cabin.isnull()

data.head()

You can see in the above figure, there is a column name ‘Has_cabin’ which has the values of True and False. Now we will drop columns which are no more useful for us. For example, ‘Cabin’ is no more useful because we have added column ‘Has_cabin’ which contains very useful information. We will also drop the ‘PassengerId’ column and ‘Ticket’ column as well because they are not going to tell us any useful information. Finally, we will drop the ‘Name’ column as well because we have extracted the titles from the ‘Name’ column.

data.drop([‘Cabin’, ‘Name’, ‘PassengerId’, ‘Ticket’], axis=1, inplace=True)

data.head()

We have successfully engineered our data till yet and added new features like ‘Title’ and ‘Has_cabin’. Now, we will handle the missing values. For that purpose, we need to look inside our data to know better either there are some missing values or not.

data.info()

The figure above shows us, that there are a total of 1309 number of values, but if we look at the Age, Fare and Embarked variables, then this figure shows us that Age has 263 missing values, Fare has 1 missing value and Embarked has 2 missing values. So, here we will calculate the median for these missing values with the help of the following code. In the Embarked missing values we will fill with the ‘S’ alphabet which is a short form for Southampton.

data[‘Age’] = data.Age.fillna(data.Age.median())

data[‘Fare’] = data.Fare.fillna(data.Fare.median())

data[‘Embarked’] = data[‘Embarked’].fillna(‘S’)

data.info()

Now, you can see in the above figure we have no missing values, so that means now we can bin the numerical data because there might be a fluctuation in our numerical data, and by binning the data, we can reduce the effects of minor observation errors.

data[‘CatAge’] = pd.qcut(data.Age, q=4, labels=False )

data[‘CatFare’]= pd.qcut(data.Fare, q=4, labels=False)

data.head()

We have binned the column of ‘Age’ and ‘Fare’ and we passed the data as a series into different columns as ‘CatAge’ and ‘CatFare’. We have also specified the 4 quantiles, which means we have put the people that are within a certain age or fare in the same bin. Here we also set the labels argument to False, so they will encode as numbers.

As you can see, we have successfully passed all of our information from column ‘Age’ and ‘Fare’ to new columns such as ‘CatAge’ and ‘CatFare’, so now we can drop out the columns Age and Fare because they are no more useful for us.

data = data.drop([‘Age’, ‘Fare’], axis=1)

data.head()

data[‘Fam_Size’] = data.Parch + data.SibSp

data = data.drop([‘SibSp’,’Parch’], axis=1)

data.head()

Here we have merged the two columns into One Fam_Size which tell us how many family members were onboard. We dropped the previous column ‘SibSp’ and ‘Parch’.

4) Transformation of Variables:

Now we will transform our variables into numerical ones because the machine learning model works better on numerical data. So, for that purpose, we have to use a pandas function .get_dummies() method which will convert data into the numerical data.

data_dum = pd.get_dummies(data, drop_first=True)

data_dum.head()

Now, finally, it’s time to build a model on our newly featured data. For this purpose, we will train our model on Decision Tree algorithms, but first, we have to split the data into five groups. We will train our model in a grid search, which means first, we will hold up our first group and train model on the remaining four groups, then we hold up the second group and then train the model on the remaining ones, then similarly we do the same thing with the third, fourth and fifth groups as well. We will also use the cross-validation to choose the best max_depth for the dataset.

data_train = data_dum.iloc[:891]

data_test = data_dum.iloc[891:]

# Here we are doing Transform into arrays for scikit-learn to draw a

#conclusion

X = data_train.values

test = data_test.values

y = survived_data_train.values

dep = np.arange(1,9)

param_grid = {‘max_depth’ : dep}

clf = tree.DecisionTreeClassifier()

clf_cv = GridSearchCV(clf, param_grid=param_grid, cv=5)

clf_cv.fit(X, y)

print(“Accuracy w.r.t. Decision Tree Parameters:

{}”.format(clf_cv.best_params_))

print(“Best score is {}”.format(clf_cv.best_score_))

Now what will happen, if we train our model but without the feature engineering process we have done so far, then the accuracy of our model will be reduced. Let’s what will happen then.

We can see that the accuracy of the model is reduced from 82% to 79%. This shows how important feature engineering is and it has a strong impact on the accuracy and performance of the model.

Conclusion:

In this article, we have discussed the five tips for feature engineering, and we have also explained the process of feature engineering. There will be lots of tips and techniques out there as well, which definitely can impact the performance of the machine learning model. We have tried to explain these techniques, in a very simple way that even a newbie can understand it, and we have also achieved the accuracy with the help of these feature engineering techniques, although this accuracy could be further improved, our main focus was to prove to our audience, that how much, feature engineering is important, to gain accuracy from the large and complex data.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store