Published in

MagniData

10 min readSep 23, 2022

RESUME CLASSIFICATION: A MULTI-CLASSIFICATION NLP PROJECT

Natural Language Processing (NLP) has gained popularity for multiple reasons and it is an exciting technology that is here to stay for a long time. NLP deals with machines understanding the way humans speak and write the language in their everyday lives. If this Artificial Intelligence (AI) subdomain grabs your attention, you can start with some textbook projects. In this article, I am going to go over one of the simple projects of that kind: classifying an applicant’s resume.

Natural Language Processing (NLP) has gained popularity for multiple reasons and it is an exciting technology that is here to stay for a long time. NLP deals with machines understanding the way humans speak and write the language in their everyday lives. If this Artificial Intelligence (AI) subdomain grabs your attention, you can start with some textbook projects. In this article, I am going to go over one of the simple projects of that kind: classifying an applicant’s resume.

The conventional techniques of hiring a candidate for a position is becoming more labor intensive, therefore inefficient, because of the growing online recruitment. The companies receive an excessive number of resumes in multiple categories for the vacant positions. Using some of the NLP and Machine Learning (ML) techniques, categorizing the applicants’ resumes for the available positions can be automated. In this article, let’s develop a simplified version of such a multiclass classification in Python using NLP. We will briefly go over the following steps:

1- Acquiring data

2- Cleaning/preprocessing/exploring text data

3- Vectorizing text data

4- Developing the ML algorithm

As a note, performance evaluation/comparison of the results will not be covered here since the main goal is to demonstrate basic NLP steps. Assuming you already have a Python environment, let’s begin with importing pandas and matplotlib.

import pandas as pdimport matplotlib.pyplot as plt

Handling the data

The first step in an ML project is to acquire the data. In this case, the data is conveniently provided to you in the form of a csv file. (Here is the link to the dataset)

Once we read the data (pd.read_csv()) into a dataframe df, we are sneaking into our dataframe using head and tail methods. This will give us some rough idea about what types of preprocessing our text data need when we are dealing with NLP.

df = pd.read_csv('./data/UpdatedResumeDataset2.csv')df.head()

df.tail()

It is always nice to see more of the text early in the game. For that reason, let’s set the maximum column width to 800(pd.set_option()). Notice that the amount of text that you can see is increased below. By peeking into the text, we already notice that there are non-alphanumeric characters, numbers, and backslashes that need to be cleaned in the text.

There are two columns of data (df.columns), and they both are text fields: Category and Resume. To be able to process the data for our machine learning model, we need to convert both of these text columns into numbers.

pd.set_option('max_colwidth', 800)df.tail()

df.columns

Let’s see how many different classes of resumes we have in our dataset using the value_counts() method of pandas library.

len(df.Category.value_counts())

Out of 25 different categories, Java Developer class is the largest class with 84 resumes, whereas Advocates class is the smallest with only 20 resumes. We have 40 resumes in Data Science class. Our goal is to develop a machine learning classifier which is going to correctly predict the class of an applicant’s resume. Since we have 25 different categories, we will develop a multi-class classification algorithm.

df.Category.value_counts()

To get information about non-null values and memory usage, we can rely on the info() method of pandas. We do not have any null values.

df.info()

It is always a good idea to visualize your findings. We will use seaborn countplot to sea each category.

import seaborn as snsplt.figure(figsize=(8,6))sns.countplot(y="Category" , palette="Set3", data=df, order = df['Category'].value_counts().index);

Text Cleaning

We need to apply some data cleaning before vectorizing our text data. Just a quick reminder, in NLP, we need to convert text data to numbers before applying any machine learning. That process is called the vectorization of the text data.

Let’s have a quick look at one of the resumes to better understand what types of steps we need to perform.

df["Resume"][1]

After carefully looking at the text data, we notice a lot of punctuation, numbers, non-ascii characters, and extra white spaces that need to be removed. For a list of punctuation marks, you can import them from string class, and use this list of characters in your cleaning function.

from string import punctuationprint(punctuation)

During text cleaning, we mainly find and replace patterns in the text. This process is easily handled by regular expressions. Simply put, a regular expression is defined as an ”instruction” that is given to a function on what and how to match, search or replace a set of strings.

As a side note, regular expressions should be understood really well if you want to be good in NLP since they are used in various tasks such as:

Data pre-processing,
Rule-based information Mining systems,
Pattern Matching,
Text feature Engineering,
Web scraping,
Data Extraction, etc.

There is a third party module in Python called re that handles regular expressions. In our cleanResume function, we are going to use mostly re.sub method to substitute a text with another, and re.escape to skip the punction marks so that we can eventually eliminate them from our text.We remove the punctuations, non-ascii characters, extra whitespaces and numbers in cleanResume function as a basic text cleaning for now. For normalization purposes, we are converting all letters to lowercase as well.

import redef cleanResume(resumeText):resumeText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', resumeText)  # remove punctuationsresumeText = re.sub(r'[^\x00-\x7f]',r' ', resumeText) # remove non-ascii charactersresumeText = re.sub('\s+', ' ', resumeText)  # remove extra whitespaceresumeText = re.sub(r'[0-9]+', '', resumeText)  #remove numbersreturn resumeText.lower()

Now, we can clean the text. All we need to do is to define a lambda function, apply the cleanResume function to each resume, and save the cleaned text in a new column “Cleaned Resume”.

df["Cleaned Resume"] = df["Resume"].apply(lambda x: cleanResume(x))len(df["Cleaned Resume"][1])

df.head(1)

Word Clouds

To explore the text data visually, let’s create a word cloud for the data science category. This will help us see the most frequently used words. Word cloud is a simple yet powerful visual representation tool in text processing. It shows the most frequent words with bigger and bolder fonts. The smaller the letters are, the less important the words are. The larger and the bolder the fonts of a word, the more frequent that word in the data science category.

Before applying the word cloud, it is always a good idea to remove the stop words. Stop words are the frequently used words in a language regardless of the context. Since they are frequently used, they carry little information. To better optimize your algorithm, you can create a list of your own stop words. For now, we are going to use NLTK stop words which includes 179 words in English.

If it is your first time using NLTK, and if you get the error “NLTK stop words not found”, make sure to download the stop words after importing nltk by including the comment line below in your code.

import nltk#nltk.download('stopwords')import stringfrom nltk.corpus import stopwordsfrom nltk import word_tokenizelen(stopwords.words('english'))

Since we are interested in the data science category only, we are saving the data science resumes in a separate data frame, ds_df. The for loop basically tokenizes each resume using teh word_tokenize() method of the nltk, and removes the stop words/punctuations from the text.

ds_df = df[df.Category == 'Data Science']resumes=""total_words = []for resume in ds_df["Cleaned Resume"]:resumes += resumewords = word_tokenize(resume)for word in words :if word not in set(stopwords.words('english')) and word not in string.punctuation:total_words.append(word)

To visualize the words in a word cloud, all we need to do is to use WordCloud package with the generate method on the resumes.

from wordcloud import WordCloudwordcloudimage = WordCloud( font_step=2 ,max_font_size=500,collocations = False,#collocation_threshold = 1).generate(resumes)plt.figure(figsize=(15,15))plt.imshow(wordcloudimage, interpolation='bilinear')plt.axis("off")plt.show()

In addition to word cloud, you can also see the frequency distribution of each word in the text, that is the number of times a word is used in total_words. FreqDist function is accessible from nltk library, and the total_words is fed into the function as an argument. The most frequent word is ‘data’ with a word frequency of 396. During the data analysis phase, it is obvious that some of the words such as ‘months’, ‘experience’, ‘year’, and ‘less’ are not specific to the category. They are most probably the frequent words in other categories as well as data science. For that reason, you might want to consider removing them from your future set before the classification.

freq_word = nltk.FreqDist(total_words)freq_word.most_common(10)

Converting label field to numbers

Our label field, categories, is text and it needs to be converted to numbers as well. One way to do that is to import LabelEncoder from sklearn.preprocesing library. After fitting and transforming the category column, we have integer label fields. Now, all we need to do is to vectorize our text, which means to convert the text into numbers as well.

from sklearn.preprocessing import LabelEncoderencoder = LabelEncoder()df['Labels']=encoder.fit_transform(df.Category)

To check if we successfully convert our labels into integers, let’s run info() method of the pandas one more time.

df.info()

df.Labels.value_counts()[:5]

df.sample() will help us to have a look at two random resumes, and the labels.

df.sample(2)

Now, the next thing is to split the data into train and test sets before we vectorize using the train_test_split() of sklearn.model_selection. We keep the ratio as 0.75 vs 0.25. Since our dataset is not balanced, we will stratify the datasets.

from sklearn.model_selection import train_test_splittext = df["Cleaned Resume"].valueslabels = df["Labels"].valuestext_train,text_test,y_train,y_test = train_test_split(text, labels, random_state=0, test_size=0.25, stratify=df.Labels)print(text_train.shape)print(y_train.shape)print(text_test.shape)print(y_test.shape)

Vectorization

The simplest form of text vectorization is the Bag of Words (BoW) model. Sklearn library makes the BoW application very easy with CountVectorizer, TfidfVectorizer, and TfidfTransformer. Let’s use the TfidfVectorizer with the default tokenizer, and by removing the English stopwords.

from sklearn.feature_extraction.text import TfidfVectorizerword_vectorizer = TfidfVectorizer(sublinear_tf=True,use_idf = True,stop_words='english',max_features=1000)

As a rule of thumb, after we instantiate it, we fit the vectorizer ONLY with the train dataset to create the vocabulary. After fitting the train data, a dictionary of words and matching indices is created and saved in the vocabulary_. We transform the vectorizer both with the train and test set to produce the whole vector with tfidf values.

Since we limited the max_features, the size of the vector is 1000 features, or words, and 769 rows, or documents.

X_train = word_vectorizer.fit_transform(text_train)X_train.shape

X_test = word_vectorizer.transform(text_test)X_test.shape

import itertoolsdict(itertools.islice(word_vectorizer.vocabulary_.items(), 10))

Multi-class Classification

Now, we have the vectorized train and test datasets. All we need to do is to use a multiclass classifier on our dataset.

One of such algorithms is One-vs-Rest. Let’s use One-vs-Rest with different binary classifiers such as KneighborsClassifier, and MultinomialNB.

from sklearn.naive_bayes import MultinomialNBfrom sklearn.multiclass import OneVsRestClassifierclf = OneVsRestClassifier(MultinomialNB()).fit(X_train, y_train)prediction_mnb = clf.predict(X_test)print('Accuracy of MultinomialNB Classifier on training set: {:.2f}'.format(clf.score(X_train, y_train)))print('Accuracy of MultinomialNB Classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))

from sklearn.neighbors import KNeighborsClassifiermodel = OneVsRestClassifier(KNeighborsClassifier()).fit(X_train, y_train)prediction_knc = model.predict(X_test)model.fit(X_train,y_train)print('Accuracy of KNeighbors Classifier on training set: {:.2f}'.format(model.score(X_train, y_train)))print('Accuracy of KNeighbors Classifier on test set: {:.2f}'.format(model.score(X_test, y_test)))

The initial results from both classifiers look promising in terms of accuracy. However, this dataset is not balanced since each class is not represented equally well

There are definitely some areas of improvement in that implementation. For instance, our dataset is imbalanced and we did not do anything for handling the imbalanced dataset so far. We can also try hypertuning using gridSearchCV in each classifier.

Lastly, let’s check the classification report for the MultinomialNB and see the precision, recall, and f1-scores for each class.

from sklearn import metricsprint(metrics.classification_report(y_test, prediction_mnb))

from sklearn import metricsprint(metrics.classification_report(y_test, prediction_knc))

l1 = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]list(zip(encoder.classes_,l1))

Nihal Sahan

This code is available at

NLP/ResumeClassification.ipynb at main · nihalsahan/NLP

Contribute to nihalsahan/NLP development by creating an account on GitHub.

github.com

Follow our publication MagniData for more!
Subscribe to receive our top stories here.
Join our new Slack community: AI-ML-DataScience-Lovers