Resume Filtering using NLP

May 3, 2019

Resume Filtering using NLP

Suppose you own a company and luckily you bagged a project for which two data scientists are required. Job postings are done in LinkedIn where 400 resumes were received. In such scenario it is a hectic task to choose the appropriate resume that fulfill your need Then you might think of ”Is there a way to select the best out of them without manually checking one-by-one ”, for such purposes these latest technologies comes very handy. NATURAL LANGUAGE PROCESSING (NLP) provide you the functionality that can manipulate the text

Problem statement:

Your company need a candidate with Deep Learning as her/his core competency and also know how Machine Learning algorithms work. The other person was required to have experience in Scala, AWS, Dockers, Kubernetes etc.

Approach to solve this:

After identifying the problem statement you have to device an approach that can correctly address the problem as follows

1. Maintain table or dictionary which comprises of various skill sets categorized i.e. if we encounter some words like CNN, RNN, tensorflow, keras then segregate them under one column titled ‘Deep Learning’.

2. Build an NLP algorithm that scans the entire resume and searches for the words mentioned in the table or dictionary

3. Then count the occurrence of the words which belongs to various category i.e. something like below for each and every candidate.

Then getting into the real task of developing the approach in programatically . we need many pre-built libraries that are available in python like Spacy(NLP related manipulations) and PYPDF (Reading the Resume)

#required modules(libraries) are need to be imported

import PyPDF2
import os
from os import listdir
from os.path import isfile, join
from io import StringIO
import pandas as pd
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
from spacy.matcher import PhraseMatcher

#this fuction read the folder that comprises the resumes 
path=''enter:\ your path\here where \you saved the\ resumes”
files = [os.path.join(path, f) for f in os.listdir(path) if os.path.isfile(os.path.join(path, f))]

def pdfextract(file):
fileReader = PyPDF2.PdfFileReader(open(file,'rb'))
countpage = fileReader.getNumPages()
count = 0
text = []
while count < countpage: 
page_Obj = fileReader.getPage(count)
count +=1
txt = page_Obj.extractText()
print (txt)
text.append(txt)
return text

#reading the ends of resume


#building the phrase matching and  candidate profile
def create_profile(file):
text = pdfextract(file) 
text = str(text)
text = text.replace("\\n", "")
text = text.lower()
#below is the csv where we have all the keywords, you can customize your own
dict _of_keyword = pd.read_csv('template_new.csv')
stats_words = [nlp(text) for text in dict _of_keyword ['Statistics'].dropna(axis = 0)]
NLP_words = [nlp(text) for text in dict _of_keyword ['NLP'].dropna(axis = 0)]
ML_words = [nlp(text) for text in dict _of_keyword ['Machine Learning'].dropna(axis = 0)]
DL_words = [nlp(text) for text in dict _of_keyword ['Deep Learning'].dropna(axis = 0)]
R_words = [nlp(text) for text in dict _of_keyword ['R Language'].dropna(axis = 0)]
python_words = [nlp(text) for text in dict _of_keyword ['Python Language'].dropna(axis = 0)]
Data_Engineering_words = [nlp(text) for text in dict _of_keyword ['Data Engineering'].dropna(axis = 0)]

Matche_obj = PhraseMatcher(nlp.vocab)
Matche_obj.add('Stats', None, *stats_words)
Matche_obj.add('NLP', None, *NLP_words)
Matche_obj.add('ML', None, *ML_words)
Matche_obj.add('DL', None, *DL_words)
Matche_obj.add('R', None, *R_words)
Matche_obj.add('Python', None, *python_words)
Matche_obj.add('DE', None, *Data_Engineering_words)
doc = nlp(text)

d = [] 
matches = Matche_obj (doc)
for match_id, start, end in matches:
id_rule = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'COLOR'
span = doc[start : end] # retrive the matched slice of  doc
d.append((id_rule, span.text)) 
key_words = "\n".join(f'{i[0]} {i[1]} ({j})' for i,j in Counter(d).items())

## Dataframe from string of words
dataframe1 = pd.read_csv(StringIO(key_words),names = ['Key_words_List'])
dataframe2= pd.DataFrame(dataframe1.Keywords_List.str.split(' ',1).tolist(),columns = ['Subject','Keyword'])
dataframe3 = pd.DataFrame(dataframe2.Keyword.str.split('(',1).tolist(),columns = ['Keyword', 'Count'])
dataframe4 = pd.concat([dataframe2 ['Subject'], dataframe3 ['Keyword'], dataframe3 ['Count']], axis =1) 
dataframe4 ['Count'] = dataframe4 ['Count'].apply(lambda x: x.rstrip(")"))

base = os.path.basename(file)
filename = os.path.splitext(base)[0]

name = filename.split('_')
name2 = name[0]
name2 = name2.lower()
## converting str to dataframe
name3 = pd.read_csv(StringIO(name2),names = ['Candidate Name'])

dataf = pd.concat([name3['Candidate Name'], dataframe4 ['Subject'], dataframe4 ['Keyword'], dataframe4 ['Count']], axis = 1)
dataf['Candidate Name'].fillna(dataf['Candidate Name'].iloc[0], inplace = True)

return(dataf)

#ending of the function

# call the above functions/ code to execute

final_database=pd.DataFrame()
i = 0 
while i < len(files):
file = files[i]
dat = create_profile(file)
final_database = final_database.append(dat)
i +=1
print(final_database)


#code snippet that counts the words under every category and visulaize it using Matplotlib

final_database2 = final_database['Keyword'].groupby([final_database['Candidate Name'], final_database['Subject']]).count().unstack()
final_database2.reset_index(inplace = True)
final_database2.fillna(0,inplace=True)
new_data = final_database2.iloc[:,1:]
new_data.index = final_database2['Candidate Name']
# if you want to see the candidate profile in a csv format then execute the below line
#sample2=new_data.to_csv('sample.csv')
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 10})
axis = new_data.plot.barh(title="Resume keywords by category", legend=False, figsize=(25,7), stacked=True)
labels = []
for j in new_data.columns:
for i in new_data.index:
label = str(j)+": " + str(new_data.loc[i][j])
labels.append(label)
patches = ax.patches
for label, rect in zip(labels, patches):
width = rect.get_width()
if width > 0:
x = rect.get_x()
y = rect.get_y()
height = rect.get_height()
axis.text(x + width/2., y + height/2., label, ha='center', va='center')
plt.show()

Now that we have the whole code, I would like emphasize on two things.

1. The Keywords csv

The keywords csv can be identified in the code as ‘template_new.csv’

And if You want you can replace it with any DB of your choice (and make required changes in the code) but it’s good to be in csv format for simplicity. below is the table of words used to do the phrase matching against the resumes.

2. The Candidate — Keywords table

sample2=new_data.to_csv(‘sample.csv’)

Above code snippet produces a csv file which shows the candidates’ keyword category counts. Here is how it looks.

Below data visualization through Matplotlib.

‘DE’ stands for Data Engineering, and others are self explanatory

1. Automatic reading of resume

Instead of manually scrutinize one-by-one, The program automatically opens the resumes and parses the content. If this were to be done manually it would take a lot of time.

2. Phrase matching and categorization

It would be very difficult to manually check all the resumes, say whether a person has expertise in Data engineering or Machine learning because we are not keeping a count of the phrases while reading. The code on the other hand just hunts for the keywords ,keeps a tab on the occurrence and the categorizes them.

3. Data Visualization

The Data Visualization is a key aspect as It speeds up the decision making process in the following ways

a. We get to know which candidate has more keywords under a particular category, there by letting us infer that she/he might have extensive experience in that apropriate category

b. We can do a relative comparison of candidates with respect to each other, there by helping us filter out the candidates that don’t meet our requirement.