Resume Filtering using NLP
Suppose you own a company and luckily you bagged a project for which two data scientists are required. Job postings are done in LinkedIn where 400 resumes were received. In such scenario it is a hectic task to choose the appropriate resume that fulfill your need Then you might think of ”Is there a way to select the best out of them without manually checking one-by-one ”, for such purposes these latest technologies comes very handy. NATURAL LANGUAGE PROCESSING (NLP) provide you the functionality that can manipulate the text
Problem statement:
Your company need a candidate with Deep Learning as her/his core competency and also know how Machine Learning algorithms work. The other person was required to have experience in Scala, AWS, Dockers, Kubernetes etc.
Approach to solve this:
After identifying the problem statement you have to device an approach that can correctly address the problem as follows
1. Maintain table or dictionary which comprises of various skill sets categorized i.e. if we encounter some words like CNN, RNN, tensorflow, keras then segregate them under one column titled ‘Deep Learning’.
2. Build an NLP algorithm that scans the entire resume and searches for the words mentioned in the table or dictionary
3. Then count the occurrence of the words which belongs to various category i.e. something like below for each and every candidate.
Then getting into the real task of developing the approach in programatically . we need many pre-built libraries that are available in python like Spacy(NLP related manipulations) and PYPDF (Reading the Resume)
#required modules(libraries) are need to be imported import PyPDF2 import os from os import listdir from os.path import isfile, join from io import StringIO import pandas as pd from collections import Counter import en_core_web_sm nlp = en_core_web_sm.load() from spacy.matcher import PhraseMatcher #this fuction read the folder that comprises the resumes path=''enter:\ your path\here where \you saved the\ resumes” files = [os.path.join(path, f) for f in os.listdir(path) if os.path.isfile(os.path.join(path, f))] def pdfextract(file): fileReader = PyPDF2.PdfFileReader(open(file,'rb')) countpage = fileReader.getNumPages() count = 0 text = [] while count < countpage: page_Obj = fileReader.getPage(count) count +=1 txt = page_Obj.extractText() print (txt) text.append(txt) return text #reading the ends of resume #building the phrase matching and candidate profile def create_profile(file): text = pdfextract(file) text = str(text) text = text.replace("\\n", "") text = text.lower() #below is the csv where we have all the keywords, you can customize your own dict _of_keyword = pd.read_csv('template_new.csv') stats_words = [nlp(text) for text in dict _of_keyword ['Statistics'].dropna(axis = 0)] NLP_words = [nlp(text) for text in dict _of_keyword ['NLP'].dropna(axis = 0)] ML_words = [nlp(text) for text in dict _of_keyword ['Machine Learning'].dropna(axis = 0)] DL_words = [nlp(text) for text in dict _of_keyword ['Deep Learning'].dropna(axis = 0)] R_words = [nlp(text) for text in dict _of_keyword ['R Language'].dropna(axis = 0)] python_words = [nlp(text) for text in dict _of_keyword ['Python Language'].dropna(axis = 0)] Data_Engineering_words = [nlp(text) for text in dict _of_keyword ['Data Engineering'].dropna(axis = 0)] Matche_obj = PhraseMatcher(nlp.vocab) Matche_obj.add('Stats', None, *stats_words) Matche_obj.add('NLP', None, *NLP_words) Matche_obj.add('ML', None, *ML_words) Matche_obj.add('DL', None, *DL_words) Matche_obj.add('R', None, *R_words) Matche_obj.add('Python', None, *python_words) Matche_obj.add('DE', None, *Data_Engineering_words) doc = nlp(text) d = [] matches = Matche_obj (doc) for match_id, start, end in matches: id_rule = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'COLOR' span = doc[start : end] # retrive the matched slice of doc d.append((id_rule, span.text)) key_words = "\n".join(f'{i[0]} {i[1]} ({j})' for i,j in Counter(d).items()) ## Dataframe from string of words dataframe1 = pd.read_csv(StringIO(key_words),names = ['Key_words_List']) dataframe2= pd.DataFrame(dataframe1.Keywords_List.str.split(' ',1).tolist(),columns = ['Subject','Keyword']) dataframe3 = pd.DataFrame(dataframe2.Keyword.str.split('(',1).tolist(),columns = ['Keyword', 'Count']) dataframe4 = pd.concat([dataframe2 ['Subject'], dataframe3 ['Keyword'], dataframe3 ['Count']], axis =1) dataframe4 ['Count'] = dataframe4 ['Count'].apply(lambda x: x.rstrip(")")) base = os.path.basename(file) filename = os.path.splitext(base)[0] name = filename.split('_') name2 = name[0] name2 = name2.lower() ## converting str to dataframe name3 = pd.read_csv(StringIO(name2),names = ['Candidate Name']) dataf = pd.concat([name3['Candidate Name'], dataframe4 ['Subject'], dataframe4 ['Keyword'], dataframe4 ['Count']], axis = 1) dataf['Candidate Name'].fillna(dataf['Candidate Name'].iloc[0], inplace = True) return(dataf) #ending of the function # call the above functions/ code to execute final_database=pd.DataFrame() i = 0 while i < len(files): file = files[i] dat = create_profile(file) final_database = final_database.append(dat) i +=1 print(final_database) #code snippet that counts the words under every category and visulaize it using Matplotlib final_database2 = final_database['Keyword'].groupby([final_database['Candidate Name'], final_database['Subject']]).count().unstack() final_database2.reset_index(inplace = True) final_database2.fillna(0,inplace=True) new_data = final_database2.iloc[:,1:] new_data.index = final_database2['Candidate Name'] # if you want to see the candidate profile in a csv format then execute the below line #sample2=new_data.to_csv('sample.csv') import matplotlib.pyplot as plt plt.rcParams.update({'font.size': 10}) axis = new_data.plot.barh(title="Resume keywords by category", legend=False, figsize=(25,7), stacked=True) labels = [] for j in new_data.columns: for i in new_data.index: label = str(j)+": " + str(new_data.loc[i][j]) labels.append(label) patches = ax.patches for label, rect in zip(labels, patches): width = rect.get_width() if width > 0: x = rect.get_x() y = rect.get_y() height = rect.get_height() axis.text(x + width/2., y + height/2., label, ha='center', va='center') plt.show()
Now that we have the whole code, I would like emphasize on two things.
1. The Keywords csv
The keywords csv can be identified in the code as ‘template_new.csv’
And if You want you can replace it with any DB of your choice (and make required changes in the code) but it’s good to be in csv format for simplicity. below is the table of words used to do the phrase matching against the resumes.
2. The Candidate — Keywords table
sample2=new_data.to_csv(‘sample.csv’)
Above code snippet produces a csv file which shows the candidates’ keyword category counts. Here is how it looks.
Below data visualization through Matplotlib.
‘DE’ stands for Data Engineering, and others are self explanatory
1. Automatic reading of resume
Instead of manually scrutinize one-by-one, The program automatically opens the resumes and parses the content. If this were to be done manually it would take a lot of time.
2. Phrase matching and categorization
It would be very difficult to manually check all the resumes, say whether a person has expertise in Data engineering or Machine learning because we are not keeping a count of the phrases while reading. The code on the other hand just hunts for the keywords ,keeps a tab on the occurrence and the categorizes them.
3. Data Visualization
The Data Visualization is a key aspect as It speeds up the decision making process in the following ways
a. We get to know which candidate has more keywords under a particular category, there by letting us infer that she/he might have extensive experience in that apropriate category
b. We can do a relative comparison of candidates with respect to each other, there by helping us filter out the candidates that don’t meet our requirement.