Python Program to read the PDF Files

May 8, 2020

Python Program to read the PDF Files

PDF file in the Python can be read by PyPDF. PyPDF2 is a python PDF library fit for parting, combining, trimming, and changing the pages of PDF records.

It can include custom information, seeing alternatives, and passwords to PDF documents. It can recover content and metadata from PDFs just as consolidate whole documents.

Step 0: Before we start working with the program, let us ensure that we installed the below 2 libraries. If not please install the below libraries with the code provided below.

pip install PyPDF2
pip install textract

Step 1: First, we have to import our library PyPDF2.

Import PyPDF2

Step 2: Create Python file object by using the function “Open” and pass the python file object to PdfFileReader.

pdfFileObj = open('C:\\Users\\admin\\Sample.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

Step 3: You can get the number of pages in the pdf document by using “numPages” and also get the document information by using the “documentInfo”.

print(pdfReader.numPages)
print(pdfReader.documentInfo)

Step 4: Now , we have to create the page object using getPage and pass an argument as number of which page that you want to extract the text from.