Python Program to read the PDF Files
PDF file in the Python can be read by PyPDF. PyPDF2 is a python PDF library fit for parting, combining, trimming, and changing the pages of PDF records.
It can include custom information, seeing alternatives, and passwords to PDF documents. It can recover content and metadata from PDFs just as consolidate whole documents.
Step 0: Before we start working with the program, let us ensure that we installed the below 2 libraries. If not please install the below libraries with the code provided below.
pip install PyPDF2 pip install textract
Step 1: First, we have to import our library PyPDF2.
Step 2: Create Python file object by using the function “Open” and pass the python file object to PdfFileReader.
pdfFileObj = open('C:\\Users\\admin\\Sample.pdf','rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
Step 3: You can get the number of pages in the pdf document by using “numPages” and also get the document information by using the “documentInfo”.
Step 4: Now , we have to create the page object using getPage and pass an argument as number of which page that you want to extract the text from.
pageObj = pdfReader.getPage(0)
Step 5: finally get the text out of the first page which is indexed as ‘0’ by using the extractText() method.