PDF to Text Conversion Script

Installation Instructions

To use this script on Ubuntu, you need to install the following dependencies:

Step 1: Update the package list

sudo apt update

Step 2: Install Poppler and Tesseract OCR

sudo apt install poppler-utils tesseract-ocr

Step 3: Install Required Python Libraries

pip install pdf2image pytesseract Pillow

Python Script

Use the following Python script to convert a PDF file to text using OCR:

#!/usr/bin/env python

from pdf2image import convert_from_path
import pytesseract

#-----------------------------------------------------------------#
# Specify the path to the Tesseract executable
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'

#-----------------------------------------------------------------#
def pdf_to_text_with_ocr(pdf_file, text_file):
    # Convert PDF pages to images
    images = convert_from_path(pdf_file)

    # Initialize empty string to store the text
    text_content = ""

    # Extract text from each image using OCR
    for image in images:
        text_content += pytesseract.image_to_string(image)

    # Save the extracted text to a file
    with open(text_file, 'w') as output_file:
        output_file.write(text_content)

    print(f"Text extracted and saved to {text_file}")

#-----------------------------------------------------------------#

pdf_to_text_with_ocr('Abram-Hoffer-Orthomolecular-Medicine.pdf', 
                     'Abram-Hoffer-Orthomolecular-Medicine.txt')
    

Usage

To use this script:

  1. Place the Python script in a file (e.g., pdf_to_text.py).
  2. Make sure you have installed the required dependencies mentioned above.
  3. Run the script with a PDF file as the input.

Run the script:

python3 pdf_to_text.py

This will convert the PDF to a text file using OCR.