To use this script on Ubuntu, you need to install the following dependencies:
sudo apt update
sudo apt install poppler-utils tesseract-ocr
pip install pdf2image pytesseract Pillow
Use the following Python script to convert a PDF file to text using OCR:
#!/usr/bin/env python
from pdf2image import convert_from_path
import pytesseract
#-----------------------------------------------------------------#
# Specify the path to the Tesseract executable
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
#-----------------------------------------------------------------#
def pdf_to_text_with_ocr(pdf_file, text_file):
# Convert PDF pages to images
images = convert_from_path(pdf_file)
# Initialize empty string to store the text
text_content = ""
# Extract text from each image using OCR
for image in images:
text_content += pytesseract.image_to_string(image)
# Save the extracted text to a file
with open(text_file, 'w') as output_file:
output_file.write(text_content)
print(f"Text extracted and saved to {text_file}")
#-----------------------------------------------------------------#
pdf_to_text_with_ocr('Abram-Hoffer-Orthomolecular-Medicine.pdf',
'Abram-Hoffer-Orthomolecular-Medicine.txt')
To use this script:
pdf_to_text.py
).python3 pdf_to_text.py
This will convert the PDF to a text file using OCR.