About this project
it-programming / web-development
Open
We are seeking a Python developer to create a script for automating the collection and processing of environmental permit data from a public state government website. The project involves accessing a dynamic website, downloading PDF documents (Notice of Intent - NOIs) that are revealed after clicking a 'View' button for each entry, and then extracting specific contact information from these PDFs.
The script must be able to handle PDFs that may contain scanned or handwritten text, requiring the implementation of Optical Character Recognition (OCR) capabilities. After extracting the data, the script should compile it into a structured Excel (.xlsx) file.
Key requirements:
- Automate navigation and interaction with a dynamic website to access PDF links.
- Download PDF documents programmatically.
- Implement OCR to read content from potentially scanned or handwritten PDFs.
- Extract the following data points from each PDF: Permit ID, Company Name, First Contact Name and Email, Second Contact Name and Email, Third Contact Name and Email.
- Generate an Excel file with a single row per processed document, containing the extracted data in clearly defined columns.
- The script should be modular, well-documented, and include basic error handling for issues like unreadable files or missing data.
A lightweight approach to simulating web interaction is preferred over full browser automation if feasible. The final deliverable is a functional Python script and potentially the initial output file. Example entries and the desired spreadsheet format can be provided.
Category IT & Programming
Subcategory Web development
Delivery term: Not specified