Python web scraping script development with... - Freelance Job in Web development

About this project

Open

Seeking an experienced Python developer to create a custom web scraping script. The script needs to extract specific container node from designated websites. Freelancers should be proficient in Python and familiar with common web scraping libraries and anti-bot. Please detail your experience with similar projects and any specific libraries or frameworks you prefer to use. The project will involve understanding common web novels website structures, handling dynamic content (if any), and delivering clean, structured .txt file. Further detailed requirements will be provided upon selection.

As an example:

Constants at top of script

START_URL — string; manually set to the first chapter URL.

TOC_MODE — boolean; manually set.

PREFERRED_CONTAINER — list of CSS selectors; manually set (example: ["div.chp_raw"]).

If empty or no selector matches, fallback to ["div.entry-content"].

CHAPTER_HEADER_FORMAT — # Chapter {n} (1‑indexed).

OUTPUT_TXT — novel.txt.

MEDIA_DIR — media.

HTML_BACKUP_DIR — html_backup.

CHAPTER_HTML_COMBINED — novel_fullbackup.html.

CHAPTER_HTML_COPY — novel_fullbackup - Copy.html.

TXT_COPY — novel - Copy(original).

TARGET_CLASSES_TO_REMOVE — set of class names to remove (will be provided).

CHAPTER_LINK_KEYWORDS — for ToC mode: ["Chapter","story","chp","Intermission","Skill Table","illust","prologue","epilogue"].

Chapter formatting

Each chapter begins with # Chapter {n}.

Preserve layout and grouping in a .txt output:

Paragraphs: single blank line between.

Tables/lists: preserve structure; insert ------ above and below.

Table cells: join with em dash — unless previous cell ends with : or cell is empty.

Inline media markers: insert #media <filename> at exact postition found position.

Write each completed chapter to novel.txt as they are done.

Download rules

Download all media >50 KB inside chosen container.

Always select highest quality available(src set).

Deduplicate by content hash (SHA‑256).

Filenames: Chp{N}-{XXXX} (zero‑padded sequence per chapter).

Insert #media Chp{N}-{XXXX} inline in novel.txt.

Category IT & Programming
Subcategory Web development
What is the scope of the project? Medium-sized change

Delivery term: Not specified

Skills needed

Python HTML CSS Data Mining

Python Web Scraping Script Development with Specific Requirements

About this project

it-programming / web-development

Open