Seeking an experienced Python developer to create a custom web scraping script. The script needs to extract specific container node from designated websites. Freelancers should be proficient in Python and familiar with common web scraping libraries and anti-bot. Please detail your experience with similar projects and any specific libraries or frameworks you prefer to use. The project will involve understanding common web novels website structures, handling dynamic content (if any), and delivering clean, structured .txt file. Further detailed requirements will be provided upon selection.
As an example:
Constants at top of script
START_URL — string; manually set to the first chapter URL.
TOC_MODE — boolean; manually set.
PREFERRED_CONTAINER — list of CSS selectors; manually set (example: ["div.chp_raw"]).
If empty or no selector matches, fallback to ["
div.entry-content"].
CHAPTER_HEADER_FORMAT — # Chapter {n} (1‑indexed).
OUTPUT_TXT —
novel.txt.
MEDIA_DIR — media.
HTML_BACKUP_DIR — html_backup.
CHAPTER_HTML_COMBINED —
novel_fullbackup.html.
CHAPTER_HTML_COPY — novel_fullbackup -
Copy.html.
TXT_COPY — novel - Copy(original).
TARGET_CLASSES_TO_REMOVE — set of class names to remove (will be provided).
CHAPTER_LINK_KEYWORDS — for ToC mode: ["Chapter","story","chp","Intermission","Skill Table","illust","prologue","epilogue"].
Chapter formatting
Each chapter begins with # Chapter {n}.
Preserve layout and grouping in a .txt output:
Paragraphs: single blank line between.
Tables/lists: preserve structure; insert ------ above and below.
Table cells: join with em dash — unless previous cell ends with : or cell is empty.
Inline media markers: insert #media <filename> at exact postition found position.
Write each completed chapter to
novel.txt as they are done.
Download rules
Download all media >50 KB inside chosen container.
Always select highest quality available(src set).
Deduplicate by content hash (SHA‑256).
Filenames: Chp{N}-{XXXX} (zero‑padded sequence per chapter).
Insert #media Chp{N}-{XXXX} inline in
novel.txt.
Delivery term: Not specified