Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrote scrapers for new ui #3

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

mueslimak3r
Copy link

@mueslimak3r mueslimak3r commented Jun 2, 2024

Opening this as a draft because it hasn't been thoroughly tested, and hasn't been formatted to respect the "debug" flag.
Also, because I scrape the series-part list from a specific series' page, the series parts/stories don't have stats such as rating. Only one-shot stories get those stats

This was easier than handling clicking the "View Full 128 Part Series" button on the author works page before parsing the list from there. The list on the author's page is what has the stats in each story's card.

In its current state this is adequate for me, and I'll leave this as-is.
If anyone wants to finish what I've started, I'll keep an eye out and update this PR as needed.

Closes #2

@mueslimak3r mueslimak3r marked this pull request as ready for review July 15, 2024 02:29
@domaniko
Copy link

Thank you for the PR.

It works for quite some texts, but I found some issues with others.

These small changes improved it a lot for me:

@@ -390,7 +390,7 @@ def parse_series_page(page_url, author):
 
 def parse_author_works_page(html):
     soup = bs4.BeautifulSoup(html, 'html.parser')
-    author_element = soup.find('h1', class_='headline__title')
+    author_element = soup.find('title')
     if not author_element:
         error("Cannot determine author on member page.")
     if "Stories by " in author_element.text.strip():

and

@@ -478,16 +478,16 @@ def get_story_text(st):
     #[0].select("div[class^=_item_title]")[0]['href']
 
     #vals = re.findall('<option value=".*?">(\d+)</option>', sel_match.group(1))
-    if not paginator_elements: # just one page
-        error("Couldn't find paginator elements.")
     complete_text = ""
 
     end = 1
-    for pe in paginator_elements:
-        if pe.text.strip() == '' or not pe.text.strip().isnumeric():
-            continue
-        if int(pe.text.strip()) > end:
-            end = int(pe.text.strip())
+    if paginator_parent_element:
+        for pe in paginator_elements:
+            if pe.text.strip() == '' or not pe.text.strip().isnumeric():
+                continue
+            if int(pe.text.strip()) > end:
+                end = int(pe.text.strip())

The generated EPUBs now do not have an additional line break between paragraphs wich makes reading in some EBook Readers a little bit more awkward.

@mueslimak3r
Copy link
Author

@domaniko can you open a PR to merge your changes into my branch so they can be added to this PR with proper attribution?

I'm also happy to just make your changes on my end and push them.

@domaniko
Copy link

Done so mueslimak3r#1

@domaniko
Copy link

@mueslimak3r Could you please consider mueslimak3r#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

needs updating
2 participants