Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle author dict Bug #641

Merged
merged 4 commits into from
Oct 22, 2024
Merged

Handle author dict Bug #641

merged 4 commits into from
Oct 22, 2024

Conversation

addie9800
Copy link
Collaborator

When crawling Braunschweiger Zeitung I have encountered a rather odd bug. Take this article for example: https://www.braunschweiger-zeitung.de/article238597927/Von-Natur-aus-Harzer-Honig-aus-Willensen-ist-Typisch-Harz.html with the author element of its ld_json

"author":[{"@type":"Organization","name":"FUNKE Mediengruppe","url":"https://www.harzkurier.de/autoren"}]

The previous code would throw an attribute error, that str does not have an attribute get. Going through the ld element, I could see that the author element is of type dict. After some more searching, I found that in the HTML Fundus receives, we have this:

"author":{"@type": "Person","name": "hn","url": "https://www.braunschweiger-zeitung.de/autoren/"}

I am not entirely sure, why this is happening, my guess is that it's some kind of redirection issue. Nevertheless, this PR fixes it.

@addie9800 addie9800 requested a review from MaxDall October 21, 2024 17:19
@MaxDall
Copy link
Collaborator

MaxDall commented Oct 22, 2024

@addie9800 I have to admit this sounds strange. I would feel more comfortable if we understood what was happening before fixing it. Further, I cannot reproduce this bug on the current master branch. Can you confirm this?

Edit:
Additionally I think generic_author_parsing should cover both cases.

@MaxDall MaxDall merged commit 346aae8 into master Oct 22, 2024
5 checks passed
@MaxDall MaxDall deleted the handle-bsz-bug branch October 22, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants