Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store KIT IPD files in folders according to HTML heading structure #99

Merged
merged 2 commits into from
Nov 4, 2024

Conversation

Scriptim
Copy link
Contributor

@Scriptim Scriptim commented Nov 3, 2024

Until now, the kit-ipd crawlers have stored files that are presented in an HTML table together in a folder. The folder is named after the HTML heading that precedes the table. This is specifically tailored to sites such as this.

This PR introduces a more general approach that can construct a nested folder structure. The folders' names are obtained by the HTML structure of (nested) headings without relying on the presence of table elements.

For the example website above, the following file tree is created:
propa2022ws_pferd_kit_ipd_headings

This might be considered a breaking chance as it is incompatible with the file trees produced by previous runs. (?)

@I-Al-Istannen I-Al-Istannen merged commit 5983200 into Garmelon:master Nov 4, 2024
5 checks passed
@Scriptim Scriptim deleted the kit-ipd-headings branch November 4, 2024 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants