Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add free access atrribute #362

Merged
merged 16 commits into from
Feb 24, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/attribute_guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,4 +58,12 @@ Those attributes will be validated with unit tests when used.
<td><code>List[str]</code></td>
<td><code>generic_topic_parsing</code></td>
</tr>
<tr>
<td>free_access</td>
<td>A boolean which is set to be False, if the article is restricted to users with a subscription. This usually indicates
that the article cannot be crawled completely.
<i>This attribute is implemented by default</i></td>
<td><code>bool</code></td>
<td><code></code></td>
</tr>
</table>
18 changes: 18 additions & 0 deletions docs/how_to_add_a_publisher.md
Original file line number Diff line number Diff line change
Expand Up @@ -469,6 +469,24 @@ Instead, we recommend referring to [this](https://devhints.io/xpath) documentati
Make sure to examine other parsers and consult the [attribute guidelines](attribute_guidelines.md) for specifics on attribute implementation.
We strongly encourage utilizing these utility functions, especially when parsing the `ArticleBody`.

### Checking the free_access attribute

In case your new publisher does not have a subscription model, you can go ahead and skip this step. If it does,
please verify that there is a tag `isAccessibleForFree` within the `<script type="application/ld+json">` blocks in the
source code of premium articles that is set to either `false` or `False`. It doesn't matter if the tag is missing in the
freely accessible articles. If this is the case, you can continue with the next step. If not, please overwrite the
existing function by adding the following snippet to your parser:

```python
@attribute
def free_access(self) -> bool:
# Your personalized logic goes here
pass
```

Usually you can identify a premium article by an indicator within the URL or by using XPath or CSSSelector and selecting
the element asking to to purchase a subscription to view the article.

### Finishing the Parser

Bringing all the above together, the Los Angeles Times now looks like this.
Expand Down
Loading
Loading