Skip to content

Commit f851cd6

Browse files
committed
Merge branch 'unbatch-fundus' into ci_test_matrix
# Conflicts: # .github/workflows/tests.yml # setup.cfg
2 parents be14db1 + 8c36a71 commit f851cd6

File tree

98 files changed

+4492
-1205
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

98 files changed

+4492
-1205
lines changed
+78
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
name: Publisher Coverage
2+
3+
on:
4+
schedule:
5+
- cron: '0 1 * * *' # Runs at 01:00
6+
7+
workflow_dispatch:
8+
9+
jobs:
10+
validate_crawlers:
11+
runs-on: ubuntu-latest
12+
13+
steps:
14+
- name: Set up Git repository
15+
uses: actions/checkout@v3
16+
with:
17+
ref: ${{ github.head_ref }}
18+
19+
- name: Set up Python 3.9
20+
uses: actions/setup-python@v4
21+
with:
22+
python-version: 3.9
23+
24+
- name: Install Fundus
25+
run: pip install -e .
26+
27+
- name: Validate Crawlers
28+
run: |
29+
set -o pipefail
30+
exec python scripts/publisher_coverage.py | tee publisher_coverage.txt
31+
32+
- name: Upload Coverage Report
33+
if: success() || failure()
34+
uses: actions/upload-artifact@v3
35+
with:
36+
name: Publisher Coverage
37+
path: publisher_coverage.txt
38+
39+
40+
create_badge:
41+
runs-on: ubuntu-latest
42+
needs: validate_crawlers
43+
if: success() || failure()
44+
45+
steps:
46+
- name: Set up Git repository
47+
uses: actions/checkout@v3
48+
with:
49+
ref: ${{ github.head_ref }}
50+
51+
- name: Download Coverage Report
52+
uses: actions/download-artifact@v3
53+
with:
54+
name: Publisher Coverage
55+
56+
- name: Get Success Rate
57+
run: echo "SUCCESS_RATE=$(tail -n 1 publisher_coverage.txt | grep -P -o '\d+\/\d+')" >> $GITHUB_ENV
58+
59+
- name: Get Coverage Bounds
60+
run: |
61+
echo "TOTAL_PUBLISHERS=$(echo ${{ env.SUCCESS_RATE }} | grep -P -o '\d+' | tail -1)" >> $GITHUB_ENV
62+
echo "PASSED_PUBLISHERS=$(echo ${{ env.SUCCESS_RATE }} | grep -P -o '\d+' | head -1)" >> $GITHUB_ENV
63+
64+
- name: Get Red Threshold
65+
# We set the badge colour to red when at least one publisher failed the tests.
66+
run: echo "RED_THRESHOLD=$(( ${{ env.TOTAL_PUBLISHERS }} - 1 ))" >> $GITHUB_ENV
67+
68+
- name: Create Badge
69+
uses: schneegans/[email protected]
70+
with:
71+
auth: ${{ secrets.DOBBERSC_GIST_SECRET }}
72+
gistID: ca0ae056b05cbfeaf30fa42f84ddf458
73+
filename: fundus_publisher_coverage.json
74+
label: Publisher Coverage
75+
message: ${{ env.SUCCESS_RATE }}
76+
valColorRange: ${{ env.PASSED_PUBLISHERS }}
77+
maxColorRange: ${{ env.TOTAL_PUBLISHERS }}
78+
minColorRange: ${{ env.RED_THRESHOLD }}

.github/workflows/tests.yml

+2-4
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,7 @@ jobs:
3131
id: cache
3232
with:
3333
path: ${{ env.pythonLocation }}
34-
key: ${{ matrix.os }}-${{ steps.setup_python.outputs.python-version }}-${{ hashFiles('setup.cfg') }}
35-
34+
key: ${{ matrix.os }}-{{ matrix.python-version }}-${{ hashFiles('pyproject.toml') }}
3635
- name: Install dependencies
3736
if: steps.cache.outputs.cache-hit != 'true'
3837
run: |
@@ -62,8 +61,7 @@ jobs:
6261
id: cache
6362
with:
6463
path: ${{ env.pythonLocation }}
65-
key: ${{ matrix.os }}-${{ steps.setup_python.outputs.python-version }}-${{ hashFiles('setup.cfg') }}
66-
64+
key: ${{ matrix.os }}-{{ matrix.python-version }}-${{ hashFiles('pyproject.toml') }}
6765
- name: Install dependencies
6866
if: steps.cache.outputs.cache-hit != 'true'
6967
run: |

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ __pycache__/
1616
*.so
1717

1818
# Distribution / packaging
19+
.pypirc
1920
.Python
2021
build/
2122
develop-eggs/

LICENSE

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2023 Max Dallabetta, Conrad Dobberstein, Aaron Wey
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

+81-33
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,71 @@
1-
<img alt="alt text" src="resources/fundus_logo.png" width="180"/>
1+
<p align="center">
2+
<picture>
3+
<source media="(prefers-color-scheme: dark)" srcset="https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_darkmode_with_font_and_clear_space.svg">
4+
<source media="(prefers-color-scheme: light)" srcset="https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_lightmode_with_font_and_clear_space.svg">
5+
<img src="https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_lightmode_with_font_and_clear_space.svg" alt="Logo" width="50%" height="50%">
6+
</picture>
7+
</p>
8+
9+
<p align="center">A very simple <b>news crawler</b> in Python.
10+
Developed at <a href="https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/">Humboldt University of Berlin</a>.
11+
</p>
12+
<p align="center">
13+
<img alt="version" src="https://img.shields.io/badge/version-0.1-green">
14+
<img alt="python" src="https://img.shields.io/badge/python-3.8-blue">
15+
<img alt="Static Badge" src="https://img.shields.io/badge/license-MIT-green">
16+
<img alt="Publisher Coverage" src="https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/dobbersc/ca0ae056b05cbfeaf30fa42f84ddf458/raw/fundus_publisher_coverage.json">
17+
</p>
18+
<div align="center">
19+
<hr>
20+
21+
[Quick Start](#quick-start) | [Tutorials](#tutorials) | [News Sources](/docs/supported_publishers.md)
22+
23+
</div>
224

3-
[![PyPI version](https://badge.fury.io/py/fundus.svg)](https://badge.fury.io/py/fundus)
4-
[![GitHub Issues](https://img.shields.io/github/issues/flairNLP/fundus.svg)](https://github.com/flairNLP/fundus/issues)
5-
[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](docs/how_to_contribute.md)
6-
[![License: MIT](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://opensource.org/licenses/MIT)
7-
8-
A very simple **news crawler**.
9-
Developed at [Humboldt University of Berlin](https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/).
1025

1126
---
1227

1328
Fundus is:
1429

15-
* A crawler for news ...
30+
* **A static news crawler.**
31+
Fundus lets you crawl online news articles with only a few lines of Python code!
32+
Be it from live websites or the CC-NEWS dataset.
1633

17-
* A Python ...
34+
* **An open-source Python package.**
35+
Fundus is built on the idea of building something together.
36+
We welcome your contribution to help Fundus [grow](docs/how_to_contribute.md)!
1837

19-
## Quick Start
38+
<hr>
2039

21-
### Requirements and Installation
40+
## Quick Start
2241

23-
In your favorite virtual environment, simply do:
42+
To install from pip, simply do:
2443

2544
```
2645
pip install fundus
2746
```
2847

2948
Fundus requires Python 3.8+.
3049

31-
### Example 1: Crawl a bunch of German-language news articles
3250

33-
Let's use Fundus to crawl 2 articles of English-language news publishers based in the US.
51+
## Example 1: Crawl a bunch of English-language news articles
3452

35-
```python
53+
Let's use Fundus to crawl 2 articles from publishers based in the US.
3654

55+
```python
3756
from fundus import PublisherCollection, Crawler
3857

39-
# initialize the crawler for news publisher based in the us
58+
# initialize the crawler for news publishers based in the US
4059
crawler = Crawler(PublisherCollection.us)
4160

4261
# crawl 2 articles and print
4362
for article in crawler.crawl(max_articles=2):
4463
print(article)
4564
```
4665

47-
This should print something like:
66+
That's already it!
67+
68+
If you run this code, it should print out something like this:
4869

4970
```console
5071
Fundus-Article:
@@ -53,6 +74,7 @@ Fundus-Article:
5374
through committee votes on Thursday thanks to a last-minute [...]"
5475
- URL: https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
5576
- From: FreeBeacon (2023-05-11 18:41)
77+
5678
Fundus-Article:
5779
- Title: "Northwestern student government freezes College Republicans funding over [...]"
5880
- Text: "Student government at Northwestern University in Illinois "indefinitely" froze
@@ -61,50 +83,76 @@ Fundus-Article:
6183
- From: FoxNews (2023-05-09 14:37)
6284
```
6385

64-
This means that you crawled 2 articles from different US publishers.
86+
This printout tells you that you successfully crawled two articles!
6587

66-
### Example 2: Crawl specific news source
88+
For each article, the printout details:
89+
- the "Title" of the article, i.e. its headline
90+
- the "Text", i.e. the main article body text
91+
- the "URL" from which it was crawled
92+
- the news source it is "From"
6793

68-
Maybe you want to crawl a specific news source instead. Let's crawl news articles form Washington Times only:
6994

70-
```python
95+
## Example 2: Crawl a specific news source
7196

97+
Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:
98+
99+
```python
72100
from fundus import PublisherCollection, Crawler
73101

74102
# initialize the crawler for Washington Times
75-
crawler = Crawler(PublisherCollection.us.WashingtonTimes)
103+
crawler = Crawler(PublisherCollection.us.TheNewYorker)
76104

77-
# crawl 5 articles and print
78-
for article in crawler.crawl(max_articles=5):
105+
# crawl 2 articles and print
106+
for article in crawler.crawl(max_articles=2):
79107
print(article)
80108
```
81109

110+
## Example 3: Crawl articles from CC-NEWS
111+
112+
If you're not familiar with CC-NEWS, check out their [paper](https://paperswithcode.com/dataset/cc-news).
113+
114+
````python
115+
from fundus import PublisherCollection, CCNewsCrawler
116+
117+
# initialize the crawler for news publishers based in the US
118+
crawler = CCNewsCrawler(*PublisherCollection.us)
119+
120+
# crawl 2 articles and print
121+
for article in crawler.crawl(max_articles=2):
122+
print(article)
123+
````
124+
125+
82126
## Tutorials
83127

84128
We provide **quick tutorials** to get you started with the library:
85129

86-
1. [**Tutorial 1: How to crawl news with Fundus**](docs/...)
87-
2. [**Tutorial 2: The Article Class**](docs/...)
88-
3. [**Tutorial 3: How to add a new news-source**](docs/how_to_contribute.md)
130+
1. [**Tutorial 1: How to crawl news with Fundus**](docs/1_getting_started.md)
131+
2. [**Tutorial 2: How to crawl articles from CC-NEWS**](docs/2_crawl_from_cc_news.md)
132+
3. [**Tutorial 3: The Article Class**](docs/3_the_article_class.md)
133+
4. [**Tutorial 4: How to filter articles**](docs/4_how_to_filter_articles.md)
134+
5. [**Tutorial 5: How to search for publishers**](docs/5_how_to_search_for_publishers.md)
89135

90-
The tutorials explain how ...
136+
If you wish to contribute check out these tutorials:
137+
1. [**How to contribute**](docs/how_to_contribute.md)
138+
2. [**How to add a publisher**](docs/how_to_add_a_publisher.md)
91139

92140
## Currently Supported News Sources
93141

94142
You can find the publishers currently supported [**here**](/docs/supported_publishers.md).
95143

96-
Also: **Adding a new source is easy - consider contributing to the project!**
144+
Also: **Adding a new publisher is easy - consider contributing to the project!**
97145

98146
## Contact
99147

100-
Please email your questions or comments to ...
148+
Please email your questions or comments to [**Max Dallabetta**](mailto:[email protected]?subject=[GitHub]%20Fundus)
101149

102150
## Contributing
103151

104152
Thanks for your interest in contributing! There are many ways to get involved;
105153
start with our [contributor guidelines](docs/how_to_contribute.md) and then
106154
check these [open issues](https://github.com/flairNLP/fundus/issues) for specific tasks.
107155

108-
## [License](/LICENSE)
156+
## License
109157

110-
?
158+
[MIT](LICENSE)

0 commit comments

Comments
 (0)