1
- <img alt =" alt text " src =" resources/fundus_logo.png " width =" 180 " />
1
+ <p align =" center " >
2
+ <picture >
3
+ <source media="(prefers-color-scheme: dark)" srcset="https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_darkmode_with_font_and_clear_space.svg">
4
+ <source media="(prefers-color-scheme: light)" srcset="https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_lightmode_with_font_and_clear_space.svg">
5
+ <img src="https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_lightmode_with_font_and_clear_space.svg" alt="Logo" width="50%" height="50%">
6
+ </picture >
7
+ </p >
8
+
9
+ <p align =" center " >A very simple <b >news crawler</b > in Python.
10
+ Developed at <a href =" https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/ " >Humboldt University of Berlin</a >.
11
+ </p >
12
+ <p align =" center " >
13
+ <img alt =" version " src =" https://img.shields.io/badge/version-0.1-green " >
14
+ <img alt =" python " src =" https://img.shields.io/badge/python-3.8-blue " >
15
+ <img alt =" Static Badge " src =" https://img.shields.io/badge/license-MIT-green " >
16
+ <img alt =" Publisher Coverage " src =" https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/dobbersc/ca0ae056b05cbfeaf30fa42f84ddf458/raw/fundus_publisher_coverage.json " >
17
+ </p >
18
+ <div align =" center " >
19
+ <hr >
20
+
21
+ [ Quick Start] ( #quick-start ) | [ Tutorials] ( #tutorials ) | [ News Sources] ( /docs/supported_publishers.md )
22
+
23
+ </div >
2
24
3
- [ ![ PyPI version] ( https://badge.fury.io/py/fundus.svg )] ( https://badge.fury.io/py/fundus )
4
- [ ![ GitHub Issues] ( https://img.shields.io/github/issues/flairNLP/fundus.svg )] ( https://github.com/flairNLP/fundus/issues )
5
- [ ![ Contributions welcome] ( https://img.shields.io/badge/contributions-welcome-brightgreen.svg )] ( docs/how_to_contribute.md )
6
- [ ![ License: MIT] ( https://img.shields.io/badge/License-MIT-brightgreen.svg )] ( https://opensource.org/licenses/MIT )
7
-
8
- A very simple ** news crawler** .
9
- Developed at [ Humboldt University of Berlin] ( https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/ ) .
10
25
11
26
---
12
27
13
28
Fundus is:
14
29
15
- * A crawler for news ...
30
+ * ** A static news crawler.**
31
+ Fundus lets you crawl online news articles with only a few lines of Python code!
32
+ Be it from live websites or the CC-NEWS dataset.
16
33
17
- * A Python ...
34
+ * ** An open-source Python package.**
35
+ Fundus is built on the idea of building something together.
36
+ We welcome your contribution to help Fundus [ grow] ( docs/how_to_contribute.md ) !
18
37
19
- ## Quick Start
38
+ < hr >
20
39
21
- ### Requirements and Installation
40
+ ## Quick Start
22
41
23
- In your favorite virtual environment , simply do:
42
+ To install from pip , simply do:
24
43
25
44
```
26
45
pip install fundus
27
46
```
28
47
29
48
Fundus requires Python 3.8+.
30
49
31
- ### Example 1: Crawl a bunch of German-language news articles
32
50
33
- Let's use Fundus to crawl 2 articles of English-language news publishers based in the US.
51
+ ## Example 1: Crawl a bunch of English-language news articles
34
52
35
- ``` python
53
+ Let's use Fundus to crawl 2 articles from publishers based in the US.
36
54
55
+ ``` python
37
56
from fundus import PublisherCollection, Crawler
38
57
39
- # initialize the crawler for news publisher based in the us
58
+ # initialize the crawler for news publishers based in the US
40
59
crawler = Crawler(PublisherCollection.us)
41
60
42
61
# crawl 2 articles and print
43
62
for article in crawler.crawl(max_articles = 2 ):
44
63
print (article)
45
64
```
46
65
47
- This should print something like:
66
+ That's already it!
67
+
68
+ If you run this code, it should print out something like this:
48
69
49
70
``` console
50
71
Fundus-Article:
@@ -53,6 +74,7 @@ Fundus-Article:
53
74
through committee votes on Thursday thanks to a last-minute [...]"
54
75
- URL: https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
55
76
- From: FreeBeacon (2023-05-11 18:41)
77
+
56
78
Fundus-Article:
57
79
- Title: "Northwestern student government freezes College Republicans funding over [...]"
58
80
- Text: "Student government at Northwestern University in Illinois "indefinitely" froze
@@ -61,50 +83,76 @@ Fundus-Article:
61
83
- From: FoxNews (2023-05-09 14:37)
62
84
```
63
85
64
- This means that you crawled 2 articles from different US publishers.
86
+ This printout tells you that you successfully crawled two articles!
65
87
66
- ### Example 2: Crawl specific news source
88
+ For each article, the printout details:
89
+ - the "Title" of the article, i.e. its headline
90
+ - the "Text", i.e. the main article body text
91
+ - the "URL" from which it was crawled
92
+ - the news source it is "From"
67
93
68
- Maybe you want to crawl a specific news source instead. Let's crawl news articles form Washington Times only:
69
94
70
- ``` python
95
+ ## Example 2: Crawl a specific news source
71
96
97
+ Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:
98
+
99
+ ``` python
72
100
from fundus import PublisherCollection, Crawler
73
101
74
102
# initialize the crawler for Washington Times
75
- crawler = Crawler(PublisherCollection.us.WashingtonTimes )
103
+ crawler = Crawler(PublisherCollection.us.TheNewYorker )
76
104
77
- # crawl 5 articles and print
78
- for article in crawler.crawl(max_articles = 5 ):
105
+ # crawl 2 articles and print
106
+ for article in crawler.crawl(max_articles = 2 ):
79
107
print (article)
80
108
```
81
109
110
+ ## Example 3: Crawl articles from CC-NEWS
111
+
112
+ If you're not familiar with CC-NEWS, check out their [ paper] ( https://paperswithcode.com/dataset/cc-news ) .
113
+
114
+ ```` python
115
+ from fundus import PublisherCollection, CCNewsCrawler
116
+
117
+ # initialize the crawler for news publishers based in the US
118
+ crawler = CCNewsCrawler(* PublisherCollection.us)
119
+
120
+ # crawl 2 articles and print
121
+ for article in crawler.crawl(max_articles = 2 ):
122
+ print (article)
123
+ ````
124
+
125
+
82
126
## Tutorials
83
127
84
128
We provide ** quick tutorials** to get you started with the library:
85
129
86
- 1 . [ ** Tutorial 1: How to crawl news with Fundus** ] ( docs/... )
87
- 2 . [ ** Tutorial 2: The Article Class** ] ( docs/... )
88
- 3 . [ ** Tutorial 3: How to add a new news-source** ] ( docs/how_to_contribute.md )
130
+ 1 . [ ** Tutorial 1: How to crawl news with Fundus** ] ( docs/1_getting_started.md )
131
+ 2 . [ ** Tutorial 2: How to crawl articles from CC-NEWS** ] ( docs/2_crawl_from_cc_news.md )
132
+ 3 . [ ** Tutorial 3: The Article Class** ] ( docs/3_the_article_class.md )
133
+ 4 . [ ** Tutorial 4: How to filter articles** ] ( docs/4_how_to_filter_articles.md )
134
+ 5 . [ ** Tutorial 5: How to search for publishers** ] ( docs/5_how_to_search_for_publishers.md )
89
135
90
- The tutorials explain how ...
136
+ If you wish to contribute check out these tutorials:
137
+ 1 . [ ** How to contribute** ] ( docs/how_to_contribute.md )
138
+ 2 . [ ** How to add a publisher** ] ( docs/how_to_add_a_publisher.md )
91
139
92
140
## Currently Supported News Sources
93
141
94
142
You can find the publishers currently supported [ ** here** ] ( /docs/supported_publishers.md ) .
95
143
96
- Also: ** Adding a new source is easy - consider contributing to the project!**
144
+ Also: ** Adding a new publisher is easy - consider contributing to the project!**
97
145
98
146
## Contact
99
147
100
- Please email your questions or comments to ...
148
+ Please email your questions or comments to
[ ** Max Dallabetta ** ] ( mailto:[email protected] ?subject=[GitHub]%20Fundus )
101
149
102
150
## Contributing
103
151
104
152
Thanks for your interest in contributing! There are many ways to get involved;
105
153
start with our [ contributor guidelines] ( docs/how_to_contribute.md ) and then
106
154
check these [ open issues] ( https://github.com/flairNLP/fundus/issues ) for specific tasks.
107
155
108
- ## [ License] ( /LICENSE )
156
+ ## License
109
157
110
- ?
158
+ [ MIT ] ( LICENSE )
0 commit comments