-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathtodolist
173 lines (124 loc) · 6.72 KB
/
todolist
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
Maybe treat these as timeouts too?
<urlopen error [Errno 113] No route to host>
<urlopen error [Errno 111] Connection refused>
Add a --change option to change a feed's URL. Problem: getopt can't handle
parsing this -- so it would mean changing to argparse.
Rework the template mechanism so that it keeps strings as Unicode objects until
output time. This'd reduce a lot of the current encoding complexity.
A common complaint of packagers (e.g. Gentoo #345485) is that you have to
explicitly install a config file for rawdog to work.
Build the default config file and stylesheet into rawdog somehow?
Show the chain of HTTP statuses, not just the last one.
Show feedparser version in --help.
Add --version.
Can we do without ensure_unicode? Having instrumented it, it looks like it's
mostly converting strings that feedparser itself has created.
... however, it also flattens out wacky classes that feedparser inserts, so
yes, it's useful.
Handle maxage working on article.date/added -- make this a config option? Merge with one of the existing options?
Make maxarticles work as a per-feed option.
Plugin hook to allow the articles list to be sorted again after filtering -- so
you can filter out duplicates then sort by originally-published date.
Duplicate removal by article title.
gzip the state file.
Optionally use a better backend: a real transactional database rather than a
huge file and pickle.
An alternative approach would be to keep a second lightweight cache of
articles that rawdog knows it's already seen -- it would only need to
store hashes (or article guids) rather than the full articles.
Or could use fuzzy comparison against previous articles in the same feed -- do
this as a plugin.
Daemon mode -- keep a pidfile, and check the mtime of the state file to avoid
having to reread it.
Option to limit update runtime
Fix rawdog -a https://www.fsf.org/blogs/rms/
... specifically, the problem is that it lists lots of feeds that aren't
related to that page:
<link rel="alternate" title="FSF News" href="//static.fsf.org/fsforg/rss/news.xml" type="application/rss+xml" />
<link rel="alternate" title="FSF Events" href="//static.fsf.org/fsforg/rss/events.xml" type="application/rss+xml" />
... I'm not sure there's anything that can reasonably be done about that.
Look at Expires header to automatically select "periods"
- Longer-term future: split features out to plugins
- refresh header
- HTML4 output
- removing duplicate articles
- DayWriter
- mx.Tidy
- proxy auth
- per-feed proxies
- length limiting in templates
- mx.Tidy flag to output xhtml
- nuke future pubDates (Rick van Rein)
- And new features to write as plugins
- sort by feeds first (should work since sort is stable)
- Article numbering (probably not)
- Ctrl-C shouldn't print the "error loading" warning.
If rawdog crashes while updating a feed, it shouldn't forget the feeds it's
already updated. Perhaps have an exception handler that keeps a safety copy of
the state file and saves where it's got to so far?
Improve efficiency -- memoise stuff before comparing articles.
__comments__ from the feed (and anything else worth having?).
Option to choose whether full content or summary is preferred.
Review expiry logic: is maxage=0 the same as currentonly?
OPML listing -- needs feed type.
Do for now as separate program once config parser separate.
Error reporting as feed (or as separate display?) (Dorward)
RSS output. [email protected]
Feed schedule spreading.
Timeouts should be ignored unless it's been getting timeouts for more than a
configurable amount of time.
Add a needs_update() method to Feed; make Rawdog call that on all the feeds
(when not being forced) and then call update() on each of them that needs it.
For rawdog 3:
- newer Python features
- use unicode.encode('ascii','xmlcharrefreplace') if possible?
- "for line in file"
- "item in dict"
- include author in article hash
Abandoned:
- __feednumber_mod_N__ (for easier default styles, i.e. _mod_4 counts 0-3,0-3)
(didn't really work for lots of feeds)
## Notes from the splitstate conversion
The objective here is to significantly reduce rawdog's memory usage in favour
of IO. (Although the IO usage may actually go down, since we don't have to
rewrite feed states that didn't change.)
The plan is to enable split state while keeping regular behaviour around as the
default (for now, to be removed in rawdog 3).
-- Stage 1: making update memory usage O(biggest #articles) --
Feed stays as is -- i.e. persisted as part of Rawdog, containing the feed info,
and so forth. (These may change in rawdog 3 -- there's a tradeoff, because if
we store the update time/eTag/... in the feed state then we have to rewrite it
every time we update, rather than just if the content's changed. Actually, we
don't want to do this, since we don't want to read the FeedState at all if it
doesn't need updating.)
There's a new FeedState class, persisted into STATEDIR/feeds/12345678.state
(where 12345678 is the feed URL hash as currently used).
(FIXME: when changing feed URL, we need to rename the statefile too.)
Feed.update() takes an article-dict argument, which might be the existing
Rawdog.articles hash or might be from a FeedState, just containing that feed's
articles. (It doesn't care either way.)
When doing updates, if we're in split-state mode, it loads and saves the
FeedState around each article.
(FIXME: optimisation: only mark a FeedState as modified if it was actually
modified, not if it was updated but nothing changed.)
-- Stage 2: making write memory usage O(#articles on page) --
Article gets a new method to return the date that should be used for sorting
(i.e. this logic gets moved out of the write code).
Get the list of articles eligable for output -- as (sort-date, feed-hash,
sequence-number, article-hash) tuples (for ease of sorting). Then fetch the
articles for each feed.
(FIXME: the implementation of this is rather messy; it should be done, perhaps,
at the Feed level, then it would be sufficiently abstract to let us do this
over a database at some point in the future...)
Rawdog.write() then collects the list of articles from all the feeds, sorts it,
and retrieves only the appropriate set of articles from each feed state before
writing them.
(FIXME: optimisation: have a dict available at update and write time into which
the current article lists get stashed as the update progresses, to avoid
opening the state file three times when we update a feed.)
(FIXME: the sort hook will need to be changed -- use a different hook when in
split-state mode.)
-- Stage 3: making fetch memory usage O(biggest #articles * #threads) --
Give the fetcher threads a "shared channel" to the main thread that's doing the
updates, so that updates and fetches can proceed in parallel, and the only
buffers used are by active threads.