-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsklearn_webSpider.html
1517 lines (1137 loc) · 96.1 KB
/
sklearn_webSpider.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head><meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>sklearn_webSpider</title><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.1.10/require.min.js"></script>
<link rel="stylesheet" type="text/css" href="/static/css/md_notebook.css" />
<!-- Load mathjax -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/latest.js?config=TeX-AMS_CHTML-full,Safe"> </script>
<!-- MathJax configuration -->
<script type="text/x-mathjax-config">
init_mathjax = function() {
if (window.MathJax) {
// MathJax loaded
MathJax.Hub.Config({
TeX: {
equationNumbers: {
autoNumber: "AMS",
useLabelIds: true
}
},
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true,
processEnvironments: true
},
displayAlign: 'center',
CommonHTML: {
linebreaks: {
automatic: true
}
}
});
MathJax.Hub.Queue(["Typeset", MathJax.Hub]);
}
}
init_mathjax();
</script>
<!-- End of mathjax configuration --></head>
<body class="jp-Notebook" data-jp-theme-light="true" data-jp-theme-name="JupyterLab Light">
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<p>chapter 14 数据获取到话题提取: 爬虫(Requests/ bs4/ RegExp)</p>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h1 id="%E7%AE%80%E5%8D%95%E9%A1%B5%E9%9D%A2%E7%9A%84%E7%88%AC%E5%8F%96">简单页面的爬取<a class="anchor-link" href="#%E7%AE%80%E5%8D%95%E9%A1%B5%E9%9D%A2%E7%9A%84%E7%88%AC%E5%8F%96">¶</a></h1>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [1]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span> pip list <span class="p">|</span> grep -i request
<span class="c1"># Rrequests 库是基于 urllib的,但是更好用。</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>requests 2.25.1
requests-oauthlib 1.3.0
requests-unixsocket 0.2.0
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="%E6%9F%A5%E8%AF%A2User-agent">查询User agent<a class="anchor-link" href="#%E6%9F%A5%E8%AF%A2User-agent">¶</a></h2>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [2]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># (1) 从网页复制</span>
<span class="c1"># https://www.ip138.com/useragent/</span>
<span class="c1"># Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36</span>
<span class="c1"># (2) 从网页F12 复制</span>
<span class="c1"># 打开百度,F12键-network,再F5刷新,随便选择一个加载内容,点header - Request,看其中内的 User-Agent:</span>
<span class="n">UserAgent</span><span class="o">=</span><span class="s2">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36"</span>
</pre></div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="%E7%86%9F%E6%82%89%E7%BD%91%E7%AB%99%E7%BB%93%E6%9E%84">熟悉网站结构<a class="anchor-link" href="#%E7%86%9F%E6%82%89%E7%BD%91%E7%AB%99%E7%BB%93%E6%9E%84">¶</a></h2>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [3]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># 政策-最新: http://www.gov.cn/zhengce/zuixin.htm</span>
<span class="n">url</span><span class="o">=</span> <span class="s2">"http://www.gov.cn/zhengce/content/2021-11/03/content_5648645.htm"</span>
</pre></div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="%E7%88%AC%E5%8F%96%E5%B9%B6%E4%BF%9D%E5%AD%98%E5%88%B0%E6%9C%AC%E5%9C%B0">爬取并保存到本地<a class="anchor-link" href="#%E7%88%AC%E5%8F%96%E5%B9%B6%E4%BF%9D%E5%AD%98%E5%88%B0%E6%9C%AC%E5%9C%B0">¶</a></h2>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [4]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">requests</span>
<span class="n">headers</span><span class="o">=</span><span class="p">{</span>
<span class="s2">"User-Agent"</span><span class="p">:</span> <span class="n">UserAgent</span>
<span class="p">}</span>
<span class="c1"># 发起请求</span>
<span class="n">r</span><span class="o">=</span><span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">)</span>
<span class="c1">#print(r.text)# 主要内容乱码</span>
<span class="c1">#打印编码方式</span>
<span class="nb">print</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">encoding</span><span class="p">)</span> <span class="c1">#ISO-8859-1</span>
<span class="c1"># 而原网页正文写的是 charset="utf-8" </span>
<span class="c1"># 重新设置编码方式</span>
<span class="n">r</span><span class="o">.</span><span class="n">encoding</span><span class="o">=</span><span class="s2">"urf-8"</span>
<span class="nb">print</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">encoding</span><span class="p">)</span> <span class="c1">#urf-8</span>
<span class="c1">#print(r.text) #正常打印</span>
<span class="c1"># 但是包含html标记,不需要它们。</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>ISO-8859-1
urf-8
</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [5]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># 要么保存为 html</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">"dustbin/test.html"</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">'utf8'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="c1"># 要么解析其中的文本,保存为csv文件,见下文。</span>
</pre></div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h1 id="%E7%A8%8D%E5%BE%AE%E5%A4%8D%E6%9D%82%E7%9A%84%E7%88%AC%E8%99%AB">稍微复杂的爬虫<a class="anchor-link" href="#%E7%A8%8D%E5%BE%AE%E5%A4%8D%E6%9D%82%E7%9A%84%E7%88%AC%E8%99%AB">¶</a></h1>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<p>上面的例子其实降低了我们的效率,单个网址我们直接用浏览器看反而更高效。</p>
<p>我们希望看到一个列表(部门,标题,链接),大致判断感兴趣的内容,再点开看细节。怎么获得这样的列表呢?</p>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<p>目标 最新政策 list 的爬取 url="<a href="http://www.gov.cn/zhengce/zuixin.htm">http://www.gov.cn/zhengce/zuixin.htm</a>"</p>
<ul>
<li>分析细节页面,打开该页面,F12查看,要闻列表</li>
</ul>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F%E7%AE%80%E4%BB%8B">正则表达式简介<a class="anchor-link" href="#%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F%E7%AE%80%E4%BB%8B">¶</a></h2>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [1]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">re</span>
<span class="n">pattern</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"\d+"</span><span class="p">)</span>
<span class="n">rs1</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="s2">"是数字吗0123"</span><span class="p">)</span> <span class="c1"># match是从头匹配</span>
<span class="nb">print</span><span class="p">(</span><span class="n">rs1</span><span class="p">)</span>
<span class="n">rs2</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="s2">"456text78"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">rs2</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">rs2</span><span class="o">.</span><span class="n">group</span><span class="p">())</span> <span class="c1">#使用 .group() 获取匹配的内容</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>None
<re.Match object; span=(0, 3), match='456'>
456
</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [2]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># 如果想匹配中间的,可以用 re.search</span>
<span class="n">rs1</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="s2">"是数字吗0123"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">rs1</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">rs1</span><span class="o">.</span><span class="n">group</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="n">rs1</span><span class="o">.</span><span class="n">span</span><span class="p">())</span> <span class="c1">#使用 .span() 获取匹配的起始位置</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre><re.Match object; span=(4, 8), match='0123'>
0123
(4, 8)
</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [3]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">str1</span><span class="o">=</span><span class="s2">"是数字吗0123"</span>
<span class="n">span</span><span class="o">=</span><span class="n">rs1</span><span class="o">.</span><span class="n">span</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">rs1</span><span class="o">.</span><span class="n">span</span><span class="p">())</span> <span class="c1">#(4, 8)</span>
<span class="n">str1</span><span class="p">[</span><span class="n">span</span><span class="p">[</span><span class="mi">0</span><span class="p">]:</span><span class="n">span</span><span class="p">[</span><span class="mi">1</span><span class="p">]]</span> <span class="c1">#'0123'</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>(4, 8)
</pre>
</div>
</div>
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt">Out[3]:</div>
<div class="jp-RenderedText jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/plain">
<pre>'0123'</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [4]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># 使用数字分割字符串</span>
<span class="n">rs3</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="s2">"这是123一个1漂亮的3鼠标"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">rs3</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>['这是', '一个', '漂亮的', '鼠标']
</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [5]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># 把所有的数字都提取出来</span>
<span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="s2">"这是123一个1漂亮的3鼠标"</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt">Out[5]:</div>
<div class="jp-RenderedText jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/plain">
<pre>['123', '1', '3']</pre>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="%E4%BD%BF%E7%94%A8-BeautiSoup-%E8%BF%9B%E8%A1%8C-html-%E8%A7%A3%E6%9E%90">使用 BeautiSoup 进行 html 解析<a class="anchor-link" href="#%E4%BD%BF%E7%94%A8-BeautiSoup-%E8%BF%9B%E8%A1%8C-html-%E8%A7%A3%E6%9E%90">¶</a></h2>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<p>python的两个解析html的库: lxml 和 BeautifulSoup。</p>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [6]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># $ pip3 install beautifulsoup4 -i https://pypi.douban.com/simple/</span>
<span class="c1"># $ pip3 list | grep -i soup</span>
<span class="c1">#beautifulsoup4 4.10.0</span>
<span class="c1"># $ pip3 list | grep -i xml</span>
<span class="c1">#defusedxml 0.7.1</span>
<span class="c1">#lxml 4.6.3</span>
<span class="c1"># 下载html信息</span>
<span class="n">url</span><span class="o">=</span> <span class="s2">"http://www.gov.cn/zhengce/content/2021-11/03/content_5648645.htm"</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="n">headers</span><span class="o">=</span><span class="p">{</span>
<span class="s2">"User-Agent"</span><span class="p">:</span> <span class="s2">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36"</span>
<span class="p">}</span>
<span class="n">r</span><span class="o">=</span><span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">)</span>
<span class="c1"># 重新设置编码方式</span>
<span class="n">r</span><span class="o">.</span><span class="n">encoding</span><span class="o">=</span><span class="s2">"urf-8"</span>
<span class="c1"># 解析html信息</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="s1">'lxml'</span><span class="p">,</span> <span class="n">from_encoding</span><span class="o">=</span><span class="s1">'utf8'</span><span class="p">)</span> <span class="c1">#指定使用 lxml 库作为解析库,比自带的标准库快</span>
<span class="c1">#print(soup)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="application/vnd.jupyter.stderr">
<pre>/home/wangjl/anaconda3/lib/python3.7/site-packages/bs4/__init__.py:223: UserWarning: You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.
warnings.warn("You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.")
</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [7]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># 获取标题</span>
<span class="nb">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="p">)</span> <span class="c1">#带着 html 标签</span>
<span class="nb">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">string</span><span class="p">)</span> <span class="c1">#去标签</span>
<span class="nb">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">get_text</span><span class="p">())</span> <span class="c1">#或者</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre><title>国务院关于2020年度国家科学技术奖励的决定(国发〔2021〕22号)_政府信息公开专栏</title>
国务院关于2020年度国家科学技术奖励的决定(国发〔2021〕22号)_政府信息公开专栏
国务院关于2020年度国家科学技术奖励的决定(国发〔2021〕22号)_政府信息公开专栏
</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [8]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># 获取段落</span>
<span class="nb">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
<span class="c1"># 怎么获取全部的段落?</span>
<span class="n">texts</span><span class="o">=</span><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s1">'p'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">texts</span><span class="p">:</span>
<span class="k">if</span> <span class="n">text</span><span class="o">.</span><span class="n">string</span> <span class="o">==</span> <span class="kc">None</span><span class="p">:</span> <span class="c1">#过滤空白段</span>
<span class="k">continue</span>
<span class="nb">print</span><span class="p">(</span><span class="n">text</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>国务院关于2020年度
国务院关于2020年度
国家科学技术奖励的决定
国发〔2021〕22号
各省、自治区、直辖市人民政府,国务院各部委、各直属机构:
为深入贯彻落实习近平新时代中国特色社会主义思想,全面贯彻党的十九大和十九届二中、三中、四中、五中全会精神,坚定实施科教兴国战略、人才强国战略和创新驱动发展战略,国务院决定,对为我国科学技术进步、经济社会发展、国防现代化建设作出突出贡献的科学技术人员和组织给予奖励。
根据《国家科学技术奖励条例》的规定,经国家科学技术奖励评审委员会评审、国家科学技术奖励委员会审定和科技部审核,国务院批准并报请国家主席习近平签署,授予顾诵芬院士、王大中院士国家最高科学技术奖;国务院批准,授予“纳米限域催化”等2项成果国家自然科学奖一等奖,授予“面心立方材料弹塑性力学行为及原子层次机理研究”等44项成果国家自然科学奖二等奖,授予“超高清视频多态基元编解码关键技术”等3项成果国家技术发明奖一等奖,授予“良种牛羊卵子高效利用快繁关键技术”等58项成果国家技术发明奖二等奖,授予“嫦娥四号工程”等2项成果国家科学技术进步奖特等奖,授予“400万吨/年煤间接液化成套技术创新开发及产业化”等18项成果国家科学技术进步奖一等奖,授予“厘米级型谱化移动测量装备关键技术及规模化工程应用”等137项成果国家科学技术进步奖二等奖,授予苏·欧瑞莉教授等8名外国专家和国际热带农业中心中华人民共和国国际科学技术合作奖。
全国科学技术工作者要向顾诵芬院士、王大中院士及全体获奖者学习,不忘初心、牢记使命,秉持国家利益和人民利益至上,继承和发扬老一辈科学家胸怀祖国、服务人民的优秀品质,主动肩负起历史重任,坚持创新在我国现代化建设全局中的核心地位,把科技自立自强作为国家发展的战略支撑,以与时俱进的精神、革故鼎新的勇气、坚忍不拔的定力,面向世界科技前沿、面向经济主战场、面向国家重大需求、面向人民生命健康,加快建设科技强国,为夺取全面建设社会主义现代化国家新胜利、实现中华民族伟大复兴作出新的更大贡献。
国务院
2021年10月19日
(此件公开发布)
</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [9]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># 获取链接</span>
<span class="n">link</span><span class="o">=</span><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s1">'a'</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c1">#最后的一个链接</span>
<span class="n">link</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'href'</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt">Out[9]:</div>
<div class="jp-RenderedText jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/plain">
<pre>'http://www.gov.cn/home/2014-02/18/content_5046260.htm'</pre>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="%E7%88%AC%E5%8F%96%E6%96%B0%E9%97%BB%E7%9B%AE%E5%BD%95%E9%A1%B5%E9%9D%A2%E5%B9%B6%E4%BF%9D%E5%AD%98%E4%B8%BAcsv">爬取新闻目录页面并保存为csv<a class="anchor-link" href="#%E7%88%AC%E5%8F%96%E6%96%B0%E9%97%BB%E7%9B%AE%E5%BD%95%E9%A1%B5%E9%9D%A2%E5%B9%B6%E4%BF%9D%E5%AD%98%E4%B8%BAcsv">¶</a></h2>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [10]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">requests</span><span class="o">,</span> <span class="nn">csv</span><span class="o">,</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="c1">#下载</span>
<span class="n">url</span><span class="o">=</span><span class="s2">"http://www.gov.cn/zhengce/zuixin.htm"</span>
<span class="n">user_agent</span><span class="o">=</span><span class="s2">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36"</span>
<span class="n">policies</span><span class="o">=</span><span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s2">"User-Agent"</span><span class="p">:</span> <span class="n">user_agent</span><span class="p">})</span>
<span class="n">policies</span><span class="o">.</span><span class="n">encoding</span><span class="o">=</span><span class="s1">'utf-8'</span>
<span class="c1">#解析</span>
<span class="n">p</span><span class="o">=</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">policies</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="s1">'lxml'</span><span class="p">)</span>
<span class="c1"># 用正则表达式匹配所有包含 content 的单词的a标签</span>
<span class="n">contents</span> <span class="o">=</span><span class="n">p</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s2">"a"</span><span class="p">,</span> <span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'content'</span><span class="p">))</span>
<span class="c1"># 定义一个空列表</span>
<span class="n">rows</span><span class="o">=</span><span class="p">[]</span>
<span class="c1">#设计一个for循环,提取每个链接汇总的标题</span>
<span class="k">for</span> <span class="n">content</span> <span class="ow">in</span> <span class="n">contents</span><span class="p">:</span>
<span class="n">href</span><span class="o">=</span><span class="n">content</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'href'</span><span class="p">)</span>
<span class="n">row</span><span class="o">=</span><span class="p">(</span><span class="n">content</span><span class="o">.</span><span class="n">string</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">4</span><span class="p">],</span> <span class="n">content</span><span class="o">.</span><span class="n">string</span><span class="p">,</span> <span class="n">href</span><span class="p">)</span>
<span class="n">rows</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
<span class="c1"># 定义表头</span>
<span class="n">header</span><span class="o">=</span><span class="p">[</span><span class="s2">"发文部门"</span><span class="p">,</span> <span class="s2">"标题"</span><span class="p">,</span> <span class="s2">"链接"</span><span class="p">]</span>
<span class="c1"># 保存文件;原文 encoding='gb18030'</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">"dustbin/policies.csv"</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">'utf8'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">writer</span><span class="o">=</span><span class="n">csv</span><span class="o">.</span><span class="n">writer</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="n">writer</span><span class="o">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">header</span><span class="p">)</span>
<span class="n">writer</span><span class="o">.</span><span class="n">writerows</span><span class="p">(</span> <span class="n">rows</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"==done=="</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>==done==
</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [11]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># check</span>
<span class="o">!</span>head -n <span class="m">3</span> dustbin/policies.csv
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h1 id="%E5%AF%B9%E6%96%87%E6%9C%AC%E6%95%B0%E6%8D%AE%E8%BF%9B%E8%A1%8C%E8%AF%9D%E9%A2%98%E6%8F%90%E5%8F%96">对文本数据进行话题提取<a class="anchor-link" href="#%E5%AF%B9%E6%96%87%E6%9C%AC%E6%95%B0%E6%8D%AE%E8%BF%9B%E8%A1%8C%E8%AF%9D%E9%A2%98%E6%8F%90%E5%8F%96">¶</a></h1>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<p>当爬取的内容很多时,看着很花时间。能否快速了解几万字的核心内容呢?</p>
<ul>
<li>可以使用 “潜在地理克雷分布”(Latent Dirichlet Allocation)对文本做话题提取。</li>
</ul>
</div>
</div>
<div class="jp-Cell-inputWrapper"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="%E4%B8%8B%E8%BD%BD%E5%A4%A7%E9%87%8F%E5%86%85%E5%AE%B9">下载大量内容<a class="anchor-link" href="#%E4%B8%8B%E8%BD%BD%E5%A4%A7%E9%87%8F%E5%86%85%E5%AE%B9">¶</a></h2>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [1]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># 百度搜 段子,找到一个全是文本段子的网站。</span>
<span class="kn">import</span> <span class="nn">requests</span><span class="o">,</span> <span class="nn">csv</span><span class="o">,</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="n">user_agent</span><span class="o">=</span><span class="s2">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36"</span>
<span class="c1"># 定义一个空列表</span>
<span class="n">rows</span><span class="o">=</span><span class="p">[]</span>
<span class="n">urls</span><span class="o">=</span><span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">):</span>
<span class="n">url</span><span class="o">=</span><span class="s2">"http://www.duanziku.com/gaoxiaoduanzi/gxdz</span><span class="si">{}</span><span class="s2">.html"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="n">webpage</span><span class="o">=</span><span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s2">"User-Agent"</span><span class="p">:</span> <span class="n">user_agent</span><span class="p">})</span>
<span class="n">webpage</span><span class="o">.</span><span class="n">encoding</span><span class="o">=</span><span class="s1">'utf-8'</span>
<span class="c1">#解析</span>
<span class="n">p</span><span class="o">=</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">webpage</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="s1">'lxml'</span><span class="p">)</span>
<span class="c1"># 用正则表达式匹配所有包含 gaoxiaoduanzi 的单词的a标签</span>
<span class="n">contents</span> <span class="o">=</span><span class="n">p</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s2">"a"</span><span class="p">,</span> <span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'gaoxiaoduanzi.*\d+\.html$'</span><span class="p">))</span>
<span class="c1">#设计一个for循环,提取标题和链接</span>
<span class="k">for</span> <span class="n">content</span> <span class="ow">in</span> <span class="n">contents</span><span class="p">:</span>
<span class="n">href</span><span class="o">=</span><span class="n">content</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'href'</span><span class="p">)</span>
<span class="k">if</span><span class="p">(</span><span class="n">content</span><span class="o">.</span><span class="n">string</span><span class="o">==</span><span class="kc">None</span><span class="p">):</span>
<span class="k">continue</span>
<span class="n">row</span><span class="o">=</span><span class="p">(</span><span class="n">content</span><span class="o">.</span><span class="n">string</span><span class="p">,</span> <span class="n">href</span><span class="p">)</span>
<span class="k">if</span> <span class="n">href</span> <span class="ow">in</span> <span class="n">urls</span><span class="p">:</span>
<span class="k">continue</span><span class="p">;</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">urls</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">href</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">urls</span><span class="p">)</span><span class="o"><</span><span class="mi">5</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
<span class="n">rows</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
<span class="nb">len</span><span class="p">(</span><span class="n">rows</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>('少小离家老大回,安能辨我是雌雄', 'http://www.duanziku.com/gaoxiaoduanzi/1532571462.html')
('据说女人很喜欢这个笑话', 'http://www.duanziku.com/gaoxiaoduanzi/1532050935.html')
('小明,你出去!(四)', 'http://www.duanziku.com/gaoxiaoduanzi/1532047928.html')
('小明,你出去!(三)', 'http://www.duanziku.com/gaoxiaoduanzi/1532007455.html')
</pre>
</div>
</div>
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt">Out[1]:</div>
<div class="jp-RenderedText jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/plain">
<pre>65</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [2]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">getJoke</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="n">webpage</span><span class="o">=</span><span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s2">"User-Agent"</span><span class="p">:</span> <span class="n">user_agent</span><span class="p">})</span>
<span class="n">webpage</span><span class="o">.</span><span class="n">encoding</span><span class="o">=</span><span class="s1">'utf-8'</span>
<span class="n">p</span><span class="o">=</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">webpage</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="s1">'lxml'</span><span class="p">)</span>
<span class="n">contents</span> <span class="o">=</span><span class="n">p</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"span"</span><span class="p">,</span> <span class="n">style</span><span class="o">=</span><span class="s2">"font-size:16px;"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">contents</span><span class="o">==</span><span class="kc">None</span><span class="p">:</span>
<span class="k">return</span><span class="p">;</span>
<span class="n">contents</span><span class="o">=</span><span class="n">contents</span><span class="o">.</span><span class="n">text</span>
<span class="c1"># 去掉换行</span>
<span class="n">w</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s2">"</span><span class="se">\r\n</span><span class="s2">"</span><span class="p">,</span> <span class="s2">""</span><span class="p">,</span> <span class="n">contents</span><span class="p">)</span>
<span class="c1"># 切分文本</span>
<span class="n">contents2</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">"</span><span class="se">\n</span><span class="s2">+"</span><span class="p">,</span> <span class="n">w</span><span class="p">)</span>
<span class="c1"># 保存</span>
<span class="n">db</span><span class="o">=</span><span class="p">[]</span>
<span class="k">for</span> <span class="n">joke</span> <span class="ow">in</span> <span class="n">contents2</span><span class="p">:</span>
<span class="k">if</span> <span class="n">joke</span><span class="o">==</span><span class="s2">""</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">joke</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s2">"\d{1,2}[\.\、]"</span><span class="p">,</span> <span class="s2">""</span><span class="p">,</span> <span class="n">joke</span><span class="p">)</span> <span class="c1">#去掉首数字</span>
<span class="n">db</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">joke</span><span class="p">)</span>
<span class="k">return</span> <span class="n">db</span>
<span class="c1"># test</span>
<span class="n">getJoke</span><span class="p">(</span> <span class="n">rows</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">])[</span><span class="mi">0</span><span class="p">:</span><span class="mi">3</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt">Out[2]:</div>
<div class="jp-RenderedText jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/plain">
<pre>['大学时不喜欢读书,跟上铺的哥们说,我最羡慕那些徒步旅行的驴友,想去那里就去那,真想来一次说走就走的徒步旅行,只可惜自己没钱买装备。上铺那哥们表示,他可以友情赞助我装备。然后,送给我一个碗……',
'前两天跟我们组印度同事说认识他之后对印度有所改观,他大惊失色说“千万别,印度人里我是凤毛麟角,所有其他人都是傻哔。”',
'当代人的肥胖问题需要得到重视了。据调查,地铁上每三次为孕妇让座的行为中,就有一次对方不是孕妇。']</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [3]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">getJoke2</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="n">webpage</span><span class="o">=</span><span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s2">"User-Agent"</span><span class="p">:</span> <span class="n">user_agent</span><span class="p">})</span>
<span class="n">webpage</span><span class="o">.</span><span class="n">encoding</span><span class="o">=</span><span class="s1">'utf-8'</span>
<span class="n">p</span><span class="o">=</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">webpage</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="s1">'lxml'</span><span class="p">)</span>
<span class="n">p</span><span class="o">=</span><span class="n">p</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"table"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">p</span><span class="o">==</span><span class="kc">None</span><span class="p">:</span>
<span class="k">return</span><span class="p">;</span>
<span class="n">contents</span> <span class="o">=</span><span class="n">p</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s2">"p"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">contents</span><span class="o">==</span><span class="kc">None</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"broken page:"</span><span class="p">,</span> <span class="n">url</span><span class="p">)</span>
<span class="k">return</span><span class="p">;</span>
<span class="n">db</span><span class="o">=</span><span class="p">[]</span>
<span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">contents</span><span class="p">:</span>
<span class="n">text2</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s2">"\s+"</span><span class="p">,</span><span class="s2">""</span><span class="p">,</span> <span class="n">text</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">text2</span><span class="p">)</span><span class="o"><</span><span class="mi">10</span> <span class="ow">or</span> <span class="n">re</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="s2">"[0-9]"</span><span class="p">,</span><span class="n">text2</span><span class="p">)</span><span class="o">==</span><span class="kc">None</span><span class="p">:</span>
<span class="c1">#pass</span>
<span class="k">continue</span>
<span class="n">joke</span> <span class="o">=</span><span class="n">text2</span>
<span class="n">joke</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s2">"\d{1,2}[\.\、]"</span><span class="p">,</span> <span class="s2">""</span><span class="p">,</span> <span class="n">joke</span><span class="p">)</span> <span class="c1">#去掉首数字</span>
<span class="n">db</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">joke</span><span class="p">)</span>
<span class="k">return</span> <span class="n">db</span>
<span class="n">getJoke2</span><span class="p">(</span> <span class="n">rows</span><span class="p">[</span><span class="mi">6</span><span class="p">][</span><span class="mi">1</span><span class="p">])[</span><span class="mi">0</span><span class="p">:</span><span class="mi">3</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt">Out[3]:</div>
<div class="jp-RenderedText jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/plain">
<pre>['某人求领导办事儿,拿出五张美女照片任其选一张:您中意哪个?领导看了看没说话,那人急了:您表个态呀!领导坏笑:你说穷人见到地上有五张百元大钞会捡哪张?',
'楼主在外地,准备偷偷回去给老婆一个惊喜,就没告诉老婆自己买机票飞回去,到家已经半夜,开了房门借着昏暗的床头灯,发现老婆抱着一个男的在睡觉。我看了看,默默的走过去,扒下他的裤子,然后给他换了个尿片。',
'一女生养了一只兔子,那个兔子这几天一直不吃萝卜,女生担心兔子生病,询问医生,医生回答:“兔子从来不吃有腥味的东西。”']</pre>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">