Parser Engine Driven by Template

Install

Fresh pre-release:

pip install git+https://github.com/Danceiny/parser_engine

TODO

support extracting multi item in a html node list, with one template provided
define extraction rule of item in Item class

TemplateAnnotation

参数约定：

业务名称name，即采用该注解的Spider类的name类变量

其他请参考decorator.py中的注释。

Html response

举个简单的例子。

目标：抓取抖音用户关键字搜索抓包数据分析脚本使用指南页面的几个步骤标题，每个步骤是一个h3标签，步骤标题在id属性里，并且需要去掉形如1-的前缀。那么相应的配置文件是：

    {
      "name": "demo",
      "fields": [
        {
          "dom_id": null,
          "_css": null,
          "xpath": null,
          "tags": [
            "h3"
          ],
          "classes": [],
          "attributes": null,
          "position": null,
          "key": "步骤",
          "value_type": null,
          "regexp": "[\\d]{1,2}-(\\w+)",
          "attr_name": "id"
        }
      ]
    }

输出

{'步骤': ['准备工作', '找到电脑的ip地址和端口', '确保手机与电脑建立连接', '抖音搜索关键词', '抓包数据导出', '提取用户信息', '推荐在线转换工具', 'python脚本导出']}

如果只需要第二个步骤，将json配置中的position参数改为2，即可得到如下输出：

{'步骤': ['找到电脑的ip地址和端口']}

JSON text response

    {
      "name": "json-api-demo",
      "fields": [
        {
          "key": "poi_id",
          "json_path": "$.pois[:1].id"
        },
        {
          "key": "地名",
          "json_path": "$.data.name",
          "value_type": "singleton"
        },
        {
          "key": "下级",
          "json_path": "$.data.children[*].name"
        }
      ]
    }

json_path字段完全遵循json_path协议，json_path在线调试。由于json_path解析总是返回一个list，对于一些确定的字段，比如通过调用APIhttp://172.31.1.4:30815/api/dict/area/0?childrenDepth=1，想拿到该地区的name字段，则可以设置value_type为singleton，则PE会做一次转换。

具体使用可以参考：

demo_spider。
gaode_spider。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

Parser Engine Driven by Template

Install

TODO

TemplateAnnotation

Html response

JSON text response

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

Parser Engine Driven by Template

Install

TODO

TemplateAnnotation

Html response

JSON text response