Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into 53
Browse files Browse the repository at this point in the history
  • Loading branch information
Gallaecio committed Nov 11, 2020
2 parents 96c77af + d78167c commit 96bf6b3
Show file tree
Hide file tree
Showing 26 changed files with 22,369 additions and 113 deletions.
14 changes: 12 additions & 2 deletions HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@
History
=======

v0.10.0 (2020-09-01)
--------------------

* support open graph arrays via ``with_og_array=True`` (PR #138)
* support "expanded" Open Graph metadata based on og:type (PR #140)
* parse JSON with JS comments for json-ld (PR #137)
* preserve order for duplicated properties for RDFa (PR #139)
* improve microdata parser performance with large number of items (PR #148)
* spelling fixes (PR #145)

v0.9.0 (2020-04-20)
-------------------

Expand Down Expand Up @@ -80,7 +90,7 @@ v0.5.0 (2018-06-08)
html nodes.
* ``base_url`` substitutes ``url`` in ``MicroformatExtractor``, ``JsonLdExtractor``,
``OpenGraphExtractor``, ``RDFaExtractor`` and ``MicrodataExtractor``
* individual extractors accpet ``base_url`` instead of ``url``, unused keyword
* individual extractors accept ``base_url`` instead of ``url``, unused keyword
arguments are removed.
* In ``w3microdata.extract_items`` ``items_seen`` and ``url`` are no longer
class variables but are passed as arguments.
Expand Down Expand Up @@ -152,7 +162,7 @@ v0.2.0 (2016-09-26)
-------------------

* Web service response content-type set to 'application/json'
* Web service Python 3 compatiblity
* Web service Python 3 compatibility
* Code coverage reports
* Fix extraction of ``<object>`` "data" URL with microdata
* Handle textContent mixed with ``<script>`` and ``<style>`` tags
Expand Down
108 changes: 91 additions & 17 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Currently, *extruct* supports:
- `Microformat`_ via `mf2py`_
- `Facebook's Open Graph`_
- (experimental) `RDFa`_ via `rdflib`_
- `Dublin Core Metadata (DC-HTML-2003)`_

.. _W3C's HTML Microdata: http://www.w3.org/TR/microdata/
.. _embedded JSON-LD: http://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents
Expand All @@ -32,6 +33,7 @@ Currently, *extruct* supports:
.. _Microformat: http://microformats.org/wiki/Main_Page
.. _mf2py: https://github.com/microformats/mf2py
.. _Facebook's Open Graph: http://ogp.me/
.. _Dublin Core Metadata (DC-HTML-2003): https://www.dublincore.org/specifications/dublin-core/dcq-html/2003-11-30/

The microdata algorithm is a revisit of `this Scrapinghub blog post`_ showing how to use EXSLT extensions.

Expand Down Expand Up @@ -71,7 +73,17 @@ First fetch the HTML using python-requests and then feed the response body to ``
>>> data = extruct.extract(r.text, base_url=base_url)
>>>
>>> pp.pprint(data)
{ 'json-ld': [ { '@context': 'https://schema.org',
{ 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',
'content': 'What is Open Graph Protocol '
'and why you need it? Learn to '
'implement Open Graph Protocol '
'for Facebook on your website. '
'Open Graph Protocol Meta Tags.',
'name': 'description'}],
'namespaces': {},
'terms': []}],

'json-ld': [ { '@context': 'https://schema.org',
'@id': '#organization',
'@type': 'Organization',
'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',
Expand Down Expand Up @@ -163,7 +175,7 @@ First fetch the HTML using python-requests and then feed the response body to ``

Select syntaxes
+++++++++++++++
It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa'. If no list is passed all syntaxes will be extracted and returned::
It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned::

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
Expand Down Expand Up @@ -207,9 +219,9 @@ It is possible to select which syntaxes to extract by passing a list with the de

Uniform
+++++++
Another option is to uniform the output of microformat, opengraph, microdata and json-ld syntaxes to the following structure: ::
Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure: ::

{'@context': 'http://example.com',
{'@context': 'http://example.com',
'@type': 'example_type',
/* All other the properties in keys here */
}
Expand Down Expand Up @@ -584,6 +596,80 @@ Microformat extraction
}
}]

DublinCore extraction
++++++++++++++++++++++++++++++
::

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.dublincore import DublinCoreExtractor
>>> html = '''<head profile="http://dublincore.org/documents/dcq-html/">
... <title>Expressing Dublin Core in HTML/XHTML meta and link elements</title>
... <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
... <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />
...
...
... <meta name="DC.title" lang="en" content="Expressing Dublin Core
... in HTML/XHTML meta and link elements" />
... <meta name="DC.creator" content="Andy Powell, UKOLN, University of Bath" />
... <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF" content="2003-11-01" />
... <meta name="DC.identifier" scheme="DCTERMS.URI"
... content="http://dublincore.org/documents/dcq-html/" />
... <link rel="DCTERMS.replaces" hreflang="en"
... href="http://dublincore.org/documents/2000/08/15/dcq-html/" />
... <meta name="DCTERMS.abstract" content="This document describes how
... qualified Dublin Core metadata can be encoded
... in HTML/XHTML &lt;meta&gt; elements" />
... <meta name="DC.format" scheme="DCTERMS.IMT" content="text/html" />
... <meta name="DC.type" scheme="DCTERMS.DCMIType" content="Text" />
... <meta name="DC.Date.modified" content="2001-07-18" />
... <meta name="DCTERMS.modified" content="2001-07-18" />'''
>>> dublinlde = DublinCoreExtractor()
>>> data = dublinlde.extract(html)
>>> pp.pprint(data)
[ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',
'content': 'Expressing Dublin Core\n'
'in HTML/XHTML meta and link elements',
'lang': 'en',
'name': 'DC.title'},
{ 'URI': 'http://purl.org/dc/elements/1.1/creator',
'content': 'Andy Powell, UKOLN, University of Bath',
'name': 'DC.creator'},
{ 'URI': 'http://purl.org/dc/elements/1.1/identifier',
'content': 'http://dublincore.org/documents/dcq-html/',
'name': 'DC.identifier',
'scheme': 'DCTERMS.URI'},
{ 'URI': 'http://purl.org/dc/elements/1.1/format',
'content': 'text/html',
'name': 'DC.format',
'scheme': 'DCTERMS.IMT'},
{ 'URI': 'http://purl.org/dc/elements/1.1/type',
'content': 'Text',
'name': 'DC.type',
'scheme': 'DCTERMS.DCMIType'}],
'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',
'DCTERMS': 'http://purl.org/dc/terms/'},
'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',
'content': '2003-11-01',
'name': 'DCTERMS.issued',
'scheme': 'DCTERMS.W3CDTF'},
{ 'URI': 'http://purl.org/dc/terms/abstract',
'content': 'This document describes how\n'
'qualified Dublin Core metadata can be encoded\n'
'in HTML/XHTML <meta> elements',
'name': 'DCTERMS.abstract'},
{ 'URI': 'http://purl.org/dc/terms/modified',
'content': '2001-07-18',
'name': 'DC.Date.modified'},
{ 'URI': 'http://purl.org/dc/terms/modified',
'content': '2001-07-18',
'name': 'DCTERMS.modified'},
{ 'URI': 'http://purl.org/dc/terms/replaces',
'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',
'hreflang': 'en',
'rel': 'DCTERMS.replaces'}]}]



Command Line Tool
-----------------
Expand Down Expand Up @@ -622,7 +708,7 @@ those, you can pass their individual names collected in a list through 'syntaxes
For example, this command extracts only Microdata and JSON-LD metadata from
"http://example.com"::

extruct "http://example.com" --syntaxes microdata json-ld
extruct "http://example.com" --syntaxes microdata json-ld

NB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat

Expand All @@ -649,15 +735,3 @@ Use tox_ to run tests with different Python versions::


.. _tox: https://testrun.org/tox/latest/


Versioning
----------

Use bumpversion_ to conveniently change project version::

bumpversion patch # 0.0.0 -> 0.0.1
bumpversion minor # 0.0.1 -> 0.1.0
bumpversion major # 0.1.0 -> 1.0.0

.. _bumpversion: https://pypi.python.org/pypi/bumpversion
2 changes: 1 addition & 1 deletion extruct/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.9.0
0.10.0
20 changes: 18 additions & 2 deletions extruct/_extruct.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,12 @@
from extruct.w3cmicrodata import MicrodataExtractor
from extruct.opengraph import OpenGraphExtractor
from extruct.microformat import MicroformatExtractor
from extruct.uniform import _umicrodata_microformat, _uopengraph
from extruct.dublincore import DublinCoreExtractor
from extruct.uniform import _umicrodata_microformat, _uopengraph, _udublincore
from extruct.utils import parse_xmldom_html

logger = logging.getLogger(__name__)
SYNTAXES = ['microdata', 'opengraph', 'json-ld', 'microformat', 'rdfa']
SYNTAXES = ['microdata', 'opengraph', 'json-ld', 'microformat', 'rdfa', 'dublincore']


def extract(htmlstring,
Expand Down Expand Up @@ -96,6 +97,11 @@ def extract(htmlstring,
('rdfa', RDFaExtractor().extract_items,
tree,
))
if 'dublincore' in syntaxes:
processors.append(
('dublincore', DublinCoreExtractor().extract_items,
tree,
))
output = {}
for syntax, extract, document in processors:
try:
Expand Down Expand Up @@ -132,10 +138,20 @@ def extract(htmlstring,
output['opengraph'],
None,
))
if 'dublincore' in syntaxes:
uniform_processors.append(
('dublincore',
_udublincore,
output['dublincore'],
None,
))

for syntax, uniform, raw, schema_context in uniform_processors:
try:
if syntax == 'opengraph':
output[syntax] = uniform(raw, with_og_array=with_og_array)
elif syntax == 'dublincore':
output[syntax] = uniform(raw)
else:
output[syntax] = uniform(raw, schema_context)
except Exception as e:
Expand Down
Loading

0 comments on commit 96bf6b3

Please sign in to comment.