Extending styles parsing and RegEx search #52

prabhkaran · 2018-05-14T09:16:10Z

Extended two features:

Added regex search and toggle for full span match in RegexMatch class.
Added styles parsing from the style class and appended to existing html styles attr.

… for full span match.

…e tag.

lukehsiao · 2018-05-14T22:36:03Z

Working on getting travis tests fixed: HazyResearch/numbskull#54.

lukehsiao · 2018-05-14T23:57:09Z

fonduer/matchers.py

@@ -228,11 +228,18 @@ def init(self):
        self.attrib = self.opts.get('attrib', WORDS)
        self.sep = self.opts.get('sep', " ")

+        # Extending the RegexMatch to handle search(instead of only match) 
+        # and adding a toggle for full span match.
+        # Default values are set to False and True for search flag and full 


Please get rid of these trailing whitespaces.

Make sure the code still passes our code style check: make check

lukehsiao · 2018-05-16T05:45:41Z

fonduer/matchers.py

+        # Default values are set to False and True for search flag and full
+        # span matching flag respectively.
+        self.search = self.opts.get('search', False)
+        self.full_match = self.opts.get('full_match', True)


Is it necessary to have both of these flags? It seems like these should never both be true. Only one or the other would be true at one time, if I understand correctly.

I would prefer just having self.search.

self.full_match is to toggle appending $ to regex

Eg:

phrase = 'Invoice#:2387621387' r1 = re.compile(r'Invoice') r2 = re.compile(r'(Invoice)$') r1.search(phrase) # returns <_sre.SRE_Match object; span=(0, 7), match='Invoice'> r2.search(phrase) # returns None

This is happening because $ matches the end of the string but the expression can be part of the span not at the end.

I see. Then this looks good to me, thanks!

fonduer/matchers.py

        # Compile regex matcher
        # NOTE: Enforce full span matching by ensuring that regex ends with $.
        # Group self.rgx first so that $ applies to all components of an 'OR'
        # expression. (e.g., we want r'(a|b)$' instead of r'a|b$')
-        self.rgx = self.rgx if self.rgx.endswith('$') else (
+        self.rgx = self.rgx if self.rgx.endswith('$') or not self.full_match else (


lukehsiao · 2018-05-16T05:52:48Z

fonduer/matchers.py

-        return True if self.r.match(
-            c.get_attrib_span(self.attrib,
-                              sep=self.sep)) is not None else False
+        if not self.search:


Minor nit, it feels a little smoother to read to use the affirmative case first

if self.search: return True if self.r.search( c.get_attrib_span(self.attrib, sep=self.sep)) is not None else False else: return True if self.r.match( c.get_attrib_span(self.attrib, sep=self.sep)) is not None else False

lukehsiao · 2018-05-16T06:27:20Z

fonduer/parser/parser.py

+                                                    else:
+                                                        parts['html_attrs'] = 'style=' + r.search(styles.text).group(3)\
+                                                            .replace('\r', '').replace('\n', '').replace('\t', '')
+                                                break


It's not super obvious to me what the use case for this code is. It looks to me like if an element has style attributes, then your extending the element's styles with the first style class in the head of the document that you find. Is that correct? Could you explain why you needed this code?

Also, could you add a test case for this code? This would be added to test_parser.py. For example, although <style> elements are supposed to be defined in head, we would want to make sure things don't break in the case of messy HTML, where the tag may not conform to the standards, such as if the style is defined in the body.

The easiest way to do this would be to add a new, simplified HTML document into tests/data/html/ (you can create this new html directory). Then just add a case similar to:

def test_parse_style(caplog): """Test style tag parsing.""" caplog.set_level(logging.INFO) logger = logging.getLogger(__name__) session = Meta.init('postgres://localhost:5432/' + ATTRIBUTE).Session() max_docs = 1 docs_path = 'tests/data/html/[your new document].html' # Preprocessor for the Docs preprocessor = HTMLPreprocessor(docs_path, max_docs=max_docs) # Grab the document, text tuple from the preprocessor doc, text = next(preprocessor.generate()) logger.info(" Text: {}".format(text)) # Create an OmniParserUDF omni_udf = OmniParserUDF( True, # structural [], # blacklist, empty so that style is not blacklisted ["span", "br"], # flatten '', # flatten delim True, # lingual True, # strip [], # replace True, # tabular True, # visual pdf_path, # pdf path Spacy()) # lingual parser # Grab the phrases parsed by the OmniParser phrases = list(omni_udf.parse_structure(doc, text)) # Add your assertions

If an html element <elm class='.s2'> has a class attribute here .s2 then this code will find that class s2 in <style> and append/assign the styles defined in the s2 class to style attribute in parts['html_attrs']

Will add test case and check in

I see, makes sense. Thanks!

lukehsiao

Looks like a great PR. Just waiting for the tests and it should be ready to merge.

lukehsiao · 2018-05-17T19:14:06Z

fonduer/parser/parser.py

+                                                    else:
+                                                        parts['html_attrs'] = 'style=' + r.search(styles.text).group(3)\
+                                                            .replace('\r', '').replace('\n', '').replace('\t', '')
+                                                break


I see, makes sense. Thanks!

lukehsiao · 2018-05-17T19:15:01Z

fonduer/matchers.py

+        # Default values are set to False and True for search flag and full
+        # span matching flag respectively.
+        self.search = self.opts.get('search', False)
+        self.full_match = self.opts.get('full_match', True)


I see. Then this looks good to me, thanks!

Just waiting for the tests to be added.

prabhkaran · 2018-05-23T16:31:14Z

@lukehsiao added test case. Please review

lukehsiao

Looks great. Thanks!

…h#52) matplotlib can be installed in three ways: pip, apt, build from source. (See https://matplotlib.org/users/installing.html) The current Dockerfile does both pip and apt, which is redundant. This patch stops the apt one.

PK added 2 commits May 14, 2018 14:25

Adding regex search function to the RegexMatch and adding toggle flag…

54ac34d

… for full span match.

Extending style parsing for an element from inline class in head>styl…

b74f343

…e tag.

senwu requested a review from lukehsiao May 14, 2018 20:53

lukehsiao added the enhancement New feature or request label May 14, 2018

lukehsiao added this to the v0.1.8 milestone May 14, 2018

lukehsiao reviewed May 14, 2018

View reviewed changes

Prabh added 2 commits May 15, 2018 16:36

Removed trailing whitespaces.

7638aa0

Fixed nonetype issue if style tag is not there in head

37187c2

lukehsiao suggested changes May 16, 2018

View reviewed changes

Prabh added 2 commits May 17, 2018 11:06

changed to affirmative condition first

41fda5c

style fix

a6f2a8e

lukehsiao reviewed May 17, 2018

View reviewed changes

lukehsiao previously approved these changes May 17, 2018

View reviewed changes

lukehsiao and others added 5 commits May 18, 2018 11:23

Merge branch 'master' into master

4531f3a

extending current styles if exists

2107c62

add a test case

7e2977b

Merge branch 'master' of https://github.com/Prabh06/fonduer

446440f

reverted to original path for test_spacy_integration

d435623

lukehsiao approved these changes May 23, 2018

View reviewed changes

lukehsiao merged commit 3b20d74 into HazyResearch:master May 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending styles parsing and RegEx search #52

Extending styles parsing and RegEx search #52

prabhkaran commented May 14, 2018

lukehsiao commented May 14, 2018

lukehsiao May 14, 2018

prabhkaran May 15, 2018

lukehsiao May 16, 2018 •

edited

Loading

prabhkaran May 17, 2018

lukehsiao May 17, 2018

This comment was marked as resolved.

lukehsiao May 16, 2018

prabhkaran May 17, 2018

lukehsiao May 16, 2018 •

edited

Loading

prabhkaran May 17, 2018

prabhkaran May 17, 2018

lukehsiao May 17, 2018

lukehsiao left a comment

lukehsiao May 17, 2018

lukehsiao May 17, 2018

prabhkaran commented May 23, 2018

lukehsiao left a comment

Extending styles parsing and RegEx search #52

Extending styles parsing and RegEx search #52

Conversation

prabhkaran commented May 14, 2018

lukehsiao commented May 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukehsiao May 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukehsiao May 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukehsiao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prabhkaran commented May 23, 2018

lukehsiao left a comment

Choose a reason for hiding this comment

lukehsiao May 16, 2018 •

edited

Loading

lukehsiao May 16, 2018 •

edited

Loading