-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending styles parsing and RegEx search #52
Conversation
Working on getting travis tests fixed: HazyResearch/numbskull#54. |
fonduer/matchers.py
Outdated
@@ -228,11 +228,18 @@ def init(self): | |||
self.attrib = self.opts.get('attrib', WORDS) | |||
self.sep = self.opts.get('sep', " ") | |||
|
|||
# Extending the RegexMatch to handle search(instead of only match) | |||
# and adding a toggle for full span match. | |||
# Default values are set to False and True for search flag and full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please get rid of these trailing whitespaces.
Make sure the code still passes our code style check: make check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# Default values are set to False and True for search flag and full | ||
# span matching flag respectively. | ||
self.search = self.opts.get('search', False) | ||
self.full_match = self.opts.get('full_match', True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to have both of these flags? It seems like these should never both be true. Only one or the other would be true at one time, if I understand correctly.
I would prefer just having self.search
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.full_match
is to toggle appending $
to regex
Eg:
phrase = 'Invoice#:2387621387'
r1 = re.compile(r'Invoice')
r2 = re.compile(r'(Invoice)$')
r1.search(phrase) # returns <_sre.SRE_Match object; span=(0, 7), match='Invoice'>
r2.search(phrase) # returns None
This is happening because $ matches the end of the string but the expression can be part of the span not at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then this looks good to me, thanks!
# Compile regex matcher | ||
# NOTE: Enforce full span matching by ensuring that regex ends with $. | ||
# Group self.rgx first so that $ applies to all components of an 'OR' | ||
# expression. (e.g., we want r'(a|b)$' instead of r'a|b$') | ||
self.rgx = self.rgx if self.rgx.endswith('$') else ( | ||
self.rgx = self.rgx if self.rgx.endswith('$') or not self.full_match else ( |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
fonduer/matchers.py
Outdated
return True if self.r.match( | ||
c.get_attrib_span(self.attrib, | ||
sep=self.sep)) is not None else False | ||
if not self.search: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, it feels a little smoother to read to use the affirmative case first
if self.search:
return True if self.r.search(
c.get_attrib_span(self.attrib, sep=self.sep)) is not None else False
else:
return True if self.r.match(
c.get_attrib_span(self.attrib, sep=self.sep)) is not None else False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done.
fonduer/parser/parser.py
Outdated
else: | ||
parts['html_attrs'] = 'style=' + r.search(styles.text).group(3)\ | ||
.replace('\r', '').replace('\n', '').replace('\t', '') | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not super obvious to me what the use case for this code is. It looks to me like if an element has style attributes, then your extending the element's styles with the first style class in the head of the document that you find. Is that correct? Could you explain why you needed this code?
Also, could you add a test case for this code? This would be added to test_parser.py. For example, although <style>
elements are supposed to be defined in head, we would want to make sure things don't break in the case of messy HTML, where the tag may not conform to the standards, such as if the style is defined in the body.
The easiest way to do this would be to add a new, simplified HTML document into tests/data/html/
(you can create this new html directory). Then just add a case similar to:
def test_parse_style(caplog):
"""Test style tag parsing."""
caplog.set_level(logging.INFO)
logger = logging.getLogger(__name__)
session = Meta.init('postgres://localhost:5432/' + ATTRIBUTE).Session()
max_docs = 1
docs_path = 'tests/data/html/[your new document].html'
# Preprocessor for the Docs
preprocessor = HTMLPreprocessor(docs_path, max_docs=max_docs)
# Grab the document, text tuple from the preprocessor
doc, text = next(preprocessor.generate())
logger.info(" Text: {}".format(text))
# Create an OmniParserUDF
omni_udf = OmniParserUDF(
True, # structural
[], # blacklist, empty so that style is not blacklisted
["span", "br"], # flatten
'', # flatten delim
True, # lingual
True, # strip
[], # replace
True, # tabular
True, # visual
pdf_path, # pdf path
Spacy()) # lingual parser
# Grab the phrases parsed by the OmniParser
phrases = list(omni_udf.parse_structure(doc, text))
# Add your assertions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an html element <elm class='.s2'>
has a class attribute here .s2
then this code will find that class s2
in <style>
and append/assign the styles
defined in the s2
class to style attribute in parts['html_attrs']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add test case and check in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a great PR. Just waiting for the tests and it should be ready to merge.
fonduer/parser/parser.py
Outdated
else: | ||
parts['html_attrs'] = 'style=' + r.search(styles.text).group(3)\ | ||
.replace('\r', '').replace('\n', '').replace('\t', '') | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense. Thanks!
# Default values are set to False and True for search flag and full | ||
# span matching flag respectively. | ||
self.search = self.opts.get('search', False) | ||
self.full_match = self.opts.get('full_match', True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then this looks good to me, thanks!
@lukehsiao added test case. Please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks!
…h#52) matplotlib can be installed in three ways: pip, apt, build from source. (See https://matplotlib.org/users/installing.html) The current Dockerfile does both pip and apt, which is redundant. This patch stops the apt one.
Extended two features:
Added regex search and toggle for full span match in RegexMatch class.
Added styles parsing from the style class and appended to existing html styles attr.