Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending styles parsing and RegEx search #52

Merged
merged 11 commits into from
May 23, 2018

Conversation

prabhkaran
Copy link

Extended two features:

  • Added regex search and toggle for full span match in RegexMatch class.

  • Added styles parsing from the style class and appended to existing html styles attr.

@senwu senwu requested a review from lukehsiao May 14, 2018 20:53
@lukehsiao lukehsiao added the enhancement New feature or request label May 14, 2018
@lukehsiao lukehsiao added this to the v0.1.8 milestone May 14, 2018
@lukehsiao
Copy link
Contributor

Working on getting travis tests fixed: HazyResearch/numbskull#54.

@@ -228,11 +228,18 @@ def init(self):
self.attrib = self.opts.get('attrib', WORDS)
self.sep = self.opts.get('sep', " ")

# Extending the RegexMatch to handle search(instead of only match)
# and adding a toggle for full span match.
# Default values are set to False and True for search flag and full
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get rid of these trailing whitespaces.

Make sure the code still passes our code style check: make check

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# Default values are set to False and True for search flag and full
# span matching flag respectively.
self.search = self.opts.get('search', False)
self.full_match = self.opts.get('full_match', True)
Copy link
Contributor

@lukehsiao lukehsiao May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to have both of these flags? It seems like these should never both be true. Only one or the other would be true at one time, if I understand correctly.

I would prefer just having self.search.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.full_match is to toggle appending $ to regex

Eg:

phrase = 'Invoice#:2387621387'
r1 = re.compile(r'Invoice')
r2 = re.compile(r'(Invoice)$')

r1.search(phrase) # returns <_sre.SRE_Match object; span=(0, 7), match='Invoice'>
r2.search(phrase) # returns None

This is happening because $ matches the end of the string but the expression can be part of the span not at the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then this looks good to me, thanks!

# Compile regex matcher
# NOTE: Enforce full span matching by ensuring that regex ends with $.
# Group self.rgx first so that $ applies to all components of an 'OR'
# expression. (e.g., we want r'(a|b)$' instead of r'a|b$')
self.rgx = self.rgx if self.rgx.endswith('$') else (
self.rgx = self.rgx if self.rgx.endswith('$') or not self.full_match else (

This comment was marked as resolved.

return True if self.r.match(
c.get_attrib_span(self.attrib,
sep=self.sep)) is not None else False
if not self.search:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, it feels a little smoother to read to use the affirmative case first

if self.search:
    return True if self.r.search(
        c.get_attrib_span(self.attrib, sep=self.sep)) is not None else False
else:
    return True if self.r.match(
        c.get_attrib_span(self.attrib, sep=self.sep)) is not None else False

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.

else:
parts['html_attrs'] = 'style=' + r.search(styles.text).group(3)\
.replace('\r', '').replace('\n', '').replace('\t', '')
break
Copy link
Contributor

@lukehsiao lukehsiao May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super obvious to me what the use case for this code is. It looks to me like if an element has style attributes, then your extending the element's styles with the first style class in the head of the document that you find. Is that correct? Could you explain why you needed this code?

Also, could you add a test case for this code? This would be added to test_parser.py. For example, although <style> elements are supposed to be defined in head, we would want to make sure things don't break in the case of messy HTML, where the tag may not conform to the standards, such as if the style is defined in the body.

The easiest way to do this would be to add a new, simplified HTML document into tests/data/html/ (you can create this new html directory). Then just add a case similar to:

def test_parse_style(caplog):
    """Test style tag parsing."""
    caplog.set_level(logging.INFO)
    logger = logging.getLogger(__name__)
    session = Meta.init('postgres://localhost:5432/' + ATTRIBUTE).Session()

    max_docs = 1
    docs_path = 'tests/data/html/[your new document].html'

    # Preprocessor for the Docs
    preprocessor = HTMLPreprocessor(docs_path, max_docs=max_docs)

    # Grab the document, text tuple from the preprocessor
    doc, text = next(preprocessor.generate())
    logger.info("    Text: {}".format(text))

    # Create an OmniParserUDF
    omni_udf = OmniParserUDF(
        True,           # structural
        [],             # blacklist, empty so that style is not blacklisted
        ["span", "br"], # flatten
        '',             # flatten delim
        True,           # lingual
        True,           # strip
        [],             # replace
        True,           # tabular
        True,           # visual
        pdf_path,       # pdf path
        Spacy())        # lingual parser

    # Grab the phrases parsed by the OmniParser
    phrases = list(omni_udf.parse_structure(doc, text))

    # Add your assertions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an html element <elm class='.s2'> has a class attribute here .s2 then this code will find that class s2 in <style> and append/assign the styles defined in the s2 class to style attribute in parts['html_attrs']

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add test case and check in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense. Thanks!

Copy link
Contributor

@lukehsiao lukehsiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great PR. Just waiting for the tests and it should be ready to merge.

else:
parts['html_attrs'] = 'style=' + r.search(styles.text).group(3)\
.replace('\r', '').replace('\n', '').replace('\t', '')
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense. Thanks!

# Default values are set to False and True for search flag and full
# span matching flag respectively.
self.search = self.opts.get('search', False)
self.full_match = self.opts.get('full_match', True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then this looks good to me, thanks!

lukehsiao
lukehsiao previously approved these changes May 17, 2018
@lukehsiao lukehsiao dismissed their stale review May 17, 2018 19:18

Just waiting for the tests to be added.

@prabhkaran
Copy link
Author

@lukehsiao added test case. Please review

Copy link
Contributor

@lukehsiao lukehsiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks!

@lukehsiao lukehsiao merged commit 3b20d74 into HazyResearch:master May 23, 2018
stackoverflowed pushed a commit to stackoverflowed/multimodal that referenced this pull request Dec 4, 2021
…h#52)

matplotlib can be installed in three ways: pip, apt, build from source.
(See https://matplotlib.org/users/installing.html)
The current Dockerfile does both pip and apt, which is redundant.
This patch stops the apt one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants