Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip HTML & XML tags from RSS feed input #1670

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
ab7eaf4
First pass at a truncate strings method
dagardner-nv Apr 23, 2024
54e7681
Merge branch 'branch-24.06' of github.com:nv-morpheus/Morpheus into d…
dagardner-nv Apr 23, 2024
2f14478
Tests for new truncate_string_cols_by_bytes method and the _cudf_need…
dagardner-nv Apr 23, 2024
83ad8ad
Log a warning when we truncate a column
dagardner-nv Apr 23, 2024
f91d9ac
Refactor such that each column has it's own max length
dagardner-nv Apr 24, 2024
10a68f9
Expand tests
dagardner-nv Apr 24, 2024
c80fc27
Add in truncating of long string fields.
dagardner-nv Apr 24, 2024
4b771e8
Use DataFrameType alias
dagardner-nv Apr 24, 2024
3b6408e
Cleanup
dagardner-nv Apr 24, 2024
9c5435c
Exclude string type from max_length checking
dagardner-nv Apr 24, 2024
dbe34dc
Ensure truncate_long_strings parameter is set in configs and passed a…
dagardner-nv Apr 24, 2024
9acea9e
Add docstring for truncate_long_strings
dagardner-nv Apr 24, 2024
078cd19
Add docstring for warn_on_truncate
dagardner-nv Apr 24, 2024
00f8170
Merge branch 'branch-24.06' into david-truncate-milvus-1650
dagardner-nv Apr 24, 2024
2b72313
Add type-alias for Series type
dagardner-nv Apr 25, 2024
f26ce68
Refactor cudf_string_cols_exceed_max_bytes and truncate_string_cols_b…
dagardner-nv Apr 25, 2024
4a6de1b
Refactor to call cudf_string_cols_exceed_max_bytes prior to convertin…
dagardner-nv Apr 25, 2024
e121b33
Merge branch 'branch-24.06' of github.com:nv-morpheus/Morpheus into d…
dagardner-nv Apr 25, 2024
5f3804f
Test WIP [no ci]
dagardner-nv Apr 25, 2024
701e1f9
Fix bug where truncate was always enabled for cudf dataframes
dagardner-nv Apr 25, 2024
38d3125
Finish up tests
dagardner-nv Apr 25, 2024
a5f05e1
Remove parametarization based on num_rows, not important for this tes…
dagardner-nv Apr 25, 2024
35dc567
Remove old param
dagardner-nv Apr 25, 2024
a6cb131
Lint fixes
dagardner-nv Apr 25, 2024
148d398
Remove stray print method
dagardner-nv Apr 25, 2024
d352930
Don't hard-code the name of the probabilities tensor, don't assume it…
dagardner-nv Apr 25, 2024
bc1e3c2
Re-work hard-coded probs->embeddings copy that used to exist in infer…
dagardner-nv Apr 25, 2024
078780f
Lint fix
dagardner-nv Apr 25, 2024
f3d0334
Re-enable C++ mode support
dagardner-nv Apr 25, 2024
abeb811
Remove the two issues this PR should resolve
dagardner-nv Apr 25, 2024
3b43a8a
Optionally strip HTML & XML tags from feed content
dagardner-nv Apr 26, 2024
d0d6744
Add strip_markup field
dagardner-nv Apr 26, 2024
ca7a4ff
Fix docstring, fix fstring for exception, add strip_markup arg
dagardner-nv Apr 26, 2024
eb48fa6
Add strip_markup arg
dagardner-nv Apr 26, 2024
2f1e4fe
Add strip_markup arg
dagardner-nv Apr 26, 2024
7286c02
Add strip_markup argument
dagardner-nv Apr 26, 2024
a7ca7ea
Add strip_markup arg to RSSSourceStage constructor
dagardner-nv Apr 26, 2024
8277d22
Parameratize some tests
dagardner-nv Apr 26, 2024
853c611
Add simple test for tag stripping
dagardner-nv Apr 26, 2024
458eb10
We currently receive beautifulsoup4 as a transitive dep, however we d…
dagardner-nv Apr 29, 2024
a6ab881
Add test for html stripping functionality
dagardner-nv Apr 29, 2024
46bbbae
Rename summary_col var
dagardner-nv Apr 29, 2024
5f41fa3
Update deps
dagardner-nv Apr 29, 2024
c720639
Merge branch 'branch-24.06' into david-vdb_upload-strip-tags-1666
mdemoret-nv May 1, 2024
0fc298c
Fix spelling mistake, 'markup' not 'makrup'
dagardner-nv May 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions conda/environments/all_cuda-121_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ dependencies:
- appdirs
- arxiv=1.4
- automake
- beautifulsoup4
- benchmark=1.8.3
- boost-cpp=1.84
- boto3
Expand Down
1 change: 1 addition & 0 deletions conda/environments/dev_cuda-121_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ channels:
dependencies:
- appdirs
- automake
- beautifulsoup4
- benchmark=1.8.3
- boost-cpp=1.84
- breathe=4.35.0
Expand Down
1 change: 1 addition & 0 deletions conda/environments/examples_cuda-121_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ dependencies:
- anyio>=3.7
- appdirs
- arxiv=1.4
- beautifulsoup4
- boto3
- click >=8
- cuml=24.02.*
Expand Down
1 change: 1 addition & 0 deletions conda/environments/runtime_cuda-121_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ channels:
- pytorch
dependencies:
- appdirs
- beautifulsoup4
- click >=8
- datacompy=0.10
- dill=0.3.7
Expand Down
1 change: 1 addition & 0 deletions dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,7 @@ dependencies:
- &dill dill=0.3.7
- &scikit-learn scikit-learn=1.3.2
- appdirs
- beautifulsoup4
- datacompy=0.10
- elasticsearch==8.9.0
- feedparser=6.0.10
Expand Down
3 changes: 3 additions & 0 deletions examples/llm/vdb_upload/module/rss_source_pipe.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ class RSSSourcePipeSchema(BaseModel):
request_timeout_sec: float = 2.0
run_indefinitely: bool = True
stop_after_rec: int = 0
strip_markup: bool = True
vdb_resource_name: str
web_scraper_config: Optional[Dict[Any, Any]] = None

Expand Down Expand Up @@ -98,6 +99,7 @@ def _rss_source_pipe(builder: mrc.Builder):
- **request_timeout_sec**: Timeout in seconds for RSS feed requests.
- **run_indefinitely**: Boolean to indicate continuous running.
- **stop_after**: Number of records to process before stopping (0 for indefinite).
- **strip_markup**: When True, strip HTML & XML markup from feed content.
- **web_scraper_config**: Configuration for the web scraper module.
- **chunk_overlap**: Overlap size for chunks in web scraping.
- **chunk_size**: Size of content chunks for processing.
Expand Down Expand Up @@ -131,6 +133,7 @@ def _rss_source_pipe(builder: mrc.Builder):
"request_timeout_sec": validated_config.request_timeout_sec,
"interval_sec": validated_config.interval_sec,
"stop_after_rec": validated_config.stop_after_rec,
"strip_markup": validated_config.strip_markup,
}
rss_source_loader = RSSSourceLoaderFactory.get_instance("rss_source", {"rss_source": rss_source_config})

Expand Down
1 change: 1 addition & 0 deletions examples/llm/vdb_upload/vdb_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ vdb_pipeline:
request_timeout_sec: 2.0
run_indefinitely: true
stop_after_rec: 0
strip_markup: true
web_scraper_config:
chunk_overlap: 51
chunk_size: 512
Expand Down
1 change: 1 addition & 0 deletions examples/llm/vdb_upload/vdb_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ def _build_default_rss_source(enable_cache,
"interval_sec": interval_secs,
"request_timeout_sec": rss_request_timeout_sec,
"run_indefinitely": run_indefinitely,
"strip_markup": True,
"vdb_resource_name": vector_db_resource_name,
"web_scraper_config": {
"chunk_size": content_chunking_size,
Expand Down
54 changes: 53 additions & 1 deletion morpheus/controllers/rss_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,16 +70,26 @@ class RSSController:
Cooldown interval in seconds if there is a failure in fetching or parsing the feed.
request_timeout : float, optional, default = 2.0
Request timeout in secs to fetch the feed.
strip_markup : bool, optional, default = False
When true, strip HTML & XML markup from the from the content, summary and title fields.
"""

# Fields which may contain HTML or XML content
MARKUP_FIELDS = (
"content",
"summary",
"title",
)

def __init__(self,
feed_input: str | list[str],
batch_size: int = 128,
run_indefinitely: bool = None,
enable_cache: bool = False,
cache_dir: str = "./.cache/http",
cooldown_interval: int = 600,
request_timeout: float = 2.0):
request_timeout: float = 2.0,
strip_markup: bool = False):
if IMPORT_EXCEPTION is not None:
raise ImportError(IMPORT_ERROR_MESSAGE) from IMPORT_EXCEPTION

Expand All @@ -92,6 +102,7 @@ def __init__(self,
self._previous_entries = set() # Stores the IDs of previous entries to prevent the processing of duplicates.
self._cooldown_interval = cooldown_interval
self._request_timeout = request_timeout
self._strip_markup = strip_markup

# Validate feed_input
for f in self._feed_input:
Expand Down Expand Up @@ -236,6 +247,44 @@ def _try_parse_feed(self, url: str) -> "feedparser.FeedParserDict":

return feed

@staticmethod
def _strip_markup_from_field(field: str, mime_type: str) -> str:
if mime_type.endswith("xml"):
parser = "xml"
else:
parser = "html.parser"

try:
soup = BeautifulSoup(field, features=parser)
return soup.get_text()
except Exception as ex:
logger.error("Failed to strip tags from field: %s: %s", field, ex)
return field

def _strip_markup_from_fields(self, entry: "feedparser.FeedParserDict"):
"""
Strip HTML & XML tags from the content, summary and title fields.

Per note in feedparser documentation even if a field is advertized as plain text, it may still contain HTML
https://feedparser.readthedocs.io/en/latest/html-sanitization.html
"""
for field in self.MARKUP_FIELDS:
field_value = entry.get(field)
if field_value is not None:
if isinstance(field_value, list):
for field_item in field_value:
mime_type = field_item.get("type", "text/plain")
field_item["value"] = self._strip_markup_from_field(field_item["value"], mime_type)
field_item["type"] = "text/plain"
else:
detail_field_name = f"{field}_detail"
detail_field: dict = entry.get(detail_field_name, {})
mime_type = detail_field.get("type", "text/plain")

entry[field] = self._strip_markup_from_field(field_value, mime_type)
detail_field["type"] = "text/plain"
entry[detail_field_name] = detail_field

def parse_feeds(self):
"""
Parse the RSS feed using the feedparser library.
Expand Down Expand Up @@ -291,6 +340,9 @@ def fetch_dataframes(self):
entry_id = entry.get('id')
current_entries.add(entry_id)
if entry_id not in self._previous_entries:
if self._strip_markup:
self._strip_markup_from_fields(entry)

entry_accumulator.append(entry)

if self._batch_size > 0 and len(entry_accumulator) >= self._batch_size:
Expand Down
33 changes: 15 additions & 18 deletions morpheus/modules/input/rss_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,30 +32,26 @@
@register_module("rss_source", "morpheus")
def _rss_source(builder: mrc.Builder):
"""
A module for applying simple DataFrame schema transform policies.

This module reads the configuration to determine how to set data types for columns, select, or rename them in the
dataframe.
A module for loading RSS feed items into a DataFrame.

Parameters
----------
builder : mrc.Builder
The Morpheus pipeline builder object.

Notes
-------------
The configuration should be passed to the module through the `module_config` attribute of the builder. It should
contain a dictionary where each key is a column name, and the value is another dictionary with keys 'dtype' for
data type, 'op_type' for operation type ('select' or 'rename'), and optionally 'from' for the original column
name (if the column is to be renamed).

Example Configuration
---------------------
{
"summary": {"dtype": "str", "op_type": "select"},
"title": {"dtype": "str", "op_type": "select"},
"content": {"from": "page_content", "dtype": "str", "op_type": "rename"},
"source": {"from": "link", "dtype": "str", "op_type": "rename"}
"batch_size": 32,
"cache_dir": "./.cache/http",
"cooldown_interval_sec": 600,
"enable_cache": True,
"feed_input": ["https://nvidianews.nvidia.com/releases.xml"],
"interval_sec": 600,
"request_timeout_sec": 2.0,
run_indefinitely: True,
"stop_after_rec": 0,
"strip_markup": True,
}
"""

Expand All @@ -77,7 +73,8 @@ def _rss_source(builder: mrc.Builder):
enable_cache=validated_config.enable_cache,
cache_dir=validated_config.cache_dir,
cooldown_interval=validated_config.cooldown_interval_sec,
request_timeout=validated_config.request_timeout_sec)
request_timeout=validated_config.request_timeout_sec,
strip_markup=validated_config.strip_markup)

stop_requested = False

Expand Down Expand Up @@ -108,9 +105,9 @@ def fetch_feeds() -> MessageMeta:

except Exception as exc:
if not controller.run_indefinitely:
logger.error("Failed either in the process of fetching or processing entries: %d.", exc)
logger.error("Failed either in the process of fetching or processing entries: %s.", exc)
raise
logger.error("Failed either in the process of fetching or processing entries: %d.", exc)
logger.error("Failed either in the process of fetching or processing entries: %s.", exc)

if not controller.run_indefinitely:
stop_requested = True
Expand Down
1 change: 1 addition & 0 deletions morpheus/modules/schemas/rss_source_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ class RSSSourceSchema(BaseModel):
request_timeout_sec: float = 2.0
interval_sec: int = 600
stop_after_rec: int = 0
strip_markup: bool = True

class Config:
extra = "forbid"
8 changes: 6 additions & 2 deletions morpheus/stages/input/rss_source_stage.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ class RSSSourceStage(PreallocatorMixin, SingleOutputSource):
Cooldown interval in seconds if there is a failure in fetching or parsing the feed.
request_timeout : float, optional, default = 2.0
Request timeout in secs to fetch the feed.
strip_markup : bool, optional, default = False
When true, strip HTML & XML markup from the from the content, summary and title fields.
"""

def __init__(self,
Expand All @@ -64,7 +66,8 @@ def __init__(self,
enable_cache: bool = False,
cache_dir: str = "./.cache/http",
cooldown_interval: int = 600,
request_timeout: float = 2.0):
request_timeout: float = 2.0,
strip_markup: bool = False):
super().__init__(c)
self._stop_requested = False

Expand All @@ -87,7 +90,8 @@ def __init__(self,
"enable_cache": enable_cache,
"cache_dir": cache_dir,
"cooldown_interval_sec": cooldown_interval,
"request_timeout_sec": request_timeout
"request_timeout_sec": request_timeout,
"strip_markup": strip_markup
}
}

Expand Down
Loading
Loading