Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added framework to test HTML reports #2283

Merged
merged 21 commits into from
Feb 26, 2025
Merged

Added framework to test HTML reports #2283

merged 21 commits into from
Feb 26, 2025

Conversation

shubham-yb
Copy link
Contributor

@shubham-yb shubham-yb commented Feb 3, 2025

Describe the changes in this pull request

Added the Python script to compare 2 HTML reports.
Added the expected files for PG and Oracle Assessment tests.

Note:
The HTML diffs might not look very readable at times. The idea is to bring up the fact that something has changed in the HTML report as well so that the developer is aware.

Describe if there are any user-facing changes

How was this pull request tested?

Does your PR have changes that can cause upgrade issues?

Component Breaking changes?
MetaDB Yes/No
Name registry json Yes/No
Data File Descriptor Json Yes/No
Export Snapshot Status Json Yes/No
Import Data State Yes/No
Export Status Json Yes/No
Data .sql files of tables Yes/No
Export and import data queue Yes/No
Schema Dump Yes/No
AssessmentDB Yes/No
Sizing DB Yes/No
Migration Assessment Report Json Yes/No
Callhome Json Yes/No
YugabyteD Tables Yes/No
TargetDB Metadata Tables Yes/No

Sample output when the reports are matching:

image

Sample output if the migration complexity changes:

image

Sample output if there are mismatches in tags:

image

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@shubham-yb
Copy link
Contributor Author

Copy link
Collaborator

@sanyamsinghal sanyamsinghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of special cases are there in extract/parsing html function here. can you please add more descriptive comments in the functions explaining "what and why" both.

Copy link
Collaborator

@sanyamsinghal sanyamsinghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good mostly, one important comment regarding unique set of tags.

Comment on lines 97 to 109
data = {
"title": normalize_text(soup.title.string) if soup.title and soup.title.string else "No Title",
"headings": extract_and_normalize_texts(soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])),
"paragraphs": extract_paragraphs(soup),
"tables": sort_table_data(soup.find_all("table")),
"links": {
k: v for k, v in sorted(
{normalize_text(a.get("href") or ""): normalize_text(a.text) for a in soup.find_all("a")}.items()
)
},
"spans": extract_and_normalize_texts(soup.find_all("span")),
"divs": extract_divs(soup)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we are checking for all these tags.
What is missing if there is some info in html(now or later in future) which is inside some tag apart from these...

Comparision of unique set of tags in both the html reports should fix that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a function compare_html_tags() which takes care of this. Added a screenshot in the description as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great @shubham-yb. A small nit, instead of File1 or File2 in output lets use expected file or actual file ?

@sanyamsinghal
Copy link
Collaborator

@shubham-yb Few guidelines for merging the PR

  1. Always Squash and Merge
  2. Make sure you edit the final commit msg there, to make it precise and concise. (All the commits of that branch don't need to be there in the final commit msg)

Copy link
Collaborator

@sanyamsinghal sanyamsinghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just few final comments.

Good job on adding the framework and thanks for improving the code further.

Comment on lines 137 to 158
def compare_html_tags(html_data1, html_data2):
"""Compare the unique tags in the two HTML reports."""

def compare_html_reports(file1, file2):
"""Compares two HTML reports and prints structured differences."""
with open(file1, "r", encoding="utf-8") as f1, open(file2, "r", encoding="utf-8") as f2:
html_data1 = extract_html_data(f1.read())
html_data2 = extract_html_data(f2.read())
def get_unique_tags(html_content):
"""Extracts all unique tag names from the given HTML content."""
soup = BeautifulSoup(html_content, 'html.parser')
return {tag.name for tag in soup.find_all()}

tags1 = get_unique_tags(html_data1)
tags2 = get_unique_tags(html_data2)

missing_tags_in_file1 = tags2 - tags1 # Tags in file2 but missing in file1
missing_tags_in_file2 = tags1 - tags2 # Tags in file1 but missing in file2

differences = {}

if missing_tags_in_file1:
differences["missing_tags_in_file1"] = "\n".join(missing_tags_in_file1)
if missing_tags_in_file2:
differences["missing_tags_in_file2"] = "\n".join(missing_tags_in_file2)

return differences
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One high level comment - what we(consumer of the script) interested to know is the difference between expected and actual report.

Assuming here, file1 - expected, file2- actual
Wording the output from the perspective of file2(actual) will be easier to read.

can we display in output like:

Missing tags in actual report:
...

Extra/new tags in actual report:

@@ -129,44 +133,85 @@ def generate_diff_list(list1, list2, section_name, file1_path, file2_path):
def dict_to_list(dict_data):
"""Convert dictionary to list of formatted strings."""
return [f"{k} -> {v}" for k, v in dict_data.items()]

def compare_html_tags(html_data1, html_data2):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing unique tags is one.

How about one enhancement - also match the counts for each tags(number of times each tag appeared)
I guess it should be small change, lib might be providing that.

@shubham-yb shubham-yb merged commit 1cd7f89 into main Feb 26, 2025
66 checks passed
@shubham-yb shubham-yb deleted the shubham/html-comparison branch February 26, 2025 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants