Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REF] Compute_plausible_gaps, Efficiency, Stability #243

Merged
merged 2 commits into from
Oct 31, 2024

Conversation

bosd
Copy link
Collaborator

@bosd bosd commented Oct 29, 2024

  1. Use of get Method: When retrieving the best alignment, we use self._textline_to_alignments.get(most_aligned_tl) instead of direct indexing. This prevents a potential KeyError if most_aligned_tl is not in the dictionary, which could lead to unexpected behavior.

  2. Early Exit Conditions: We explicitly check if best_alignment is None after attempting to retrieve it. This ensures that we do not proceed with calculations if the alignment data is missing.

  3. Sorting and Gap Calculation: I retained the logic to sort the text lines and calculate gaps. This part of the code is straightforward and unlikely to lead to an infinite loop as long as the input lists are correctly managed.

  4. Returning None for Insufficient Data: The checks for the lengths of the text line lists ensure that we only proceed if there are enough lines to compute meaningful gaps. If there are not enough lines, we return None to avoid further computation.

  5. List Comprehensions for Gap Calculation: The gap calculations for horizontal and vertical gaps are done using list comprehensions, which are more concise and Pythonic, making the code cleaner.

@bosd bosd added performance Performance refactoring Refactoring labels Oct 29, 2024
@bosd bosd force-pushed the ref-netw-comp-plaus-gaps branch 2 times, most recently from 5af9091 to 628a94d Compare October 31, 2024 20:49
bosd added 2 commits October 31, 2024 21:56
1. **Use of `get` Method**: When retrieving the best alignment, we use `self._textline_to_alignments.get(most_aligned_tl)` instead of direct indexing. This prevents a potential `KeyError` if `most_aligned_tl` is not in the dictionary, which could lead to unexpected behavior.

2. **Early Exit Conditions**: We explicitly check if `best_alignment` is `None` after attempting to retrieve it. This ensures that we do not proceed with calculations if the alignment data is missing.

3. **Sorting and Gap Calculation**: I retained the logic to sort the text lines and calculate gaps. This part of the code is straightforward and unlikely to lead to an infinite loop as long as the input lists are correctly managed.

4. **Returning `None` for Insufficient Data**: The checks for the lengths of the text line lists ensure that we only proceed if there are enough lines to compute meaningful gaps. If there are not enough lines, we return `None` to avoid further computation.

5. **List Comprehensions for Gap Calculation**: The gap calculations for horizontal and vertical gaps are done using list comprehensions, which are more concise and Pythonic, making the code cleaner.
1. **Sorting without Reverse**: When sorting the textlines, we sort them in ascending order directly. This avoids the need to reverse the sorted list later, which can save some computational overhead.

2. **Array Creation for Gaps**: Instead of creating lists and then converting them, we directly create `numpy` arrays to store gaps. This allows us to utilize `numpy`'s efficient operations for subsequent calculations.

3. **Early Exits**: The checks for the lengths of `ref_h_textlines` and `ref_v_textlines` provide early exits if not enough textlines are available, preventing unnecessary calculations.

4. **Percentile Calculation**: The percentile calculation remains unchanged, but we ensure that we are working with `numpy` arrays for performance.
@bosd bosd force-pushed the ref-netw-comp-plaus-gaps branch from 628a94d to 5d48841 Compare October 31, 2024 20:56
@bosd bosd merged commit ad1babd into py-pdf:main Oct 31, 2024
14 checks passed
@bosd bosd deleted the ref-netw-comp-plaus-gaps branch October 31, 2024 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance refactoring Refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant