Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vectorize layout elements #384

Merged
merged 18 commits into from
Sep 22, 2024
Merged

vectorize layout elements #384

merged 18 commits into from
Sep 22, 2024

Conversation

badGarnet
Copy link
Collaborator

@badGarnet badGarnet commented Sep 9, 2024

This PR adds two new vectorized page level dataclasses:

  • TextRegions: replaces a list[TextRegion] and store coordinates and texts as numpy arrays for more efficient memory operations when the number of items is large
  • LayoutElements: replaces a list[LayoutElement] and store data in numpy arrays as above
    In addition this PR refactors yolox model inference to use those two new classes above internally while keeping the list data structure still available for backward compatibility (e.g., passing into a PageLayout object).

test

compare memory usage and runtime on a pdf image using

from unstructured_inference.inference.layout import process_file_with_model


def main():
    f = "/Users/yaoyou/Downloads/002489.pdf"
    layout = process_file_with_model(f, model_name="yolox")
    # replace elements_array with elements when using main branch
    print(f"fount {len(layout.pages[0].elements_array)} elements")


if __name__ == "__main__":
    main()

The peak memory is smaller on this branch (exact amount depends on the number of layout elements detected) and processing time is slightly faster (since this PR skips generation of list of LayoutElement from numpy array output of the yolox model).

@badGarnet badGarnet marked this pull request as ready for review September 16, 2024 15:35
@badGarnet badGarnet enabled auto-merge (squash) September 19, 2024 18:02
@badGarnet badGarnet merged commit 9504649 into main Sep 22, 2024
5 checks passed
@badGarnet badGarnet deleted the feat/improve-iou-speed branch September 22, 2024 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants