Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle lack of support for docx/pptx/xlsx for media description #2260

Merged
merged 7 commits into from
Jan 9, 2025

Conversation

pamelafox
Copy link
Collaborator

Purpose

Fixes #2242

Unfortunately, the figures extraction doesn't yet work for office documents. This PR makes the flow default to standard ingestion for them. We could also have it error out entirely, like it does now, but with a more helpful message. Am not sure what developers would prefer.

Still working on the unit test

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[X] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[X] No

Type of change

[X] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

  • The current tests all pass (python -m pytest).
  • I added tests that prove my fix is effective or that my feature works
  • I ran python -m pytest --cov to verify 100% coverage of added lines
  • I ran python -m mypy to check for type errors
  • I either used the pre-commit hooks or ran ruff and black manually on my code.

@pamelafox pamelafox marked this pull request as draft January 9, 2025 00:06
@pamelafox pamelafox marked this pull request as ready for review January 9, 2025 17:47
@pamelafox pamelafox changed the title WIP: Handle lack of support for docx/pptx/xlsx for media description Handle lack of support for docx/pptx/xlsx for media description Jan 9, 2025
except HttpResponseError as e:
content.seek(0)
if e.error and e.error.code == "InvalidArgument":
logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning or error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested with error, that looks good. I think error makes sense given that I do really want people to notice it!

@@ -175,6 +174,9 @@ If you have already run `azd up`, you will need to run `azd provision` to create
If you have already indexed your documents and want to re-index them with the media descriptions,
first [remove the existing documents](./data_ingestion.md#removing-documents) and then [re-ingest the data](./data_ingestion.md#indexing-additional-documents).

⚠️ This feature does not yet support DOCX, PPTX, or XLSX formats. If you have figures in those formats, they will be ignored.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning looks good!

Copy link
Collaborator

@mattgotteiner mattgotteiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@pamelafox pamelafox merged commit a967edf into Azure-Samples:main Jan 9, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error in ingestion of .docx file type due ocrHighResolution when using Content Understanding.
2 participants