Handle lack of support for docx/pptx/xlsx for media description #2260

pamelafox · 2025-01-09T00:06:34Z

Purpose

Unfortunately, the figures extraction doesn't yet work for office documents. This PR makes the flow default to standard ingestion for them. We could also have it error out entirely, like it does now, but with a more helpful message. Am not sure what developers would prefer.

Still working on the unit test

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[X] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[X] No

Type of change

[X] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

The current tests all pass (python -m pytest).
I added tests that prove my fix is effective or that my feature works
I ran python -m pytest --cov to verify 100% coverage of added lines
I ran python -m mypy to check for type errors
I either used the pre-commit hooks or ran ruff and black manually on my code.

mattgotteiner · 2025-01-09T18:00:27Z

app/backend/prepdocslib/pdfparser.py

+                except HttpResponseError as e:
+                    content.seek(0)
+                    if e.error and e.error.code == "InvalidArgument":
+                        logger.warning(


warning or error?

Just tested with error, that looks good. I think error makes sense given that I do really want people to notice it!

mattgotteiner · 2025-01-09T18:00:41Z

docs/deploy_features.md

@@ -175,6 +174,9 @@ If you have already run `azd up`, you will need to run `azd provision` to create
 If you have already indexed your documents and want to re-index them with the media descriptions,
 first [remove the existing documents](./data_ingestion.md#removing-documents) and then [re-ingest the data](./data_ingestion.md#indexing-additional-documents).

+⚠️ This feature does not yet support DOCX, PPTX, or XLSX formats. If you have figures in those formats, they will be ignored.


warning looks good!

mattgotteiner

Thanks!

pamelafox added 2 commits January 7, 2025 12:47

Configure Azure Developer Pipeline

6fac970

Address lack of support for office formats

100367c

pamelafox marked this pull request as draft January 9, 2025 00:06

pamelafox and others added 4 commits January 8, 2025 16:17

Test attempt

2c52253

Fixed test with error

9b6ff66

Mypy types

c3aeeab

Merge branch 'main' into contentunderdocx

c3f156e

pamelafox marked this pull request as ready for review January 9, 2025 17:47

pamelafox changed the title ~~WIP: Handle lack of support for docx/pptx/xlsx for media description~~ Handle lack of support for docx/pptx/xlsx for media description Jan 9, 2025

pamelafox requested a review from mattgotteiner January 9, 2025 17:48

pamelafox mentioned this pull request Jan 9, 2025

Error in ingestion of .docx file type due ocrHighResolution when using Content Understanding. #2242

Closed

mattgotteiner reviewed Jan 9, 2025

View reviewed changes

mattgotteiner approved these changes Jan 9, 2025

View reviewed changes

Change warning to error

13d658b

pamelafox merged commit a967edf into Azure-Samples:main Jan 9, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle lack of support for docx/pptx/xlsx for media description #2260

Handle lack of support for docx/pptx/xlsx for media description #2260

pamelafox commented Jan 9, 2025

mattgotteiner Jan 9, 2025

pamelafox Jan 9, 2025

mattgotteiner Jan 9, 2025

mattgotteiner left a comment

Handle lack of support for docx/pptx/xlsx for media description #2260

Handle lack of support for docx/pptx/xlsx for media description #2260

Conversation

pamelafox commented Jan 9, 2025

Purpose

Does this introduce a breaking change?

Does this require changes to learn.microsoft.com docs?

Type of change

Code quality checklist

mattgotteiner Jan 9, 2025

Choose a reason for hiding this comment

pamelafox Jan 9, 2025

Choose a reason for hiding this comment

mattgotteiner Jan 9, 2025

Choose a reason for hiding this comment

mattgotteiner left a comment

Choose a reason for hiding this comment