GitHub - SWHL/ExtractOfficeContent: Extract content (include text, table, image) from the office files (Word, Excel, PPT).

Extract Office Content

目前已知问题

提取PPT:
- 提取ppt中的内容时，会丢失带有公式的文本框
- 提取的表格格式不全
- PPT中的表格会提取为对应的excel文件，是否有更好的方式？
提取Word:
- 表格位置不能与原文中一一对应

Use

Installextract_office_content
```
$ pip install extract_office_content
```

Run by CLI.

Extract All office file's content.

$ extract_office_content -h
usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_path

positional arguments:
file_path

optional arguments:
-h, --help            show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

$ extract_office_content tests/test_files

Extract Word.

$ extract_word -h
usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_path

positional arguments:
word_path

optional arguments:
-h, --help            show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

$ extract_word tests/test_files/word_example.docx

Extract PPT.

$ extract_ppt -h
usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_path

positional arguments:
ppt_path

optional arguments:
-h, --help            show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

$ extract_ppt tests/test_files/ppt_example.pptx

Extract Excel.

$ extract_excel -h
usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR]
                    excel_path

positional arguments:
excel_path

optional arguments:
-h, --help            show this help message and exit
-f {markdown,html,latex,string}, --output_format {markdown,html,latex,string}
-o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

$ extract_excel tests/test_files/excel_example.xlsx

Run by python script.

Extract All.

from pathlib import Path

from extract_office_content import ExtractOfficeContent

extracter = ExtractOfficeContent()
file_list = list(Path('tests/test_files').iterdir())

for file_path in file_list:
    res = extracter(file_path)
    print(res)

Extract Word.

from extract_office_content import ExtractWord

word_extract = ExtractWord()
word_path = 'tests/test_files/word_example.docx'
text = word_extract(word_path, "outputs/word")

# or bytes
with open(word_path, 'rb') as f:
    word_content = f.read()
text = word_extract(word_content, "outputs/word")
print(text)

Extract PPT.

from pathlib import Path

from extract_office_content import ExtractPPT

ppt_extracter = ExtractPPT()

ppt_path = 'tests/test_files/ppt_example.pptx'
save_dir = 'outputs'
save_img_dir = Path(save_dir) / Path(ppt_path).stem
res = ppt_extracter(ppt_path, save_img_dir=str(save_img_dir))

# or bytes
with open(ppt_path, 'rb') as f:
    ppt_content = f.read()
res = ppt_extracter(ppt_content, save_img_dir=str(save_img_dir))
print(res)

Extract Excel.

from extract_office_content import ExtractExcel

excel_extract = ExtractExcel()

excel_path = 'tests/test_files/excel_with_image.xlsx'
res  = excel_extract(excel_path, out_format='markdown', save_img_dir='1')

# or bytes
with open(excel_path, 'rb') as f:
    excel_content = f.read()
res  = excel_extract(excel_content, out_format='markdown', save_img_dir='1')
print(res)

更新日志

2023-07-02 v0.0.6 update:
- 统一提取word接口返回值为List，与其他统一
2023-06-17 v0.0.4 update:
- 支持file-like object输入

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
docs		docs
extract_office_content		extract_office_content
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
demo.py		demo.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract Office Content

目前已知问题

Use

更新日志

Reference

About

Releases

Languages

SWHL/ExtractOfficeContent

Folders and files

Latest commit

History

Repository files navigation

Extract Office Content

目前已知问题

Use

更新日志

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages