Skip to content

Latest commit

 

History

History
156 lines (127 loc) · 5.29 KB

README.md

File metadata and controls

156 lines (127 loc) · 5.29 KB

Extract Office Content

PyPI SemVer2.0

目前已知问题

  • 提取PPT:
    • 提取ppt中的内容时,会丢失带有公式的文本框
    • 提取的表格格式不全
    • PPT中的表格会提取为对应的excel文件,是否有更好的方式?
  • 提取Word:
    • 表格位置不能与原文中一一对应

Use

  1. Installextract_office_content
    $ pip install extract_office_content
  2. Run by CLI.
    • Extract All office file's content.
      $ extract_office_content -h
      usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_path
      
      positional arguments:
      file_path
      
      optional arguments:
      -h, --help            show this help message and exit
      -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
      
      $ extract_office_content tests/test_files
    • Extract Word.
      $ extract_word -h
      usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_path
      
      positional arguments:
      word_path
      
      optional arguments:
      -h, --help            show this help message and exit
      -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
      
      $ extract_word tests/test_files/word_example.docx
    • Extract PPT.
      $ extract_ppt -h
      usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_path
      
      positional arguments:
      ppt_path
      
      optional arguments:
      -h, --help            show this help message and exit
      -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
      
      $ extract_ppt tests/test_files/ppt_example.pptx
    • Extract Excel.
      $ extract_excel -h
      usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR]
                          excel_path
      
      positional arguments:
      excel_path
      
      optional arguments:
      -h, --help            show this help message and exit
      -f {markdown,html,latex,string}, --output_format {markdown,html,latex,string}
      -o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
      
      $ extract_excel tests/test_files/excel_example.xlsx
  3. Run by python script.
    • Extract All.
      from pathlib import Path
      
      from extract_office_content import ExtractOfficeContent
      
      extracter = ExtractOfficeContent()
      file_list = list(Path('tests/test_files').iterdir())
      
      for file_path in file_list:
          res = extracter(file_path)
          print(res)
    • Extract Word.
      from extract_office_content import ExtractWord
      
      word_extract = ExtractWord()
      word_path = 'tests/test_files/word_example.docx'
      text = word_extract(word_path, "outputs/word")
      
      # or bytes
      with open(word_path, 'rb') as f:
          word_content = f.read()
      text = word_extract(word_content, "outputs/word")
      print(text)
    • Extract PPT.
      from pathlib import Path
      
      from extract_office_content import ExtractPPT
      
      ppt_extracter = ExtractPPT()
      
      ppt_path = 'tests/test_files/ppt_example.pptx'
      save_dir = 'outputs'
      save_img_dir = Path(save_dir) / Path(ppt_path).stem
      res = ppt_extracter(ppt_path, save_img_dir=str(save_img_dir))
      
      # or bytes
      with open(ppt_path, 'rb') as f:
          ppt_content = f.read()
      res = ppt_extracter(ppt_content, save_img_dir=str(save_img_dir))
      print(res)
    • Extract Excel.
      from extract_office_content import ExtractExcel
      
      excel_extract = ExtractExcel()
      
      excel_path = 'tests/test_files/excel_with_image.xlsx'
      res  = excel_extract(excel_path, out_format='markdown', save_img_dir='1')
      
      # or bytes
      with open(excel_path, 'rb') as f:
          excel_content = f.read()
      res  = excel_extract(excel_content, out_format='markdown', save_img_dir='1')
      print(res)

更新日志

  • 2023-07-02 v0.0.6 update:
    • 统一提取word接口返回值为List,与其他统一
  • 2023-06-17 v0.0.4 update:
    • 支持file-like object输入

Reference