Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support AsciiDoc and Markdown input format #168

Merged
merged 36 commits into from
Oct 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
1203353
updated the base-model and added the asciidoc_backend
PeterStaar-IBM Oct 17, 2024
c1d9241
updated the asciidoc backend
PeterStaar-IBM Oct 18, 2024
63d3704
Ensure all models work only on valid pages (#158)
cau-git Oct 18, 2024
77fa1db
ci: run ci also on forks (#160)
dolfim-ibm Oct 18, 2024
eb154a1
fix: fix legacy doc ref (#162)
vagenas Oct 18, 2024
b6c0610
docs: typo fix (#155)
fadkeabhi Oct 18, 2024
006cfb4
feat: add coverage_threshold to skip OCR for small images (#161)
dolfim-ibm Oct 18, 2024
c60c402
chore: bump version to 2.1.0 [skip ci]
github-actions[bot] Oct 18, 2024
1138cae
adding tests for asciidocs
PeterStaar-IBM Oct 18, 2024
5016dae
first working asciidoc parser
PeterStaar-IBM Oct 18, 2024
70b2ae3
reformatted the code
PeterStaar-IBM Oct 18, 2024
e60c525
fixed the mypy
PeterStaar-IBM Oct 19, 2024
c23d049
adding test_02.asciidoc
PeterStaar-IBM Oct 21, 2024
5986213
Drafting Markdown backend via Marko library
Oct 17, 2024
1df89f7
work in progress on MD backend
Oct 18, 2024
534b220
md_backend produces docling document with headers, paragraphs, lists
Oct 18, 2024
bef429f
Improvements in md parsing
Oct 18, 2024
fa2f8cf
Detecting and assembling tables in markdown in temporary buffers
Oct 21, 2024
ba9beb6
Added initial docling table support to md_backend
Oct 21, 2024
dae3664
Cleaned code, improved logging for MD
Oct 21, 2024
1456a36
Fixes MyPy requirements, and rest of pre-commit
Oct 21, 2024
8c60dfa
Fixed example run_md, added origin info to md_backend
Oct 21, 2024
1c0a766
working on asciidocs, struggling with ImageRef
PeterStaar-IBM Oct 22, 2024
b04f14e
able to parse the captions and image uri's
PeterStaar-IBM Oct 22, 2024
bb3db07
fixed the mypy
PeterStaar-IBM Oct 22, 2024
0bbd50f
Merge branch 'dev/add-asciidocs-backend' of github.com:DS4SD/docling …
cau-git Oct 22, 2024
789b29b
Merge ASCIIDoc and Markdown backends in, fixes
cau-git Oct 22, 2024
b1a2af6
Update all backends with proper filename in DocumentOrigin
cau-git Oct 22, 2024
578e30e
Update to docling-core v2.1.0
cau-git Oct 22, 2024
47a4d31
Fixes for MD Backend, to avoid duplicated text inserts into docling doc
Oct 22, 2024
4fb803f
Fix styling
cau-git Oct 22, 2024
186d71a
Added support for code blocks and fenced code in MD
Oct 22, 2024
e8229fd
cleaned prints
Oct 22, 2024
0f81ffd
Added proper processing of in-line textual elements for MD backend
Oct 23, 2024
82126e3
Fixed issues with duplicated paragraphs and incorrect lists in pptx
Oct 23, 2024
76d9041
Fixed issue with group ordeering in pptx backend, added gebug log int…
Oct 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,5 +94,5 @@ If you use Docling in your projects, please consider citing the following:

## License

The Docling codebase is under MIT license.
The Docling codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.
1 change: 1 addition & 0 deletions docling/backend/abstract_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
class AbstractDocumentBackend(ABC):
@abstractmethod
def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
self.file = in_doc.file
self.path_or_stream = path_or_stream
self.document_hash = in_doc.document_hash
self.input_format = in_doc.format
Expand Down
Loading
Loading