Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: removing code block in MarkdownConverter #3960

Merged
merged 14 commits into from
Jan 27, 2023
2 changes: 1 addition & 1 deletion haystack/nodes/file_converter/markdown.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ def convert(
metadata, markdown_text = frontmatter.parse(f.read())

# md -> html -> text since BeautifulSoup can extract text cleanly
html = markdown(markdown_text)
html = markdown(markdown_text, extensions=["fenced_code"])

# remove code snippets
if remove_code_snippets:
Expand Down
9 changes: 8 additions & 1 deletion test/nodes/test_file_converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,8 @@ def test_docx_converter():
def test_markdown_converter():
converter = MarkdownConverter()
document = converter.convert(file_path=SAMPLES_PATH / "markdown" / "sample.md")[0]
assert document.content.startswith("What to build with Haystack")
assert document.content.startswith("\nWhat to build with Haystack")
assert "# git clone https://github.com/deepset-ai/haystack.git" not in document.content


def test_markdown_converter_headline_extraction():
Expand Down Expand Up @@ -178,6 +179,12 @@ def test_markdown_converter_frontmatter_to_meta():
assert document.meta["date"] == "1.1.2023"


def test_markdown_converter_remove_code_snippets():
converter = MarkdownConverter(remove_code_snippets=False)
document = converter.convert(file_path=SAMPLES_PATH / "markdown" / "sample.md")[0]
assert document.content.startswith("pip install farm-haystack")


def test_azure_converter():
# Check if Form Recognizer endpoint and credential key in environment variables
if "AZURE_FORMRECOGNIZER_ENDPOINT" in os.environ and "AZURE_FORMRECOGNIZER_KEY" in os.environ:
Expand Down
3 changes: 3 additions & 0 deletions test/samples/markdown/sample.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
type: intro
date: 1.1.2023
---
```bash
pip install farm-haystack
```
## What to build with Haystack

- **Ask questions in natural language** and find granular answers in your own documents.
Expand Down