Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readme Content Verification #3

Merged
merged 1 commit into from
Feb 26, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 36 additions & 10 deletions packages/python-packages/doc-warden/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,12 @@ Every CI build owned by the Azure-SDK team also needs to verify that the documen
Features:

* Enforces Readme Standards
- [x] Readmes present
- [ ] Readmes have appropriate contents
- [ ] Files issues for failed standards checks
- [ ] Exit code > 0 for issues discovered
* Generates report for included observed packages
- Readmes present - *completed*
- Readmes have appropriate contents - *completed*
- Files issues for failed standards checks - *pending*
* Generates report for included observed packages - *pending*

This package is under development, and as such Python version compatibility has not been finalized at this time.
This package is tested on Python 2.7 -> 3.8.

## Prerequisites
This package is intended to be run as part of a pipeline within Azure DevOps. As such, [Python](https://www.python.org/downloads/) must be installed prior to attempting to install or use `Doc-Warden.` While `pip` comes pre-installed on most modern Python installs, if `pip` is an unrecognized command when attempting to install `warden`, run the following command **after** your Python installation is complete.
Expand Down Expand Up @@ -42,8 +41,14 @@ Example usage:

* Devops is a bit finicky with registering a console entry point, hence the `sudo` just on the installation. `sudo` is only required on devops machines.
* Assumption is that the `.docsettings` file is placed at the root of the repository.
* To provide a different path (like `azure-sdk-for-java` does...), use:
* `ward scan -d $(Build.SourcesDirectory) -c $(Build.SourcesDirectory)/eng/.docsettings.yml`

To provide a different path (like `azure-sdk-for-java` does...), use:

```

/:> ward scan -d $(Build.SourcesDirectory) -c $(Build.SourcesDirectory)/eng/.docsettings.yml

```

##### Parameter Options

Expand Down Expand Up @@ -102,14 +107,23 @@ A package is indicated by:

* The presence of a `package.json` file

### Enforcing Readme Content

`doc-warden` has the ability to check discovered readme files to ensure that a set of configured sections is present. How does it work? `doc-warden` will check each pattern present within `required_readme_sections` against all headers present within a target readme. If all the patterns match at least one header, the readme will pass content verification.

Other Notes:
* A `section` title is any markdown or RST that will result in a `<h1>` to `<h6>` html tag.
* `warden` will content verify any `readme.rst` or `readme.md` file found outside the `omitted_paths` in the targeted repo.
* Case of the readme file title is ignored.

#### Control, the `.docsettings.yml` File, and You

Special cases often need to be configured. It seems logical that there needs be a central location (per repo) to override conventional settings. To that end, a new `.docsettings.yml` file will be added to each repo.

```
<repo-root>
│ README.md
│ .docsettings.yml
│ .docsettings.yml
└───.azure-pipelines
│ │ <build def>
Expand All @@ -126,6 +140,9 @@ omitted_paths:
- archive/*
language: java
root_check_enabled: True
required_readme_sections:
- "(Client Library for Azure .*|Microsoft Azure SDK for .*)"
- Getting Started
```

The above configuration tells `warden`...
Expand All @@ -136,6 +153,15 @@ The above configuration tells `warden`...

Possible values for `language` right now are `['net', 'java', 'js', 'python']`. Greater than one target language is not currently supported.

##### `required_readme_sections` Configuration
This section instructs `warden` to verify that there is at least one matching section title for each provided `section` pattern in any discovered readme. Regex is fully supported.

The two items listed from the example `.docsettings` file will:
- Match a header matched by a simple regex expression
- Match a header exactly titled "Getting Started"

Note that the regex is surrounded by quotation marks where the regex will break `yml` parsing of the configuration file.

## Provide Feedback

If you encounter any bugs or have suggestions, please file an issue [here](<https://github.com/Azure/azure-sdk/issues>) and assign to `scbedd`.
If you encounter any bugs or have suggestions, please file an issue [here](https://github.com/Azure/azure-sdk-tools/issues) and assign to `scbedd`.
11 changes: 9 additions & 2 deletions packages/python-packages/doc-warden/setup.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

from setuptools import setup, find_packages
import setuptools

Expand Down Expand Up @@ -25,7 +28,7 @@
description=DESCRIPTION,
long_description=long_description,
long_description_content_type='text/markdown',
url='https://github.com/Azure/azure-sdk-tools/packages/python-packages/',
url='https://github.com/Azure/azure-sdk-tools/',
author='Microsoft Corporation',
author_email='[email protected]',

Expand All @@ -45,7 +48,11 @@
],
packages=find_packages(),
install_requires = [
'pyyaml',
'pyyaml', # docsettings file parse
'markdown2', # parsing markdown to html
'docutils', # parsing rst to html
'pygments', # docutils uses pygments for parsing rst to html
'beautifulsoup4', # parsing of generated html
'pathlib'
],
entry_points = {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

from __future__ import print_function
import argparse
import yaml
Expand Down Expand Up @@ -66,6 +69,11 @@ def __init__(self):
except:
self.omitted_paths = []

try:
self.required_readme_sections = doc['required_readme_sections'] or []
except:
self.required_readme_sections = []

try:
self.scan_language = args.scan_language or doc['language']
except:
Expand All @@ -88,5 +96,6 @@ def dump(self):
'omitted_paths': self.omitted_paths,
'scan_language': self.scan_language,
'root_check_enabled': self.root_check_enabled,
'verbose_output': self.verbose_output
'verbose_output': self.verbose_output,
'required_readme_sections': self.required_readme_sections
}
35 changes: 14 additions & 21 deletions packages/python-packages/doc-warden/warden/__init__.py
Original file line number Diff line number Diff line change
@@ -1,28 +1,21 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

from .version import VERSION
from .enforce_readme_presence import *
from .WardenConfiguration import WardenConfiguration

from .enforce_readme_presence import find_missing_readmes
from .enforce_readme_content import verify_readme_content
from .WardenConfiguration import WardenConfiguration
from .warden_common import walk_directory_for_pattern, get_omitted_files
from .cmd_entry import console_entry_point

__all__ = ['WardenConfiguration',
'DEFAULT_LOCATION',
'return_true',
'unrecognized_option',
__all__ = [
'WardenConfiguration',
'find_missing_readmes',
'verify_readme_content',
'console_entry_point',
'scan_repo',
'results',
'check_package_readmes',
'check_python_readmes',
'check_js_readmes',
'check_net_readmes',
'is_net_csproj_package',
'check_java_readmes',
'is_java_pom_package_pom',
'check_repo_root',
'find_alongside_file',
'get_file_sets',
'get_omitted_files',
'walk_directory_for_pattern',
'check_match',
'parse_pom']
'get_omitted_files',
]

__version__ = VERSION
27 changes: 27 additions & 0 deletions packages/python-packages/doc-warden/warden/cmd_entry.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

from __future__ import print_function

from .enforce_readme_presence import find_missing_readmes
from .enforce_readme_content import verify_readme_content
from .WardenConfiguration import WardenConfiguration

# CONFIGURATION. ENTRY POINT. EXECUTION.
def console_entry_point():
cfg = WardenConfiguration()
print(cfg.dump())

command_selector = {
'scan': scan,
}

if cfg.command in command_selector:
command_selector.get(cfg.command)(cfg)
else:
print('Unrecognized command invocation {}.'.format(cfg.command))
exit(1)

def scan(config):
find_missing_readmes(config)
verify_readme_content(config)
102 changes: 102 additions & 0 deletions packages/python-packages/doc-warden/warden/enforce_readme_content.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

from __future__ import print_function

import os
import markdown2
import bs4
import re
from .warden_common import check_match, walk_directory_for_pattern, get_omitted_files
from docutils import core
from docutils.writers.html4css1 import Writer,HTMLTranslator

# fnmatch is case insensitive by default, just look for readme rst and md
README_PATTERNS = ['*/readme.md', '*/readme.rst']

# entry point
def verify_readme_content(config):
all_readmes = walk_directory_for_pattern(config.target_directory, README_PATTERNS)
omitted_readmes = get_omitted_files(config)
targeted_readmes = [readme for readme in all_readmes if readme not in omitted_readmes]

readme_results = []

for readme in targeted_readmes:
ext = os.path.splitext(readme)[1]
if ext == '.rst':
readme_results.append(verify_rst_readme(readme, config))
else:
readme_results.append(verify_md_readme(readme, config))

results([readme_tuple for readme_tuple in readme_results if readme_tuple[1]], config)

# output results
def results(readmes_with_issues, config):
if len(readmes_with_issues):
print('{} readmes have missing required sections.'.format(len(readmes_with_issues)))
for readme_tuple in readmes_with_issues:
print(readme_tuple[0].replace(os.path.normpath(config.target_directory), '') + ' is missing headers with pattern(s):')
for missing_pattern in readme_tuple[1]:
print(' * {0}'.format(missing_pattern))
exit(1)

# parse rst to html, check for presence of appropriate sections
def verify_rst_readme(readme, config):
with open(readme, 'r') as f:
readme_content = f.read()
html_readme_content = rst_to_html(readme_content)
html_soup = bs4.BeautifulSoup(html_readme_content, "html.parser")

missed_patterns = find_missed_sections(html_soup, config.required_readme_sections)

return (readme, missed_patterns)

# parse md to html, check for presence of appropriate sections
def verify_md_readme(readme, config):
with open(readme, 'r') as f:
readme_content = f.read()
html_readme_content = markdown2.markdown(readme_content)
html_soup = bs4.BeautifulSoup(html_readme_content, "html.parser")

missed_patterns = find_missed_sections(html_soup, config.required_readme_sections)

return (readme, missed_patterns)

# within the entire readme, are there any missing sections that are expected?
def find_missed_sections(html_soup, patterns):
headers = html_soup.find_all(re.compile('^h[1-6]$'))
missed_patterns = []
observed_patterns = []

for header in headers:
observed_patterns.extend(match_regex_set(header, patterns))

return list(set(patterns) - set(observed_patterns))

# checks a header tag (soup) against a set of configured patterns
def match_regex_set(header, patterns):
matching_patterns = []
for pattern in patterns:
result = re.search(pattern, header.get_text())
if result:
matching_patterns.append(pattern)
return matching_patterns

# boilerplate for translating RST
class HTMLFragmentTranslator(HTMLTranslator):
def __init__(self, document):
HTMLTranslator.__init__(self, document)
self.head_prefix = ['','','','','']
self.body_prefix = []
self.body_suffix = []
self.stylesheet = []
def astext(self):
return ''.join(self.body)

html_fragment_writer = Writer()
html_fragment_writer.translator_class = HTMLFragmentTranslator

# utilize boilerplate
def rst_to_html(input_rst):
return core.publish_string(input_rst, writer = html_fragment_writer)
Loading