Skip to content

Commit

Permalink
Merge pull request #3 from scbedd/readme-content-verification
Browse files Browse the repository at this point in the history
Readme Content Verification
  • Loading branch information
scbedd authored Feb 26, 2019
2 parents e0a515b + 19b1712 commit a2ff91a
Show file tree
Hide file tree
Showing 9 changed files with 246 additions and 100 deletions.
46 changes: 36 additions & 10 deletions packages/python-packages/doc-warden/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,12 @@ Every CI build owned by the Azure-SDK team also needs to verify that the documen
Features:

* Enforces Readme Standards
- [x] Readmes present
- [ ] Readmes have appropriate contents
- [ ] Files issues for failed standards checks
- [ ] Exit code > 0 for issues discovered
* Generates report for included observed packages
- Readmes present - *completed*
- Readmes have appropriate contents - *completed*
- Files issues for failed standards checks - *pending*
* Generates report for included observed packages - *pending*

This package is under development, and as such Python version compatibility has not been finalized at this time.
This package is tested on Python 2.7 -> 3.8.

## Prerequisites
This package is intended to be run as part of a pipeline within Azure DevOps. As such, [Python](https://www.python.org/downloads/) must be installed prior to attempting to install or use `Doc-Warden.` While `pip` comes pre-installed on most modern Python installs, if `pip` is an unrecognized command when attempting to install `warden`, run the following command **after** your Python installation is complete.
Expand Down Expand Up @@ -42,8 +41,14 @@ Example usage:

* Devops is a bit finicky with registering a console entry point, hence the `sudo` just on the installation. `sudo` is only required on devops machines.
* Assumption is that the `.docsettings` file is placed at the root of the repository.
* To provide a different path (like `azure-sdk-for-java` does...), use:
* `ward scan -d $(Build.SourcesDirectory) -c $(Build.SourcesDirectory)/eng/.docsettings.yml`

To provide a different path (like `azure-sdk-for-java` does...), use:

```
/:> ward scan -d $(Build.SourcesDirectory) -c $(Build.SourcesDirectory)/eng/.docsettings.yml
```

##### Parameter Options

Expand Down Expand Up @@ -102,14 +107,23 @@ A package is indicated by:

* The presence of a `package.json` file

### Enforcing Readme Content

`doc-warden` has the ability to check discovered readme files to ensure that a set of configured sections is present. How does it work? `doc-warden` will check each pattern present within `required_readme_sections` against all headers present within a target readme. If all the patterns match at least one header, the readme will pass content verification.

Other Notes:
* A `section` title is any markdown or RST that will result in a `<h1>` to `<h6>` html tag.
* `warden` will content verify any `readme.rst` or `readme.md` file found outside the `omitted_paths` in the targeted repo.
* Case of the readme file title is ignored.

#### Control, the `.docsettings.yml` File, and You

Special cases often need to be configured. It seems logical that there needs be a central location (per repo) to override conventional settings. To that end, a new `.docsettings.yml` file will be added to each repo.

```
<repo-root>
│ README.md
│ .docsettings.yml
│ .docsettings.yml
└───.azure-pipelines
│ │ <build def>
Expand All @@ -126,6 +140,9 @@ omitted_paths:
- archive/*
language: java
root_check_enabled: True
required_readme_sections:
- "(Client Library for Azure .*|Microsoft Azure SDK for .*)"
- Getting Started
```

The above configuration tells `warden`...
Expand All @@ -136,6 +153,15 @@ The above configuration tells `warden`...

Possible values for `language` right now are `['net', 'java', 'js', 'python']`. Greater than one target language is not currently supported.

##### `required_readme_sections` Configuration
This section instructs `warden` to verify that there is at least one matching section title for each provided `section` pattern in any discovered readme. Regex is fully supported.

The two items listed from the example `.docsettings` file will:
- Match a header matched by a simple regex expression
- Match a header exactly titled "Getting Started"

Note that the regex is surrounded by quotation marks where the regex will break `yml` parsing of the configuration file.

## Provide Feedback

If you encounter any bugs or have suggestions, please file an issue [here](<https://github.com/Azure/azure-sdk/issues>) and assign to `scbedd`.
If you encounter any bugs or have suggestions, please file an issue [here](https://github.com/Azure/azure-sdk-tools/issues) and assign to `scbedd`.
11 changes: 9 additions & 2 deletions packages/python-packages/doc-warden/setup.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

from setuptools import setup, find_packages
import setuptools

Expand Down Expand Up @@ -25,7 +28,7 @@
description=DESCRIPTION,
long_description=long_description,
long_description_content_type='text/markdown',
url='https://github.com/Azure/azure-sdk-tools/packages/python-packages/',
url='https://github.com/Azure/azure-sdk-tools/',
author='Microsoft Corporation',
author_email='[email protected]',

Expand All @@ -45,7 +48,11 @@
],
packages=find_packages(),
install_requires = [
'pyyaml',
'pyyaml', # docsettings file parse
'markdown2', # parsing markdown to html
'docutils', # parsing rst to html
'pygments', # docutils uses pygments for parsing rst to html
'beautifulsoup4', # parsing of generated html
'pathlib'
],
entry_points = {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

from __future__ import print_function
import argparse
import yaml
Expand Down Expand Up @@ -66,6 +69,11 @@ def __init__(self):
except:
self.omitted_paths = []

try:
self.required_readme_sections = doc['required_readme_sections'] or []
except:
self.required_readme_sections = []

try:
self.scan_language = args.scan_language or doc['language']
except:
Expand All @@ -88,5 +96,6 @@ def dump(self):
'omitted_paths': self.omitted_paths,
'scan_language': self.scan_language,
'root_check_enabled': self.root_check_enabled,
'verbose_output': self.verbose_output
'verbose_output': self.verbose_output,
'required_readme_sections': self.required_readme_sections
}
35 changes: 14 additions & 21 deletions packages/python-packages/doc-warden/warden/__init__.py
Original file line number Diff line number Diff line change
@@ -1,28 +1,21 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

from .version import VERSION
from .enforce_readme_presence import *
from .WardenConfiguration import WardenConfiguration

from .enforce_readme_presence import find_missing_readmes
from .enforce_readme_content import verify_readme_content
from .WardenConfiguration import WardenConfiguration
from .warden_common import walk_directory_for_pattern, get_omitted_files
from .cmd_entry import console_entry_point

__all__ = ['WardenConfiguration',
'DEFAULT_LOCATION',
'return_true',
'unrecognized_option',
__all__ = [
'WardenConfiguration',
'find_missing_readmes',
'verify_readme_content',
'console_entry_point',
'scan_repo',
'results',
'check_package_readmes',
'check_python_readmes',
'check_js_readmes',
'check_net_readmes',
'is_net_csproj_package',
'check_java_readmes',
'is_java_pom_package_pom',
'check_repo_root',
'find_alongside_file',
'get_file_sets',
'get_omitted_files',
'walk_directory_for_pattern',
'check_match',
'parse_pom']
'get_omitted_files',
]

__version__ = VERSION
27 changes: 27 additions & 0 deletions packages/python-packages/doc-warden/warden/cmd_entry.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

from __future__ import print_function

from .enforce_readme_presence import find_missing_readmes
from .enforce_readme_content import verify_readme_content
from .WardenConfiguration import WardenConfiguration

# CONFIGURATION. ENTRY POINT. EXECUTION.
def console_entry_point():
cfg = WardenConfiguration()
print(cfg.dump())

command_selector = {
'scan': scan,
}

if cfg.command in command_selector:
command_selector.get(cfg.command)(cfg)
else:
print('Unrecognized command invocation {}.'.format(cfg.command))
exit(1)

def scan(config):
find_missing_readmes(config)
verify_readme_content(config)
102 changes: 102 additions & 0 deletions packages/python-packages/doc-warden/warden/enforce_readme_content.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

from __future__ import print_function

import os
import markdown2
import bs4
import re
from .warden_common import check_match, walk_directory_for_pattern, get_omitted_files
from docutils import core
from docutils.writers.html4css1 import Writer,HTMLTranslator

# fnmatch is case insensitive by default, just look for readme rst and md
README_PATTERNS = ['*/readme.md', '*/readme.rst']

# entry point
def verify_readme_content(config):
all_readmes = walk_directory_for_pattern(config.target_directory, README_PATTERNS)
omitted_readmes = get_omitted_files(config)
targeted_readmes = [readme for readme in all_readmes if readme not in omitted_readmes]

readme_results = []

for readme in targeted_readmes:
ext = os.path.splitext(readme)[1]
if ext == '.rst':
readme_results.append(verify_rst_readme(readme, config))
else:
readme_results.append(verify_md_readme(readme, config))

results([readme_tuple for readme_tuple in readme_results if readme_tuple[1]], config)

# output results
def results(readmes_with_issues, config):
if len(readmes_with_issues):
print('{} readmes have missing required sections.'.format(len(readmes_with_issues)))
for readme_tuple in readmes_with_issues:
print(readme_tuple[0].replace(os.path.normpath(config.target_directory), '') + ' is missing headers with pattern(s):')
for missing_pattern in readme_tuple[1]:
print(' * {0}'.format(missing_pattern))
exit(1)

# parse rst to html, check for presence of appropriate sections
def verify_rst_readme(readme, config):
with open(readme, 'r') as f:
readme_content = f.read()
html_readme_content = rst_to_html(readme_content)
html_soup = bs4.BeautifulSoup(html_readme_content, "html.parser")

missed_patterns = find_missed_sections(html_soup, config.required_readme_sections)

return (readme, missed_patterns)

# parse md to html, check for presence of appropriate sections
def verify_md_readme(readme, config):
with open(readme, 'r') as f:
readme_content = f.read()
html_readme_content = markdown2.markdown(readme_content)
html_soup = bs4.BeautifulSoup(html_readme_content, "html.parser")

missed_patterns = find_missed_sections(html_soup, config.required_readme_sections)

return (readme, missed_patterns)

# within the entire readme, are there any missing sections that are expected?
def find_missed_sections(html_soup, patterns):
headers = html_soup.find_all(re.compile('^h[1-6]$'))
missed_patterns = []
observed_patterns = []

for header in headers:
observed_patterns.extend(match_regex_set(header, patterns))

return list(set(patterns) - set(observed_patterns))

# checks a header tag (soup) against a set of configured patterns
def match_regex_set(header, patterns):
matching_patterns = []
for pattern in patterns:
result = re.search(pattern, header.get_text())
if result:
matching_patterns.append(pattern)
return matching_patterns

# boilerplate for translating RST
class HTMLFragmentTranslator(HTMLTranslator):
def __init__(self, document):
HTMLTranslator.__init__(self, document)
self.head_prefix = ['','','','','']
self.body_prefix = []
self.body_suffix = []
self.stylesheet = []
def astext(self):
return ''.join(self.body)

html_fragment_writer = Writer()
html_fragment_writer.translator_class = HTMLFragmentTranslator

# utilize boilerplate
def rst_to_html(input_rst):
return core.publish_string(input_rst, writer = html_fragment_writer)
Loading

0 comments on commit a2ff91a

Please sign in to comment.