Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merlin 0.4.0 #129

Merged
merged 36 commits into from
Apr 9, 2020
Merged

Merlin 0.4.0 #129

merged 36 commits into from
Apr 9, 2020

Conversation

stooit
Copy link
Contributor

@stooit stooit commented Apr 9, 2020

No description provided.

steveworley and others added 30 commits September 4, 2019 13:28
Ensure that media types are added to output filenames.
… migrate generate command (#79)

* Set crawl and migrate limits using an optional runtime flag

* Documentation for -l flag for Crawler

* Documentation for -l limit flag on migrate generate command

* Refactor "-l/--limit" to "-n/--number"

* Revert "Refactor "-l/--limit" to "-n/--number""

This reverts commit 8b03d72.

* Make runWeb/RunXml consistent.
Remove requirement for urls/url_file in XML parsing.

* Tweak docs.
Resolve issue where errors were only reported once.
* Adds support for inline links via linkit.

* Improved test coverage for linkit plugin.
* Multiple selectors for config map

Introduces multiple selectors and multiple target fields per mapping, work towards #40 allowing you to re-use mappings

* phpcs

* Remove debug print

* Multiple selectors documentation

* Tests for various multiple selector and field modes
* Add crawler test.

* Add group testing.

- Adds tear down to remove output files to remove chance of false-positives.
- Adds group testing.
- Adds additional pages to the test server.

* Use sys_get_temp_dir().

* Remove cleanup to debug CI

* Add sleep and php server from the right directory.

* Strip trailing slash if exists on domain for cache store dir

* WIP but functional spider-crawler cache

Work towards #74.  Spider-crawler now caches results for much faster reruns.  Note this introduces breaking changes in the Fetcher cache which now stores and expects json data not just the html content.  This allows to store extra info along with the content and is needed for the spider-crawler cache.

* A missing ingredient

* Resolved URL count.

* Added no-cache runtime flag.

* Code standards cleanup.

* cache_enabled flag docs

* Fixes cache write before grouping logic

* Detect urls with duplicate content in crawler-spider

* Detect urls with duplicate content in crawler-spider

* crawler duplicate flag docs

* Remove newline

* Refactoring to allow common php server base for tests

* Crawler tests using common php server class

* phpdoc fixes

* Cache update to allow recursive removal of a domains cached content

* Crawler tests based on common php server including caching and duplicate options

* clear output dirs on runs

* Stop server on start to reset if running

* Move sleep, increase

* readd html pages
* Support staring points for crawler
* Documentation for starting URLs
Provide tests for the urls option in crawler config [#83]
Expose a number of crawler configuration options so they can be controlled via the crawler configuration yaml file.

- Adds timeout
- Adds connect_timeout
- Adds verify
- Adds cookies
Update crawler configuration options.
* Ensure filenames are decoded.

* Fix phpcs.
* Use correct urls key for crawled urls.

* Fix tests.
* Exclude external media assets by default.
* Added entity_type requirement on spider.
stooit and others added 6 commits November 7, 2019 13:59
#116)

* Functionality and tests for providing urls_file as an array
* Documentation for multiple urls_files
…e changes (#120)

* Implementation of saving redirect source and destination results for crawler and fetcher
* Change from collection to arrays for queues #115
* Check for UTF8 encoded issues in html content prior to json for cache
* Added support for custom boolean yes/no values.
* Provides cache purge for a single URL or domain from migrate cmd line
* Change default fetcher to multicurl
* Update tests to account for redirects
* Rejigged redirect support.  Raw headers and redirect info is now also cached so the redirect reports can be built from cache for crawler and fetcher.
* Slight change in status code behaviour
* Migrate reporting work towards #109 and improves #93
* Split duplicate log by entity_type.
* Added better xpath support for ordered type.
* Added html template for report
* Support for port in host name
* Reporting and Cache Tools documentation
* Split exclusion/inclusion lists by crawl and url output.
* Allow processors to alter UUID value.
* Alter output of include/exclude warnings.
* Improved options for reporting media can use list of files to merge or wildcard patterns.
* Updated inflateMappings to remove case of selector array, now handled by TypeBase.  Fixes #110.
* Adds new url_options to strip script tags or other pattern from raw html using regex
* Ignore case when building menu uuid.
* Moves crawler redirect detection outside of cache enabled
* Prepare merlin for the wild.
- Adds some niceness to the Composer file.
- Updates the namespace to Merlin.
- Renames the migrate executable to merlin.
* Update README.md
* Add licenses for dependencies.
@stooit stooit merged commit 1fe152d into master Apr 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants