-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merlin 0.4.0 #129
Merged
Merlin 0.4.0 #129
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Ensure that media types are added to output filenames.
… migrate generate command (#79) * Set crawl and migrate limits using an optional runtime flag * Documentation for -l flag for Crawler * Documentation for -l limit flag on migrate generate command * Refactor "-l/--limit" to "-n/--number" * Revert "Refactor "-l/--limit" to "-n/--number"" This reverts commit 8b03d72. * Make runWeb/RunXml consistent. Remove requirement for urls/url_file in XML parsing. * Tweak docs.
Resolve issue where errors were only reported once.
* Adds support for inline links via linkit. * Improved test coverage for linkit plugin.
* Multiple selectors for config map Introduces multiple selectors and multiple target fields per mapping, work towards #40 allowing you to re-use mappings * phpcs * Remove debug print * Multiple selectors documentation * Tests for various multiple selector and field modes
* Add crawler test. * Add group testing. - Adds tear down to remove output files to remove chance of false-positives. - Adds group testing. - Adds additional pages to the test server. * Use sys_get_temp_dir(). * Remove cleanup to debug CI * Add sleep and php server from the right directory. * Strip trailing slash if exists on domain for cache store dir * WIP but functional spider-crawler cache Work towards #74. Spider-crawler now caches results for much faster reruns. Note this introduces breaking changes in the Fetcher cache which now stores and expects json data not just the html content. This allows to store extra info along with the content and is needed for the spider-crawler cache. * A missing ingredient * Resolved URL count. * Added no-cache runtime flag. * Code standards cleanup. * cache_enabled flag docs * Fixes cache write before grouping logic * Detect urls with duplicate content in crawler-spider * Detect urls with duplicate content in crawler-spider * crawler duplicate flag docs * Remove newline * Refactoring to allow common php server base for tests * Crawler tests using common php server class * phpdoc fixes * Cache update to allow recursive removal of a domains cached content * Crawler tests based on common php server including caching and duplicate options * clear output dirs on runs * Stop server on start to reset if running * Move sleep, increase * readd html pages
* Support staring points for crawler * Documentation for starting URLs
Provide tests for the urls option in crawler config [#83]
Expose a number of crawler configuration options so they can be controlled via the crawler configuration yaml file. - Adds timeout - Adds connect_timeout - Adds verify - Adds cookies
Update crawler configuration options.
* Ensure filenames are decoded. * Fix phpcs.
* Use correct urls key for crawled urls. * Fix tests.
Use guzzle to resolve relative urls.
* Exclude external media assets by default.
* Added entity_type requirement on spider.
#116) * Functionality and tests for providing urls_file as an array * Documentation for multiple urls_files
…e changes (#120) * Implementation of saving redirect source and destination results for crawler and fetcher * Change from collection to arrays for queues #115 * Check for UTF8 encoded issues in html content prior to json for cache * Added support for custom boolean yes/no values. * Provides cache purge for a single URL or domain from migrate cmd line * Change default fetcher to multicurl * Update tests to account for redirects * Rejigged redirect support. Raw headers and redirect info is now also cached so the redirect reports can be built from cache for crawler and fetcher. * Slight change in status code behaviour * Migrate reporting work towards #109 and improves #93 * Split duplicate log by entity_type. * Added better xpath support for ordered type. * Added html template for report * Support for port in host name * Reporting and Cache Tools documentation * Split exclusion/inclusion lists by crawl and url output. * Allow processors to alter UUID value. * Alter output of include/exclude warnings. * Improved options for reporting media can use list of files to merge or wildcard patterns. * Updated inflateMappings to remove case of selector array, now handled by TypeBase. Fixes #110. * Adds new url_options to strip script tags or other pattern from raw html using regex * Ignore case when building menu uuid. * Moves crawler redirect detection outside of cache enabled
* Prepare merlin for the wild. - Adds some niceness to the Composer file. - Updates the namespace to Merlin. - Renames the migrate executable to merlin. * Update README.md * Add licenses for dependencies.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.