Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSVW-based validation! #142

Merged
merged 35 commits into from
Sep 23, 2015
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
69405fa
rudimentary CSVW metadata support
Aug 29, 2015
7ee6e80
auto generating features from csvw tests
Aug 29, 2015
079c81b
loading all CSVW tests
Aug 29, 2015
7756ab3
match against all column titles
Aug 29, 2015
aaa2ff0
handle missing fields
Aug 29, 2015
2b49e86
try to turn off Travis CI for now
Aug 29, 2015
241a98a
more support for CSVW
Aug 29, 2015
440260e
support remote schemas
Aug 30, 2015
420e803
fix args for normal schema validation
Aug 30, 2015
f0ea272
default for source_url
Aug 30, 2015
90963dd
refactoring to support property value checking
Aug 30, 2015
4194e99
working through tests
Aug 30, 2015
c05a9c9
more progress against tests
Aug 30, 2015
3a67f2b
further work
Aug 31, 2015
8c5af64
more progress
Aug 31, 2015
45c7e72
fixes to support other tests
Aug 31, 2015
97a8a09
a few fixes, addition of number format matcher
Aug 31, 2015
097ad25
initial support for date formats
Sep 1, 2015
2854c4d
datatype tests passing
Sep 1, 2015
a828a7e
majority of support for foreign keys
Sep 1, 2015
0dbfcaf
all tests passing!
Sep 2, 2015
db2c7b0
now all the tests are passing for good reason!
Sep 2, 2015
5506f41
tidying of docs etc
Sep 2, 2015
74926f7
more README
Sep 2, 2015
05b0b16
fixes to work across Ruby versions
Sep 2, 2015
93d294f
remove odd line of double quotes!
Sep 2, 2015
1e3f32a
update for new tests
Sep 3, 2015
12efa87
duh, not useful to have entry[name] given as name of each test
Sep 3, 2015
412fe85
better command-line test runs
Sep 3, 2015
3740bcb
Modularise Csvw classes
pezholio Sep 9, 2015
173a78e
Fix class level method definition
pezholio Sep 9, 2015
1741498
Move namespaced tests
pezholio Sep 10, 2015
afb9ad3
Tweak instructions
pezholio Sep 10, 2015
b263c7c
Put schema warnings back
pezholio Sep 10, 2015
bacfc8e
Remove explicit return
pezholio Sep 23, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,10 @@ coverage/
/.rspec

.idea
.DS_Store
.DS_Store
features/csvw_validation_tests.feature
features/fixtures/csvw

bin/run-csvw-tests

features/csvw_json_transformation_tests.feature
65 changes: 61 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,9 +154,9 @@ There are also information messages available:
## Schema Validation

The library supports validating data against a schema. A schema configuration can be provided as a Hash or parsed from JSON. The structure currently
follows JSON Table Schema with some extensions.
follows JSON Table Schema with some extensions and rudinmentary [CSV on the Web Metadata](http://www.w3.org/TR/tabular-metadata/).

An example schema file is:
An example JSON Table Schema schema file is:

{
"fields": [
Expand All @@ -178,11 +178,41 @@ An example schema file is:
]
}

Parsing and validating with a schema:
An equivalent CSV on the Web Metadata file is:

schema = Csvlint::Schema.load_from_json_table(uri)
{
"@context": "http://www.w3.org/ns/csvw",
"url": "http://example.com/example1.csv",
"tableSchema": {
"columns": [
{
"name": "id",
"required": true
},
{
"name": "price",
"required": true,
"datatype": { "base": "string", "minLength": 1 }
},
{
"name": "postcode",
"required": true
}
]
}
}

Parsing and validating with a schema (of either kind):

schema = Csvlint::Schema.load_from_json(uri)
validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, schema )

### CSV on the Web Validation Support

This gem passes all the validation tests in the [official CSV on the Web test suite](http://w3c.github.io/csvw/tests/) (though there might still be errors or parts of the [CSV on the Web standard](http://www.w3.org/TR/tabular-metadata/) that aren't tested by that test suite).

### JSON Table Schema Support

Supported constraints:

* `required` -- there must be a value for this field in every row
Expand Down Expand Up @@ -248,3 +278,30 @@ validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, nil, opt
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create new Pull Request

### Testing

The codebase includes both rspec and cucumber tests, which can be run together using:

$ rake

or separately:

$ rake spec
$ rake features

When the cucumber tests are first run, a script will create tests based on the latest version of the [CSV on the Web test suite](http://w3c.github.io/csvw/tests/), including creating a local cache of the test files. This requires an internet connection and some patience. Following that download, the tests will run locally; there's also a batch script:

$ bin/run-csvw-tests

which will run the tests from the command line.

If you need to refresh the CSV on the Web tests:

$ rm bin/run-csvw-tests
$ rm features/csvw_validation_tests.feature
$ rmdir features/fixtures/csvw

and then run the cucumber tests again or:

$ ruby features/support/load_tests.rb
93 changes: 67 additions & 26 deletions bin/csvlint
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ opts.on("-d", "--dump-errors", "Pretty print error and warning objects.") do |d|
options[:dump] = d
end

opts.on("-s", "--schema-file FILENAME", "Schema file") do |s|
options[:schema_file] = s
opts.on("-s", "--schema FILENAME", "Schema file") do |s|
options[:schema] = s
end

opts.on_tail("-h", "--help",
Expand All @@ -35,14 +35,15 @@ rescue OptionParser::InvalidOption => e
end

def print_error(index, error, dump, color)

location = ""
location += error.row.to_s if error.row
location += "#{error.row ? "," : ""}#{error.column.to_s}" if error.column
if error.row || error.column
location = "#{error.row ? "Row" : "Column"}: #{location}"
end
output_string = "#{index+1}. #{error.type}. #{location}"
output_string = "#{index+1}. #{error.type}"
output_string += ". #{location}" unless location.empty?
output_string += ". #{error.content}" if error.content

if $stdout.tty?
puts output_string.colorize(color)
Expand All @@ -56,6 +57,30 @@ def print_error(index, error, dump, color)

end

def validate_csv(source, schema, dump)
validator = Csvlint::Validator.new( source, nil, schema )

if $stdout.tty?
puts "#{source.path || source || "CSV"} is #{validator.valid? ? "VALID".green : "INVALID".red}"
else
puts "#{source.path || source || "CSV"} is #{validator.valid? ? "VALID" : "INVALID"}"
end

if validator.errors.size > 0
validator.errors.each_with_index do |error, i|
print_error(i, error, dump, :red)
end
end

if validator.warnings.size > 0
validator.warnings.each_with_index do |error, i|
print_error(i, error, dump, :yellow)
end
end

return validator.valid?
end

if ARGV.length == 0 && !$stdin.tty?
source = StringIO.new(ARGF.read)
else
Expand All @@ -69,42 +94,58 @@ else
exit 1
end
end
else
elsif !options[:schema]
puts "No CSV data to validate."
puts opts
exit 1
end
end

schema = nil
if options[:schema_file]
if options[:schema]
begin
schemafile = File.read( options[:schema_file] )
schema = Csvlint::Schema.load_from_json(options[:schema])
rescue JSON::ParserError => e
output_string = "invalid metadata: malformed JSON"
if $stdout.tty?
puts output_string.colorize(:red)
else
puts output_string
end
exit 1
rescue Csvlint::CsvwMetadataError => e
output_string = "invalid metadata: #{e.message}#{" at " + e.path if e.path}"
if $stdout.tty?
puts output_string.colorize(:red)
else
puts output_string
end
exit 1
rescue Errno::ENOENT
puts "#{options[:schema_file]} not found"
puts "#{options[:schema]} not found"
exit 1
end
schema = Csvlint::Schema.from_json_table(nil, JSON.parse(schemafile))
end

validator = Csvlint::Validator.new( source, nil, schema )

if $stdout.tty?
puts "#{ARGV[0] || "CSV"} is #{validator.valid? ? "VALID".green : "INVALID".red}"
else
puts "#{ARGV[0] || "CSV"} is #{validator.valid? ? "VALID" : "INVALID"}"
end

if validator.errors.size > 0
validator.errors.each_with_index do |error, i|
print_error(i, error, options[:dump], :red)
valid = true
if source.nil?
unless schema.instance_of? Csvlint::CsvwTableGroup
puts "No CSV data to validate."
puts opts
exit 1
end
end

if validator.warnings.size > 0
validator.warnings.each_with_index do |error, i|
print_error(i, error, options[:dump], :yellow)
schema.tables.keys.each do |source|
begin
source = source.sub("file:","")
source = File.new( source )
rescue Errno::ENOENT
puts "#{source} not found"
exit 1
end unless source =~ /^http(s)?/
valid &= validate_csv(source, schema, options[:dump])
end
else
valid = validate_csv(source, schema, options[:dump])
end

exit 1 unless validator.valid?
exit 1 unless valid
2 changes: 2 additions & 0 deletions csvlint.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ Gem::Specification.new do |spec|
spec.add_dependency "open_uri_redirections"
spec.add_dependency "activesupport"
spec.add_dependency "addressable"
spec.add_dependency "escape_utils"
spec.add_dependency "uri_template"

spec.add_development_dependency "bundler", "~> 1.3"
spec.add_development_dependency "rake"
Expand Down
127 changes: 127 additions & 0 deletions features/csvw_schema_validation.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
Feature: CSVW Schema Validation

Scenario: Valid CSV
Given I have a CSV with the following content:
"""
"Bob","1234","[email protected]"
"Alice","5","[email protected]"
"""
And it is stored at the url "http://example.com/example1.csv"
And I have metadata with the following content:
"""
{
"@context": "http://www.w3.org/ns/csvw",
"url": "http://example.com/example1.csv",
"dialect": { "header": false },
"tableSchema": {
"columns": [
{ "name": "Name", "required": true },
{ "name": "Id", "required": true, "datatype": { "base": "string", "minLength": 1 } },
{ "name": "Email", "required": true }
]
}
}
"""
When I ask if there are errors
Then there should be 0 error

Scenario: Schema invalid CSV
Given I have a CSV with the following content:
"""
"Bob","1234","[email protected]"
"Alice","5","[email protected]"
"""
And it is stored at the url "http://example.com/example1.csv"
And I have metadata with the following content:
"""
{
"@context": "http://www.w3.org/ns/csvw",
"url": "http://example.com/example1.csv",
"dialect": { "header": false },
"tableSchema": {
"columns": [
{ "name": "Name", "required": true },
{ "name": "Id", "required": true, "datatype": { "base": "string", "minLength": 3 } },
{ "name": "Email", "required": true }
]
}
}
"""
When I ask if there are errors
Then there should be 1 error

Scenario: CSV with incorrect header
Given I have a CSV with the following content:
"""
"name","id","contact"
"Bob","1234","[email protected]"
"Alice","5","[email protected]"
"""
And it is stored at the url "http://example.com/example1.csv"
And I have metadata with the following content:
"""
{
"@context": "http://www.w3.org/ns/csvw",
"url": "http://example.com/example1.csv",
"tableSchema": {
"columns": [
{ "titles": "name", "required": true },
{ "titles": "id", "required": true, "datatype": { "base": "string", "minLength": 1 } },
{ "titles": "email", "required": true }
]
}
}
"""
When I ask if there are errors
Then there should be 1 error

Scenario: Schema with valid regex
Given I have a CSV with the following content:
"""
"firstname","id","email"
"Bob","1234","[email protected]"
"Alice","5","[email protected]"
"""
And it is stored at the url "http://example.com/example1.csv"
And I have metadata with the following content:
"""
{
"@context": "http://www.w3.org/ns/csvw",
"url": "http://example.com/example1.csv",
"tableSchema": {
"columns": [
{ "titles": "firstname", "required": true, "datatype": { "base": "string", "format": "^[A-Za-z0-9_]*$" } },
{ "titles": "id", "required": true, "datatype": { "base": "string", "minLength": 1 } },
{ "titles": "email", "required": true }
]
}
}
"""
When I ask if there are warnings
Then there should be 0 warnings

Scenario: Schema with invalid regex
Given I have a CSV with the following content:
"""
"firstname","id","email"
"Bob","1234","[email protected]"
"Alice","5","[email protected]"
"""
And it is stored at the url "http://example.com/example1.csv"
And I have metadata with the following content:
"""
{
"@context": "http://www.w3.org/ns/csvw",
"url": "http://example.com/example1.csv",
"tableSchema": {
"columns": [
{ "titles": "firstname", "required": true, "datatype": { "base": "string", "format": "((" } },
{ "titles": "id", "required": true, "datatype": { "base": "string", "minLength": 1 } },
{ "titles": "email", "required": true }
]
}
}
"""
When I ask if there are warnings
Then there should be 1 warnings
And that warning should have the type "invalid_regex"
Loading