Skip to content

Commit

Permalink
Merge pull request #142 from theodi/feature-csvw-validation
Browse files Browse the repository at this point in the history
CSVW-based validation!
  • Loading branch information
pezholio committed Sep 23, 2015
2 parents 0d98001 + bacfc8e commit 50047ff
Show file tree
Hide file tree
Showing 29 changed files with 3,261 additions and 122 deletions.
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,10 @@ coverage/
/.rspec

.idea
.DS_Store
.DS_Store
features/csvw_validation_tests.feature
features/fixtures/csvw

bin/run-csvw-tests

features/csvw_json_transformation_tests.feature
129 changes: 93 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,13 @@ You can either use this gem within your own Ruby code, or as a standolone comman
After installing the gem, you can validate a CSV on the command line like so:

csvlint myfile.csv

You will then see the validation result, together with any warnings or errors e.g.

```
myfile.csv is INVALID
1. blank_rows. Row: 3
1. title_row.
1. title_row.
2. inconsistent_values. Column: 14
```

Expand All @@ -50,40 +50,40 @@ You can also optionally pass a schema file like so:
Currently the gem supports retrieving a CSV accessible from a URL, File, or an IO-style object (e.g. StringIO)

require 'csvlint'

validator = Csvlint::Validator.new( "http://example.org/data.csv" )
validator = Csvlint::Validator.new( File.new("/path/to/my/data.csv" ))
validator = Csvlint::Validator.new( StringIO.new( my_data_in_a_string ) )

When validating from a URL the range of errors and warnings is wider as the library will also check HTTP headers for
When validating from a URL the range of errors and warnings is wider as the library will also check HTTP headers for
best practices
#invoke the validation

#invoke the validation
validator.validate

#check validation status
validator.valid?

#access array of errors, each is an Csvlint::ErrorMessage object
validator.errors

#access array of warnings
validator.warnings

#access array of information messages
validator.info_messages

#get some information about the CSV file that was validated
validator.encoding
validator.content_type
validator.extension

#retrieve HTTP headers from request
validator.headers

## Controlling CSV Parsing

The validator supports configuration of the [CSV Dialect](http://dataprotocols.org/csv-dialect/) used in a data file. This is specified by
The validator supports configuration of the [CSV Dialect](http://dataprotocols.org/csv-dialect/) used in a data file. This is specified by
passing a dialect hash to the constructor:

dialect = {
Expand All @@ -94,17 +94,17 @@ passing a dialect hash to the constructor:

The options should be a Hash that conforms to the [CSV Dialect](http://dataprotocols.org/csv-dialect/) JSON structure.

While these options configure the parser to correctly process the file, the validator will still raise errors or warnings for CSV
While these options configure the parser to correctly process the file, the validator will still raise errors or warnings for CSV
structure that it considers to be invalid, e.g. a missing header or different delimiters.

Note that the parser will also check for a `header` parameter on the `Content-Type` header returned when fetching a remote CSV file. As
Note that the parser will also check for a `header` parameter on the `Content-Type` header returned when fetching a remote CSV file. As
specified in [RFC 4180](http://www.ietf.org/rfc/rfc4180.txt) the values for this can be `present` and `absent`, e.g:

Content-Type: text/csv; header=present

## Error Reporting

The validator provides feedback on a validation result using instances of `Csvlint::ErrorMessage`. Errors are divided into errors, warnings and information
The validator provides feedback on a validation result using instances of `Csvlint::ErrorMessage`. Errors are divided into errors, warnings and information
messages. A validation attempt is successful if there are no errors.

Messages provide context including:
Expand All @@ -122,7 +122,7 @@ The following types of error can be reported:
* `:wrong_content_type` -- content type is not `text/csv`
* `:ragged_rows` -- row has a different number of columns (than the first row in the file)
* `:blank_rows` -- completely empty row, e.g. blank line or a line where all column values are empty
* `:invalid_encoding` -- encoding error when parsing row, e.g. because of invalid characters
* `:invalid_encoding` -- encoding error when parsing row, e.g. because of invalid characters
* `:not_found` -- HTTP 404 error when retrieving the data
* `:stray_quote` -- missing or stray quote
* `:unclosed_quote` -- unclosed quoted field
Expand Down Expand Up @@ -153,36 +153,66 @@ There are also information messages available:

## Schema Validation

The library supports validating data against a schema. A schema configuration can be provided as a Hash or parsed from JSON. The structure currently
follows JSON Table Schema with some extensions.
The library supports validating data against a schema. A schema configuration can be provided as a Hash or parsed from JSON. The structure currently
follows JSON Table Schema with some extensions and rudinmentary [CSV on the Web Metadata](http://www.w3.org/TR/tabular-metadata/).

An example schema file is:
An example JSON Table Schema schema file is:

{
"fields": [
{
"name": "id",
"constraints": { "required": true }
{
"name": "id",
"constraints": { "required": true }
},
{
"name": "price",
"constraints": { "required": true, "minLength": 1 }
{
"name": "price",
"constraints": { "required": true, "minLength": 1 }
},
{
"name": "postcode",
"constraints": {
"required": true,
"pattern": "[A-Z]{1,2}[0-9][0-9A-Z]? ?[0-9][A-Z]{2}"
}
{
"name": "postcode",
"constraints": {
"required": true,
"pattern": "[A-Z]{1,2}[0-9][0-9A-Z]? ?[0-9][A-Z]{2}"
}
}
]
}

Parsing and validating with a schema:
An equivalent CSV on the Web Metadata file is:

{
"@context": "http://www.w3.org/ns/csvw",
"url": "http://example.com/example1.csv",
"tableSchema": {
"columns": [
{
"name": "id",
"required": true
},
{
"name": "price",
"required": true,
"datatype": { "base": "string", "minLength": 1 }
},
{
"name": "postcode",
"required": true
}
]
}
}

schema = Csvlint::Schema.load_from_json_table(uri)
Parsing and validating with a schema (of either kind):

schema = Csvlint::Schema.load_from_json(uri)
validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, schema )

### CSV on the Web Validation Support

This gem passes all the validation tests in the [official CSV on the Web test suite](http://w3c.github.io/csvw/tests/) (though there might still be errors or parts of the [CSV on the Web standard](http://www.w3.org/TR/tabular-metadata/) that aren't tested by that test suite).

### JSON Table Schema Support

Supported constraints:

* `required` -- there must be a value for this field in every row
Expand All @@ -192,7 +222,7 @@ Supported constraints:
* `pattern` -- values must match the provided regular expression
* `type` -- specifies an XML Schema data type. Values of the column must be a valid value for that type
* `minimum` -- specify a minimum range for values, the value will be parsed as specified by `type`
* `maximum` -- specify a maximum range for values, the value will be parsed as specified by `type`
* `maximum` -- specify a maximum range for values, the value will be parsed as specified by `type`
* `datePattern` -- specify a `strftime` compatible date pattern to be used when parsing date values and min/max constraints

Supported data types (this is still a work in progress):
Expand All @@ -214,7 +244,7 @@ Supported data types (this is still a work in progress):
* Time -- `http://www.w3.org/2001/XMLSchema#time`

Use of an unknown data type will result in the column failing to validate.

Schema validation provides some additional types of error and warning messages:

* `:missing_value` (error) -- a column marked as `required` in the schema has no value
Expand Down Expand Up @@ -248,3 +278,30 @@ validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, nil, opt
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create new Pull Request

### Testing

The codebase includes both rspec and cucumber tests, which can be run together using:

$ rake

or separately:

$ rake spec
$ rake features

When the cucumber tests are first run, a script will create tests based on the latest version of the [CSV on the Web test suite](http://w3c.github.io/csvw/tests/), including creating a local cache of the test files. This requires an internet connection and some patience. Following that download, the tests will run locally; there's also a batch script:

$ bin/run-csvw-tests

which will run the tests from the command line.

If you need to refresh the CSV on the Web tests:

$ rm bin/run-csvw-tests
$ rm features/csvw_validation_tests.feature
$ rm -r features/fixtures/csvw

and then run the cucumber tests again or:

$ ruby features/support/load_tests.rb
95 changes: 68 additions & 27 deletions bin/csvlint
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ opts.on("-d", "--dump-errors", "Pretty print error and warning objects.") do |d|
options[:dump] = d
end

opts.on("-s", "--schema-file FILENAME", "Schema file") do |s|
options[:schema_file] = s
opts.on("-s", "--schema FILENAME", "Schema file") do |s|
options[:schema] = s
end

opts.on_tail("-h", "--help",
Expand All @@ -35,14 +35,15 @@ rescue OptionParser::InvalidOption => e
end

def print_error(index, error, dump, color)

location = ""
location += error.row.to_s if error.row
location += "#{error.row ? "," : ""}#{error.column.to_s}" if error.column
if error.row || error.column
location = "#{error.row ? "Row" : "Column"}: #{location}"
end
output_string = "#{index+1}. #{error.type}. #{location}"
output_string = "#{index+1}. #{error.type}"
output_string += ". #{location}" unless location.empty?
output_string += ". #{error.content}" if error.content

if $stdout.tty?
puts output_string.colorize(color)
Expand All @@ -56,55 +57,95 @@ def print_error(index, error, dump, color)

end

def validate_csv(source, schema, dump)
validator = Csvlint::Validator.new( source, nil, schema )

if $stdout.tty?
puts "#{source.path || source || "CSV"} is #{validator.valid? ? "VALID".green : "INVALID".red}"
else
puts "#{source.path || source || "CSV"} is #{validator.valid? ? "VALID" : "INVALID"}"
end

if validator.errors.size > 0
validator.errors.each_with_index do |error, i|
print_error(i, error, dump, :red)
end
end

if validator.warnings.size > 0
validator.warnings.each_with_index do |error, i|
print_error(i, error, dump, :yellow)
end
end

return validator.valid?
end

if ARGV.length == 0 && !$stdin.tty?
source = StringIO.new(ARGF.read)
else
if ARGV[0]
source = ARGV[0]
unless source =~ /^http(s)?/
begin
source = File.new( source ) unless source =~ /^http(s)?/
source = File.new( source ) unless source =~ /^http(s)?/
rescue Errno::ENOENT
puts "#{source} not found"
exit 1
end
end
else
elsif !options[:schema]
puts "No CSV data to validate."
puts opts
exit 1
end
end

schema = nil
if options[:schema_file]
if options[:schema]
begin
schemafile = File.read( options[:schema_file] )
schema = Csvlint::Schema.load_from_json(options[:schema])
rescue JSON::ParserError => e
output_string = "invalid metadata: malformed JSON"
if $stdout.tty?
puts output_string.colorize(:red)
else
puts output_string
end
exit 1
rescue Csvlint::Csvw::MetadataError => e
output_string = "invalid metadata: #{e.message}#{" at " + e.path if e.path}"
if $stdout.tty?
puts output_string.colorize(:red)
else
puts output_string
end
exit 1
rescue Errno::ENOENT
puts "#{options[:schema_file]} not found"
puts "#{options[:schema]} not found"
exit 1
end
schema = Csvlint::Schema.from_json_table(nil, JSON.parse(schemafile))
end

validator = Csvlint::Validator.new( source, nil, schema )

if $stdout.tty?
puts "#{ARGV[0] || "CSV"} is #{validator.valid? ? "VALID".green : "INVALID".red}"
else
puts "#{ARGV[0] || "CSV"} is #{validator.valid? ? "VALID" : "INVALID"}"
end

if validator.errors.size > 0
validator.errors.each_with_index do |error, i|
print_error(i, error, options[:dump], :red)
valid = true
if source.nil?
unless schema.instance_of? Csvlint::Csvw::TableGroup
puts "No CSV data to validate."
puts opts
exit 1
end
end

if validator.warnings.size > 0
validator.warnings.each_with_index do |error, i|
print_error(i, error, options[:dump], :yellow)
schema.tables.keys.each do |source|
begin
source = source.sub("file:","")
source = File.new( source )
rescue Errno::ENOENT
puts "#{source} not found"
exit 1
end unless source =~ /^http(s)?/
valid &= validate_csv(source, schema, options[:dump])
end
else
valid = validate_csv(source, schema, options[:dump])
end

exit 1 unless validator.valid?
exit 1 unless valid
Loading

0 comments on commit 50047ff

Please sign in to comment.