Stop auto-promoting column-types #30

cjprybol · 2017-03-10T19:00:25Z

I was hitting issues during the rebase and thought it would be smarter to just branch from master, dump the edits in, and push to overwrite the old branch. Looks like that was a bad idea and the prior PR was automatically closed.

To summarize some of the features discussed earlier and still present in this PR:

Automatic promotion of columns to nullable arrays no longer occurs
all constructors will try and recycle inputs as a byproduct of consolidating the constructors
new functions nullify!, nullify, denullify!, and denullify have been added for converting DataTables back and forth from nullable and nullfree states.

#24

nalimilan

Still need to review the rest of the tests, but this should cover the most substantive parts.

nalimilan · 2017-03-10T20:53:18Z

src/abstractdatatable/abstractdatatable.jl

+    if length(uniqueheaders) == 0
+        return DataTable()
+    elseif length(unique(map(length, uniqueheaders))) > 1
+        throw(ArgumentError("not all DataTables have the same number of columns. Resolve column(s): $(setdiff(union(allheaders...), intersect(allheaders...)))"))


"Resolve" might not be completely clear. Maybe "unmatched" (sorry, not a native speaker)?

Also it would still be interesting to print the number of columns of each table, since that's what the message says.

nalimilan · 2017-03-10T20:53:33Z

src/abstractdatatable/abstractdatatable.jl

+    elseif length(unique(map(length, uniqueheaders))) > 1
+        throw(ArgumentError("not all DataTables have the same number of columns. Resolve column(s): $(setdiff(union(allheaders...), intersect(allheaders...)))"))
+    elseif length(uniqueheaders) > 1
+        throw(ArgumentError("Column names do not match. Use `rename!` or `names!` to adjust columns names. Resolve column(s): $(setdiff(union(allheaders...), intersect(allheaders...)))"))


I think errors should start with a lowercase letter.

nalimilan · 2017-03-10T20:54:51Z

src/abstractdatatable/abstractdatatable.jl

+ Int64
+```
+
+See also [`denullify`](@ref) & [`nullify!`](@ref).


Better use "and" than "&".

nalimilan · 2017-03-10T21:04:35Z

src/abstractdatatable/abstractdatatable.jl

+function denullify!(dt::AbstractDataTable)
+    for i in 1:size(dt,2)
+        if !anynull(dt[i])
+            dt[i] = dropnull(dt[i])


Use dropnull!? That will avoid a copy for NullableArray, and it won't make a difference for other arrays. Since there a no nulls, the original arrays won't be modified.

nalimilan · 2017-03-10T21:09:26Z

src/abstractdatatable/abstractdatatable.jl

+    denullify!(dt::AbstractDataTable)
+
+Convert columns with a `Nullable` element type without any null values
+to a non-`Nullable` equivalent array type. The table `dt` is modified in place.


Mention that new columns may alias the old ones, even when they were converted. Same for denullify.

Should denullify switch to deepcopy? I'm not sure I like this alias behavior for denullify.

julia> using DataTables julia> dt = DataTable(A = 1:3, B = NullableArray(1:3)) 3×2 DataTables.DataTable │ Row │ A │ B │ ├─────┼───┼───┤ │ 1 │ 1 │ 1 │ │ 2 │ 2 │ 2 │ │ 3 │ 3 │ 3 │ julia> ddt = denullify(dt) 3×2 DataTables.DataTable │ Row │ A │ B │ ├─────┼───┼───┤ │ 1 │ 1 │ 1 │ │ 2 │ 2 │ 2 │ │ 3 │ 3 │ 3 │ julia> dt[:A] === ddt[:A] true julia> ddt[:A] = 1 1 julia> ddt 3×2 DataTables.DataTable │ Row │ A │ B │ ├─────┼───┼───┤ │ 1 │ 1 │ 1 │ │ 2 │ 1 │ 2 │ │ 3 │ 1 │ 3 │ julia> dt 3×2 DataTables.DataTable │ Row │ A │ B │ ├─────┼───┼───┤ │ 1 │ 1 │ 1 │ │ 2 │ 1 │ 2 │ │ 3 │ 1 │ 3 │

and nullify...

julia> dt = DataTable(A = 1:3, B = NullableArray(1:3)) 3×2 DataTables.DataTable │ Row │ A │ B │ ├─────┼───┼───┤ │ 1 │ 1 │ 1 │ │ 2 │ 2 │ 2 │ │ 3 │ 3 │ 3 │ julia> ndt = nullify(dt) 3×2 DataTables.DataTable │ Row │ A │ B │ ├─────┼───┼───┤ │ 1 │ 1 │ 1 │ │ 2 │ 2 │ 2 │ │ 3 │ 3 │ 3 │ julia> dt[:B] === ndt[:B] true julia> ndt[:B] = 3 3 julia> dt 3×2 DataTables.DataTable │ Row │ A │ B │ ├─────┼───┼───┤ │ 1 │ 1 │ 3 │ │ 2 │ 2 │ 3 │ │ 3 │ 3 │ 3 │ julia> ndt 3×2 DataTables.DataTable │ Row │ A │ B │ ├─────┼───┼───┤ │ 1 │ 1 │ 3 │ │ 2 │ 2 │ 3 │ │ 3 │ 3 │ 3 │

I'm not sure, we would need concrete use cases to decide. In both cases people can easily make a copy manually.

I've changed these back to using copy as you suggested and added a note for nullify and denullify that if users want fully alias-free copies, they should use nullify!(deepcopy(dt)) and denullify!(deepcopy(dt)). Hopefully if anyone hits this issue and doesn't know why columns are changed across multiple tables, they'll open the help docstrings and see the note.

nalimilan · 2017-03-10T22:10:33Z

test/constructors.jl

-                                x2 = [1.0, 1.0, 1.0]))
-    @test isequal(dt, DataTable(x1 = [0.0, 0.0, 0.0],
-                                x2 = [1.0, 1.0, 1.0],
+    @test isequal(dt, DataTable(x1 = NullableCategoricalVector([0.0, 0.0, 0.0]),


NullableVector rather.

nalimilan · 2017-03-10T22:11:33Z

test/constructors.jl


    dt = DataTable([Int, Float64], [:x1, :x2], 2)
    @test size(dt) == (2, 2)
-    @test eltypes(dt) == [Nullable{Int}, Nullable{Float64}]
-
-    @test isequal(dt, DataTable([Int, Float64], 2))


This constructor still seems legitimate?

The constructor is legitimate but the test is asking if two datatables with uninitialized columns contain the same values.

by the same logic these two vectors should be equal

julia> Vector{Float64}(10) 10-element Array{Float64,1}: 2.29759e-314 2.29759e-314 2.29759e-314 2.29759e-314 2.29759e-314 2.29759e-314 2.29759e-314 2.29759e-314 2.29759e-314 2.29759e-314 julia> Vector{Float64}(10) 10-element Array{Float64,1}: 2.29514e-314 9.88131e-324 2.25711e-314 2.29432e-314 0.0 2.29469e-314 2.29469e-314 2.29514e-314 2.29469e-314 5.45361e-312

OK. So use Nullable eltypes, which should compare as isequal even with nulls values.

nalimilan · 2017-03-10T22:12:17Z

test/constructors.jl

+    @test DataTable(a=1, b=1:2) == DataTable(a=[1,1], b=[1,2])
+
+    @testset "associative" begin
+        dt = DataTable(Dict(k => v for (k,v) in zip([:A, :B], [1:3, 4:6])))


Just Dict(A => 1:3, B => 4:6).

nalimilan · 2017-03-10T22:12:45Z

test/constructors.jl

+
+    @testset "associative" begin
+        dt = DataTable(Dict(k => v for (k,v) in zip([:A, :B], [1:3, 4:6])))
+        @test dt == DataTable(A = 1:3, B = 4:6)


Also test the type of the columns (would maybe make sense elsewhere too).

nalimilan · 2017-03-10T22:14:49Z

src/datatable/datatable.jl

+                if isa(columns[i], Range)
+                    columns[i] = collect(columns[i])
+                end
+                repeats = div(maxlen, length(columns[i]))


I don't like this recycling behavior which feels too magical. JuliaData/DataFrames.jl#882 only recycles single-element arrays, and that feels safer.

The constructors seem like they are already pretty forgiving given that recycling is attempted at all. It would be an interesting duality of stances: forgiving and friendly during the construction of a DataTable but enforce strict behavior rules during use. We've taken a pretty clear stance on the latter by deciding no to allowing dt[row, :col] = Nullabe() to autopromote to a NullableArray and accept the assignment. I think there's precedent for the recycling behavior too. I'd like some additional perspectives here.

I thought of this mainly because I saw it in the tests

DataTables.jl/src/groupeddatatable/grouping.jl

Line 65 in 5f611bf

dt = DataTable(a = repeat([1, 2, 3, 4], outer=[2]),

DataTables.jl/src/abstractdatatable/reshape.jl

Line 64 in 20c71d6

d1 = DataTable(a = repeat([1:3;], inner = [4]),

DataTables.jl/src/subdatatable/subdatatable.jl

Line 74 in 20c71d6

dt = DataTable(a = repeat([1, 2, 3, 4], outer=[2]),

DataTables.jl/test/grouping.jl

Line 6 in 5f611bf

dt = DataTable(a = repeat([1, 2, 3, 4], outer=[2]),

DataTables.jl/test/data.jl

Line 146 in 20c71d6

d1 = DataTable(a = repeat([1:3;], inner = [4]),

DataTables.jl/src/abstractdatatable/join.jl

Line 293 in 20c71d6

cols = Any[[repeat(c, inner=r2) for c in columns(dt1)];

Of course, the shortcoming here is that we decide for the user which is the default, inner or outer repeat. But at least one of these DataTable constructor behaviors will have a smaller declaration

The constructors seem like they are already pretty forgiving given that recycling is attempted at all. It would be an interesting duality of stances: forgiving and friendly during the construction of a DataTable but enforce strict behavior rules during use.

What do you mean? Constructors do not support any recycling currently. I'm just suggesting we extend to constructors the current behavior of setindex!, i.e. recycle scalars (and maybe on-element arrays, but that's up for discussion).

The tests/examples are not really representative of work with real data, where you more rarely need to generate vectors by recycling.

cjprybol · 2017-03-14T06:15:21Z

New, potentially disruptive (but probably not?) change I have added to this. stack/unstack & meltdt/unstackdt were all performing lots of Array type conversions, the first to NullableCategoricalArrays/NullableArrays and the later to the internal AbstractVector types of RepeatedArray/StackedArray. Removing the automatic array type promotion of stack/unstack wasn't too bad, but removing it from meltdt/unstackdt is essentially impossible without a rewrite of how those work (and the custom backend array types). RepeatedArrays and StackedArrays don't support all indexing types (Bool/BitArray were the two I cared about) and also, they are both another case of auto promotion of Array type that we're trying to get rid of. They don't appear to be commonly used (and are definitely not well tested), otherwise I'm sure someone would have caught these basic usability issues already.

cjprybol · 2017-03-14T06:17:29Z

docs/src/man/reshaping_and_pivoting.md

-
-This saves memory. To create the view, several AbstractVectors are defined:
-
-`:variable` column -- `EachRepeatedVector`  


There aren't any EachRepeatedVectors left in the code so this is probably overdue for removal

nalimilan · 2017-03-15T09:39:12Z

I think that behavior is fine. That's the only way to avoid unnecessary copies. People can just make a copy if needed.

nalimilan · 2017-03-15T09:40:39Z

Probably not worth it in practice, but who knows...

nalimilan · 2017-03-15T09:42:40Z

Ah, too bad. I'd rather keep the old form then. Anyway iterating over keyword arguments should be quite cheap.

nalimilan · 2017-03-15T09:48:51Z

Ah, that's really annoying. This could be a sufficient motivation to add temporary hacks to support concatenation up to 3 or 4 arguments as in JuliaStats/NullableArrays.jl#187.

nalimilan

Sorry for the delay!

nalimilan · 2017-03-15T08:45:41Z

src/abstractdatatable/reshape.jl

@@ -76,9 +76,9 @@ function stack(dt::AbstractDataTable, measure_vars::Vector{Int},
    cnames = names(dt)[id_vars]
    insert!(cnames, 1, value_name)
    insert!(cnames, 1, variable_name)
-    DataTable(Any[Compat.repeat(_names(dt)[measure_vars], inner=nrow(dt)),   # variable
+    DataTable(Any[repeat(_names(dt)[measure_vars], inner=nrow(dt)),   # variable
                  vcat([dt[c] for c in measure_vars]...),                    # value


Should align this comment too.

nalimilan · 2017-03-15T09:00:49Z

src/abstractdatatable/join.jl

    NullableArray(eltype(T), dims)

-similar_nullable{T,R}(dv::CategoricalArray{T,R}, dims::@compat(Union{Int, Tuple{Vararg{Int}}})) =
+similar_nullable{T,R}(dv::CategoricalArray{T,R}, dims::Union{Int, Tuple{Vararg{Int}}}) =
    NullableCategoricalArray(T, dims)

 similar_nullable(dt::AbstractDataTable, dims::Int) =


Is this definition still used somewhere?

no, only the Array ones are still used. I'll take this one out.

nalimilan · 2017-03-15T09:33:32Z

src/abstractdatatable/io.jl

@@ -45,7 +45,7 @@ function printtable(io::IO,
            if !isnull(dt[j][i])
                if ! (etypes[j] <: Real)
                    print(io, quotemark)
-                    x = isa(dt[i, j], Nullable) ? _unsafe_get(dt[i, j]) : dt[i, j]
+                    x = isa(dt[i, j], Nullable) ? NullableArrays.unsafe_get(dt[i, j]) : dt[i, j]


I'd rather not rely on NullableArrays internals for this. Also in Julia 0.6 `unsafe_get exists in Julia Base, so better not make it look like a NullableArrays-specific function.

Ah, I was working on 0.5 when I wrote this and couldn't find the Base.unsafe_get. Will use @compat(unsafeget)

clarify error messages, fix spacing, address case of length(::Symbol) and length(::String) not giving the desired output (Vector length), and unify constructors in accepting AbstractVector{Symbol} for colnames

…nto cjp/retaintype

ararslan · 2017-03-18T19:08:40Z

src/abstractdatatable/join.jl

-similar_nullable{T}(dv::AbstractVector{T}, dims::Union{Int, Tuple{Vararg{Int}}}) =
-    NullableVector{T}(dims)
+similar_nullable{T}(dv::AbstractArray{T}, dims::Union{Int, Tuple{Vararg{Int}}}) =
+    NullableArray(T, dims)


Do NullableArrays not support the proper type parameter convention? I thought that was just DataArrays.

No, they do. This is reverting back to the current master syntax. Now that I've added tests for CategoricalArrays in unstack (or maybe just ran a Pkg.update?) I caught a new deprecation warning.

julia> DataTables.similar_nullable(CategoricalArray([1, 1, 1]), 2) WARNING: NullableCategoricalArray{T}(::Type{T}, dims::Int...; ordered=false) is deprecated, use NullableCategoricalArray{T}(dims, ordered=ordered) instead. Stacktrace: [1] depwarn(::String, ::Symbol) at ./deprecated.jl:64 [2] #NullableCategoricalArray#137(::Bool, ::Type{T} where T, ::Type{Int64}, ::Int64, ::Vararg{Int64,N} where N) at ./deprecated.jl:51 [3] CategoricalArrays.NullableCategoricalArray(::Type{Int64}, ::Int64) at ./deprecated.jl:51 [4] similar_nullable(::CategoricalArrays.CategoricalArray{Int64,1,UInt32}, ::Int64) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/join.jl:12 [5] eval(::Module, ::Any) at ./boot.jl:235 [6] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66 [7] macro expansion at ./REPL.jl:97 [inlined] [8] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73 while loading no file, in expression starting on line 0 2-element CategoricalArrays.NullableCategoricalArray{Int64,1,UInt32}: #NULL #NULL julia> DataTables.similar_nullable(NullableCategoricalArray([1, 1, 1]), 2) WARNING: NullableCategoricalArray{T}(::Type{T}, dims::Int...; ordered=false) is deprecated, use NullableCategoricalArray{T}(dims, ordered=ordered) instead. Stacktrace: [1] depwarn(::String, ::Symbol) at ./deprecated.jl:64 [2] #NullableCategoricalArray#137(::Bool, ::Type{T} where T, ::Type{Int64}, ::Int64, ::Vararg{Int64,N} where N) at ./deprecated.jl:51 [3] CategoricalArrays.NullableCategoricalArray(::Type{Int64}, ::Int64) at ./deprecated.jl:51 [4] similar_nullable(::CategoricalArrays.NullableCategoricalArray{Int64,1,UInt32}, ::Int64) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/join.jl:15 [5] eval(::Module, ::Any) at ./boot.jl:235 [6] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66 [7] macro expansion at ./REPL.jl:97 [inlined] [8] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73 while loading no file, in expression starting on line 0 2-element CategoricalArrays.NullableCategoricalArray{Int64,1,UInt32}: #NULL #NULL

I changed all of these to the new syntax but @nalimilan asked if I would revert and save the changes for another PR, too many changes at once.

cjprybol · 2017-03-18T19:46:05Z

src/datatable/datatable.jl

-hcat!(dt::DataTable, x::NullableVector) = hcat!(dt, DataTable(Any[x]))
-hcat!(dt::DataTable, x::Vector) = hcat!(dt, DataTable(Any[NullableArray(x)]))
-hcat!(dt::DataTable, x) = hcat!(dt, DataTable(Any[NullableArray([x])]))
+hcat!(dt::DataTable, x) = hcat!(dt, DataTable(Any[x]))


with the scalar recycling in this PR these can all be consolidated into a single function

cjprybol · 2017-03-18T20:03:56Z

src/abstractdatatable/abstractdatatable.jl

+        unionunique = union(uniqueheaders...)
+        coldiff = setdiff(unionunique, intersect(uniqueheaders...))
+        if !isempty(coldiff)
+            # if any datatables are a full superset of names, skip them


for each unique set of column names I'm throwing the error to tell which columns are missing from each of the sets. If any of the inputs to vcat have all of the column names then we can't show which are missing, so they're dropped from the error output

cjprybol · 2017-03-18T20:35:23Z

~~CategoricalArrays issue has been filed NullableCategoricalArrays constructor and levels ordering CategoricalArrays.jl#62~~ fixed
vcat errors are extensively tested, both for error type and error message
joins are fully type stabilized
unstack is type stabilized and order stabilized except for when columns are NullableCategoricalArrays, but that will fix itself when NullableCategoricalArrays constructor and levels ordering CategoricalArrays.jl#62 is fixed (and we'll know it's fixed because some of the unstack tests here will fail, and it will be a rare moment when we are excited for failures!)
everything else unrelated has been reverted (except for incorrect spacing and compat removals, which I see no reason to revert).
I don't see any more issues when reviewing the diff by eye.
Let's ignore the coverage drop? I've added many tests here and caught dozens of behavioral bugs that are now corrected. We're also still up 2% on DataFrames and I'll increase coverage a bit more with Remove unused functions and increase test coverage #31.
constructors are all consolidated in behavior and return the correct column type, and recycle properly.
right join errors are patched

pending no major issues during reviews, I think this is ready to merge.

cjprybol · 2017-03-18T20:41:15Z

src/deprecated.jl

@@ -18,5 +18,5 @@ import Base: keys, values, insert!

 @deprecate sub(dt::AbstractDataTable, rows) view(dt, rows)

-@deprecate stackdf stackdt
-@deprecate meltdf meltdt
+@deprecate stackdf stack


This could be considered an unrelated change but given the issues with the stackdt/meltdt maybe we should be steering people towards the non-view versions? I'm happy to revert this if others would like it reverted

I'd revert it for now, as it's unrelated. We can discuss steering people toward stack separately.

cjprybol · 2017-03-18T20:49:16Z

docs/src/man/reshaping_and_pivoting.md

@@ -61,13 +61,13 @@ d = stackdt(iris)

 This saves memory. To create the view, several AbstractVectors are defined:

-`:variable` column -- `EachRepeatedVector`  


Actually, these probably are important. 2 spaces == newline. I forgot this is markdown interpreter dependant

cjprybol · 2017-03-20T06:12:17Z

test/cat.jl

+        dt2 = DataTable(A = 1:3)
+        # right missing 1 column
+        err = @test_throws ArgumentError vcat(dt1, dt2)
+        @test err.value.msg == "column(s) B are missing from argument(s) 2"


I really like that this is possible (catching the error output in a variable to test the string). Thanks for the suggestion here

cjprybol · 2017-03-20T06:14:30Z

src/abstractdatatable/reshape.jl

    valuecol = dt[value]
    keycol = NullableCategoricalArray(dt[colkey])
+    levels!(keycol, unique(dt[colkey]))


JuliaData/CategoricalArrays.jl#63

cjprybol · 2017-03-20T06:14:42Z

src/abstractdatatable/reshape.jl

    groupidxs = [g.idx[g.starts[i]:g.ends[i]] for i in 1:length(g.starts)]
    rowkey = zeros(Int, size(dt, 1))
    for i in 1:length(groupidxs)
        rowkey[groupidxs[i]] = i
    end
    keycol = NullableCategoricalArray(dt[colkey])
+    levels!(keycol, unique(dt[colkey]))


JuliaData/CategoricalArrays.jl#63

and adjust tests accordingly

nalimilan

Sorry for the delay, couldn't find the time to review the new changes.

After discussing this with others, I think we should adopt the following strategy to merge this:

I hadn't really thought about the stackdt/meltdt issue. I turns out it's not OK to remove these. The fact that nobody noticed they were semi-broken doesn't mean they aren't useful, since DataFrames master hasn't been used widely before forking to DataTables. So these need to be restored and adapted like you did for stack and melt.
The changes to tests implied by the removal of automatic promotion to NullableArray are somewhat problematic. Indeed, that means by default data tables won't be using NullableArray, so we won't test that they are still supported by all functions. Ideally, we would test both Array and NullableArray, and their combination, but for now to ensure we don't regress (now or in the future) I'm afraid all tests will have to be converted to use NullableArray columns, except of course for specific cases where it's justified.

In order to make this less messy, could you refactor this PR in (at least) two parts? The first part can contain most changes to the code (but not to tests), except that the promotion to NullableArray would still happen. That way, tests do not need to be adapted too much. The second part can then change the default, and adapt tests to use NullableArray columns. Since that's going to touch a lot of code, it's better to have it separate so that more fundamental changes do not get lost in the noise.

Thanks!

nalimilan · 2017-03-25T14:53:42Z

src/abstractdatatable/abstractdatatable.jl

+                matchingloci = find(h -> u == h, allheaders)
+                headerdiff = filter(x -> !in(x, u), coldiff)
+                headerdiff = length(headerdiff) > 1 ?
+                                join(string.(headerdiff[1:end-1]), ", ") * " and " * string(headerdiff[end]) :


I'd rather align everything on length, without further indentation. Anyway if you keep the indentation, it should be four spaces. Same below.

Anyway, no need for this length check nor to handle the last element manually: as I said, just use join's last argument.

nalimilan · 2017-03-25T14:58:16Z

test/cat.jl

+        # multiple missing 1 column
+        err = @test_throws ArgumentError vcat(dt1, dt2, dt2, dt2, dt2, dt2)
+        @test err.value.msg == "column(s) B are missing from argument(s) 2, 3, 4, 5 and 6"
+        # argument missing >1columns


Missing space.

nalimilan · 2017-03-25T15:00:46Z

src/abstractdatatable/abstractdatatable.jl

+                                    string(matchingloci[end])
+                estrings[i] = "column(s) $headerdiff are missing from argument(s) $matchingloci"
+            end
+            throw(ArgumentError(join(estrings, ", and ")))


Also use last here for "and".

nalimilan · 2017-03-25T18:40:08Z

test/io.jl

@@ -49,6 +49,6 @@ module TestIO
                   F = NullableArray(fill(Nullable(), 3)),
                   G = fill(Nullable(), 3))

-    answer = Sys.WORD_SIZE == 64 ? 0xd4b5a035796ad770 : 0x1950ccd7
+    answer = Sys.WORD_SIZE == 64 ? 0x937e94e70d642cce : 0x2b8864d8


Actually while this trick saves some space, I'd prefer copying the full output (using triple quotes). That way the logs show the actual and expected output on failure, which makes it clearer what doesn't work.

nalimilan · 2017-03-25T18:42:58Z

test/constructors.jl

+    @testset "null conversions" begin
+        dt = DataTable(A = 1:3, B = 2:4, C = 3:5)
+        nullfree = Any[Array{Int,1},Array{Int,1},Array{Int,1}]
+        nullified = convert(Vector{Any}, fill(NullableArray{Int,1}, 3))


Why this conversion? Why use a different approach from the line above? Also I don't think you need the Any above, and Vector{Int} is shorter.

nalimilan · 2017-03-26T14:59:33Z

test/data.jl

@@ -271,13 +272,6 @@ module TestData
        v1 = randn(10)
    )

-    dt2 = DataTable(


Unrelated, please keep this for now.

nalimilan · 2017-03-26T15:00:50Z

test/conversions.jl

@@ -35,8 +35,6 @@ module TestConversions
    @test isa(ai, Matrix{Int})
    @test ai == convert(Matrix{Int}, dt)

-    dt[1,1] = Nullable()


Is this tested somewhere else? If not, should be kept (after adapting of course).

nalimilan · 2017-03-26T15:01:28Z

test/constructors.jl

+    @testset "associative" begin
+        dt = DataTable(Dict(:A => 1:3, :B => 4:6))
+        @test dt == DataTable(A = 1:3, B = 4:6)
+        @test all(e -> e <: Int, eltypes(dt))


Just eltypes(dt) == [Int, Int].

nalimilan · 2017-03-26T15:02:40Z

test/constructors.jl

-
-    @test isequal(dt, DataTable([Int, Float64], 2))
+    @test isequal(dt, DataTable([Nullable{Int}, Nullable{Float64}], 2))
+    @test all(isnull, (dt[:x1], dt[:x2]))


Just @test isnull(dt[:x1]); @test isnull(dt[:x2]) (on a new line). That also makes the tests more precise about what's wrong if they fail.

nalimilan · 2017-03-26T15:05:23Z

test/constructors.jl

@@ -18,8 +18,6 @@ module TestConstructors

    @test isequal(dt, DataTable(Any[NullableCategoricalVector(zeros(3)),
                                    NullableCategoricalVector(ones(3))]))
-    @test isequal(dt, DataTable(x1 = [0.0, 0.0, 0.0],


This test still applies: it checks that comparisons work across types. You can convert it to NullableArray like below.

cjprybol · 2017-03-27T18:03:18Z

That sounds like a great plan, @nalimilan, I can definitely do that!

I think the logical units present in this PR that are necessary to stop autopromotion or otherwise assist with the transition away from autopromotion are:

converters - nullify, nullify!, denullify, denullify!
join - add similar_nullable function for NullableCategoricalArrays, type-stabilize joins, always return fully nullable datatables for outer joins, and patch right outer join bug adjust outer join behavior (types and right outer join bug) #44
vcat - update vcat promotion rules, remove assumptions on how to concatenate columns that are out of order or only present in some of the tables in favor of stricter behavior that mirrors Base.vcat with detailed error messages. (use the proposed @generated function to handle the cases where Base.vcat fails when mixing categorical arrays and non categorical arrays) update vcat to mimic Base.vcat and enhance promotion rules of mixed column type #45
io - minor updates to abstractdatatable/io.jl (indentation and nullable handling) convert tabs to spaces and use unsafe_get for efficiency in printtable #42
hcat - condense that group of hcat functions to the single function needed and fix that one untested hcat that has paramters dt, dt2 but calls dt1, dt2 consolidate internal hcat! function and remove autopromotion #43 & Remove unused functions and increase test coverage #31
reshape w/ copies - type stabilize stack/unstack Stack should use similar_nullable, not NullableArray #54
reshape w/ views - update indexing in stackdt/meltdt to handle all expected indexing strategies (and explore whether reusing existing options like subarrays/arrayviews would be helpful for this or not. It might be easier to just add boolean and bitarray indexing than to swap out the backend)
reshape bonus? - choose either melt or unstack as the name we want and deprecate the other?
DataTable constructor - and last but not least, consolidate data tables constructors, remove autopromotion of values, perform recycling, and extend tests to cover nullable arrays, null free arrays, and mixtures of the two Consolidate DataTable constructors and remove autopromotion #53

I agree with all of your specific code comments/reviews as well, so I'll incorporate those where appropriate. Thanks for the feedback, and please don't feel the need to apologize for any delays! I'm sure you're busy enough before you start volunteering your time to maintain and develop this package, and DataTables isn't even the only package you maintain and develop. @nalimilan and @ararslan both deserve 🥇🥇s, as do the maintainers of many other packages and the language itself.

I'll leave this PR open as a reference, and now that we're going to refactor this, I'll update #31 to not be dependant on this PR and try and get those changes merged soon too.

add changes

8db2821

cjprybol mentioned this pull request Mar 10, 2017

Remove unused functions and increase test coverage #31

Merged

nalimilan reviewed Mar 10, 2017

View reviewed changes

cjprybol added 7 commits March 12, 2017 22:36

make vcat error more informative

4a939fe

add docstring for vcat

f5a53a1

incorporate edits suggested during review

2c95f13

_unsafe_get -> NullableArrays.unsafe_get

412ceaa

Merge branch 'master' into cjp/rebaseretaintype

f142df5

fix new tests from master

cc95658

remove RepeatedVector, StackedVector, unstackdt, meltdt

06dc914

cjprybol commented Mar 14, 2017

View reviewed changes

DataFrames doensn't reshape 2d Arrays -> Vectors so don't do it here

c4e218e

cjprybol changed the title ~~Use whatever column-type you want 2.0~~ Stop auto-promoting column-types Mar 14, 2017

minor cleanup

e954226

nalimilan reviewed Mar 15, 2017

View reviewed changes

cjprybol added 10 commits March 15, 2017 14:02

change (de)nullify back to copy and cleanup docstrings

ed8a515

NullableArrays.unsafe_get -> compat(unsafe_get)

1636a0c

default to NullableArray for joins that may introduce missing data

91233d3

align comments

7462612

lots of edits

9b65533

clarify error messages, fix spacing, address case of length(::Symbol) and length(::String) not giving the desired output (Vector length), and unify constructors in accepting AbstractVector{Symbol} for colnames

tests and no need for compat

b643ff8

spacing mistakes

4c68452

throw errors on 1-d matrices and change confusing variable name

7310681

add back check to differentiate scalars from AbstractArrays

de280ba

save work

88b20ca

cjprybol added 5 commits March 18, 2017 11:46

update vcat testing and change similar_nullable constructor call

e0f7982

:Merge branch 'cjp/retaintype' of github.com:cjprybol/DataTables.jl i…

d65385e

…nto cjp/retaintype

remove old error message from docstring

b0c29b4

and change docstring to doctest

95a6f31

change similar_nullable back and fix unrelated copy paste space removal

7df712f

ararslan reviewed Mar 18, 2017

View reviewed changes

add missing rightperm reordering and properly unify hcat! functions

27da644

cjprybol commented Mar 18, 2017

View reviewed changes

cjprybol added 2 commits March 18, 2017 12:47

accidental spacing changes

5fa8fa0

forgot one spacing change

a1d58f9

cjprybol commented Mar 18, 2017

View reviewed changes

change deprecations

db87443

cjprybol commented Mar 18, 2017

View reviewed changes

add back extra spaces

9c66a1e

cjprybol mentioned this pull request Mar 20, 2017

Fix constructors and conversion between categorical arrays JuliaData/CategoricalArrays.jl#63

Merged

cjprybol commented Mar 20, 2017

View reviewed changes

cjprybol and others added 3 commits March 20, 2017 11:28

bump catarrays version, remove manual resetting of levels in unstack

887346b

and adjust tests accordingly

Merge branch 'master' into cjp/retaintype

00c08cc

only use "and" when joining the last estring

020c88e

nalimilan reviewed Mar 26, 2017

View reviewed changes

This was referenced Mar 31, 2017

adjust outer join behavior (types and right outer join bug) #44

Merged

consolidate internal hcat! function and remove autopromotion #43

Merged

cjprybol mentioned this pull request Apr 12, 2017

duplicate functionality in merge! and hcat! #52

Open

quinnj closed this May 18, 2017

cjprybol deleted the cjp/retaintype branch July 19, 2017 23:27


		This saves memory. To create the view, several AbstractVectors are defined:

		`:variable` column -- `EachRepeatedVector`

		@@ -61,13 +61,13 @@ d = stackdt(iris)

		This saves memory. To create the view, several AbstractVectors are defined:

		`:variable` column -- `EachRepeatedVector`

Stop auto-promoting column-types #30

Stop auto-promoting column-types #30

Conversation

cjprybol commented Mar 10, 2017 • edited Loading

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjprybol commented Mar 14, 2017 • edited Loading

Choose a reason for hiding this comment

nalimilan commented Mar 15, 2017 via email

nalimilan commented Mar 15, 2017 via email

nalimilan commented Mar 15, 2017 via email

nalimilan commented Mar 15, 2017 via email

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjprybol commented Mar 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjprybol commented Mar 27, 2017 • edited Loading

cjprybol commented Mar 10, 2017 •

edited

Loading

cjprybol commented Mar 14, 2017 •

edited

Loading

cjprybol commented Mar 18, 2017 •

edited

Loading

cjprybol commented Mar 27, 2017 •

edited

Loading