adds table transforms #45

manikyabard · 2021-06-08T09:38:32Z

Adds TabularItem for holding table row values and some transformations for it.

ToucheSir · 2021-06-18T16:48:15Z

src/rowtransforms.jl

+	end
+	for col in tfm.catcols
+		if ismissing(x[col])
+			Setfield.@set! x[col] = "missing"


Is there a better sentinel value we can use than the literal string "missing"? Maybe missing, nothing, or a symbol :missing?

Yeah we discussed during the call how we can use missing here. I'll add a review that includes that discussion.

ToucheSir · 2021-06-18T16:50:53Z

src/rowtransforms.jl

+
+function DataAugmentation.apply(tfm::NormalizeRow, item::TabularItem; randstate=nothing)
+	x = (; zip(item.columns, [data for data in item.data])...)
+	for col in tfm.normcols


Instead of iterating the columns twice and having setfield repeatedly construct a namedtuple, perhaps look into a helper function that does the normalization given the tfm, column name and value? The function could check if the column is in normcols and transform it with normstats if it is.

Yeah do we even need Setfield anymore if we're standardizing on a NamedTuple? We could just build the transformed data as a vector or something then construct a NamedTuple at the end.

Yeah we should be able to build it at the end.

ToucheSir · 2021-06-18T16:51:17Z

src/rowtransforms.jl

+
+function DataAugmentation.apply(tfm::Categorify, item::TabularItem; randstate=nothing)
+	x = (; zip(item.columns, [data for data in item.data])...)
+	for col in tfm.categorycols


Same comment here about double iteration of the columns. I suppose it applies to FillMissing as well :)

Did you mean something like this?

function tfmrowvals(tfm::NormalizeRow, col, val) if col in tfm.cols colmean, colstd = tfm.dict[col] val = (val - colmean)/colstd end (col, val) end function apply(tfm::NormalizeRow, item; randstate=nothing) TabularItem((; tfmrowvals.( [tfm for _ in 1:length(item.columns)], item.columns, [val for val in item.data])... ), item.columns ) end

If this is better than the current implementation, we can even have a single apply which works on Union of all the transforms, and different methods for tfmrowvals.

Probably still need to dispatch on each type separately in order to know which tfmrowvals function to call, right?

Yeah there will probably be 3 tfmrowvals methods.

function apply(tfm::NormalizeRow, item; randstate=nothing) x = NamedTuple(Iterators.map(item.cols, item.data) do col, val if col in tfm.cols colmean, colstd = tfm.dict[col] val = (val - colmean)/colstd end (col, val) end) end

darsnack

Also, I think you are using TAB for indents. Could you convert to using 4 spaces?

darsnack · 2021-06-18T16:52:12Z

src/rowtransforms.jl

+function DataAugmentation.apply(tfm::FillMissing, item::TabularItem; randstate=nothing)
+	x = (; zip(item.columns, [data for data in item.data])...)
+	for col in tfm.contcols
+		if ismissing(x[col])
+			Setfield.@set! x[col] = tfm.fmvals[col]
+		end
+	end
+	for col in tfm.catcols
+		if ismissing(x[col])
+			Setfield.@set! x[col] = "missing"
+		end
+	end
+	TabularItem(x, item.columns)
+end


Suggested change

function DataAugmentation.apply(tfm::FillMissing, item::TabularItem; randstate=nothing)

x = (; zip(item.columns, [data for data in item.data])...)

for col in tfm.contcols

if ismissing(x[col])

Setfield.@set! x[col] = tfm.fmvals[col]

end

end

for col in tfm.catcols

if ismissing(x[col])

Setfield.@set! x[col] = "missing"

end

end

TabularItem(x, item.columns)

end

function DataAugmentation.apply(tfm::FillMissing, item::TabularItem; randstate=nothing)

x = (; zip(item.columns, [data for data in item.data])...)

for col in tfm.contcols

if ismissing(x[col])

Setfield.@set! x[col] = tfm.fmvals[col]

end

end

TabularItem(x, item.columns)

end

Unless we want to allow the catergorical missings to be filled too

darsnack · 2021-06-18T16:57:24Z

src/rowtransforms.jl

+		if ismissing(x[col])
+			Setfield.@set! x[col] = "missing"
+		end
+		Setfield.@set! x[col] = tfm.pooldict[col].invindex[x[col]]


Suggested change

if ismissing(x[col])

Setfield.@set! x[col] = "missing"

end

Setfield.@set! x[col] = tfm.pooldict[col].invindex[x[col]]

if ismissing(x[col])

Setfield.@set! x[col] = 0

else

Setfield.@set! x[col] = findfirst(tfm.categories .== col)

end

Wouldn't this give the same value for a column?

Also seeing that the Embedding layers won't work with 0 indexing, we should probably try to avoid it.

Made a typo, it should be x[col].

Then let's make missing == 1 and do + 1 for the other columns

darsnack · 2021-06-18T16:57:46Z

src/rowtransforms.jl

+function getcategorypools(catdict, catcols)
+	pooldict = Dict()
+	for col in catcols
+		catarray = CategoricalArrays.categorical(catdict[col])
+        CategoricalArrays.levels!(catarray, ["missing", CategoricalArrays.levels(catarray)...])
+        pooldict[col] = catarray.pool
+	end
+	pooldict
+end


Suggested change

function getcategorypools(catdict, catcols)

pooldict = Dict()

for col in catcols

catarray = CategoricalArrays.categorical(catdict[col])

CategoricalArrays.levels!(catarray, ["missing", CategoricalArrays.levels(catarray)...])

pooldict[col] = catarray.pool

end

pooldict

end

darsnack · 2021-06-18T17:00:04Z

src/rowtransforms.jl

+struct Categorify <: DataAugmentation.Transform
+	pooldict
+	categorycols
+end


Suggested change

struct Categorify <: DataAugmentation.Transform

pooldict

categorycols

end

struct Categorify{T, S} <: DataAugmentation.Transform

categories::T

categorycols::S

end

Two changes: swap to categories to just be a vector of the categories. I don't think we need the complexity of categorical arrays when the mapping is just the index in a list of categories passed by the user.

To reduce the complexity, we could just use the catdict used in getcategorypool directly for Categorify. A vector of vectors (or a NamedTuple) with the classes for each categorical column could work as well.

I'm not sure this will work if categories is just a vector of categorical column names as we'll have to replace the class for a categorical column with an integer, and for doing this we'll need information about all the classes which are present in a particular column.

Yeah, sorry it should be a NamedTuple/Dict.

darsnack · 2021-06-18T17:00:29Z

src/rowtransforms.jl

+struct NormalizeRow <: DataAugmentation.Transform
+	normstats
+	normcols
+end


Suggested change

struct NormalizeRow <: DataAugmentation.Transform

normstats

normcols

end

struct NormalizeRow{T, S} <: DataAugmentation.Transform

normstats::T

normcols::S

end

darsnack · 2021-06-18T17:00:56Z

src/rowtransforms.jl

+struct FillMissing <: DataAugmentation.Transform
+	fmvals
+	contcols
+	catcols
+end


Suggested change

struct FillMissing <: DataAugmentation.Transform

fmvals

contcols

catcols

end

struct FillMissing{T, S} <: DataAugmentation.Transform

fmvals::T

contcols::S

end

darsnack · 2021-06-18T17:06:10Z

src/rowtransforms.jl

+
+function DataAugmentation.apply(tfm::NormalizeRow, item::TabularItem; randstate=nothing)
+	x = (; zip(item.columns, [data for data in item.data])...)
+	for col in tfm.normcols


Yeah do we even need Setfield anymore if we're standardizing on a NamedTuple? We could just build the transformed data as a vector or something then construct a NamedTuple at the end.

src/items/table.jl

Co-authored-by: lorenzoh <[email protected]>

…ard/DataAugmentation.jl into manikyabard/tabulartfms

darsnack

I also wonder if we should have a transform that maps to Flux.OneHotArray instead of just categorical indices.

src/rowtransforms.jl

darsnack · 2021-06-23T12:45:39Z

src/rowtransforms.jl

+    x = [val for val in item.data]
+    for col in tfm.categorycols
+        idx = findfirst(col .== item.columns)
+        x[idx] = ismissing(x[idx]) ? 1 : findfirst(skipmissing(x[idx] .== tfm.catdict[col])) + 1


Suggested change

x[idx] = ismissing(x[idx]) ? 1 : findfirst(skipmissing(x[idx] .== tfm.catdict[col])) + 1

x[idx] = ismissing(x[idx]) ? 1 : findfirst(x[idx] .== tfm.catdict[col]) + 1

No need for skipmissing here, right? x[idx] is a value and tfm.catdict[col] is a vector of categorical values (which doesn't contain missing). findfirst is just assigning an index based on which symbol in tfm.catdict[col] matches x[idx].

Initially I was thinking if someone creates catdict using unique or something, and if somehow missing is a part of this vector then an error could be thrown, but yeah it might just be better to remove it.

Probably better to map(v -> filter!(!ismissing, v), values(catdict)) when constructing the transform. We could throw a warning when that happens too.

darsnack · 2021-06-23T12:46:28Z

src/rowtransforms.jl

+FillMissing(fmvals::T, fmcols::S) where {T, S} = FillMissing{T, S}(fmvals, fmcols)
+
+function DataAugmentation.apply(tfm::FillMissing, item::TabularItem; randstate=nothing)
+    x = [val for val in item.data]


Does collect(item.data) not work?

We should be able to use that.

Co-authored-by: Kyle Daruwalla <[email protected]>

darsnack

This is looking really clean now; nice job!

darsnack · 2021-06-28T20:59:10Z

src/rowtransforms.jl

+    cols::S
+    function Categorify{T, S}(dict::T, cols::S) where {T, S}
+        for (col, vals) in dict
+            dict[col] = append!([], [missing], collect(skipmissing(Set(vals))))


I think here you want to do SortedSet from DataStructures.jl. And you don't need skipmissing first, cause pushing missing onto a set that already contains it is a no-op (AbstractSets can't contain duplicates). Since it is sorted, missing will always map to the same index too (addressing @ToucheSir's concern from the call).

darsnack · 2021-06-28T21:00:21Z

src/rowtransforms.jl

+            val = (val - colmean)/colstd
+        end
+        (col, val)
+    end)


Suggested change

end)

end)

return TabularItem(x, item.columns)

And all the other transforms too

darsnack · 2021-06-28T21:01:07Z

src/rowtransforms.jl

+function apply(tfm::Categorify, item; randstate=nothing)
+    x = NamedTuple(Iterators.map(item.columns, item.data) do col, val
+        if col in tfm.cols
+            val = ismissing(val) ? 1 : findfirst(val .== tfm.dict[col]) + 1


This can just be findfirst if we use SortedSet.

Would findfirst work when the input function involves comparing with missing?

Yes, because tfm.dict[col] always contains missing, and missing is treated as any other element in the set.

Ah I see what you mean. An equality comparison with missing is missing.

I think as a result of this, the whole storing missing in the tfm.dict[col] is not going to work. We'll have to revert to the old filtering way + the conditional shown here.

In that case, there's no need to store missing in the dict values either then right? The conditional is required either way.

Yeah that's what I meant. Filter the missing out of the dict, and don't add it if it isn't there.

Alright, I have updated the constructor to use skipmissing and collect for the values containing missing.

darsnack · 2021-06-28T21:02:00Z

src/rowtransforms.jl

-    TabularItem(x, item.columns)
+Categorify(dict::T, cols::S) where {T, S} = Categorify{T, S}(dict, cols)
+
+function apply(tfm::NormalizeRow, item; randstate=nothing)


Is randstate an artifact from Python? Or is it part of the DataAugmentation interface? What is its role here?

Yeah, I can't see which of these transforms requires an RNG.

Yeah, even though randstate isn't required for the tabular transformations, I put it there because it was a part of the transformation interface. I think internally for compositions, apply is called along with randstate args so everything might not work without it.

Yeah, it's because of how the dispatch is set up.

src/rowtransforms.jl

Co-authored-by: Kyle Daruwalla <[email protected]>

src/items/table.jl

darsnack

Just fix the tab issue

Co-authored-by: Kyle Daruwalla <[email protected]>

lorenzoh · 2021-07-09T15:05:24Z

Still needs some tests

manikyabard · 2021-07-17T09:22:18Z

The tests should be fixed now.

ToucheSir

Just some nits on the tests. Is there a way to trigger a CI run on the latest commit?

test/rowtransforms.jl

ToucheSir · 2021-07-17T15:44:56Z

test/rowtransforms.jl

+    normdict = Dict(:col1 => (col1_mean, col1_std), :col3 => (col3_mean, col3_std))
+
+    tfm = NormalizeRow(normdict, cols_to_normalize)
+    # @test_nowarn apply(tfm, item)


Note to delete dangling comment before submission

test/rowtransforms.jl

darsnack · 2021-07-17T18:51:31Z

@lorenzoh will have to trigger it for "first-time contributors."

Co-authored-by: Brian Chen <[email protected]>

manikyabard added 2 commits June 6, 2021 15:09

added tabular transforms

2fa8f76

updated transforms

c1767c5

manikyabard changed the title ~~Manikyabard/tabulartfms~~ adds table transforms Jun 8, 2021

manikyabard mentioned this pull request Jun 14, 2021

FastAI.jl tabular development GSoC tracking FluxML/FluxML-Community-Call-Minutes#34

Closed

5 tasks

updated tabular transforms

2e0b87c

ToucheSir reviewed Jun 18, 2021

View reviewed changes

darsnack requested changes Jun 18, 2021

View reviewed changes

lorenzoh reviewed Jun 21, 2021

View reviewed changes

src/items/table.jl Outdated Show resolved Hide resolved

manikyabard and others added 3 commits June 22, 2021 18:36

Update src/items/table.jl

7a57684

Co-authored-by: lorenzoh <[email protected]>

updated transformations

3f8db5b

Merge branch 'manikyabard/tabulartfms' of https://github.com/manikyab…

bfaec10

…ard/DataAugmentation.jl into manikyabard/tabulartfms

darsnack requested changes Jun 23, 2021

View reviewed changes

manikyabard and others added 2 commits June 23, 2021 18:51

remove redundant constructor methods

90c3bae

Co-authored-by: Kyle Daruwalla <[email protected]>

updated tabular transforms

eef71ec

darsnack requested changes Jun 28, 2021

View reviewed changes

manikyabard added 2 commits July 1, 2021 00:05

updated Categorify to use SortedSet

cd42f1d

change Categorify constructor to remove missing values

38f35a0

darsnack reviewed Jul 2, 2021

View reviewed changes

src/rowtransforms.jl Outdated Show resolved Hide resolved

src/rowtransforms.jl Outdated Show resolved Hide resolved

manikyabard and others added 2 commits July 3, 2021 00:15

Minor changes in Categorify

a033d1d

Co-authored-by: Kyle Daruwalla <[email protected]>

remove unused dependencies

5a40475

manikyabard marked this pull request as ready for review July 2, 2021 18:50

darsnack reviewed Jul 2, 2021

View reviewed changes

src/items/table.jl Outdated Show resolved Hide resolved

darsnack approved these changes Jul 2, 2021

View reviewed changes

Update src/items/table.jl

0de742b

Co-authored-by: Kyle Daruwalla <[email protected]>

darsnack approved these changes Jul 2, 2021

View reviewed changes

added docstrings for tabular transforms

128cf97

manikyabard mentioned this pull request Jul 9, 2021

add blog about working with tabular data using FastAI.jl FluxML/fluxml.github.io#94

Open

manikyabard added 2 commits July 11, 2021 22:31

added row transformation testcases

be5dbff

fixed test

e90dc4b

ToucheSir requested changes Jul 17, 2021

View reviewed changes

manikyabard and others added 2 commits July 18, 2021 02:00

updated tests

1a22791

Co-authored-by: Brian Chen <[email protected]>

made tests consistent

a03bc72

lorenzoh merged commit d4d4687 into FluxML:master Aug 10, 2021

	x[idx] = ismissing(x[idx]) ? 1 : findfirst(skipmissing(x[idx] .== tfm.catdict[col])) + 1
	x[idx] = ismissing(x[idx]) ? 1 : findfirst(x[idx] .== tfm.catdict[col]) + 1

-    end)
+    end)
+    return TabularItem(x, item.columns)

adds table transforms #45

adds table transforms #45

Conversation

manikyabard commented Jun 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ToucheSir Jun 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manikyabard Jun 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack Jun 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack Jun 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack left a comment

Choose a reason for hiding this comment

lorenzoh commented Jul 9, 2021

manikyabard commented Jul 17, 2021

ToucheSir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack commented Jul 17, 2021

ToucheSir Jun 18, 2021 •

edited

Loading

manikyabard Jun 21, 2021 •

edited

Loading

darsnack Jun 30, 2021 •

edited

Loading

darsnack Jun 30, 2021 •

edited

Loading