Skip to content

Commit

Permalink
Add unique(AbstractArray, dim)
Browse files Browse the repository at this point in the history
Efficiently finds the unique columns, rows, etc. of an array. The
algorithm first hashes each row, then finds the unique hashes, and
finally checks that the hashes don't collide. It is roughly O(n) in
the number of elements in the matrix.

This is my first time using Cartesian. Without it, this code is
presently about 10% faster for finding unique rows of a matrix, but
the overhead is probably worth it for the generality.
  • Loading branch information
simonster committed Feb 14, 2014
1 parent d15ce97 commit db3b28d
Show file tree
Hide file tree
Showing 2 changed files with 93 additions and 0 deletions.
67 changes: 67 additions & 0 deletions base/multidimensional.jl
Original file line number Diff line number Diff line change
Expand Up @@ -410,3 +410,70 @@ for (V, PT, BT) in [((:N,), BitArray, BitArray), ((:T,:N), Array, StridedArray)]
return P
end
end

## unique across dim

immutable Prehashed
hash::Uint
end
hash(x::Prehashed) = x.hash

@ngenerate N typeof(A) function unique{T,N}(A::AbstractArray{T,N}, dim::Int)
1 <= dim <= N || return copy(A)
hashes = zeros(Uint, size(A, dim))

# Compute hash for each row
j = 0
@nloops N i A d->(if d == dim; j = i_d; end) begin
@inbounds hashes[j] = bitmix(hashes[j], hash((@nref N A i)))
end

# Collect index of first row for each hash
uniquerow = Array(Int, size(A, dim))
firstrow = Dict{Prehashed,Int}()
for j = 1:size(A, dim)
uniquerow[j] = get!(firstrow, Prehashed(hashes[j]), j)
end
uniquerows = collect(values(firstrow))

# Check for collisions
collided = falses(size(A, dim))
@inbounds begin
@nloops N i A d->(if d == dim; j = i_d; end) begin
if (@nref N A d->ifelse(d == dim, uniquerow[j], i_d)) != (@nref N A i)

This comment has been minimized.

Copy link
@timholy

timholy Feb 14, 2014

Member

It's possible this can be sped up by doing more of your indexing in the pre-expression. For two dimensions, this inner loop generates code like this:

if A[ifelse(dim==1,uniquerow[j],i_1), ifelse(dim==2,uniquerow[j],i_2)] != A[i_1, i_2]

Despite appearances, this might be fast because branch prediction should be 100% effective (dim isn't changing). But, if you want that 10% back (relative to non-cartesian) you may want to evaluate something like this:

local k
@nloops N i A d->(if d == dim; k = i_d; j_d = uniquerow[k]; else j_d = i_d; end) begin
    if (@nref N A j) != (@nref N A i)
        collided[k] = true
    end
end

which moves one of your if statements out of the inner loop.

This comment has been minimized.

Copy link
@simonster

simonster Feb 14, 2014

Author Member

Thanks Tim. That is indeed a bit faster.

collided[j] = true
end
end
end

if any(collided)
nowcollided = BitArray(size(A, dim))
while any(collided)
# Collect index of first row for each collided hash
empty!(firstrow)
for j = 1:size(A, dim)
collided[j] || continue
uniquerow[j] = get!(firstrow, Prehashed(hashes[j]), j)
end
for v in values(firstrow)
push!(uniquerows, v)
end

# Check for collisions
fill!(nowcollided, false)
@nloops N i A d->begin
if d == dim
j = i_d
(!collided[j] || uniquerow[j] == j) && continue
end
end begin
if (@nref N A d->ifelse(d == dim, uniquerow[j], i_d)) != (@nref N A i)

This comment has been minimized.

Copy link
@timholy

timholy Feb 14, 2014

Member

Same situation here, if the above suggestion gives you a speedup above.

nowcollided[j] = true
end
end
(collided, nowcollided) = (nowcollided, collided)
end
end

@nref N A d->d == dim ? sort!(uniquerows) : (1:size(A, d))
end
26 changes: 26 additions & 0 deletions test/arrayops.jl
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,32 @@ for i = tensors
@test isequal(i,permutedims(ipermutedims(i,perm),perm))
end

## unique across dim ##

# All rows and columns unique
A = ones(10, 10)
A[diagind(A)] = shuffle!([1:10])
@test unique(A, 1) == A
@test unique(A, 2) == A

# 10 repeats of each row
B = A[shuffle!(repmat(1:10, 10)), :]
C = unique(B, 1)
@test sortrows(C) == sortrows(A)
@test unique(B, 2) == B
@test unique(B.', 2).' == C

# Along third dimension
D = cat(3, B, B)
@test unique(D, 1) == cat(3, C, C)
@test unique(D, 3) == cat(3, B)

# With hash collisions
immutable HashCollision
x::Float64
end
Base.hash(::HashCollision) = uint(0)
@test map(x->x.x, unique(map(HashCollision, B), 1)) == C

## reduce ##

Expand Down

3 comments on commit db3b28d

@StefanKarpinski
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was kind of surprised to discover this method and wonder how it can possibly work. What happens if the rows/columns don't have the same number of unique elements? This method needs documentation at least to explain what it does.

@KristofferC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this function compares the whole column / row against other columns/rows.

For example:

julia> a = [ones(5) ones(5) zeros(5)]
5x3 Array{Float64,2}:
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0

julia> unique(a, 1)
1x3 Array{Float64,2}:
 1.0  1.0  0.0

julia> unique(a, 2)
5x2 Array{Float64,2}:
 1.0  0.0
 1.0  0.0
 1.0  0.0
 1.0  0.0
 1.0  0.0

Where does the number of unique elements in the row / column become a problem?

@StefanKarpinski
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that was totally unclear to me – that makes sense. I thought it was doing the unique operation in each column/row.

Please sign in to comment.