Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safer serialization of ::SubArray{T,N,A<:Array} #14625

Merged
merged 1 commit into from
Jan 17, 2016

Conversation

timholy
Copy link
Member

@timholy timholy commented Jan 10, 2016

When dealing with SubArrays-of-Arrays, we currently make serialization more efficient by "trimming" any unused data. However, we were not careful to reconstruct the exact type, and when the type was listed as a type-parameter in another type this can lead to problems.

Demo of failure on current master:

module ArrayWrappers

immutable ArrayWrapper{T,N,A<:AbstractArray} <: AbstractArray{T,N}
    data::A
end
Base.size(A::ArrayWrapper) = size(A.data)
Base.size(A::ArrayWrapper, d) = size(A.data, d)
Base.getindex(A::ArrayWrapper, i::Real...) = getindex(A.data, i...)

end

let A = rand(3,4)
    for B in (sub(A, :, 2:4), slice(A, 2, 1:3))
        C = ArrayWrappers.ArrayWrapper{Float64,2,typeof(B)}(B)
        io = IOBuffer()
        serialize(io, C)
        seek(io, 0)
        Cd = deserialize(io)
        @show size(Cd) size(C)
    end
end

Results:

julia> include("/tmp/sertest.jl")
size(Cd) = (561160,3)    # oh crap
size(C) = (3,3)
size(Cd) = (3,)
size(C) = (3,)

This PR reconstructs the type exactly, making the trimming strategy safe. (And look ma, no generated functions!)

Note this only affects SubArrays-of-Arrays; as before, we still serialize the entire parent array for any other kind of SubArray.

When dealing with SubArrays-of-Arrays, we currently make serialization more efficient by "trimming" any unused data. However, we were not careful to reconstruct the exact type, and when the type was listed as a type-parameter in another type this can lead to problems. This reconstructs the type exactly, making the trimming strategy safe.

Note this only works for SubArrays-of-Arrays; we still serialize the entire parent array for any other kind of SubArray.
@timholy
Copy link
Member Author

timholy commented Jan 10, 2016

One potential issue for backporting: currently SubArray -> serializer -> deserializer -> Array, with this PR SubArray -> serializer -> deserializer -> SubArray. While that's presumably something that will make @tkelman happy, changing the output type may not be the best thing to do in the middle of a stable release.

@tkelman
Copy link
Contributor

tkelman commented Jan 10, 2016

I could see that interacting with parallel code in unfortunate ways. That code might be currently sort-of working but making copies rather than subarrays? Would have to think about a test case.

@simonster
Copy link
Member

If you have an array and a SubArray of that array and serialize/deserialize both, I assume changing one no longer changes the other? While I can imagine cases where you don't want to serialize the whole array (e.g. splitting up a big array between workers), I'm not sure it's kosher for serialization to change semantics in this way...

@timholy
Copy link
Member Author

timholy commented Jan 10, 2016

@simonster, yes, serialization breaks the coupling. It's been this way for ages; this PR doesn't change anything there. Formerly, not only would it break the coupling, but upon deserialization you'd get an Array rather than a SubArray, so the type wasn't preserved. With this PR, at least the deserialized type is the same as what you fed in.

We could alternatively consider dropping this optimization altogether and just serialize the whole parent array. Naturally that copies all the data in the parent. If what you're serializing is, e.g., a single image slice from an hour-long movie, you'll have to serialize the whole movie, and that's a pretty expensive operation. But now that we have SharedArrays, there are good ways around that problem. So, rather than trying to improve this method, it's a valid option to just delete it altogether.

@timholy
Copy link
Member Author

timholy commented Jan 10, 2016

Link to the current code:

julia/base/serialize.jl

Lines 193 to 201 in a6770b0

function serialize{T,N,A<:Array}(s::SerializationState, a::SubArray{T,N,A})
if !isbits(T) || stride(a,1)!=1
return serialize(s, copy(a))
end
writetag(s.io, ARRAY_TAG)
serialize(s, T)
serialize(s, size(a))
serialize_array_data(s.io, a)
end
(it's easy to see that you get an Array out when deserializing).

@simonster
Copy link
Member

I think the current behavior is too surprising and we shouldn't try to trim the SubArrays, at least not by default. It would be nice to have some way to say that SubArrays should be trimmed when serialized (since e.g. SharedArrays won't work with worker processes on separate machines) but I'm not sure what that would look like.

@timholy
Copy link
Member Author

timholy commented Jan 11, 2016

Are you saying that because, secretly, you want some way of preserving the link between the parent and the subarray when both get serialized? If we're not going to be happy until that's true, then I suspect there's little point changing one without the other, and we should just wait until we come up with a way to do that.

If we don't think we can preserve the link, then is it really worth serializing the whole parent? Keep in mind that in the future, most subarrays will be created as B = A[:, :, 1854]. Currently with that syntax, serializing B does not serialize A.

@timholy
Copy link
Member Author

timholy commented Jan 17, 2016

Just ran into this again. I'd like to get this bug fixed, and I confess I'm not exactly sure I understand where people sit on this issue. So to clarify:

  • is this an improvement over current behavior?
  • if the answer to the above is "yes," is this therefore mergeable independent of other improvements we might make? (Note the test failures seem independent of this PR.)
  • or, should we just delete that specialized method, which will cause the entire parent array to be serialized along with the subarray wrapper?
  • or, should we wait to do anything until we come up with a way to preserve the relationship between a subarray and its parent?

I'd be a bit unhappy about the last one, because we currently have a bug that needs fixing and the last choice seems unlikely to be a quick fix.

@simonster
Copy link
Member

This is definitely an improvement over the current behavior, so I think we should merge it.

@timholy
Copy link
Member Author

timholy commented Jan 17, 2016

OK, I'll merge this and file a separate issue to act as a reminder to reevaluate behavior.

timholy added a commit that referenced this pull request Jan 17, 2016
Safer serialization of ::SubArray{T,N,A<:Array}
@timholy timholy merged commit 8801efa into master Jan 17, 2016
@timholy timholy deleted the teh/serialize_subarrays branch January 17, 2016 23:45
@timholy
Copy link
Member Author

timholy commented Apr 21, 2016

I don't know whether there will be another 0.4 release, but I just fixed a local bug by backporting this.

@tkelman
Copy link
Contributor

tkelman commented Apr 27, 2016

Why wouldn't there be? Could you describe the bug a bit? I was hesitant to backport this because it's a bit of a behavior change.

@timholy
Copy link
Member Author

timholy commented Apr 27, 2016

Fair enough. Let's take off the label, then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants