Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vec: add uninitialized(uint) and resize(&mut self, uint, T). #19104

Closed
wants to merge 2 commits into from

Conversation

nodakai
Copy link
Contributor

@nodakai nodakai commented Nov 19, 2014

Add two more convenience methods to Vec.

Please review my English in the rustdoc comments!

@rust-highfive
Copy link
Collaborator

warning Warning warning

  • These commits modify unsafe code. Please review it carefully!

if l < n {
self.grow(n - l, value)
} else {
unsafe { self.set_len(n) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destructors need to be called here. It should just be using truncate.

@nodakai
Copy link
Contributor Author

nodakai commented Nov 21, 2014

  1. resize() now calls grow() or truncate() and is safe for any types.
  2. My new constructor Vec::initialized() is now restricted to <T: Int>. Still UB but a tamed kind of UB.

Incidentally, modified truncate() a little according to my sense of beauty.

@@ -383,6 +408,29 @@ impl<T: Clone> Vec<T> {
}
}

impl<T: Int> Vec<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not T: Copy at this point? Are you just trying to use Int here as a synonym for "meaningless bits"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int is the most general trait which includes u8 (expected to be the most frequent use case for this ctor) but is known not to imply Drop (we should forbid allocating a Vec of RAII objects such as Vec<Vec> with this ctor.) According to the document, Int is

A built-in signed or unsigned integer.

What problem do you expect with Int? Do you think it is too specific?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really think there's a point in having a bound here. It's still very unsafe either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way uninitialized memory works in LLVM is that each access can return an arbitrary value. For example, using an uninitialized integer for indexing may pass the bounds check and then become a different value for the pointer offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned Vec contains what we call an "uninitialized" raw memory chunk, but from LLVM's point of view, it should just be "some data returned by C function called je_mallocx()" and should not be different from, say, "some data returned by C function called getaddrinfo(3)." If you think it's legitimate to read each struct addrinfo returned by getaddrinfo(3) under LLVM, then it should also be legitimate to read void* buffer returned by je_mallocx(). Reading such a buffer should surely be called UB, but it will never lead to deletion of your home directory; it just yields unpredictable byte sequence.

Hit the [ir] button to see how Vec and getaddrinfo are translated to LLVM IR:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware that Rust is not currently communicating that the memory is uninitialized in a build without --disable-jemalloc. LLVM special cases certain libc symbols like malloc, and it does infer this information from the symbol name along with doing stuff like dead store elimination of dynamic allocations. The fact that it isn't communicated to LLVM for other allocator symbols is a performance bug. It is not a guarantee that you're allowed to leverage. Note that the reference manual currently states that reading uninitialized memory is UB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--disable-jemalloc

Good point. You're right, reading data from uninitialized elements of a Vec under --disable-jemalloc is what LLVM define as UB.

Still, I don't think this is "more unsafe" than existing unsafe functions Vec::from_raw_parts() and set_len() which are already so flexible that anything can happen from their incorrect usage. So the discussion should focus on "will adding one more rewards us?"

My point is, allocating a buffer to pass its slice to Reader's read_*() methods and alike is a frequent use case and the library should support it with options with varying degrees of safety. Note that the length of a vector has to be set properly beforehand. Then my Vec::uninitialized() constructor should be preferred over calling bare set_len(). From a safety point of view, Vec::from_elem() is the best, but "extra" cost of clearing a buffer is sometimes undesirable. A benchmark with a 16 GB buffer on 3 GHz Xeon machine:

fn main() {
    const GB: uint = 1 << 30;
    const N: uint = 16 * GB;

    let mut v = if 1 < std::os::args().len() {
        let mut v = Vec::<u8>::with_capacity(N);
        unsafe { v.set_len(N); }
        v
    } else {
        Vec::<u8>::from_elem(N, 0)
    };

    println!("{}", std::io::stdin().read_at_least(N, v.as_mut_slice()));
}
$ time ./large_vec < /dev/zero
Ok(17179869184)
./large_vec < /dev/zero  4.99s user 11.52s system 99% cpu 16.529 total
$ time ./large_vec < /dev/zero x
Ok(17179869184)
./large_vec x < /dev/zero  0.00s user 9.14s system 99% cpu 9.157 total

Also remember, if someone makes a mistake of reading past the end of the region written by read_*(), it's just a bug regardless whether or not the rest of the buffer was zero-cleard beforehand in a well-defined way. "Safety" and "well-defined behavior" don't substantially help here. We shouldn't always need to pay the extra cost of clearing a buffer for the doubtful benefit from "safety" and "well-defined behavior".

This is a convenience wrapper on top of `grow()` and `truncate()`.

Signed-off-by: NODA, Kai <[email protected]>
@nodakai
Copy link
Contributor Author

nodakai commented Nov 22, 2014

I'd like to drop Vec::uninitialized() from my proposal because it didn't gain much approval. Thanks @thestinger and @gankro for your comments.

Incidentally, here's my motivation for adding the resize() method: there are numerous variable-length message formats on a byte stream whose messages have the form of [fixed-length header][varable-length payload]. We can use resize() to parse such messages like this:

let mut buf: Vec<u8> = vec![];
loop {
    buf.resize(HDR_LEN, 0);
    try!( stream.read_at_least(HDR_LEN, buf[mut]) );
    let msg_len = (... either bit operations on buf or transmute into C struct ...);
    buf.resize(msg_len, 0);
    try!( stream.read_at_least(msg_len, buf[mut]) );
    (... cook the data in buf ...)
}

Well, try!( stream.read_at_least(HDR_LEN, buf[mut]) ); part isn't very nice, but I hope you get the idea.

@Gankra
Copy link
Contributor

Gankra commented Nov 22, 2014

@nodakai I'm drafting up collections reform part 2. I could add this to the docket for discussion, if that works for you?

@nodakai
Copy link
Contributor Author

nodakai commented Nov 22, 2014

@gankro Yes please, that seems to be the only way for this proposal to make it into the merge. I'll close this PR later.

@alexcrichton
Copy link
Member

Closing due to #19104 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants