SINGLE_FILE_READERS/write_next_chunk race condition? #91

jbielick · 2019-02-19T20:04:12Z

Hello maintainer(s)!

Thanks so much for your work and curation. I've recently resurfaced from deep-sea debugging and have a question about my understanding of the Operations::Upload code—i.e. if there's a race condition when uploading a single (small) file.

Background

Net::SFTP 2.2.1
Net::SSH 4.2.0
Ruby 2.4.1
protocol version 6

An application I work on regularly uploads small csv files to an sftp server. The volume might be around several dozen per day.

Expected behavior

# nearly identical to the code from our application:
Net::SFTP.start(host, user, port: port, password: password) do |sftp|
  sftp.upload!(local_path, remote_path)
end

Under normal conditions and a working server, that the code above uploads a file without a socket failure.

Actual Behavior

The file is sometimes uploaded and we receive this error.

IOError: closed stream
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/ruby_compat.rb:20:in `select'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/ruby_compat.rb:20:in `io_select'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/transport/packet_stream.rb:75:in `available_for_read?'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/transport/packet_stream.rb:90:in `next_packet'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/transport/session.rb:193:in `block in poll_message'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/transport/session.rb:188:in `loop'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/transport/session.rb:188:in `poll_message'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/session.rb:544:in `dispatch_incoming_packets'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/session.rb:246:in `ev_preprocess'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/event_loop.rb:99:in `each'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/event_loop.rb:99:in `ev_preprocess'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/event_loop.rb:27:in `process'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/session.rb:225:in `process'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/session.rb:178:in `block in loop'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/session.rb:178:in `loop'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/session.rb:178:in `loop'
... 15 levels...
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/event_loop.rb:27:in `process'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/session.rb:225:in `process'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/session.rb:178:in `block in loop'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/session.rb:178:in `loop'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/connection/session.rb:178:in `loop'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-sftp-2.1.2/lib/net/sftp/session.rb:802:in `loop'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-sftp-2.1.2/lib/net/sftp/session.rb:787:in `connect!'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/net-sftp-2.1.2/lib/net/sftp.rb:32:in `start'
	from (irb):401
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/railties-4.2.9/lib/rails/commands/console.rb:110:in `start'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/railties-4.2.9/lib/rails/commands/console.rb:9:in `start'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/railties-4.2.9/lib/rails/commands/commands_tasks.rb:68:in `console'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/railties-4.2.9/lib/rails/commands/commands_tasks.rb:39:in `run_command!'
	from /u/apps/bighatads/production/shared/vendor/bundle/ruby/2.4.0/gems/railties-4.2.9/lib/rails/commands.rb:17:in `<top (required)>'
	from bin/rails:5:in `require'
	from bin/rails:5:in `<main>'

Possible Cause

When the size of the file being uploaded is smaller than the read_size, Net::SFTP appears to close the connection before the upload (write bytes) is done and acknowledged. When the read_size is less than the file size, it always seems to succeed independently of file size.

filesize: 10_713, read_size: 32_000 - FAIL
filesize: 10_767, read_size: 32_000 - SUCCESS
filesize: 10_670, read_size: 32_000 - SUCCESS

filesize: 10_713, read_size: 10_713 - FAIL
filesize: 10_713, read_size: 10_714 - FAIL
filesize: 10_713, read_size: 10_712 - SUCCESS

The conclusion isn't straightforward, but it seems like a race condition correlated with file size.

When trying to trace the cause I found this:

write_next_chunk is the method that says "read a chunk of size x, if there's no data left then close the connection". Is that correct?

      def write_next_chunk(file)
        if file.io.nil?
          process_next_entry
        else
          @active += 1
          offset = file.io.pos
          data = file.io.read(options[:read_size] || DEFAULT_READ_SIZE)
          if data.nil?
            update_progress(:close, file)
            request = sftp.close(file.handle, &method(:on_close))

The issue I'm seeing is when the upload initially starts and the file is opened on disk (see on_open):

      def on_open(response)
        @active -= 1
        file = response.request[:file]
        raise StatusException.new(response, "open #{file.remote}") unless response.ok?

        file.handle = response[:handle]

        @uploads << file
        # we start writing chunks here. This will get called recursively
        write_next_chunk(file)

        # recursive? is only true when uploading a directory
        if !recursive?
          # n.times write_next_chunk(same file as above)!
          (options[:requests] || SINGLE_FILE_READERS).to_i.times { write_next_chunk(file) }
        end
      end

We really only need write_next_chunk called once and the event handler will continually call it after each chunk sent over the socket until the file is completely read. The next few lines under if !recursive? are troubling. At this point write_next_chunk has already been called. Let's say the file size is smaller than the chunk size (read_size). When write_next_chunk is called, is reads the entire file contents to EOF.

After this has occurred, requests.to_i.times { write_next_chunk(file) } is executed. The default being 2.

        if !recursive?
          # n.times write_next_chunk(same file as above)!
          (options[:requests] || SINGLE_FILE_READERS).to_i.times { write_next_chunk(file) }
        end

Thus, two more times we call write_next_chunk. As the second or later caller to write_next_chunks will have a file handle at the EOF position, it will call read(bytes) and data will be nil.

It will enter the if data.nil? block:

          data = file.io.read(options[:read_size] || DEFAULT_READ_SIZE)
          if data.nil?
            update_progress(:close, file)
            request = sftp.close(file.handle, &method(:on_close))

and the session/connection will be closed.

I believe the issue I'm seeing is that the second call to write_next_chunk occurs synchronously after the first call and the server response is not received before the socket is closed. The result is an uploaded file and a "closed stream" error (this error is sometimes different depending on how fast this all happens). Sometimes the error is connection closed by remote host and sometimes it is connection reset by peer recvfrom(2).

Am I reading this correctly? Would you consider a patch to this? We can currently work around it by always ensuring the chunk size is smaller than the file size. Another workaround is ensuring that requests: 0 is passed as we don't need multiple writers. As a very core part of the library I wouldn't expect to have to use these workaround so I thought this was worth brining up.

How should I proceed?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SINGLE_FILE_READERS/write_next_chunk race condition? #91

SINGLE_FILE_READERS/write_next_chunk race condition? #91

jbielick commented Feb 19, 2019

SINGLE_FILE_READERS/write_next_chunk race condition? #91

SINGLE_FILE_READERS/write_next_chunk race condition? #91

Comments

jbielick commented Feb 19, 2019

Background

Expected behavior

Actual Behavior

Possible Cause