Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task txg_sync:4579 blocked for more than 120 seconds #7946

Closed
jumbi77 opened this issue Sep 22, 2018 · 7 comments
Closed

task txg_sync:4579 blocked for more than 120 seconds #7946

jumbi77 opened this issue Sep 22, 2018 · 7 comments

Comments

@jumbi77
Copy link
Contributor

jumbi77 commented Sep 22, 2018

System information

Type Version/Name
Distribution Name Proxmox
Distribution Version 5.2-2
Linux Kernel Linux proxmox 4.15.18-3-pve
Architecture x64
ZFS Version 0.7.9-pve1~bpo9
SPL Version 0.7.9-pve1~bpo9

Describe the problem you're observing

I have a zpool with a single drive (ata-ST3000DM001) and some datasets. One dataset is used for downloading big iso and tar files. Downloadspeed is at around 25-30 mb/s, because i set copies=2 i guess the hdd has to handle about 60 mb/s of writes (which should be not to much data for the hdd). After beginning the write load (downloading) i almost always get some call traces after 1-3 minutes. If i manually limit the download speed at around 11 mb/s or set copies=1 no call traces appears. So what is causing the error? Is the hdd to slow (which i dont expect) or is it zfs related? I can reproduce the call trace and can log some other metrics if you want.
Thanks

Include any warning/errors/backtraces from the system logs

DMESG output:

[53893.145953] INFO: task txg_sync:4579 blocked for more than 120 seconds.
[53893.145975]       Tainted: P           O     4.15.18-3-pve #1
[53893.145987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[53893.146005] txg_sync        D    0  4579      2 0x80000000
[53893.146006] Call Trace:
[53893.146013]  __schedule+0x3e0/0x870
[53893.146015]  schedule+0x36/0x80
[53893.146017]  io_schedule+0x16/0x40
[53893.146023]  cv_wait_common+0xb2/0x140 [spl]
[53893.146025]  ? wait_woken+0x80/0x80
[53893.146027]  __cv_wait_io+0x18/0x20 [spl]
[53893.146062]  zio_wait+0x103/0x1b0 [zfs]
[53893.146081]  dsl_pool_sync+0xb8/0x430 [zfs]
[53893.146103]  spa_sync+0x42d/0xd50 [zfs]
[53893.146125]  txg_sync_thread+0x2d4/0x4a0 [zfs]
[53893.146147]  ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[53893.146149]  thread_generic_wrapper+0x74/0x90 [spl]
[53893.146151]  kthread+0x105/0x140
[53893.146153]  ? __thread_exit+0x20/0x20 [spl]
[53893.146155]  ? kthread_create_worker_on_cpu+0x70/0x70
[53893.146157]  ret_from_fork+0x35/0x40
[53893.146273] INFO: task java:2481 blocked for more than 120 seconds.
[53893.146288]       Tainted: P           O     4.15.18-3-pve #1
[53893.146300] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[53893.146317] java            D    0  2481  24602 0x00000120
[53893.146318] Call Trace:
[53893.146320]  __schedule+0x3e0/0x870
[53893.146322]  schedule+0x36/0x80
[53893.146323]  io_schedule+0x16/0x40
[53893.146326]  cv_wait_common+0xb2/0x140 [spl]
[53893.146327]  ? wait_woken+0x80/0x80
[53893.146330]  __cv_wait_io+0x18/0x20 [spl]
[53893.146352]  zio_wait+0x103/0x1b0 [zfs]
[53893.146374]  zil_commit.part.14+0x4df/0x8b0 [zfs]
[53893.146395]  zil_commit+0x17/0x20 [zfs]
[53893.146416]  zfs_fsync+0x77/0xf0 [zfs]
[53893.146437]  zpl_fsync+0x68/0xa0 [zfs]
[53893.146439]  vfs_fsync_range+0x51/0xb0
[53893.146440]  do_fsync+0x3d/0x70
[53893.146441]  SyS_fsync+0x10/0x20
[53893.146443]  do_syscall_64+0x73/0x130
[53893.146444]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[53893.146446] RIP: 0033:0x7fd25cbfbad0
[53893.146446] RSP: 002b:00007fd238277340 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[53893.146447] RAX: ffffffffffffffda RBX: 000000000000006e RCX: 00007fd25cbfbad0
[53893.146448] RDX: 0000000000000000 RSI: 00007fd2382773f8 RDI: 000000000000006e
[53893.146449] RBP: 00007fd238277370 R08: 0000000671ba4f48 R09: 00000006000f4c78
[53893.146449] R10: 0000000000003658 R11: 0000000000000293 R12: 0000000000000001
[53893.146450] R13: 00007fd2592b63e8 R14: 00007fd2382773f8 R15: 00007fd188042000
[54255.649091] INFO: task txg_sync:4579 blocked for more than 120 seconds.
[54255.649143]       Tainted: P           O     4.15.18-3-pve #1
[54255.649220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54255.649295] txg_sync        D    0  4579      2 0x80000000
[54255.649297] Call Trace:
[54255.649305]  __schedule+0x3e0/0x870
[54255.649307]  schedule+0x36/0x80
[54255.649309]  io_schedule+0x16/0x40
[54255.649317]  cv_wait_common+0xb2/0x140 [spl]
[54255.649321]  ? wait_woken+0x80/0x80
[54255.649327]  __cv_wait_io+0x18/0x20 [spl]
[54255.649373]  zio_wait+0x103/0x1b0 [zfs]
[54255.649405]  dsl_pool_sync+0xb8/0x430 [zfs]
[54255.649439]  spa_sync+0x42d/0xd50 [zfs]
[54255.649475]  txg_sync_thread+0x2d4/0x4a0 [zfs]
[54255.649509]  ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[54255.649514]  thread_generic_wrapper+0x74/0x90 [spl]
[54255.649517]  kthread+0x105/0x140
[54255.649521]  ? __thread_exit+0x20/0x20 [spl]
[54255.649523]  ? kthread_create_worker_on_cpu+0x70/0x70
[54255.649525]  ret_from_fork+0x35/0x40
[54255.649627] INFO: task java:22601 blocked for more than 120 seconds.
[54255.649652]       Tainted: P           O     4.15.18-3-pve #1
[54255.649672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54255.649697] java            D    0 22601  24602 0x00000120
[54255.649699] Call Trace:
[54255.649702]  __schedule+0x3e0/0x870
[54255.649704]  schedule+0x36/0x80
[54255.649705]  io_schedule+0x16/0x40
[54255.649710]  cv_wait_common+0xb2/0x140 [spl]
[54255.649712]  ? wait_woken+0x80/0x80
[54255.649716]  __cv_wait_io+0x18/0x20 [spl]
[54255.649742]  zio_wait+0x103/0x1b0 [zfs]
[54255.649765]  zil_commit.part.14+0x4df/0x8b0 [zfs]
[54255.649787]  zil_commit+0x17/0x20 [zfs]
[54255.649808]  zfs_fsync+0x77/0xf0 [zfs]
[54255.649830]  zpl_fsync+0x68/0xa0 [zfs]
[54255.649832]  vfs_fsync_range+0x51/0xb0
[54255.649833]  do_fsync+0x3d/0x70
[54255.649834]  SyS_fsync+0x10/0x20
[54255.649837]  do_syscall_64+0x73/0x130
[54255.649838]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[54255.649839] RIP: 0033:0x7fd25cbfbad0
[54255.649840] RSP: 002b:00007fd1c7afb340 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[54255.649841] RAX: ffffffffffffffda RBX: 000000000000006e RCX: 00007fd25cbfbad0
[54255.649842] RDX: 0000000000000000 RSI: 00007fd1c7afb3f8 RDI: 000000000000006e
[54255.649842] RBP: 00007fd1c7afb370 R08: 00000005fffce188 R09: 00000005c98a36c8
[54255.649843] R10: 00000000000036de R11: 0000000000000293 R12: 0000000000000001
[54255.649843] R13: 00007fd2592b63e8 R14: 00007fd1c7afb3f8 R15: 00007fd18800c800
[54376.483434] INFO: task txg_sync:4579 blocked for more than 120 seconds.
[54376.483455]       Tainted: P           O     4.15.18-3-pve #1
[54376.483468] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54376.483485] txg_sync        D    0  4579      2 0x80000000
[54376.483487] Call Trace:
[54376.483493]  __schedule+0x3e0/0x870
[54376.483495]  schedule+0x36/0x80
[54376.483497]  io_schedule+0x16/0x40
[54376.483503]  cv_wait_common+0xb2/0x140 [spl]
[54376.483505]  ? wait_woken+0x80/0x80
[54376.483508]  __cv_wait_io+0x18/0x20 [spl]
[54376.483540]  zio_wait+0x103/0x1b0 [zfs]
[54376.483559]  dsl_pool_sync+0xb8/0x430 [zfs]
[54376.483581]  spa_sync+0x42d/0xd50 [zfs]
[54376.483603]  txg_sync_thread+0x2d4/0x4a0 [zfs]
[54376.483625]  ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[54376.483628]  thread_generic_wrapper+0x74/0x90 [spl]
[54376.483630]  kthread+0x105/0x140
[54376.483632]  ? __thread_exit+0x20/0x20 [spl]
[54376.483634]  ? kthread_create_worker_on_cpu+0x70/0x70
[54376.483635]  ret_from_fork+0x35/0x40
[54376.483740] INFO: task java:22601 blocked for more than 120 seconds.
[54376.483753]       Tainted: P           O     4.15.18-3-pve #1
[54376.483766] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54376.483781] java            D    0 22601  24602 0x00000124
[54376.483783] Call Trace:
[54376.483784]  __schedule+0x3e0/0x870
[54376.483786]  schedule+0x36/0x80
[54376.483787]  io_schedule+0x16/0x40
[54376.483790]  cv_wait_common+0xb2/0x140 [spl]
[54376.483791]  ? wait_woken+0x80/0x80
[54376.483794]  __cv_wait_io+0x18/0x20 [spl]
[54376.483834]  zio_wait+0x103/0x1b0 [zfs]
[54376.483864]  zil_commit.part.14+0x4df/0x8b0 [zfs]
[54376.483920]  zil_commit+0x17/0x20 [zfs]
[54376.483964]  zfs_fsync+0x77/0xf0 [zfs]
[54376.484006]  zpl_fsync+0x68/0xa0 [zfs]
[54376.484010]  vfs_fsync_range+0x51/0xb0
[54376.484012]  do_fsync+0x3d/0x70
[54376.484018]  SyS_fsync+0x10/0x20
[54376.484020]  do_syscall_64+0x73/0x130
[54376.484022]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[54376.484024] RIP: 0033:0x7fd25cbfbad0
[54376.484025] RSP: 002b:00007fd1c7afb340 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[54376.484026] RAX: ffffffffffffffda RBX: 000000000000006e RCX: 00007fd25cbfbad0
[54376.484027] RDX: 0000000000000000 RSI: 00007fd1c7afb3f8 RDI: 000000000000006e
[54376.484028] RBP: 00007fd1c7afb370 R08: 00000005fffce188 R09: 00000005c98a36c8
[54376.484029] R10: 00000000000036de R11: 0000000000000293 R12: 0000000000000001
[54376.484030] R13: 00007fd2592b63e8 R14: 00007fd1c7afb3f8 R15: 00007fd18800c800
[54625.040569] vmbr1: port 13(tap1250i0) entered disabled state

ZFS GET ALL mypool/ds1:

mypool/ds1  type                   filesystem                 -
mypool/ds1  creation               Mon Feb 12 21:54 2018      -
mypool/ds1  used                   1.97T                      -
mypool/ds1  available              672G                       -
mypool/ds1  referenced             945G                       -
mypool/ds1  compressratio          1.01x                      -
mypool/ds1  mounted                yes                        -
mypool/ds1  quota                  none                       default
mypool/ds1  reservation            none                       default
mypool/ds1  recordsize             1M                         local
mypool/ds1  mountpoint             /mypool/ds1                default
mypool/ds1  sharenfs               off                        default
mypool/ds1  checksum               fletcher4                  local
mypool/ds1  compression            lz4                        local
mypool/ds1  atime                  on                         local
mypool/ds1  devices                on                         default
mypool/ds1  exec                   on                         default
mypool/ds1  setuid                 on                         default
mypool/ds1  readonly               off                        default
mypool/ds1  zoned                  off                        default
mypool/ds1  snapdir                hidden                     default
mypool/ds1  aclinherit             passthrough                local
mypool/ds1  createtxg              28                         -
mypool/ds1  canmount               on                         default
mypool/ds1  xattr                  sa                         local
mypool/ds1  copies                 2                          local
mypool/ds1  version                5                          -
mypool/ds1  utf8only               on                         -
mypool/ds1  normalization          formD                      -
mypool/ds1  casesensitivity        sensitive                  -
mypool/ds1  vscan                  off                        default
mypool/ds1  nbmand                 off                        default
mypool/ds1  sharesmb               off                        default
mypool/ds1  refquota               none                       default
mypool/ds1  refreservation         none                       default
mypool/ds1  guid                   1095692504750449694        -
mypool/ds1  primarycache           all                        local
mypool/ds1  secondarycache         all                        local
mypool/ds1  usedbysnapshots        1.05T                      -
mypool/ds1  usedbydataset          945G                       -
mypool/ds1  usedbychildren         0B                         -
mypool/ds1  usedbyrefreservation   0B                         -
mypool/ds1  logbias                latency                    default
mypool/ds1  dedup                  off                        default
mypool/ds1  mlslabel               none                       default
mypool/ds1  sync                   standard                   default
mypool/ds1  dnodesize              legacy                     default
mypool/ds1  refcompressratio       1.01x                      -
mypool/ds1  written                9.30G                      -
mypool/ds1  logicalused            1.33T                      -
mypool/ds1  logicalreferenced      678G                       -
mypool/ds1  volmode                default                    default
mypool/ds1  filesystem_limit       none                       default
mypool/ds1  snapshot_limit         none                       default
mypool/ds1  filesystem_count       none                       default
mypool/ds1  snapshot_count         none                       default
mypool/ds1  snapdev                hidden                     default
mypool/ds1  acltype                posixacl                   local
mypool/ds1  context                none                       default
mypool/ds1  fscontext              none                       default
mypool/ds1  defcontext             none                       default
mypool/ds1  rootcontext            none                       default
mypool/ds1  relatime               off                        default
mypool/ds1  redundant_metadata     all                        local
mypool/ds1  overlay                off                        default
mypool/ds1  com.sun:auto-snapshot  false                      inherited from mypool
@jumbi77
Copy link
Contributor Author

jumbi77 commented Sep 23, 2018

@kpande So i can safely ignore this message although it's a call trace?

@jumbi77
Copy link
Contributor Author

jumbi77 commented Sep 23, 2018

Thanks for the quick answer. Just let me understand it: So when the hdd is somehow too slow to handle that load, does the data still get completely written to disk (it just takes more time than usual) or is the call trace an evidence that not all data is written completely/correctly? thanks in advance

@behlendorf
Copy link
Contributor

@jumbi77 everything will always be written safely. The ' blocked for more than 120 seconds' warning indicates that there was a kernel thread blocked for longer than expected. The warning is advisory, and is useful for debugging/troubleshooting. It is often caused by a slow disk or oversubscribed subsystem.

@bunder2015
Copy link
Contributor

Just to clarify, I had considered writing an article on it, but I don't really have much to add to something like this, other than to add that failmode=wait and the sudden loss of pool disks can also cause this same issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@behlendorf @bunder2015 @jumbi77 and others