This Linux kernel change "zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature" is included in the Linux 4.18 release. This change is authored by Minchan Kim <minchan [at] kernel.org> on Fri Aug 10 17:23:10 2018 -0700. The commit for this change in Linux stable tree is 4f7a7be (patch).
zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature If zram supports writeback feature, it's no longer a BD_CAP_SYNCHRONOUS_IO device beause zram does asynchronous IO operations for incompressible pages. Do not pretend to be synchronous IO device. It makes the system very sluggish due to waiting for IO completion from upper layers. Furthermore, it causes a user-after-free problem because swap thinks the opearion is done when the IO functions returns so it can free the page (e.g., lock_page_or_retry and goto out_release in do_swap_page) but in fact, IO is asynchronous so the driver could access a just freed page afterward. This patch fixes the problem. BUG: Bad page state in process qemu-system-x86 pfn:3dfab21 page:ffffdfb137eac840 count:0 mapcount:0 mapping:0000000000000000 index:0x1 flags: 0x17fffc000000008(uptodate) raw: 017fffc000000008 dead000000000100 dead000000000200 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set bad because of flags: 0x8(uptodate) CPU: 4 PID: 1039 Comm: qemu-system-x86 Tainted: G B 4.18.0-rc5+ #1 Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017 Call Trace: dump_stack+0x5c/0x7b bad_page+0xba/0x120 get_page_from_freelist+0x1016/0x1250 __alloc_pages_nodemask+0xfa/0x250 alloc_pages_vma+0x7c/0x1c0 do_swap_page+0x347/0x920 __handle_mm_fault+0x7b4/0x1110 handle_mm_fault+0xfc/0x1f0 __get_user_pages+0x12f/0x690 get_user_pages_unlocked+0x148/0x1f0 __gfn_to_pfn_memslot+0xff/0x3c0 [kvm] try_async_pf+0x87/0x230 [kvm] tdp_page_fault+0x132/0x290 [kvm] kvm_mmu_page_fault+0x74/0x570 [kvm] kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm] kvm_vcpu_ioctl+0x388/0x5d0 [kvm] do_vfs_ioctl+0xa2/0x630 ksys_ioctl+0x70/0x80 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x55/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Link: https://lore.kernel.org/lkml/[email protected]/ Link: http://lkml.kernel.org/r/[email protected] [[email protected]: fix changelog, add comment] Link: https://lore.kernel.org/lkml/[email protected]/ Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: coding-style fixes] Signed-off-by: Minchan Kim <min[email protected]> Reported-by: Tino Lehnig <[email protected]> Tested-by: Tino Lehnig <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Jens Axboe <[email protected]> Cc: <[email protected]> [4.15+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
There are 15 lines of Linux source code added/deleted in this change. Code changes to Linux kernel are as follows.
drivers/block/zram/zram_drv.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 7436b2d..a390c6d 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -298,7 +298,8 @@ static void reset_bdev(struct zram *zram) zram->backing_dev = NULL; zram->old_block_size = 0; zram->bdev = NULL; - + zram->disk->queue->backing_dev_info->capabilities |= + BDI_CAP_SYNCHRONOUS_IO; kvfree(zram->bitmap); zram->bitmap = NULL; } @@ -400,6 +401,18 @@ static ssize_t backing_dev_store(struct device *dev, zram->backing_dev = backing_dev; zram->bitmap = bitmap; zram->nr_pages = nr_pages; + /* + * With writeback feature, zram does asynchronous IO so it's no longer + * synchronous device so let's remove synchronous io flag. Othewise, + * upper layer(e.g., swap) could wait IO completion rather than + * (submit and return), which will cause system sluggish. + * Furthermore, when the IO function returns(e.g., swap_readpage), + * upper layer expects IO was done so it could deallocate the page + * freely but in fact, IO is going on so finally could cause + * use-after-free when the IO is really done. + */ + zram->disk->queue->backing_dev_info->capabilities &= + ~BDI_CAP_SYNCHRONOUS_IO; up_write(&zram->init_lock); pr_info("setup backing device %s\n", file_name);