From 6f9524e9e118929f1de02840dffe858f99685aea Mon Sep 17 00:00:00 2001 From: Lukas Czerner Date: Mon, 21 Feb 2011 20:16:21 -0500 Subject: ext4: update ext4 documentation Add documentation for mount options and ioctls to Documentation/filesystem/ext4.txt, which has not been udpated for some time. Also add for ext4 sysfs tunables to the Documentation/ABI/testing/sysfs-fs-ext4 file, and fix a few typographical errors in that file. https://bugzilla.kernel.org/show_bug.cgi?id=9423 Signed-off-by: Lukas Czerner Signed-off-by: "Theodore Ts'o" --- Documentation/ABI/testing/sysfs-fs-ext4 | 13 +- Documentation/filesystems/ext4.txt | 207 +++++++++++++++++++++++++++++++- 2 files changed, 216 insertions(+), 4 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-fs-ext4 b/Documentation/ABI/testing/sysfs-fs-ext4 index 5fb709997d9..f22ac0872ae 100644 --- a/Documentation/ABI/testing/sysfs-fs-ext4 +++ b/Documentation/ABI/testing/sysfs-fs-ext4 @@ -48,7 +48,7 @@ Description: will have its blocks allocated out of its own unique preallocation pool. -What: /sys/fs/ext4//inode_readahead +What: /sys/fs/ext4//inode_readahead_blks Date: March 2008 Contact: "Theodore Ts'o" Description: @@ -85,7 +85,14 @@ Date: June 2008 Contact: "Theodore Ts'o" Description: Tuning parameter which (if non-zero) controls the goal - inode used by the inode allocator in p0reference to - all other allocation hueristics. This is intended for + inode used by the inode allocator in preference to + all other allocation heuristics. This is intended for debugging use only, and should be 0 on production systems. + +What: /sys/fs/ext4//max_writeback_mb_bump +Date: September 2009 +Contact: "Theodore Ts'o" +Description: + The maximum number of megabytes the writeback code will + try to write out before move on to another inode. diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 6ab9442d7ee..6b050464a90 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -367,12 +367,47 @@ init_itable=n The lazy itable init code will wait n times the minimizes the impact on the systme performance while file system's inode table is being initialized. -discard Controls whether ext4 should issue discard/TRIM +discard Controls whether ext4 should issue discard/TRIM nodiscard(*) commands to the underlying block device when blocks are freed. This is useful for SSD devices and sparse/thinly-provisioned LUNs, but it is off by default until sufficient testing has been done. +nouid32 Disables 32-bit UIDs and GIDs. This is for + interoperability with older kernels which only + store and expect 16-bit values. + +resize Allows to resize filesystem to the end of the last + existing block group, further resize has to be done + with resize2fs either online, or offline. It can be + used only with conjunction with remount. + +block_validity This options allows to enables/disables the in-kernel +noblock_validity facility for tracking filesystem metadata blocks + within internal data structures. This allows multi- + block allocator and other routines to quickly locate + extents which might overlap with filesystem metadata + blocks. This option is intended for debugging + purposes and since it negatively affects the + performance, it is off by default. + +dioread_lock Controls whether or not ext4 should use the DIO read +dioread_nolock locking. If the dioread_nolock option is specified + ext4 will allocate uninitialized extent before buffer + write and convert the extent to initialized after IO + completes. This approach allows ext4 code to avoid + using inode mutex, which improves scalability on high + speed storages. However this does not work with nobh + option and the mount will fail. Nor does it work with + data journaling and dioread_nolock option will be + ignored with kernel warning. Note that dioread_nolock + code path is only used for extent-based files. + Because of the restrictions this options comprises + it is off by default (e.g. dioread_lock). + +i_version Enable 64-bit inode version support. This option is + off by default. + Data Mode ========= There are 3 different data modes: @@ -400,6 +435,176 @@ needs to be read from and written to disk at the same time where it outperforms all others modes. Currently ext4 does not have delayed allocation support if this data journalling mode is selected. +/proc entries +============= + +Information about mounted ext4 file systems can be found in +/proc/fs/ext4. Each mounted filesystem will have a directory in +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or +/proc/fs/ext4/dm-0). The files in each per-device directory are shown +in table below. + +Files in /proc/fs/ext4/ +.............................................................................. + File Content + mb_groups details of multiblock allocator buddy cache of free blocks +.............................................................................. + +/sys entries +============ + +Information about mounted ext4 file systems can be found in +/sys/fs/ext4. Each mounted filesystem will have a directory in +/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or +/sys/fs/ext4/dm-0). The files in each per-device directory are shown +in table below. + +Files in /sys/fs/ext4/ +(see also Documentation/ABI/testing/sysfs-fs-ext4) +.............................................................................. + File Content + + delayed_allocation_blocks This file is read-only and shows the number of + blocks that are dirty in the page cache, but + which do not have their location in the + filesystem allocated yet. + + inode_goal Tuning parameter which (if non-zero) controls + the goal inode used by the inode allocator in + preference to all other allocation heuristics. + This is intended for debugging use only, and + should be 0 on production systems. + + inode_readahead_blks Tuning parameter which controls the maximum + number of inode table blocks that ext4's inode + table readahead algorithm will pre-read into + the buffer cache + + lifetime_write_kbytes This file is read-only and shows the number of + kilobytes of data that have been written to this + filesystem since it was created. + + max_writeback_mb_bump The maximum number of megabytes the writeback + code will try to write out before move on to + another inode. + + mb_group_prealloc The multiblock allocator will round up allocation + requests to a multiple of this tuning parameter if + the stripe size is not set in the ext4 superblock + + mb_max_to_scan The maximum number of extents the multiblock + allocator will search to find the best extent + + mb_min_to_scan The minimum number of extents the multiblock + allocator will search to find the best extent + + mb_order2_req Tuning parameter which controls the minimum size + for requests (as a power of 2) where the buddy + cache is used + + mb_stats Controls whether the multiblock allocator should + collect statistics, which are shown during the + unmount. 1 means to collect statistics, 0 means + not to collect statistics + + mb_stream_req Files which have fewer blocks than this tunable + parameter will have their blocks allocated out + of a block group specific preallocation pool, so + that small files are packed closely together. + Each large file will have its blocks allocated + out of its own unique preallocation pool. + + session_write_kbytes This file is read-only and shows the number of + kilobytes of data that have been written to this + filesystem since it was mounted. +.............................................................................. + +Ioctls +====== + +There is some Ext4 specific functionality which can be accessed by applications +through the system call interfaces. The list of all Ext4 specific ioctls are +shown in the table below. + +Table of Ext4 specific ioctls +.............................................................................. + Ioctl Description + EXT4_IOC_GETFLAGS Get additional attributes associated with inode. + The ioctl argument is an integer bitfield, with + bit values described in ext4.h. This ioctl is an + alias for FS_IOC_GETFLAGS. + + EXT4_IOC_SETFLAGS Set additional attributes associated with inode. + The ioctl argument is an integer bitfield, with + bit values described in ext4.h. This ioctl is an + alias for FS_IOC_SETFLAGS. + + EXT4_IOC_GETVERSION + EXT4_IOC_GETVERSION_OLD + Get the inode i_generation number stored for + each inode. The i_generation number is normally + changed only when new inode is created and it is + particularly useful for network filesystems. The + '_OLD' version of this ioctl is an alias for + FS_IOC_GETVERSION. + + EXT4_IOC_SETVERSION + EXT4_IOC_SETVERSION_OLD + Set the inode i_generation number stored for + each inode. The '_OLD' version of this ioctl + is an alias for FS_IOC_SETVERSION. + + EXT4_IOC_GROUP_EXTEND This ioctl has the same purpose as the resize + mount option. It allows to resize filesystem + to the end of the last existing block group, + further resize has to be done with resize2fs, + either online, or offline. The argument points + to the unsigned logn number representing the + filesystem new block count. + + EXT4_IOC_MOVE_EXT Move the block extents from orig_fd (the one + this ioctl is pointing to) to the donor_fd (the + one specified in move_extent structure passed + as an argument to this ioctl). Then, exchange + inode metadata between orig_fd and donor_fd. + This is especially useful for online + defragmentation, because the allocator has the + opportunity to allocate moved blocks better, + ideally into one contiguous extent. + + EXT4_IOC_GROUP_ADD Add a new group descriptor to an existing or + new group descriptor block. The new group + descriptor is described by ext4_new_group_input + structure, which is passed as an argument to + this ioctl. This is especially useful in + conjunction with EXT4_IOC_GROUP_EXTEND, + which allows online resize of the filesystem + to the end of the last existing block group. + Those two ioctls combined is used in userspace + online resize tool (e.g. resize2fs). + + EXT4_IOC_MIGRATE This ioctl operates on the filesystem itself. + It converts (migrates) ext3 indirect block mapped + inode to ext4 extent mapped inode by walking + through indirect block mapping of the original + inode and converting contiguous block ranges + into ext4 extents of the temporary inode. Then, + inodes are swapped. This ioctl might help, when + migrating from ext3 to ext4 filesystem, however + suggestion is to create fresh ext4 filesystem + and copy data from the backup. Note, that + filesystem has to support extents for this ioctl + to work. + + EXT4_IOC_ALLOC_DA_BLKS Force all of the delay allocated blocks to be + allocated to preserve application-expected ext3 + behaviour. Note that this will also start + triggering a write of the data blocks, but this + behaviour may change in the future as it is + not necessary and has been done this way only + for sake of simplicity. +.............................................................................. + References ========== -- cgit v1.2.3 From da488945f4bf4096f4ab6091938469bd8822cfec Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Mon, 21 Feb 2011 20:39:58 -0500 Subject: ext4: fix compile warnings with EXT4FS_DEBUG enabled Compile 2.6.38-rc1 with turning EXT4FS_DEBUG on, we get following compile warnings. This patch fixes them. CC fs/ext4/hash.o CC fs/ext4/resize.o fs/ext4/resize.c: In function 'setup_new_group_blocks': fs/ext4/resize.c:233:2: warning: format '%#04llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' fs/ext4/resize.c:251:2: warning: format '%#04llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' CC fs/ext4/extents.o CC fs/ext4/ext4_jbd2.o CC fs/ext4/migrate.o Reported-by: Akira Fujita Signed-off-by: "Theodore Ts'o" --- fs/ext4/resize.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index 3ecc6e45d2f..66fec4ee76f 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -230,7 +230,7 @@ static int setup_new_group_blocks(struct super_block *sb, } /* Zero out all of the reserved backup group descriptor table blocks */ - ext4_debug("clear inode table blocks %#04llx -> %#04llx\n", + ext4_debug("clear inode table blocks %#04llx -> %#04lx\n", block, sbi->s_itb_per_group); err = sb_issue_zeroout(sb, gdblocks + start + 1, reserved_gdb, GFP_NOFS); @@ -248,7 +248,7 @@ static int setup_new_group_blocks(struct super_block *sb, /* Zero out all of the inode table blocks */ block = input->inode_table; - ext4_debug("clear inode table blocks %#04llx -> %#04llx\n", + ext4_debug("clear inode table blocks %#04llx -> %#04lx\n", block, sbi->s_itb_per_group); err = sb_issue_zeroout(sb, block, sbi->s_itb_per_group, GFP_NOFS); if (err) -- cgit v1.2.3 From 7dc576158d7e5cdff3349f78598fdb4080536342 Mon Sep 17 00:00:00 2001 From: Peter Huewe Date: Mon, 21 Feb 2011 21:01:42 -0500 Subject: ext4: Fix sparse warning: Using plain integer as NULL pointer This patch fixes the warning "Using plain integer as NULL pointer", generated by sparse, by replacing the offending 0s with NULL. Signed-off-by: Peter Huewe Signed-off-by: "Theodore Ts'o" --- fs/ext4/extents.c | 8 ++++---- fs/ext4/ialloc.c | 2 +- fs/ext4/inode.c | 18 +++++++++--------- fs/ext4/migrate.c | 10 +++++----- fs/ext4/page-io.c | 4 ++-- fs/ext4/super.c | 2 +- fs/ext4/xattr.c | 2 +- 7 files changed, 23 insertions(+), 23 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index ccce8a7e94e..d16f6b5a140 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1034,7 +1034,7 @@ cleanup: for (i = 0; i < depth; i++) { if (!ablocks[i]) continue; - ext4_free_blocks(handle, inode, 0, ablocks[i], 1, + ext4_free_blocks(handle, inode, NULL, ablocks[i], 1, EXT4_FREE_BLOCKS_METADATA); } } @@ -2059,7 +2059,7 @@ static int ext4_ext_rm_idx(handle_t *handle, struct inode *inode, if (err) return err; ext_debug("index is empty, remove it, free block %llu\n", leaf); - ext4_free_blocks(handle, inode, 0, leaf, 1, + ext4_free_blocks(handle, inode, NULL, leaf, 1, EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET); return err; } @@ -2156,7 +2156,7 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode, num = le32_to_cpu(ex->ee_block) + ee_len - from; start = ext4_ext_pblock(ex) + ee_len - num; ext_debug("free last %u blocks starting %llu\n", num, start); - ext4_free_blocks(handle, inode, 0, start, num, flags); + ext4_free_blocks(handle, inode, NULL, start, num, flags); } else if (from == le32_to_cpu(ex->ee_block) && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) { printk(KERN_INFO "strange request: removal %u-%u from %u:%u\n", @@ -3485,7 +3485,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, /* not a good idea to call discard here directly, * but otherwise we'd need to call it every free() */ ext4_discard_preallocations(inode); - ext4_free_blocks(handle, inode, 0, ext4_ext_pblock(&newex), + ext4_free_blocks(handle, inode, NULL, ext4_ext_pblock(&newex), ext4_ext_get_actual_len(&newex), 0); goto out2; } diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index eb9097aec6f..2fd3b0e4178 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -649,7 +649,7 @@ static int find_group_other(struct super_block *sb, struct inode *parent, *group = parent_group + flex_size; if (*group > ngroups) *group = 0; - return find_group_orlov(sb, parent, group, mode, 0); + return find_group_orlov(sb, parent, group, mode, NULL); } /* diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 9f7f9e49914..c6c6b7fcb45 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -720,7 +720,7 @@ allocated: return ret; failed_out: for (i = 0; i < index; i++) - ext4_free_blocks(handle, inode, 0, new_blocks[i], 1, 0); + ext4_free_blocks(handle, inode, NULL, new_blocks[i], 1, 0); return ret; } @@ -823,20 +823,20 @@ static int ext4_alloc_branch(handle_t *handle, struct inode *inode, return err; failed: /* Allocation failed, free what we already allocated */ - ext4_free_blocks(handle, inode, 0, new_blocks[0], 1, 0); + ext4_free_blocks(handle, inode, NULL, new_blocks[0], 1, 0); for (i = 1; i <= n ; i++) { /* * branch[i].bh is newly allocated, so there is no * need to revoke the block, which is why we don't * need to set EXT4_FREE_BLOCKS_METADATA. */ - ext4_free_blocks(handle, inode, 0, new_blocks[i], 1, + ext4_free_blocks(handle, inode, NULL, new_blocks[i], 1, EXT4_FREE_BLOCKS_FORGET); } for (i = n+1; i < indirect_blks; i++) - ext4_free_blocks(handle, inode, 0, new_blocks[i], 1, 0); + ext4_free_blocks(handle, inode, NULL, new_blocks[i], 1, 0); - ext4_free_blocks(handle, inode, 0, new_blocks[i], num, 0); + ext4_free_blocks(handle, inode, NULL, new_blocks[i], num, 0); return err; } @@ -924,7 +924,7 @@ err_out: ext4_free_blocks(handle, inode, where[i].bh, 0, 1, EXT4_FREE_BLOCKS_FORGET); } - ext4_free_blocks(handle, inode, 0, le32_to_cpu(where[num].key), + ext4_free_blocks(handle, inode, NULL, le32_to_cpu(where[num].key), blks, 0); return err; @@ -4228,7 +4228,7 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode, for (p = first; p < last; p++) *p = 0; - ext4_free_blocks(handle, inode, 0, block_to_free, count, flags); + ext4_free_blocks(handle, inode, NULL, block_to_free, count, flags); return 0; } @@ -4416,7 +4416,7 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode, * transaction where the data blocks are * actually freed. */ - ext4_free_blocks(handle, inode, 0, nr, 1, + ext4_free_blocks(handle, inode, NULL, nr, 1, EXT4_FREE_BLOCKS_METADATA| EXT4_FREE_BLOCKS_FORGET); @@ -4875,7 +4875,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) return inode; ei = EXT4_I(inode); - iloc.bh = 0; + iloc.bh = NULL; ret = __ext4_get_inode_loc(inode, &iloc, 0); if (ret < 0) diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c index b0a126f23c2..d1bafa57f48 100644 --- a/fs/ext4/migrate.c +++ b/fs/ext4/migrate.c @@ -263,7 +263,7 @@ static int free_dind_blocks(handle_t *handle, for (i = 0; i < max_entries; i++) { if (tmp_idata[i]) { extend_credit_for_blkdel(handle, inode); - ext4_free_blocks(handle, inode, 0, + ext4_free_blocks(handle, inode, NULL, le32_to_cpu(tmp_idata[i]), 1, EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET); @@ -271,7 +271,7 @@ static int free_dind_blocks(handle_t *handle, } put_bh(bh); extend_credit_for_blkdel(handle, inode); - ext4_free_blocks(handle, inode, 0, le32_to_cpu(i_data), 1, + ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1, EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET); return 0; @@ -302,7 +302,7 @@ static int free_tind_blocks(handle_t *handle, } put_bh(bh); extend_credit_for_blkdel(handle, inode); - ext4_free_blocks(handle, inode, 0, le32_to_cpu(i_data), 1, + ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1, EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET); return 0; @@ -315,7 +315,7 @@ static int free_ind_block(handle_t *handle, struct inode *inode, __le32 *i_data) /* ei->i_data[EXT4_IND_BLOCK] */ if (i_data[0]) { extend_credit_for_blkdel(handle, inode); - ext4_free_blocks(handle, inode, 0, + ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data[0]), 1, EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET); @@ -428,7 +428,7 @@ static int free_ext_idx(handle_t *handle, struct inode *inode, } put_bh(bh); extend_credit_for_blkdel(handle, inode); - ext4_free_blocks(handle, inode, 0, block, 1, + ext4_free_blocks(handle, inode, NULL, block, 1, EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET); return retval; } diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 955cc309142..68d92a8f71d 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -279,9 +279,9 @@ void ext4_io_submit(struct ext4_io_submit *io) BUG_ON(bio_flagged(io->io_bio, BIO_EOPNOTSUPP)); bio_put(io->io_bio); } - io->io_bio = 0; + io->io_bio = NULL; io->io_op = 0; - io->io_end = 0; + io->io_end = NULL; } static int io_submit_init(struct ext4_io_submit *io, diff --git a/fs/ext4/super.c b/fs/ext4/super.c index f6a318f836b..ef83457fd4e 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1451,7 +1451,7 @@ static int parse_options(char *options, struct super_block *sb, * Initialize args struct so we know whether arg was * found; some options take optional arguments. */ - args[0].to = args[0].from = 0; + args[0].to = args[0].from = NULL; token = match_token(p, tokens, args); switch (token) { case Opt_bsd_df: diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index fc32176eee3..f4c03af05d6 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -833,7 +833,7 @@ inserted: new_bh = sb_getblk(sb, block); if (!new_bh) { getblk_failed: - ext4_free_blocks(handle, inode, 0, block, 1, + ext4_free_blocks(handle, inode, NULL, block, 1, EXT4_FREE_BLOCKS_METADATA); error = -EIO; goto cleanup; -- cgit v1.2.3 From 5dbd571d875d73e087c1eeb3d840cfc653a97422 Mon Sep 17 00:00:00 2001 From: "Alexander V. Lukyanov" Date: Mon, 21 Feb 2011 21:33:21 -0500 Subject: ext4: allow inode_readahead_blks=0 (linux-2.6.37) I cannot disable inode-read-ahead feature of ext4 (on 2.6.37): # echo 0 > /sys/fs/ext4/sda2/inode_readahead_blks bash: echo: write error: Invalid argument On a server with lots of small files and random access this read-ahead makes performance worse, and I'd like to disable it. I work around this problem by using value of 1, but it still reads an extra block. This patch fixes the problem by checking for zero explicitly. Signed-off-by: Alexander V. Lukyanov Signed-off-by: "Theodore Ts'o" --- fs/ext4/super.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index ef83457fd4e..a1ac24b6a75 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1771,7 +1771,7 @@ set_qf_format: return 0; if (option < 0 || option > (1 << 30)) return 0; - if (!is_power_of_2(option)) { + if (option && !is_power_of_2(option)) { ext4_msg(sb, KERN_ERR, "EXT4-fs: inode_readahead_blks" " must be a power of 2"); @@ -2412,7 +2412,7 @@ static ssize_t inode_readahead_blks_store(struct ext4_attr *a, if (parse_strtoul(buf, 0x40000000, &t)) return -EINVAL; - if (!is_power_of_2(t)) + if (t && !is_power_of_2(t)) return -EINVAL; sbi->s_inode_readahead_blks = t; -- cgit v1.2.3 From 0b75a840120b1e647e32342e9cc46631410088d5 Mon Sep 17 00:00:00 2001 From: Lukas Czerner Date: Wed, 23 Feb 2011 12:22:49 -0500 Subject: ext4: mark file-local functions and variables as static Signed-off-by: Lukas Czerner Signed-off-by: "Theodore Ts'o" --- fs/ext4/mballoc.c | 3 ++- fs/ext4/super.c | 6 +++--- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index d1fe09aea73..ae4d7f5edbb 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4753,7 +4753,8 @@ static int ext4_trim_extent(struct super_block *sb, int start, int count, * bitmap. Then issue a TRIM command on this extent and free the extent in * the group buddy bitmap. This is done until whole group is scanned. */ -ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, struct ext4_buddy *e4b, +static ext4_grpblk_t +ext4_trim_all_free(struct super_block *sb, struct ext4_buddy *e4b, ext4_grpblk_t start, ext4_grpblk_t max, ext4_grpblk_t minblocks) { void *bitmap; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a1ac24b6a75..1539cf55978 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -54,9 +54,9 @@ static struct proc_dir_entry *ext4_proc_root; static struct kset *ext4_kset; -struct ext4_lazy_init *ext4_li_info; -struct mutex ext4_li_mtx; -struct ext4_features *ext4_feat; +static struct ext4_lazy_init *ext4_li_info; +static struct mutex ext4_li_mtx; +static struct ext4_features *ext4_feat; static int ext4_load_journal(struct super_block *, struct ext4_super_block *, unsigned long journal_devnum); -- cgit v1.2.3 From 4143179218960a70d821a425e3c23ce44aa93dee Mon Sep 17 00:00:00 2001 From: Lukas Czerner Date: Wed, 23 Feb 2011 12:42:32 -0500 Subject: ext4: check if device support discard in FITRIM ioctl For a device that does not support discard, the FITRIM ioctl returns -EOPNOTSUPP when blkdev_issue_discard() returns this error code, which is how the user is informed that the device does not support discard. If there are no suitable free extents to be trimmed, then FITRIM will return success even though the device does not support discard, which could confuse the user. So check explicitly if the device supports discard and return an error code at the beginning of the FITRIM ioctl processing. Signed-off-by: Lukas Czerner Signed-off-by: "Theodore Ts'o" --- fs/ext4/ioctl.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index eb3bc2fe647..25ba7c79d28 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -334,12 +334,16 @@ mext_out: case FITRIM: { struct super_block *sb = inode->i_sb; + struct request_queue *q = bdev_get_queue(sb->s_bdev); struct fstrim_range range; int ret = 0; if (!capable(CAP_SYS_ADMIN)) return -EPERM; + if (!blk_queue_discard(q)) + return -EOPNOTSUPP; + if (copy_from_user(&range, (struct fstrim_range *)arg, sizeof(range))) return -EFAULT; -- cgit v1.2.3 From 5c2ed62fd447e2c696e222dcf71d1322bbbc58d4 Mon Sep 17 00:00:00 2001 From: Lukas Czerner Date: Wed, 23 Feb 2011 17:49:51 -0500 Subject: ext4: Adjust minlen with discard_granularity in the FITRIM ioctl Discard granularity tells us the minimum size of extent that can be discarded by the device. If the user supplies a minimum extent that should be discarded (range.minlen) which is smaller than the discard granularity, increase minlen to the discard granularity, since there's no point submitting trim requests that the device will reject anyway. Signed-off-by: Lukas Czerner Signed-off-by: "Theodore Ts'o" --- fs/ext4/ioctl.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 25ba7c79d28..c052c9f0f3a 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -348,6 +348,8 @@ mext_out: sizeof(range))) return -EFAULT; + range.minlen = max((unsigned int)range.minlen, + q->limits.discard_granularity); ret = ext4_trim_fs(sb, &range); if (ret < 0) return ret; -- cgit v1.2.3 From ea6633369458992241599c9d9ebadffaeddec164 Mon Sep 17 00:00:00 2001 From: Eric Sandeen Date: Wed, 23 Feb 2011 17:51:51 -0500 Subject: ext4: enable acls and user_xattr by default There's no good reason to require the extra step of providing a mount option for acl or user_xattr once the feature is configured on; no other filesystem that I know of requires this. Userspace patches have set these options in default mount options, and this patch makes them default in the kernel. At some point we can start to deprecate the options, perhaps. For now I've removed default mount option checks in show_options() to be explicit about what's set, since it's changing the default, but I'm open to alternatives if desired. Signed-off-by: Eric Sandeen Signed-off-by: "Theodore Ts'o" --- fs/ext4/super.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 1539cf55978..a665d2fb70c 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -997,13 +997,10 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs) if (test_opt(sb, OLDALLOC)) seq_puts(seq, ",oldalloc"); #ifdef CONFIG_EXT4_FS_XATTR - if (test_opt(sb, XATTR_USER) && - !(def_mount_opts & EXT4_DEFM_XATTR_USER)) + if (test_opt(sb, XATTR_USER)) seq_puts(seq, ",user_xattr"); - if (!test_opt(sb, XATTR_USER) && - (def_mount_opts & EXT4_DEFM_XATTR_USER)) { + if (!test_opt(sb, XATTR_USER)) seq_puts(seq, ",nouser_xattr"); - } #endif #ifdef CONFIG_EXT4_FS_POSIX_ACL if (test_opt(sb, POSIX_ACL) && !(def_mount_opts & EXT4_DEFM_ACL)) @@ -3095,13 +3092,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) } if (def_mount_opts & EXT4_DEFM_UID16) set_opt(sb, NO_UID32); + /* xattr user namespace & acls are now defaulted on */ #ifdef CONFIG_EXT4_FS_XATTR - if (def_mount_opts & EXT4_DEFM_XATTR_USER) - set_opt(sb, XATTR_USER); + set_opt(sb, XATTR_USER); #endif #ifdef CONFIG_EXT4_FS_POSIX_ACL - if (def_mount_opts & EXT4_DEFM_ACL) - set_opt(sb, POSIX_ACL); + set_opt(sb, POSIX_ACL); #endif if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_DATA) set_opt(sb, JOURNAL_DATA); -- cgit v1.2.3 From 84b775a354f640736176b5d966408fc5d5da6665 Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 24 Feb 2011 12:51:59 -0500 Subject: ext4: code cleanup in mb_find_buddy() Current code calculate max no matter whether order is zero, it's unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits + 3)" only when order == 0. Signed-off-by: Coly Li Cc: Alex Tomas Cc: Theodore Tso --- fs/ext4/mballoc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index ae4d7f5edbb..1791dd4207d 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -432,9 +432,10 @@ static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) } /* at order 0 we see each particular block */ - *max = 1 << (e4b->bd_blkbits + 3); - if (order == 0) + if (order == 0) { + *max = 1 << (e4b->bd_blkbits + 3); return EXT4_MB_BITMAP(e4b); + } bb = EXT4_MB_BUDDY(e4b) + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order]; *max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order]; -- cgit v1.2.3 From 235772da3e2adb1f4d71f27ec5475093dd38b2ac Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 24 Feb 2011 13:24:18 -0500 Subject: ext4: remove unncessary call mb_find_buddy() in debugging code In __mb_check_buddy(), look at the code below: 591 fstart = -1; 592 buddy = mb_find_buddy(e4b, 0, &max); 593 for (i = 0; i < max; i++) { 594 if (!mb_test_bit(i, buddy)) { 595 MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free); 596 if (fstart == -1) { 597 fragments++; 598 fstart = i; 599 } 600 continue; 601 } 602 fstart = -1; 603 /* check used bits only */ 604 for (j = 0; j < e4b->bd_blkbits + 1; j++) { 605 buddy2 = mb_find_buddy(e4b, j, &max2); 606 k = i >> j; 607 MB_CHECK_ASSERT(k < max2); 608 MB_CHECK_ASSERT(mb_test_bit(k, buddy2)); 609 } 610 } 611 MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info)); 612 MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments); 613 614 grp = ext4_get_group_info(sb, e4b->bd_group); 615 buddy = mb_find_buddy(e4b, 0, &max); On line 592, buddy is fetched by mb_find_buddy() with order 0, between line 593 to line 615, buddy is not changed, therefore there is no need to fetch buddy again from mb_find_buddy() with order 0 again. We can safely remove the second mb_find_buddy() on line 615. Signed-off-by: Coly Li Cc: Alex Tomas Cc: Theodore Tso --- fs/ext4/mballoc.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 1791dd4207d..7de0e282443 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -617,7 +617,6 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file, MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments); grp = ext4_get_group_info(sb, e4b->bd_group); - buddy = mb_find_buddy(e4b, 0, &max); list_for_each(cur, &grp->bb_prealloc_list) { ext4_group_t groupnr; struct ext4_prealloc_space *pa; -- cgit v1.2.3 From 7c786059293335412f99732c6f4c2a886eab25c2 Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 24 Feb 2011 13:24:25 -0500 Subject: mballoc: add comments to ext4_mb_mark_free_simple() This patch adds comments to ext4_mb_mark_free_simple to make it more understandable. Signed-off-by: Coly Li Cc: Alex Tomas Cc: Theodore Tso --- fs/ext4/mballoc.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 7de0e282443..b5235c8a2e7 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -635,7 +635,12 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file, #define mb_check_buddy(e4b) #endif -/* FIXME!! need more doc */ +/* + * Divide blocks started from @first with length @len into + * smaller chunks with power of 2 blocks. + * Clear the bits in bitmap which the blocks of the chunk(s) covered, + * then increase bb_counters[] for corresponded chunk size. + */ static void ext4_mb_mark_free_simple(struct super_block *sb, void *buddy, ext4_grpblk_t first, ext4_grpblk_t len, struct ext4_group_info *grp) -- cgit v1.2.3 From 58696f3ab2b23fd6519189875fafdb5d1281eb54 Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 24 Feb 2011 14:10:00 -0500 Subject: ext4: clarify description of ac_g_ex in struct ext4_allocation_context Signed-off-by: Coly Li Cc: Alex Tomas Cc: Theodore Tso --- fs/ext4/mballoc.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index b619322c76f..22bd4d7f289 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -169,7 +169,7 @@ struct ext4_allocation_context { /* original request */ struct ext4_free_extent ac_o_ex; - /* goal request (after normalization) */ + /* goal request (normalized ac_o_ex) */ struct ext4_free_extent ac_g_ex; /* the best found extent */ -- cgit v1.2.3 From 5a54b2f199fdf19533f96c3e285b70c6729e1e4a Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 24 Feb 2011 14:10:05 -0500 Subject: ext4: mballoc: don't replace the current preallocation group unnecessarily In ext4_mb_check_group_pa(), the current preallocation space is replaced with a new preallocation space when the two have the same distance from the goal block. This doesn't actually gain us anything, so change things so that the function only switches to the new preallocation group if its distance from the goal block is strictly smaller than the current preallocaiton group's distance from the goal block. Signed-off-by: Coly Li Signed-off-by: "Theodore Ts'o" --- fs/ext4/mballoc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index b5235c8a2e7..66bee7274d6 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3213,7 +3213,7 @@ ext4_mb_check_group_pa(ext4_fsblk_t goal_block, cur_distance = abs(goal_block - cpa->pa_pstart); new_distance = abs(goal_block - pa->pa_pstart); - if (cur_distance < new_distance) + if (cur_distance <= new_distance) return cpa; /* drop the previous reference */ -- cgit v1.2.3 From e0fd9b90765f604374c42de8ac59d6584afce264 Mon Sep 17 00:00:00 2001 From: Curt Wohlgemuth Date: Sat, 26 Feb 2011 12:25:52 -0500 Subject: ext4: mark multi-page IO complete on mapping failure In mpage_da_map_and_submit(), if we have a delayed block allocation failure from ext4_map_blocks(), we need to mark the IO as complete, by setting mpd->io_done = 1; Otherwise, we could end up submitting the pages in an outer loop; since they are unlocked on mapping failure in ext4_da_block_invalidatepages(), this will cause a bug check in mpage_da_submit_io(). I tested this by injected failures into ext4_map_blocks(). Without this patch, a simple fsstress run will bug check; with the patch, it works fine. Signed-off-by: Curt Wohlgemuth Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c6c6b7fcb45..fd369dbce6a 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2314,6 +2314,9 @@ static void mpage_da_map_and_submit(struct mpage_da_data *mpd) /* invalidate all the pages */ ext4_da_block_invalidatepages(mpd, next, mpd->b_size >> mpd->inode->i_blkbits); + + /* Mark this page range as having been completed */ + mpd->io_done = 1; return; } BUG_ON(blks == 0); -- cgit v1.2.3 From c7f5938adce6727b9d17785f289c1146bd88d678 Mon Sep 17 00:00:00 2001 From: Curt Wohlgemuth Date: Sat, 26 Feb 2011 12:27:52 -0500 Subject: ext4: fix ext4_da_block_invalidatepages() to handle page range properly If ext4_da_block_invalidatepages() is called because of a failure from ext4_map_blocks() in mpage_da_map_and_submit(), it's supposed to clean up -- including unlock -- all the pages in the mpd structure. But these values may not match up, even on a system in which block size == page size: mpd->b_blocknr != mpd->first_page mpd->b_size != (mpd->next_page - mpd->first_page) ext4_da_block_invalidatepages() has been using b_blocknr and b_size; this patch changes it to use first_page and next_page. Tested: I injected a small number (5%) of failures in ext4_map_blocks() in the case that the flags contain EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this kernel. Without this patch, I got hung tasks every time. With this patch, I see no hangs in many runs of fsstress. Signed-off-by: Curt Wohlgemuth Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index fd369dbce6a..e878c3a7aaf 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2165,8 +2165,7 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd, return ret; } -static void ext4_da_block_invalidatepages(struct mpage_da_data *mpd, - sector_t logical, long blk_cnt) +static void ext4_da_block_invalidatepages(struct mpage_da_data *mpd) { int nr_pages, i; pgoff_t index, end; @@ -2174,9 +2173,8 @@ static void ext4_da_block_invalidatepages(struct mpage_da_data *mpd, struct inode *inode = mpd->inode; struct address_space *mapping = inode->i_mapping; - index = logical >> (PAGE_CACHE_SHIFT - inode->i_blkbits); - end = (logical + blk_cnt - 1) >> - (PAGE_CACHE_SHIFT - inode->i_blkbits); + index = mpd->first_page; + end = mpd->next_page - 1; while (index <= end) { nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE); if (nr_pages == 0) @@ -2312,8 +2310,7 @@ static void mpage_da_map_and_submit(struct mpage_da_data *mpd) ext4_print_free_blocks(mpd->inode); } /* invalidate all the pages */ - ext4_da_block_invalidatepages(mpd, next, - mpd->b_size >> mpd->inode->i_blkbits); + ext4_da_block_invalidatepages(mpd); /* Mark this page range as having been completed */ mpd->io_done = 1; -- cgit v1.2.3 From 6fd7a46781999c32f423025767e43b349b967d57 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Sat, 26 Feb 2011 13:53:09 -0500 Subject: ext4: enable mblk_io_submit by default Now that we've fixed the file corruption bug in commit d50bdd5aa55, it's time to enable mblk_io_submit by default. Signed-off-by: "Theodore Ts'o" --- fs/ext4/super.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a665d2fb70c..33c398785e5 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1038,8 +1038,8 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs) !(def_mount_opts & EXT4_DEFM_NODELALLOC)) seq_puts(seq, ",nodelalloc"); - if (test_opt(sb, MBLK_IO_SUBMIT)) - seq_puts(seq, ",mblk_io_submit"); + if (!test_opt(sb, MBLK_IO_SUBMIT)) + seq_puts(seq, ",nomblk_io_submit"); if (sbi->s_stripe) seq_printf(seq, ",stripe=%lu", sbi->s_stripe); /* @@ -3099,6 +3099,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) #ifdef CONFIG_EXT4_FS_POSIX_ACL set_opt(sb, POSIX_ACL); #endif + set_opt(sb, MBLK_IO_SUBMIT); if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_DATA) set_opt(sb, JOURNAL_DATA); else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_ORDERED) -- cgit v1.2.3 From 8eb9e5ce211de1b98bc84e93258b7db0860a103c Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Sat, 26 Feb 2011 14:07:31 -0500 Subject: ext4: fold __mpage_da_writepage() into write_cache_pages_da() Fold the __mpage_da_writepage() function into write_cache_pages_da(). This will give us opportunities to clean up and simplify the resulting code. Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 206 +++++++++++++++++++++++++------------------------------- 1 file changed, 91 insertions(+), 115 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e878c3a7aaf..fcd08ca0643 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2437,102 +2437,6 @@ static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh) return (buffer_delay(bh) || buffer_unwritten(bh)) && buffer_dirty(bh); } -/* - * __mpage_da_writepage - finds extent of pages and blocks - * - * @page: page to consider - * @wbc: not used, we just follow rules - * @data: context - * - * The function finds extents of pages and scan them for all blocks. - */ -static int __mpage_da_writepage(struct page *page, - struct writeback_control *wbc, - struct mpage_da_data *mpd) -{ - struct inode *inode = mpd->inode; - struct buffer_head *bh, *head; - sector_t logical; - - /* - * Can we merge this page to current extent? - */ - if (mpd->next_page != page->index) { - /* - * Nope, we can't. So, we map non-allocated blocks - * and start IO on them - */ - if (mpd->next_page != mpd->first_page) { - mpage_da_map_and_submit(mpd); - /* - * skip rest of the page in the page_vec - */ - redirty_page_for_writepage(wbc, page); - unlock_page(page); - return MPAGE_DA_EXTENT_TAIL; - } - - /* - * Start next extent of pages ... - */ - mpd->first_page = page->index; - - /* - * ... and blocks - */ - mpd->b_size = 0; - mpd->b_state = 0; - mpd->b_blocknr = 0; - } - - mpd->next_page = page->index + 1; - logical = (sector_t) page->index << - (PAGE_CACHE_SHIFT - inode->i_blkbits); - - if (!page_has_buffers(page)) { - mpage_add_bh_to_extent(mpd, logical, PAGE_CACHE_SIZE, - (1 << BH_Dirty) | (1 << BH_Uptodate)); - if (mpd->io_done) - return MPAGE_DA_EXTENT_TAIL; - } else { - /* - * Page with regular buffer heads, just add all dirty ones - */ - head = page_buffers(page); - bh = head; - do { - BUG_ON(buffer_locked(bh)); - /* - * We need to try to allocate - * unmapped blocks in the same page. - * Otherwise we won't make progress - * with the page in ext4_writepage - */ - if (ext4_bh_delay_or_unwritten(NULL, bh)) { - mpage_add_bh_to_extent(mpd, logical, - bh->b_size, - bh->b_state); - if (mpd->io_done) - return MPAGE_DA_EXTENT_TAIL; - } else if (buffer_dirty(bh) && (buffer_mapped(bh))) { - /* - * mapped dirty buffer. We need to update - * the b_state because we look at - * b_state in mpage_da_map_blocks. We don't - * update b_size because if we find an - * unmapped buffer_head later we need to - * use the b_state flag of that buffer_head. - */ - if (mpd->b_size == 0) - mpd->b_state = bh->b_state & BH_FLAGS; - } - logical++; - } while ((bh = bh->b_this_page) != head); - } - - return 0; -} - /* * This is a special get_blocks_t callback which is used by * ext4_da_write_begin(). It will either return mapped block or @@ -2811,18 +2715,17 @@ static int ext4_da_writepages_trans_blocks(struct inode *inode) /* * write_cache_pages_da - walk the list of dirty pages of the given - * address space and call the callback function (which usually writes - * the pages). - * - * This is a forked version of write_cache_pages(). Differences: - * Range cyclic is ignored. - * no_nrwrite_index_update is always presumed true + * address space and accumulate pages that need writing, and call + * mpage_da_map_and_submit to map the pages and then write them. */ static int write_cache_pages_da(struct address_space *mapping, struct writeback_control *wbc, struct mpage_da_data *mpd, pgoff_t *done_index) { + struct inode *inode = mpd->inode; + struct buffer_head *bh, *head; + sector_t logical; int ret = 0; int done = 0; struct pagevec pvec; @@ -2899,17 +2802,90 @@ continue_unlock: if (!clear_page_dirty_for_io(page)) goto continue_unlock; - ret = __mpage_da_writepage(page, wbc, mpd); - if (unlikely(ret)) { - if (ret == AOP_WRITEPAGE_ACTIVATE) { + /* BEGIN __mpage_da_writepage */ + + /* + * Can we merge this page to current extent? + */ + if (mpd->next_page != page->index) { + /* + * Nope, we can't. So, we map + * non-allocated blocks and start IO + * on them + */ + if (mpd->next_page != mpd->first_page) { + mpage_da_map_and_submit(mpd); + /* + * skip rest of the page in the page_vec + */ + redirty_page_for_writepage(wbc, page); unlock_page(page); - ret = 0; - } else { - done = 1; - break; + ret = MPAGE_DA_EXTENT_TAIL; + goto out; } + + /* + * Start next extent of pages and blocks + */ + mpd->first_page = page->index; + mpd->b_size = 0; + mpd->b_state = 0; + mpd->b_blocknr = 0; + } + + mpd->next_page = page->index + 1; + logical = (sector_t) page->index << + (PAGE_CACHE_SHIFT - inode->i_blkbits); + + if (!page_has_buffers(page)) { + mpage_add_bh_to_extent(mpd, logical, PAGE_CACHE_SIZE, + (1 << BH_Dirty) | (1 << BH_Uptodate)); + if (mpd->io_done) { + ret = MPAGE_DA_EXTENT_TAIL; + goto out; + } + } else { + /* + * Page with regular buffer heads, just add all dirty ones + */ + head = page_buffers(page); + bh = head; + do { + BUG_ON(buffer_locked(bh)); + /* + * We need to try to allocate + * unmapped blocks in the same page. + * Otherwise we won't make progress + * with the page in ext4_writepage + */ + if (ext4_bh_delay_or_unwritten(NULL, bh)) { + mpage_add_bh_to_extent(mpd, logical, + bh->b_size, + bh->b_state); + if (mpd->io_done) { + ret = MPAGE_DA_EXTENT_TAIL; + goto out; + } + } else if (buffer_dirty(bh) && (buffer_mapped(bh))) { + /* + * mapped dirty buffer. We need to update + * the b_state because we look at + * b_state in mpage_da_map_blocks. We don't + * update b_size because if we find an + * unmapped buffer_head later we need to + * use the b_state flag of that buffer_head. + */ + if (mpd->b_size == 0) + mpd->b_state = bh->b_state & BH_FLAGS; + } + logical++; + } while ((bh = bh->b_this_page) != head); } + ret = 0; + + /* END __mpage_da_writepage */ + if (nr_to_write > 0) { nr_to_write--; if (nr_to_write == 0 && @@ -2933,6 +2909,10 @@ continue_unlock: cond_resched(); } return ret; +out: + pagevec_release(&pvec); + cond_resched(); + return ret; } @@ -3059,13 +3039,9 @@ retry: } /* - * Now call __mpage_da_writepage to find the next + * Now call write_cache_pages_da() to find the next * contiguous region of logical blocks that need - * blocks to be allocated by ext4. We don't actually - * submit the blocks for I/O here, even though - * write_cache_pages thinks it will, and will set the - * pages as clean for write before calling - * __mpage_da_writepage(). + * blocks to be allocated by ext4 and submit them. */ mpd.b_size = 0; mpd.b_state = 0; -- cgit v1.2.3 From 4f01b02c8c4e4111bd1adbcafb5741e8e991f5fd Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Sat, 26 Feb 2011 14:07:37 -0500 Subject: ext4: simple cleanups to write_cache_pages_da() Eliminate duplicate code, unneeded variables, etc., to make it easier to understand the code. No behavioral changes were made in this patch. Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 115 +++++++++++++++++++++++--------------------------------- 1 file changed, 48 insertions(+), 67 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index fcd08ca0643..1e718e87f46 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2723,17 +2723,14 @@ static int write_cache_pages_da(struct address_space *mapping, struct mpage_da_data *mpd, pgoff_t *done_index) { - struct inode *inode = mpd->inode; - struct buffer_head *bh, *head; - sector_t logical; - int ret = 0; - int done = 0; - struct pagevec pvec; - unsigned nr_pages; - pgoff_t index; - pgoff_t end; /* Inclusive */ - long nr_to_write = wbc->nr_to_write; - int tag; + struct buffer_head *bh, *head; + struct inode *inode = mpd->inode; + struct pagevec pvec; + unsigned int nr_pages; + sector_t logical; + pgoff_t index, end; + long nr_to_write = wbc->nr_to_write; + int i, tag, ret = 0; pagevec_init(&pvec, 0); index = wbc->range_start >> PAGE_CACHE_SHIFT; @@ -2745,13 +2742,11 @@ static int write_cache_pages_da(struct address_space *mapping, tag = PAGECACHE_TAG_DIRTY; *done_index = index; - while (!done && (index <= end)) { - int i; - + while (index <= end) { nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); if (nr_pages == 0) - break; + return 0; for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; @@ -2763,47 +2758,37 @@ static int write_cache_pages_da(struct address_space *mapping, * mapping. However, page->index will not change * because we have a reference on the page. */ - if (page->index > end) { - done = 1; - break; - } + if (page->index > end) + goto out; *done_index = page->index + 1; lock_page(page); /* - * Page truncated or invalidated. We can freely skip it - * then, even for data integrity operations: the page - * has disappeared concurrently, so there could be no - * real expectation of this data interity operation - * even if there is now a new, dirty page at the same - * pagecache address. + * If the page is no longer dirty, or its + * mapping no longer corresponds to inode we + * are writing (which means it has been + * truncated or invalidated), or the page is + * already under writeback and we are not + * doing a data integrity writeback, skip the page */ - if (unlikely(page->mapping != mapping)) { -continue_unlock: + if (!PageDirty(page) || + (PageWriteback(page) && + (wbc->sync_mode == WB_SYNC_NONE)) || + unlikely(page->mapping != mapping)) { + continue_unlock: unlock_page(page); continue; } - if (!PageDirty(page)) { - /* someone wrote it for us */ - goto continue_unlock; - } - - if (PageWriteback(page)) { - if (wbc->sync_mode != WB_SYNC_NONE) - wait_on_page_writeback(page); - else - goto continue_unlock; - } + if (PageWriteback(page)) + wait_on_page_writeback(page); BUG_ON(PageWriteback(page)); if (!clear_page_dirty_for_io(page)) goto continue_unlock; - /* BEGIN __mpage_da_writepage */ - /* * Can we merge this page to current extent? */ @@ -2820,8 +2805,7 @@ continue_unlock: */ redirty_page_for_writepage(wbc, page); unlock_page(page); - ret = MPAGE_DA_EXTENT_TAIL; - goto out; + goto ret_extent_tail; } /* @@ -2838,15 +2822,15 @@ continue_unlock: (PAGE_CACHE_SHIFT - inode->i_blkbits); if (!page_has_buffers(page)) { - mpage_add_bh_to_extent(mpd, logical, PAGE_CACHE_SIZE, + mpage_add_bh_to_extent(mpd, logical, + PAGE_CACHE_SIZE, (1 << BH_Dirty) | (1 << BH_Uptodate)); - if (mpd->io_done) { - ret = MPAGE_DA_EXTENT_TAIL; - goto out; - } + if (mpd->io_done) + goto ret_extent_tail; } else { /* - * Page with regular buffer heads, just add all dirty ones + * Page with regular buffer heads, + * just add all dirty ones */ head = page_buffers(page); bh = head; @@ -2862,18 +2846,19 @@ continue_unlock: mpage_add_bh_to_extent(mpd, logical, bh->b_size, bh->b_state); - if (mpd->io_done) { - ret = MPAGE_DA_EXTENT_TAIL; - goto out; - } + if (mpd->io_done) + goto ret_extent_tail; } else if (buffer_dirty(bh) && (buffer_mapped(bh))) { /* - * mapped dirty buffer. We need to update - * the b_state because we look at - * b_state in mpage_da_map_blocks. We don't - * update b_size because if we find an - * unmapped buffer_head later we need to - * use the b_state flag of that buffer_head. + * mapped dirty buffer. We need + * to update the b_state + * because we look at b_state + * in mpage_da_map_blocks. We + * don't update b_size because + * if we find an unmapped + * buffer_head later we need to + * use the b_state flag of that + * buffer_head. */ if (mpd->b_size == 0) mpd->b_state = bh->b_state & BH_FLAGS; @@ -2882,14 +2867,10 @@ continue_unlock: } while ((bh = bh->b_this_page) != head); } - ret = 0; - - /* END __mpage_da_writepage */ - if (nr_to_write > 0) { nr_to_write--; if (nr_to_write == 0 && - wbc->sync_mode == WB_SYNC_NONE) { + wbc->sync_mode == WB_SYNC_NONE) /* * We stop writing back only if we are * not doing integrity sync. In case of @@ -2900,15 +2881,15 @@ continue_unlock: * pages, but have not synced all of the * old dirty pages. */ - done = 1; - break; - } + goto out; } } pagevec_release(&pvec); cond_resched(); } - return ret; + return 0; +ret_extent_tail: + ret = MPAGE_DA_EXTENT_TAIL; out: pagevec_release(&pvec); cond_resched(); -- cgit v1.2.3 From 9749895644a817cfd28a535bc3ae60e4267bdc50 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Sat, 26 Feb 2011 14:08:01 -0500 Subject: ext4: clear the dirty bit for a page in writeback at the last minute Move when we call clear_page_dirty_for_io() to just before we actually write the page. This simplifies the code somewhat, and avoids marking pages as clean and then needing to remark them as dirty later. Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 28 +++++++++++----------------- 1 file changed, 11 insertions(+), 17 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 1e718e87f46..ae6e2f43d87 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2060,7 +2060,7 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd, if (nr_pages == 0) break; for (i = 0; i < nr_pages; i++) { - int commit_write = 0, redirty_page = 0; + int commit_write = 0, skip_page = 0; struct page *page = pvec.pages[i]; index = page->index; @@ -2086,14 +2086,12 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd, * If the page does not have buffers (for * whatever reason), try to create them using * __block_write_begin. If this fails, - * redirty the page and move on. + * skip the page and move on. */ if (!page_has_buffers(page)) { if (__block_write_begin(page, 0, len, noalloc_get_block_write)) { - redirty_page: - redirty_page_for_writepage(mpd->wbc, - page); + skip_page: unlock_page(page); continue; } @@ -2104,7 +2102,7 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd, block_start = 0; do { if (!bh) - goto redirty_page; + goto skip_page; if (map && (cur_logical >= map->m_lblk) && (cur_logical <= (map->m_lblk + (map->m_len - 1)))) { @@ -2120,22 +2118,23 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd, clear_buffer_unwritten(bh); } - /* redirty page if block allocation undone */ + /* skip page if block allocation undone */ if (buffer_delay(bh) || buffer_unwritten(bh)) - redirty_page = 1; + skip_page = 1; bh = bh->b_this_page; block_start += bh->b_size; cur_logical++; pblock++; } while (bh != page_bufs); - if (redirty_page) - goto redirty_page; + if (skip_page) + goto skip_page; if (commit_write) /* mark the buffer_heads as dirty & uptodate */ block_commit_write(page, 0, len); + clear_page_dirty_for_io(page); /* * Delalloc doesn't support data journalling, * but eventually maybe we'll lift this @@ -2277,9 +2276,8 @@ static void mpage_da_map_and_submit(struct mpage_da_data *mpd) err = blks; /* * If get block returns EAGAIN or ENOSPC and there - * appears to be free blocks we will call - * ext4_writepage() for all of the pages which will - * just redirty the pages. + * appears to be free blocks we will just let + * mpage_da_submit_io() unlock all of the pages. */ if (err == -EAGAIN) goto submit_io; @@ -2777,7 +2775,6 @@ static int write_cache_pages_da(struct address_space *mapping, (PageWriteback(page) && (wbc->sync_mode == WB_SYNC_NONE)) || unlikely(page->mapping != mapping)) { - continue_unlock: unlock_page(page); continue; } @@ -2786,8 +2783,6 @@ static int write_cache_pages_da(struct address_space *mapping, wait_on_page_writeback(page); BUG_ON(PageWriteback(page)); - if (!clear_page_dirty_for_io(page)) - goto continue_unlock; /* * Can we merge this page to current extent? @@ -2803,7 +2798,6 @@ static int write_cache_pages_da(struct address_space *mapping, /* * skip rest of the page in the page_vec */ - redirty_page_for_writepage(wbc, page); unlock_page(page); goto ret_extent_tail; } -- cgit v1.2.3 From ee6ecbcc5d73672217fdea420d182ecb0cdf310c Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Sat, 26 Feb 2011 14:08:11 -0500 Subject: ext4: remove page_skipped hackery in ext4_da_writepages() Because the ext4 page writeback codepath had been prematurely calling clear_page_dirty_for_io(), if it turned out that a particular page couldn't be written out during a particular pass of write_cache_pages_da(), the page would have to get redirtied by calling redirty_pages_for_writeback(). Not only was this wasted work, but redirty_page_for_writeback() would increment wbc->pages_skipped to signal to writeback_sb_inodes() that buffers were locked, and that it should skip this inode until later. Since this signal was incorrect in ext4's case --- which was caused by ext4's historically incorrect use of write_cache_pages() --- ext4_da_writepages() saved and restored wbc->skipped_pages to avoid confusing writeback_sb_inodes(). Now that we've fixed ext4 to call clear_page_dirty_for_io() right before initiating the page I/O, we can nuke the page_skipped save/restore hackery, and breathe a sigh of relief. Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index ae6e2f43d87..617c9cbba18 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2900,7 +2900,6 @@ static int ext4_da_writepages(struct address_space *mapping, struct mpage_da_data mpd; struct inode *inode = mapping->host; int pages_written = 0; - long pages_skipped; unsigned int max_pages; int range_cyclic, cycled = 1, io_done = 0; int needed_blocks, ret = 0; @@ -2986,8 +2985,6 @@ static int ext4_da_writepages(struct address_space *mapping, mpd.wbc = wbc; mpd.inode = mapping->host; - pages_skipped = wbc->pages_skipped; - retry: if (wbc->sync_mode == WB_SYNC_ALL) tag_pages_for_writeback(mapping, index, end); @@ -3047,7 +3044,6 @@ retry: * and try again */ jbd2_journal_force_commit_nested(sbi->s_journal); - wbc->pages_skipped = pages_skipped; ret = 0; } else if (ret == MPAGE_DA_EXTENT_TAIL) { /* @@ -3055,7 +3051,6 @@ retry: * rest of the pages */ pages_written += mpd.pages_written; - wbc->pages_skipped = pages_skipped; ret = 0; io_done = 1; } else if (wbc->nr_to_write) @@ -3073,11 +3068,6 @@ retry: wbc->range_end = mapping->writeback_index - 1; goto retry; } - if (pages_skipped != wbc->pages_skipped) - ext4_msg(inode->i_sb, KERN_CRIT, - "This should not happen leaving %s " - "with nr_to_write = %ld ret = %d", - __func__, wbc->nr_to_write, ret); /* Update index */ wbc->range_cyclic = range_cyclic; -- cgit v1.2.3 From 78aaced3408141bb7c836f2db0ca435790399da5 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Sat, 26 Feb 2011 14:09:14 -0500 Subject: ext4: don't lock the next page in write_cache_pages if not needed If we have accumulated a contiguous region of memory to be written out, and the next page can added to this region, don't bother locking (and then unlocking the page) before writing out the memory. In the unlikely event that the next page was being written back by some other CPU, we can also skip waiting that page to finish writeback. Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 27 ++++++++++----------------- 1 file changed, 10 insertions(+), 17 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 617c9cbba18..c2e6af33823 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2761,6 +2761,16 @@ static int write_cache_pages_da(struct address_space *mapping, *done_index = page->index + 1; + /* + * If we can't merge this page, and we have + * accumulated an contiguous region, write it + */ + if ((mpd->next_page != page->index) && + (mpd->next_page != mpd->first_page)) { + mpage_da_map_and_submit(mpd); + goto ret_extent_tail; + } + lock_page(page); /* @@ -2784,24 +2794,7 @@ static int write_cache_pages_da(struct address_space *mapping, BUG_ON(PageWriteback(page)); - /* - * Can we merge this page to current extent? - */ if (mpd->next_page != page->index) { - /* - * Nope, we can't. So, we map - * non-allocated blocks and start IO - * on them - */ - if (mpd->next_page != mpd->first_page) { - mpage_da_map_and_submit(mpd); - /* - * skip rest of the page in the page_vec - */ - unlock_page(page); - goto ret_extent_tail; - } - /* * Start next extent of pages and blocks */ -- cgit v1.2.3 From 168fc0223c0e944957b1f31d88c2334fc904baf1 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Sat, 26 Feb 2011 14:09:20 -0500 Subject: ext4: move setup of the mpd structure to write_cache_pages_da() Move the initialization of all of the fields of the mpd structure to write_cache_pages_da(). This simplifies the code considerably. Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 29 +++++++---------------------- 1 file changed, 7 insertions(+), 22 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c2e6af33823..dcc2287433b 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2714,7 +2714,8 @@ static int ext4_da_writepages_trans_blocks(struct inode *inode) /* * write_cache_pages_da - walk the list of dirty pages of the given * address space and accumulate pages that need writing, and call - * mpage_da_map_and_submit to map the pages and then write them. + * mpage_da_map_and_submit to map a single contiguous memory region + * and then write them. */ static int write_cache_pages_da(struct address_space *mapping, struct writeback_control *wbc, @@ -2722,7 +2723,7 @@ static int write_cache_pages_da(struct address_space *mapping, pgoff_t *done_index) { struct buffer_head *bh, *head; - struct inode *inode = mpd->inode; + struct inode *inode = mapping->host; struct pagevec pvec; unsigned int nr_pages; sector_t logical; @@ -2730,6 +2731,9 @@ static int write_cache_pages_da(struct address_space *mapping, long nr_to_write = wbc->nr_to_write; int i, tag, ret = 0; + memset(mpd, 0, sizeof(struct mpage_da_data)); + mpd->wbc = wbc; + mpd->inode = inode; pagevec_init(&pvec, 0); index = wbc->range_start >> PAGE_CACHE_SHIFT; end = wbc->range_end >> PAGE_CACHE_SHIFT; @@ -2794,16 +2798,8 @@ static int write_cache_pages_da(struct address_space *mapping, BUG_ON(PageWriteback(page)); - if (mpd->next_page != page->index) { - /* - * Start next extent of pages and blocks - */ + if (mpd->next_page != page->index) mpd->first_page = page->index; - mpd->b_size = 0; - mpd->b_state = 0; - mpd->b_blocknr = 0; - } - mpd->next_page = page->index + 1; logical = (sector_t) page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits); @@ -2975,9 +2971,6 @@ static int ext4_da_writepages(struct address_space *mapping, wbc->nr_to_write = desired_nr_to_write; } - mpd.wbc = wbc; - mpd.inode = mapping->host; - retry: if (wbc->sync_mode == WB_SYNC_ALL) tag_pages_for_writeback(mapping, index, end); @@ -3008,14 +3001,6 @@ retry: * contiguous region of logical blocks that need * blocks to be allocated by ext4 and submit them. */ - mpd.b_size = 0; - mpd.b_state = 0; - mpd.b_blocknr = 0; - mpd.first_page = 0; - mpd.next_page = 0; - mpd.io_done = 0; - mpd.pages_written = 0; - mpd.retval = 0; ret = write_cache_pages_da(mapping, wbc, &mpd, &done_index); /* * If we have a contiguous extent of pages and we -- cgit v1.2.3 From a54aa76108619e5d8290b49081c2aaaeff5be9a2 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Sun, 27 Feb 2011 16:43:24 -0500 Subject: ext4: don't leave PageWriteback set after memory failure In ext4_bio_write_page(), if the memory allocation for the struct ext4_io_page fails, it returns with the page's PageWriteback flag set. This will end up causing the page not to skip writeback in WB_SYNC_NONE mode, and in WB_SYNC_ALL mode (i.e., on a sync, fsync, or umount) the writeback daemon will get stuck forever on the wait_on_page_writeback() function in write_cache_pages_da(). Or, if journalling is enabled and the file gets deleted, it the journal thread can get stuck in journal_finish_inode_data_buffers() call to filemap_fdatawait(). Another place where things can get hung up is in truncate_inode_pages(), called out of ext4_evict_inode(). Fix this by not setting PageWriteback until after we have successfully allocated the struct ext4_io_page. Signed-off-by: "Theodore Ts'o" --- fs/ext4/page-io.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 68d92a8f71d..d5c391ffad7 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -381,8 +381,6 @@ int ext4_bio_write_page(struct ext4_io_submit *io, BUG_ON(!PageLocked(page)); BUG_ON(PageWriteback(page)); - set_page_writeback(page); - ClearPageError(page); io_page = kmem_cache_alloc(io_page_cachep, GFP_NOFS); if (!io_page) { @@ -393,6 +391,8 @@ int ext4_bio_write_page(struct ext4_io_submit *io, io_page->p_page = page; atomic_set(&io_page->p_count, 1); get_page(page); + set_page_writeback(page); + ClearPageError(page); for (bh = head = page_buffers(page), block_start = 0; bh != head || !block_start; -- cgit v1.2.3 From 4dd89fc6251a6bda2c18e71e7d266e983806579d Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Sun, 27 Feb 2011 17:23:47 -0500 Subject: ext4: suppress verbose debugging information if malloc-debug is off If CONFIG_EXT4_DEBUG is enabled, then if a block allocation fails due to disk being full, a verbose debugging message is printed, even if the malloc-debug switch has not been enabled. Suppress the debugging message so that nothing is printed unless malloc-debug has been turned on. Signed-off-by: "Theodore Ts'o" --- fs/ext4/mballoc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 66bee7274d6..2f6f0dd08fc 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3912,7 +3912,8 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac) struct super_block *sb = ac->ac_sb; ext4_group_t ngroups, i; - if (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED) + if (!mb_enable_debug || + (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)) return; printk(KERN_ERR "EXT4-fs: Can't allocate:" -- cgit v1.2.3 From 6d9c85eb700bd3ac59e63bb9de463dea1aca084c Mon Sep 17 00:00:00 2001 From: Yongqiang Yang Date: Sun, 27 Feb 2011 17:25:47 -0500 Subject: ext4: make FIEMAP and delayed allocation play well together Fix the FIEMAP ioctl so that it returns all of the page ranges which are still subject to delayed allocation. We were missing some cases if the file was sparse. Reported by Chris Mason : >We've had reports on btrfs that cp is giving us files full of zeros >instead of actually copying them. It was tracked down to a bug with >the btrfs fiemap implementation where it was returning holes for >delalloc ranges. > >Newer versions of cp are trusting fiemap to tell it where the holes >are, which does seem like a pretty neat trick. > >I decided to give xfs and ext4 a shot with a few tests cases too, xfs >passed with all the ones btrfs was getting wrong, and ext4 got the basic >delalloc case right. >$ mkfs.ext4 /dev/xxx >$ mount /dev/xxx /mnt >$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 >$ fiemap-test foo >ext: 0 logical: [ 0.. 255] phys: 0.. 255 >flags: 0x007 tot: 256 > >Horray! But once we throw a hole in, things go bad: >$ mkfs.ext4 /dev/xxx >$ mount /dev/xxx /mnt >$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1 >$ fiemap-test foo >< no output > > >We've got a delalloc extent after the hole and ext4 fiemap didn't find >it. If I run sync to kick the delalloc out: >$sync >$ fiemap-test foo >ext: 0 logical: [ 256.. 511] phys: 34048.. 34303 >flags: 0x001 tot: 256 > >fiemap-test is sitting in my /usr/local/bin, and I have no idea how it >got there. It's full of pretty comments so I know it isn't mine, but >you can grab it here: > >http://oss.oracle.com/~mason/fiemap-test.c > >xfsqa has a fiemap program too. After Fix, test results are as follows: ext: 0 logical: [ 256.. 511] phys: 0.. 255 flags: 0x007 tot: 256 ext: 0 logical: [ 256.. 511] phys: 33280.. 33535 flags: 0x001 tot: 256 $ mkfs.ext4 /dev/xxx $ mount /dev/xxx /mnt $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1 $ sync $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=3 $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=5 $ fiemap-test foo ext: 0 logical: [ 256.. 511] phys: 33280.. 33535 flags: 0x000 tot: 256 ext: 1 logical: [ 768.. 1023] phys: 0.. 255 flags: 0x006 tot: 256 ext: 2 logical: [ 1280.. 1535] phys: 0.. 255 flags: 0x007 tot: 256 Tested-by: Eric Sandeen Reviewed-by: Andreas Dilger Signed-off-by: Yongqiang Yang Signed-off-by: "Theodore Ts'o" --- fs/ext4/extents.c | 187 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 148 insertions(+), 39 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index d16f6b5a140..9ea1bc64ca6 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -3775,6 +3775,7 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset, } return ret > 0 ? ret2 : ret; } + /* * Callback function called for each extent to gather FIEMAP information. */ @@ -3782,38 +3783,162 @@ static int ext4_ext_fiemap_cb(struct inode *inode, struct ext4_ext_path *path, struct ext4_ext_cache *newex, struct ext4_extent *ex, void *data) { - struct fiemap_extent_info *fieinfo = data; - unsigned char blksize_bits = inode->i_sb->s_blocksize_bits; __u64 logical; __u64 physical; __u64 length; + loff_t size; __u32 flags = 0; - int error; + int ret = 0; + struct fiemap_extent_info *fieinfo = data; + unsigned char blksize_bits; - logical = (__u64)newex->ec_block << blksize_bits; + blksize_bits = inode->i_sb->s_blocksize_bits; + logical = (__u64)newex->ec_block << blksize_bits; if (newex->ec_start == 0) { - pgoff_t offset; - struct page *page; + /* + * No extent in extent-tree contains block @newex->ec_start, + * then the block may stay in 1)a hole or 2)delayed-extent. + * + * Holes or delayed-extents are processed as follows. + * 1. lookup dirty pages with specified range in pagecache. + * If no page is got, then there is no delayed-extent and + * return with EXT_CONTINUE. + * 2. find the 1st mapped buffer, + * 3. check if the mapped buffer is both in the request range + * and a delayed buffer. If not, there is no delayed-extent, + * then return. + * 4. a delayed-extent is found, the extent will be collected. + */ + ext4_lblk_t end = 0; + pgoff_t last_offset; + pgoff_t offset; + pgoff_t index; + struct page **pages = NULL; struct buffer_head *bh = NULL; + struct buffer_head *head = NULL; + unsigned int nr_pages = PAGE_SIZE / sizeof(struct page *); + + pages = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (pages == NULL) + return -ENOMEM; offset = logical >> PAGE_SHIFT; - page = find_get_page(inode->i_mapping, offset); - if (!page || !page_has_buffers(page)) - return EXT_CONTINUE; +repeat: + last_offset = offset; + head = NULL; + ret = find_get_pages_tag(inode->i_mapping, &offset, + PAGECACHE_TAG_DIRTY, nr_pages, pages); + + if (!(flags & FIEMAP_EXTENT_DELALLOC)) { + /* First time, try to find a mapped buffer. */ + if (ret == 0) { +out: + for (index = 0; index < ret; index++) + page_cache_release(pages[index]); + /* just a hole. */ + kfree(pages); + return EXT_CONTINUE; + } - bh = page_buffers(page); + /* Try to find the 1st mapped buffer. */ + end = ((__u64)pages[0]->index << PAGE_SHIFT) >> + blksize_bits; + if (!page_has_buffers(pages[0])) + goto out; + head = page_buffers(pages[0]); + if (!head) + goto out; - if (!bh) - return EXT_CONTINUE; + bh = head; + do { + if (buffer_mapped(bh)) { + /* get the 1st mapped buffer. */ + if (end > newex->ec_block + + newex->ec_len) + /* The buffer is out of + * the request range. + */ + goto out; + goto found_mapped_buffer; + } + bh = bh->b_this_page; + end++; + } while (bh != head); - if (buffer_delay(bh)) { - flags |= FIEMAP_EXTENT_DELALLOC; - page_cache_release(page); + /* No mapped buffer found. */ + goto out; } else { - page_cache_release(page); - return EXT_CONTINUE; + /*Find contiguous delayed buffers. */ + if (ret > 0 && pages[0]->index == last_offset) + head = page_buffers(pages[0]); + bh = head; + } + +found_mapped_buffer: + if (bh != NULL && buffer_delay(bh)) { + /* 1st or contiguous delayed buffer found. */ + if (!(flags & FIEMAP_EXTENT_DELALLOC)) { + /* + * 1st delayed buffer found, record + * the start of extent. + */ + flags |= FIEMAP_EXTENT_DELALLOC; + newex->ec_block = end; + logical = (__u64)end << blksize_bits; + } + /* Find contiguous delayed buffers. */ + do { + if (!buffer_delay(bh)) + goto found_delayed_extent; + bh = bh->b_this_page; + end++; + } while (bh != head); + + for (index = 1; index < ret; index++) { + if (!page_has_buffers(pages[index])) { + bh = NULL; + break; + } + head = page_buffers(pages[index]); + if (!head) { + bh = NULL; + break; + } + if (pages[index]->index != + pages[0]->index + index) { + /* Blocks are not contiguous. */ + bh = NULL; + break; + } + bh = head; + do { + if (!buffer_delay(bh)) + /* Delayed-extent ends. */ + goto found_delayed_extent; + bh = bh->b_this_page; + end++; + } while (bh != head); + } + } else if (!(flags & FIEMAP_EXTENT_DELALLOC)) + /* a hole found. */ + goto out; + +found_delayed_extent: + newex->ec_len = min(end - newex->ec_block, + (ext4_lblk_t)EXT_INIT_MAX_LEN); + if (ret == nr_pages && bh != NULL && + newex->ec_len < EXT_INIT_MAX_LEN && + buffer_delay(bh)) { + /* Have not collected an extent and continue. */ + for (index = 0; index < ret; index++) + page_cache_release(pages[index]); + goto repeat; } + + for (index = 0; index < ret; index++) + page_cache_release(pages[index]); + kfree(pages); } physical = (__u64)newex->ec_start << blksize_bits; @@ -3822,32 +3947,16 @@ static int ext4_ext_fiemap_cb(struct inode *inode, struct ext4_ext_path *path, if (ex && ext4_ext_is_uninitialized(ex)) flags |= FIEMAP_EXTENT_UNWRITTEN; - /* - * If this extent reaches EXT_MAX_BLOCK, it must be last. - * - * Or if ext4_ext_next_allocated_block is EXT_MAX_BLOCK, - * this also indicates no more allocated blocks. - * - * XXX this might miss a single-block extent at EXT_MAX_BLOCK - */ - if (ext4_ext_next_allocated_block(path) == EXT_MAX_BLOCK || - newex->ec_block + newex->ec_len - 1 == EXT_MAX_BLOCK) { - loff_t size = i_size_read(inode); - loff_t bs = EXT4_BLOCK_SIZE(inode->i_sb); - + size = i_size_read(inode); + if (logical + length >= size) flags |= FIEMAP_EXTENT_LAST; - if ((flags & FIEMAP_EXTENT_DELALLOC) && - logical+length > size) - length = (size - logical + bs - 1) & ~(bs-1); - } - error = fiemap_fill_next_extent(fieinfo, logical, physical, + ret = fiemap_fill_next_extent(fieinfo, logical, physical, length, flags); - if (error < 0) - return error; - if (error == 1) + if (ret < 0) + return ret; + if (ret == 1) return EXT_BREAK; - return EXT_CONTINUE; } -- cgit v1.2.3 From 32a9bb57d7c1fd04ae0f72b8f671501f000a0e9f Mon Sep 17 00:00:00 2001 From: Manish Katiyar Date: Sun, 27 Feb 2011 20:42:06 -0500 Subject: ext4: fix missing iput of root inode for some mount error paths This assures that the root inode is not leaked, and that sb->s_root is NULL, which will prevent generic_shutdown_super() from doing extra work, including call sync_filesystem, which ultimately results in ext4_sync_fs() getting called with an uninitialized struct super, which is the cause of the crash noted in Kernel Bugzilla #26752. https://bugzilla.kernel.org/show_bug.cgi?id=26752 Signed-off-by: Manish Katiyar Signed-off-by: "Theodore Ts'o" --- fs/ext4/super.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 33c398785e5..bd6e86aa82a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3521,17 +3521,16 @@ no_journal: if (IS_ERR(root)) { ext4_msg(sb, KERN_ERR, "get root inode failed"); ret = PTR_ERR(root); + root = NULL; goto failed_mount4; } if (!S_ISDIR(root->i_mode) || !root->i_blocks || !root->i_size) { - iput(root); ext4_msg(sb, KERN_ERR, "corrupt root inode, run e2fsck"); goto failed_mount4; } sb->s_root = d_alloc_root(root); if (!sb->s_root) { ext4_msg(sb, KERN_ERR, "get root dentry failed"); - iput(root); ret = -ENOMEM; goto failed_mount4; } @@ -3647,6 +3646,8 @@ cantfind_ext4: goto failed_mount; failed_mount4: + iput(root); + sb->s_root = NULL; ext4_msg(sb, KERN_ERR, "mount failed"); destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq); failed_mount_wq: -- cgit v1.2.3 From 8e8eaabefee3ff645b9551ee32c6c54c7d80ad19 Mon Sep 17 00:00:00 2001 From: Amir Goldstein Date: Sun, 27 Feb 2011 23:32:12 -0500 Subject: ext4: use the nblocks arg to ext4_truncate_restart_trans() nblocks is passed into ext4_truncate_restart_trans() from ext4_ext_truncate_extend_restart() with a value different from the default blocks_for_truncate(), but is being ignored. The two other calls to ext4_truncate_restart_trans() already pass the default value, which is then being recalculated inside the function. Fix the problem by using the passed argument. Signed-off-by: Amir Goldstein --- fs/ext4/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index dcc2287433b..67e7a3caf9e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -173,7 +173,7 @@ int ext4_truncate_restart_trans(handle_t *handle, struct inode *inode, BUG_ON(EXT4_JOURNAL(inode) == NULL); jbd_debug(2, "restarting handle %p\n", handle); up_write(&EXT4_I(inode)->i_data_sem); - ret = ext4_journal_restart(handle, blocks_for_truncate(inode)); + ret = ext4_journal_restart(handle, nblocks); down_write(&EXT4_I(inode)->i_data_sem); ext4_discard_preallocations(inode); -- cgit v1.2.3 From d39195c33bb1b5fdcb0f416e8a0b34bfdb07a027 Mon Sep 17 00:00:00 2001 From: Amir Goldstein Date: Mon, 28 Feb 2011 00:53:45 -0500 Subject: ext4: skip orphan cleanup if fs has unknown ROCOMPAT features Orphan cleanup is currently executed even if the file system has some number of unknown ROCOMPAT features, which deletes inodes and frees blocks, which could be very bad for some RO_COMPAT features, especially the SNAPSHOT feature. This patch skips the orphan cleanup if it contains readonly compatible features not known by this ext4 implementation, which would prevent the fs from being mounted (or remounted) readwrite. Signed-off-by: Amir Goldstein Signed-off-by: "Theodore Ts'o" --- fs/ext4/super.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index bd6e86aa82a..9eaec22aa08 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -75,6 +75,7 @@ static void ext4_write_super(struct super_block *sb); static int ext4_freeze(struct super_block *sb); static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data); +static int ext4_feature_set_ok(struct super_block *sb, int readonly); static void ext4_destroy_lazyinit_thread(void); static void ext4_unregister_li_request(struct super_block *sb); static void ext4_clear_request_list(void); @@ -2117,6 +2118,13 @@ static void ext4_orphan_cleanup(struct super_block *sb, return; } + /* Check if feature set would not allow a r/w mount */ + if (!ext4_feature_set_ok(sb, 0)) { + ext4_msg(sb, KERN_INFO, "Skipping orphan cleanup due to " + "unknown ROCOMPAT features"); + return; + } + if (EXT4_SB(sb)->s_mount_state & EXT4_ERROR_FS) { if (es->s_last_orphan) jbd_debug(1, "Errors on filesystem, " -- cgit v1.2.3 From b616844310a6c8a4ab405d3436bbb6e53cfd852f Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Mon, 28 Feb 2011 13:12:38 -0500 Subject: ext4: optimize ext4_bio_write_page() when no extent conversion is needed If no extent conversion is required, wake up any processes waiting for the page's writeback to be complete and free the ext4_io_end structure directly in ext4_end_bio() instead of dropping it on the linked list (which requires taking a spinlock to queue and dequeue the io_end structure), and waiting for the workqueue to do this work. This removes an extra scheduling delay before process waiting for an fsync() to complete gets woken up, and it also reduces the CPU overhead for a random write workload. Signed-off-by: "Theodore Ts'o" --- fs/ext4/page-io.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index d5c391ffad7..0cfd03e19d7 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -259,6 +259,11 @@ static void ext4_end_bio(struct bio *bio, int error) bi_sector >> (inode->i_blkbits - 9)); } + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) { + ext4_free_io_end(io_end); + return; + } + /* Add the io_end to per-inode completed io list*/ spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list); -- cgit v1.2.3 From 198868f35de99e7197829314076e5465c37e4cc5 Mon Sep 17 00:00:00 2001 From: Mingming Cao Date: Sat, 5 Mar 2011 11:52:45 -0500 Subject: ext4: Use single thread to perform DIO unwritten convertion While running ext4 testing on multiple core, we found there are per cpu ext4-dio-unwritten threads processing conversion from unwritten extents to written for IOs completed from async direct IO patch. Per filesystem is enough, we don't need per cpu threads to work on conversion. Signed-off-by: Mingming Cao --- fs/ext4/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 9eaec22aa08..b357c2700d7 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3514,7 +3514,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) percpu_counter_set(&sbi->s_dirtyblocks_counter, 0); no_journal: - EXT4_SB(sb)->dio_unwritten_wq = create_workqueue("ext4-dio-unwritten"); + EXT4_SB(sb)->dio_unwritten_wq = create_singlethread_workqueue("ext4-dio-unwritten"); if (!EXT4_SB(sb)->dio_unwritten_wq) { printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n"); goto failed_mount_wq; -- cgit v1.2.3 From 688f869ce3bdc892daa993534dc6df18c95df931 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Wed, 16 Mar 2011 17:16:31 -0400 Subject: ext4: Initialize fsync transaction ids in ext4_new_inode() When allocating a new inode, we need to make sure i_sync_tid and i_datasync_tid are initialized. Otherwise, one or both of these two values could be left initialized to zero, which could potentially result in BUG_ON in jbd2_journal_commit_transaction. (This could happen by having journal->commit_request getting set to zero, which could wake up the kjournald process even though there is no running transaction, which then causes a BUG_ON via the J_ASSERT(j_ruinning_transaction != NULL) statement. Signed-off-by: "Theodore Ts'o" --- fs/ext4/ialloc.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 2fd3b0e4178..a679a482c98 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -1054,6 +1054,11 @@ got: } } + if (ext4_handle_valid(handle)) { + ei->i_sync_tid = handle->h_transaction->t_tid; + ei->i_datasync_tid = handle->h_transaction->t_tid; + } + err = ext4_mark_inode_dirty(handle, inode); if (err) { ext4_std_error(sb, err); -- cgit v1.2.3 From c2cc7028e41c76e44b6e247c4b495c7523b23c87 Mon Sep 17 00:00:00 2001 From: Amir Goldstein Date: Sun, 20 Mar 2011 20:08:48 -0400 Subject: jbd2: add the b_cow_tid field to journal_head struct The b_cow_tid field will be used by the ext4 snapshots code to store the transaction id when the buffer was last cowed. Merging this patch to mainline will allow users to test ext4 snapshots as a standalone module, without the need to patch and install a development kernel. On 64bit machines this field uses fills in a padding "hole" and does not increase the size of the struct. On a 32bit machine this patch increases the size of the struct from 60 to 64 bytes. Signed-off-by: Amir Goldstein Signed-off-by: "Theodore Ts'o" --- include/linux/journal-head.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h index 525aac3c97d..44e95d0a721 100644 --- a/include/linux/journal-head.h +++ b/include/linux/journal-head.h @@ -40,6 +40,13 @@ struct journal_head { */ unsigned b_modified; + /* + * This feild tracks the last transaction id in which this buffer + * has been cowed + * [jbd_lock_bh_state()] + */ + unsigned b_cow_tid; + /* * Copy of the buffer data frozen for writing to the log. * [jbd_lock_bh_state()] -- cgit v1.2.3 From 93737456d68ddcb86232f669b83da673dd12e351 Mon Sep 17 00:00:00 2001 From: Amir Goldstein Date: Sun, 20 Mar 2011 21:13:43 -0400 Subject: jbd2: add COW fields to struct jbd2_journal_handle Add fields needed for the copy-on-write ext4 development work. The h_cowing flag is used by ext4 snapshots code to mark the task in COWING state. The h_XXX_credits fields are used to track buffer credits usage (accounted by COW and non-COW operations). The h_cow_XXX fields are used as per task debugging counters. Merging this commit into mainline will allow users to test ext4 snapshots as a standalone module, without the need to patch and install a development kernel. Signed-off-by: Amir Goldstein Signed-off-by: "Theodore Ts'o" --- include/linux/jbd2.h | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 27e79c27ba0..a32dcaec04e 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -432,13 +432,35 @@ struct jbd2_journal_handle int h_err; /* Flags [no locking] */ - unsigned int h_sync: 1; /* sync-on-close */ - unsigned int h_jdata: 1; /* force data journaling */ - unsigned int h_aborted: 1; /* fatal error on handle */ + unsigned int h_sync:1; /* sync-on-close */ + unsigned int h_jdata:1; /* force data journaling */ + unsigned int h_aborted:1; /* fatal error on handle */ + unsigned int h_cowing:1; /* COWing block to snapshot */ + + /* Number of buffers requested by user: + * (before adding the COW credits factor) */ + unsigned int h_base_credits:14; + + /* Number of buffers the user is allowed to dirty: + * (counts only buffers dirtied when !h_cowing) */ + unsigned int h_user_credits:14; + #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map h_lockdep_map; #endif + +#ifdef CONFIG_JBD2_DEBUG + /* COW debugging counters: */ + unsigned int h_cow_moved; /* blocks moved to snapshot */ + unsigned int h_cow_copied; /* blocks copied to snapshot */ + unsigned int h_cow_ok_jh; /* blocks already COWed during current + transaction */ + unsigned int h_cow_ok_bitmap; /* blocks not set in COW bitmap */ + unsigned int h_cow_ok_mapped;/* blocks already mapped in snapshot */ + unsigned int h_cow_bitmaps; /* COW bitmaps created */ + unsigned int h_cow_excluded; /* blocks set in exclude bitmap */ +#endif }; -- cgit v1.2.3 From ef6078930263bfcdcfe4dddb2cd85254b4cf4f5c Mon Sep 17 00:00:00 2001 From: Amir Goldstein Date: Sun, 20 Mar 2011 21:18:44 -0400 Subject: ext4: handle errors in ext4_rename Checking return code from ext4_journal_get_write_access() is important with snapshots, because this function invokes COW, so may return new errors, such as ENOSPC. We move the call to ext4_journal_get_write_access earlier in the function, to simplify error handling in the case that this function returns returns an error. Signed-off-by: Amir Goldstein Signed-off-by: "Theodore Ts'o" --- fs/ext4/namei.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 5485390d32c..ad87584aa8d 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2409,6 +2409,10 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry, if (!new_inode && new_dir != old_dir && EXT4_DIR_LINK_MAX(new_dir)) goto end_rename; + BUFFER_TRACE(dir_bh, "get_write_access"); + retval = ext4_journal_get_write_access(handle, dir_bh); + if (retval) + goto end_rename; } if (!new_bh) { retval = ext4_add_entry(handle, new_dentry, old_inode); @@ -2416,7 +2420,9 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry, goto end_rename; } else { BUFFER_TRACE(new_bh, "get write access"); - ext4_journal_get_write_access(handle, new_bh); + retval = ext4_journal_get_write_access(handle, new_bh); + if (retval) + goto end_rename; new_de->inode = cpu_to_le32(old_inode->i_ino); if (EXT4_HAS_INCOMPAT_FEATURE(new_dir->i_sb, EXT4_FEATURE_INCOMPAT_FILETYPE)) @@ -2477,8 +2483,6 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry, old_dir->i_ctime = old_dir->i_mtime = ext4_current_time(old_dir); ext4_update_dx_flag(old_dir); if (dir_bh) { - BUFFER_TRACE(dir_bh, "get_write_access"); - ext4_journal_get_write_access(handle, dir_bh); PARENT_INO(dir_bh->b_data, new_dir->i_sb->s_blocksize) = cpu_to_le32(new_dir->i_ino); BUFFER_TRACE(dir_bh, "call ext4_handle_dirty_metadata"); -- cgit v1.2.3 From 537a03103c67c4688b1e8e6671ad119aec5e2efb Mon Sep 17 00:00:00 2001 From: Amir Goldstein Date: Sun, 20 Mar 2011 22:57:02 -0400 Subject: ext4: unify the ext4_handle_release_buffer() api There are two wrapper functions which do exactly the same thing: ext4_journal_release_buffer(), and ext4_handle_release_buffer(). In addition, ext4_xattr_block_set() calls jbd2_journal_release_buffer() directly. Unify all of the code to use ext4_handle_release_buffer(), and get rid of ext4_journal_release_buffer(). Signed-off-by: Amir Goldstein Signed-off-by: "Theodore Ts'o" --- fs/ext4/ext4_jbd2.h | 7 ------- fs/ext4/resize.c | 8 ++++---- fs/ext4/xattr.c | 2 +- 3 files changed, 5 insertions(+), 12 deletions(-) diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index d8b992e658c..e25e99bf7ee 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -202,13 +202,6 @@ static inline int ext4_handle_has_enough_credits(handle_t *handle, int needed) return 1; } -static inline void ext4_journal_release_buffer(handle_t *handle, - struct buffer_head *bh) -{ - if (ext4_handle_valid(handle)) - jbd2_journal_release_buffer(handle, bh); -} - static inline handle_t *ext4_journal_start(struct inode *inode, int nblocks) { return ext4_journal_start_sb(inode->i_sb, nblocks); diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index 66fec4ee76f..80bbc9c60c2 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -499,12 +499,12 @@ static int add_new_gdb(handle_t *handle, struct inode *inode, return err; exit_inode: - /* ext4_journal_release_buffer(handle, iloc.bh); */ + /* ext4_handle_release_buffer(handle, iloc.bh); */ brelse(iloc.bh); exit_dindj: - /* ext4_journal_release_buffer(handle, dind); */ + /* ext4_handle_release_buffer(handle, dind); */ exit_sbh: - /* ext4_journal_release_buffer(handle, EXT4_SB(sb)->s_sbh); */ + /* ext4_handle_release_buffer(handle, EXT4_SB(sb)->s_sbh); */ exit_dind: brelse(dind); exit_bh: @@ -586,7 +586,7 @@ static int reserve_backup_gdb(handle_t *handle, struct inode *inode, /* int j; for (j = 0; j < i; j++) - ext4_journal_release_buffer(handle, primary[j]); + ext4_handle_release_buffer(handle, primary[j]); */ goto exit_bh; } diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index f4c03af05d6..b545ca1c459 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -735,7 +735,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode, int offset = (char *)s->here - bs->bh->b_data; unlock_buffer(bs->bh); - jbd2_journal_release_buffer(handle, bs->bh); + ext4_handle_release_buffer(handle, bs->bh); if (ce) { mb_cache_entry_release(ce); ce = NULL; -- cgit v1.2.3 From d67d1218344009970ba0deb7eb15a3984518ddd0 Mon Sep 17 00:00:00 2001 From: Amir Goldstein Date: Sun, 20 Mar 2011 22:59:02 -0400 Subject: ext4: handle errors in ext4_clear_blocks() Checking return code from ext4_journal_get_write_access() is important with snapshots, because this function invokes COW, so may return new errors, such as ENOSPC. ext4_clear_blocks() now returns < 0 for fatal errors, in which case, ext4_free_data() is aborted. Signed-off-by: Amir Goldstein Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 46 ++++++++++++++++++++++++++-------------------- 1 file changed, 26 insertions(+), 20 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 67e7a3caf9e..fc8c0ce8431 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4096,6 +4096,9 @@ no_top: * * We release `count' blocks on disk, but (last - first) may be greater * than `count' because there can be holes in there. + * + * Return 0 on success, 1 on invalid block range + * and < 0 on fatal error. */ static int ext4_clear_blocks(handle_t *handle, struct inode *inode, struct buffer_head *bh, @@ -4122,25 +4125,21 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode, if (bh) { BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata"); err = ext4_handle_dirty_metadata(handle, inode, bh); - if (unlikely(err)) { - ext4_std_error(inode->i_sb, err); - return 1; - } + if (unlikely(err)) + goto out_err; } err = ext4_mark_inode_dirty(handle, inode); - if (unlikely(err)) { - ext4_std_error(inode->i_sb, err); - return 1; - } + if (unlikely(err)) + goto out_err; err = ext4_truncate_restart_trans(handle, inode, blocks_for_truncate(inode)); - if (unlikely(err)) { - ext4_std_error(inode->i_sb, err); - return 1; - } + if (unlikely(err)) + goto out_err; if (bh) { BUFFER_TRACE(bh, "retaking write access"); - ext4_journal_get_write_access(handle, bh); + err = ext4_journal_get_write_access(handle, bh); + if (unlikely(err)) + goto out_err; } } @@ -4149,6 +4148,9 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode, ext4_free_blocks(handle, inode, NULL, block_to_free, count, flags); return 0; +out_err: + ext4_std_error(inode->i_sb, err); + return err; } /** @@ -4182,7 +4184,7 @@ static void ext4_free_data(handle_t *handle, struct inode *inode, ext4_fsblk_t nr; /* Current block # */ __le32 *p; /* Pointer into inode/ind for current block */ - int err; + int err = 0; if (this_bh) { /* For indirect block */ BUFFER_TRACE(this_bh, "get_write_access"); @@ -4204,9 +4206,10 @@ static void ext4_free_data(handle_t *handle, struct inode *inode, } else if (nr == block_to_free + count) { count++; } else { - if (ext4_clear_blocks(handle, inode, this_bh, - block_to_free, count, - block_to_free_p, p)) + err = ext4_clear_blocks(handle, inode, this_bh, + block_to_free, count, + block_to_free_p, p); + if (err) break; block_to_free = nr; block_to_free_p = p; @@ -4215,9 +4218,12 @@ static void ext4_free_data(handle_t *handle, struct inode *inode, } } - if (count > 0) - ext4_clear_blocks(handle, inode, this_bh, block_to_free, - count, block_to_free_p, p); + if (!err && count > 0) + err = ext4_clear_blocks(handle, inode, this_bh, block_to_free, + count, block_to_free_p, p); + if (err < 0) + /* fatal error */ + return; if (this_bh) { BUFFER_TRACE(this_bh, "call ext4_handle_dirty_metadata"); -- cgit v1.2.3 From a56e69c28ad0782a99f3f196e93d57ba5a7e2324 Mon Sep 17 00:00:00 2001 From: Tao Ma Date: Sun, 20 Mar 2011 23:16:58 -0400 Subject: ext4: add FITRIM to compat_ioctl. FITRIM isn't added in compat_ioctl. So a 32 bit program can't be executed in a 64 bit platform. Add it in the compat_ioctl. Signed-off-by: Tao Ma Signed-off-by: "Theodore Ts'o" --- fs/ext4/ioctl.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index c052c9f0f3a..bb424de9953 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -427,6 +427,7 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) return err; } case EXT4_IOC_MOVE_EXT: + case FITRIM: break; default: return -ENOIOCTLCMD; -- cgit v1.2.3 From 21149d611ecd0faf60f4ef94aa2bf8ed872f92bf Mon Sep 17 00:00:00 2001 From: Robin Dong Date: Mon, 21 Mar 2011 20:39:22 -0400 Subject: ext4: add missing space in printk's in __ext4_grp_locked_error() When we do performence-testing on ext4 filesystem, we observed a warning like this: EXT4-fs error (device sda7): ext4_mb_generate_buddy:718: group 259825901 blocks in bitmap, 26057 in gd instead, it should be "group 2598, 25901 blocks in bitmap, 26057 in gd" Reviewed-by: Coly Li Cc: Tao Ma Signed-off-by: Robin Dong Signed-off-by: "Theodore Ts'o" --- fs/ext4/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index b357c2700d7..ccfa6865ea5 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -595,7 +595,7 @@ __acquires(bitlock) vaf.fmt = fmt; vaf.va = &args; - printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: group %u", + printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: group %u, ", sb->s_id, function, line, grp); if (ino) printk(KERN_CONT "inode %lu: ", ino); -- cgit v1.2.3 From 4596fe07679ff0fae904515691ea747467614871 Mon Sep 17 00:00:00 2001 From: Eric Sandeen Date: Mon, 21 Mar 2011 21:25:13 -0400 Subject: ext4: don't kfree uninitialized s_group_info members We can call kfree on uninitialized members of the s_group_info array on an the error path. We can avoid this by kzalloc'ing the array. This doesn't entirely solve the oops on mount if we fail down this path; failed_mount4: frees the sbi, for one, which gets referenced later in the failed mount paths - I haven't worked that out yet. https://bugzilla.kernel.org/show_bug.cgi?id=30872 Reported-by: Eugene A. Shatokhin Signed-off-by: Eric Sandeen Signed-off-by: "Theodore Ts'o" --- fs/ext4/mballoc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 2f6f0dd08fc..cdc84953f1d 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2386,7 +2386,7 @@ static int ext4_mb_init_backend(struct super_block *sb) /* An 8TB filesystem with 64-bit pointers requires a 4096 byte * kmalloc. A 128kb malloc should suffice for a 256TB filesystem. * So a two level scheme suffices for now. */ - sbi->s_group_info = kmalloc(array_size, GFP_KERNEL); + sbi->s_group_info = kzalloc(array_size, GFP_KERNEL); if (sbi->s_group_info == NULL) { printk(KERN_ERR "EXT4-fs: can't allocate buddy meta group\n"); return -ENOMEM; -- cgit v1.2.3 From 0562e0bad483d10e9651fbb8f21dc3d0bad57374 Mon Sep 17 00:00:00 2001 From: Jiaying Zhang Date: Mon, 21 Mar 2011 21:38:05 -0400 Subject: ext4: add more tracepoints and use dev_t in the trace buffer - Add more ext4 tracepoints. - Change ext4 tracepoints to use dev_t field with MAJOR/MINOR macros so that we can save 4 bytes in the ring buffer on some platforms. - Add sync_mode to ext4_da_writepages, ext4_da_write_pages, and ext4_da_writepages_result tracepoints. Also remove for_reclaim field from ext4_da_writepages since it is usually not very useful. Signed-off-by: Jiaying Zhang Signed-off-by: "Theodore Ts'o" --- fs/ext4/balloc.c | 3 + fs/ext4/extents.c | 13 +- fs/ext4/fsync.c | 14 +- fs/ext4/ialloc.c | 1 + fs/ext4/inode.c | 24 +- fs/ext4/namei.c | 3 + include/trace/events/ext4.h | 775 +++++++++++++++++++++++++++++++++----------- include/trace/events/jbd2.h | 78 ++--- 8 files changed, 658 insertions(+), 253 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index adf96b82278..97b970e7dd1 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -21,6 +21,8 @@ #include "ext4_jbd2.h" #include "mballoc.h" +#include + /* * balloc.c contains the blocks allocation and deallocation routines */ @@ -342,6 +344,7 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group) * We do it here so the bitmap uptodate bit * get set with buffer lock held. */ + trace_ext4_read_block_bitmap_load(sb, block_group); set_bitmap_uptodate(bh); if (bh_submit_read(bh) < 0) { put_bh(bh); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 9ea1bc64ca6..f46f6e3c02d 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -44,6 +44,8 @@ #include "ext4_jbd2.h" #include "ext4_extents.h" +#include + static int ext4_ext_truncate_extend_restart(handle_t *handle, struct inode *inode, int needed) @@ -664,6 +666,8 @@ ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block, if (unlikely(!bh)) goto err; if (!bh_uptodate_or_lock(bh)) { + trace_ext4_ext_load_extent(inode, block, + path[ppos].p_block); if (bh_submit_read(bh) < 0) { put_bh(bh); goto err; @@ -3297,7 +3301,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, struct ext4_ext_path *path = NULL; struct ext4_extent_header *eh; struct ext4_extent newex, *ex; - ext4_fsblk_t newblock; + ext4_fsblk_t newblock = 0; int err = 0, depth, ret; unsigned int allocated = 0; struct ext4_allocation_request ar; @@ -3305,6 +3309,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, ext_debug("blocks %u/%u requested for inode %lu\n", map->m_lblk, map->m_len, inode->i_ino); + trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags); /* check in cache */ if (ext4_ext_in_cache(inode, map->m_lblk, &newex)) { @@ -3525,6 +3530,8 @@ out2: ext4_ext_drop_refs(path); kfree(path); } + trace_ext4_ext_map_blocks_exit(inode, map->m_lblk, + newblock, map->m_len, err ? err : allocated); return err ? err : allocated; } @@ -3658,6 +3665,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) return -EOPNOTSUPP; + trace_ext4_fallocate_enter(inode, offset, len, mode); map.m_lblk = offset >> blkbits; /* * We can't just convert len to max_blocks because @@ -3673,6 +3681,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) ret = inode_newsize_ok(inode, (len + offset)); if (ret) { mutex_unlock(&inode->i_mutex); + trace_ext4_fallocate_exit(inode, offset, max_blocks, ret); return ret; } retry: @@ -3717,6 +3726,8 @@ retry: goto retry; } mutex_unlock(&inode->i_mutex); + trace_ext4_fallocate_exit(inode, offset, max_blocks, + ret > 0 ? ret2 : ret); return ret > 0 ? ret2 : ret; } diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 7829b287822..7f74019d6d7 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -164,20 +164,20 @@ int ext4_sync_file(struct file *file, int datasync) J_ASSERT(ext4_journal_current_handle() == NULL); - trace_ext4_sync_file(file, datasync); + trace_ext4_sync_file_enter(file, datasync); if (inode->i_sb->s_flags & MS_RDONLY) return 0; ret = ext4_flush_completed_IO(inode); if (ret < 0) - return ret; + goto out; if (!journal) { ret = generic_file_fsync(file, datasync); if (!ret && !list_empty(&inode->i_dentry)) ext4_sync_parent(inode); - return ret; + goto out; } /* @@ -194,8 +194,10 @@ int ext4_sync_file(struct file *file, int datasync) * (they were dirtied by commit). But that's OK - the blocks are * safe in-journal, which is all fsync() needs to ensure. */ - if (ext4_should_journal_data(inode)) - return ext4_force_commit(inode->i_sb); + if (ext4_should_journal_data(inode)) { + ret = ext4_force_commit(inode->i_sb); + goto out; + } commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid; if (jbd2_log_start_commit(journal, commit_tid)) { @@ -215,5 +217,7 @@ int ext4_sync_file(struct file *file, int datasync) ret = jbd2_log_wait_commit(journal, commit_tid); } else if (journal->j_flags & JBD2_BARRIER) blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); + out: + trace_ext4_sync_file_exit(inode, ret); return ret; } diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index a679a482c98..254e6b98b5b 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -152,6 +152,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group) * We do it here so the bitmap uptodate bit * get set with buffer lock held. */ + trace_ext4_load_inode_bitmap(sb, block_group); set_bitmap_uptodate(bh); if (bh_submit_read(bh) < 0) { put_bh(bh); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index fc8c0ce8431..f44307a2113 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -973,6 +973,7 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode, int count = 0; ext4_fsblk_t first_block = 0; + trace_ext4_ind_map_blocks_enter(inode, map->m_lblk, map->m_len, flags); J_ASSERT(!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))); J_ASSERT(handle != NULL || (flags & EXT4_GET_BLOCKS_CREATE) == 0); depth = ext4_block_to_path(inode, map->m_lblk, offsets, @@ -1058,6 +1059,8 @@ cleanup: partial--; } out: + trace_ext4_ind_map_blocks_exit(inode, map->m_lblk, + map->m_pblk, map->m_len, err); return err; } @@ -3379,6 +3382,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block) static int ext4_readpage(struct file *file, struct page *page) { + trace_ext4_readpage(page); return mpage_readpage(page, ext4_get_block); } @@ -3413,6 +3417,8 @@ static void ext4_invalidatepage(struct page *page, unsigned long offset) { journal_t *journal = EXT4_JOURNAL(page->mapping->host); + trace_ext4_invalidatepage(page, offset); + /* * free any io_end structure allocated for buffers to be discarded */ @@ -3434,6 +3440,8 @@ static int ext4_releasepage(struct page *page, gfp_t wait) { journal_t *journal = EXT4_JOURNAL(page->mapping->host); + trace_ext4_releasepage(page); + WARN_ON(PageChecked(page)); if (!page_has_buffers(page)) return 0; @@ -3792,11 +3800,16 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb, { struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; + ssize_t ret; + trace_ext4_direct_IO_enter(inode, offset, iov_length(iov, nr_segs), rw); if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) - return ext4_ext_direct_IO(rw, iocb, iov, offset, nr_segs); - - return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs); + ret = ext4_ext_direct_IO(rw, iocb, iov, offset, nr_segs); + else + ret = ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs); + trace_ext4_direct_IO_exit(inode, offset, + iov_length(iov, nr_segs), rw, ret); + return ret; } /* @@ -4425,6 +4438,8 @@ void ext4_truncate(struct inode *inode) ext4_lblk_t last_block; unsigned blocksize = inode->i_sb->s_blocksize; + trace_ext4_truncate_enter(inode); + if (!ext4_can_truncate(inode)) return; @@ -4435,6 +4450,7 @@ void ext4_truncate(struct inode *inode) if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) { ext4_ext_truncate(inode); + trace_ext4_truncate_exit(inode); return; } @@ -4564,6 +4580,7 @@ out_stop: ext4_orphan_del(handle, inode); ext4_journal_stop(handle); + trace_ext4_truncate_exit(inode); } /* @@ -4695,6 +4712,7 @@ make_io: * has in-inode xattrs, or we don't have this inode in memory. * Read the block from disk. */ + trace_ext4_load_inode(inode); get_bh(bh); bh->b_end_io = end_buffer_read_sync; submit_bh(READ_META, bh); diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index ad87584aa8d..f9f83878843 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -40,6 +40,7 @@ #include "xattr.h" #include "acl.h" +#include /* * define how far ahead to read directories while searching them. */ @@ -2183,6 +2184,7 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry) struct ext4_dir_entry_2 *de; handle_t *handle; + trace_ext4_unlink_enter(dir, dentry); /* Initialize quotas before so that eventual writes go * in separate transaction */ dquot_initialize(dir); @@ -2228,6 +2230,7 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry) end_unlink: ext4_journal_stop(handle); brelse(bh); + trace_ext4_unlink_exit(dentry, retval); return retval; } diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index e5e345fb2a5..e09592d2f91 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -21,8 +21,7 @@ TRACE_EVENT(ext4_free_inode, TP_ARGS(inode), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( umode_t, mode ) __field( uid_t, uid ) @@ -31,8 +30,7 @@ TRACE_EVENT(ext4_free_inode, ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->mode = inode->i_mode; __entry->uid = inode->i_uid; @@ -41,9 +39,9 @@ TRACE_EVENT(ext4_free_inode, ), TP_printk("dev %d,%d ino %lu mode 0%o uid %u gid %u blocks %llu", - __entry->dev_major, __entry->dev_minor, - (unsigned long) __entry->ino, __entry->mode, - __entry->uid, __entry->gid, + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + __entry->mode, __entry->uid, __entry->gid, (unsigned long long) __entry->blocks) ); @@ -53,21 +51,19 @@ TRACE_EVENT(ext4_request_inode, TP_ARGS(dir, mode), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, dir ) __field( umode_t, mode ) ), TP_fast_assign( - __entry->dev_major = MAJOR(dir->i_sb->s_dev); - __entry->dev_minor = MINOR(dir->i_sb->s_dev); + __entry->dev = dir->i_sb->s_dev; __entry->dir = dir->i_ino; __entry->mode = mode; ), TP_printk("dev %d,%d dir %lu mode 0%o", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->dir, __entry->mode) ); @@ -77,23 +73,21 @@ TRACE_EVENT(ext4_allocate_inode, TP_ARGS(inode, dir, mode), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( ino_t, dir ) __field( umode_t, mode ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->dir = dir->i_ino; __entry->mode = mode; ), TP_printk("dev %d,%d ino %lu dir %lu mode 0%o", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, (unsigned long) __entry->dir, __entry->mode) ); @@ -104,21 +98,19 @@ TRACE_EVENT(ext4_evict_inode, TP_ARGS(inode), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( int, nlink ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->nlink = inode->i_nlink; ), TP_printk("dev %d,%d ino %lu nlink %d", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->nlink) ); @@ -128,21 +120,19 @@ TRACE_EVENT(ext4_drop_inode, TP_ARGS(inode, drop), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( int, drop ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->drop = drop; ), TP_printk("dev %d,%d ino %lu drop %d", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->drop) ); @@ -152,21 +142,19 @@ TRACE_EVENT(ext4_mark_inode_dirty, TP_ARGS(inode, IP), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field(unsigned long, ip ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->ip = IP; ), TP_printk("dev %d,%d ino %lu caller %pF", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, (void *)__entry->ip) ); @@ -176,21 +164,19 @@ TRACE_EVENT(ext4_begin_ordered_truncate, TP_ARGS(inode, new_size), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( loff_t, new_size ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->new_size = new_size; ), TP_printk("dev %d,%d ino %lu new_size %lld", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, (long long) __entry->new_size) ); @@ -203,8 +189,7 @@ DECLARE_EVENT_CLASS(ext4__write_begin, TP_ARGS(inode, pos, len, flags), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( loff_t, pos ) __field( unsigned int, len ) @@ -212,8 +197,7 @@ DECLARE_EVENT_CLASS(ext4__write_begin, ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->pos = pos; __entry->len = len; @@ -221,7 +205,7 @@ DECLARE_EVENT_CLASS(ext4__write_begin, ), TP_printk("dev %d,%d ino %lu pos %llu len %u flags %u", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->pos, __entry->len, __entry->flags) ); @@ -249,8 +233,7 @@ DECLARE_EVENT_CLASS(ext4__write_end, TP_ARGS(inode, pos, len, copied), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( loff_t, pos ) __field( unsigned int, len ) @@ -258,8 +241,7 @@ DECLARE_EVENT_CLASS(ext4__write_end, ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->pos = pos; __entry->len = len; @@ -267,9 +249,9 @@ DECLARE_EVENT_CLASS(ext4__write_end, ), TP_printk("dev %d,%d ino %lu pos %llu len %u copied %u", - __entry->dev_major, __entry->dev_minor, - (unsigned long) __entry->ino, __entry->pos, - __entry->len, __entry->copied) + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + __entry->pos, __entry->len, __entry->copied) ); DEFINE_EVENT(ext4__write_end, ext4_ordered_write_end, @@ -310,22 +292,20 @@ TRACE_EVENT(ext4_writepage, TP_ARGS(inode, page), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( pgoff_t, index ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->index = page->index; ), TP_printk("dev %d,%d ino %lu page_index %lu", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->index) ); @@ -335,43 +315,39 @@ TRACE_EVENT(ext4_da_writepages, TP_ARGS(inode, wbc), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( long, nr_to_write ) __field( long, pages_skipped ) __field( loff_t, range_start ) __field( loff_t, range_end ) + __field( int, sync_mode ) __field( char, for_kupdate ) - __field( char, for_reclaim ) __field( char, range_cyclic ) __field( pgoff_t, writeback_index ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->nr_to_write = wbc->nr_to_write; __entry->pages_skipped = wbc->pages_skipped; __entry->range_start = wbc->range_start; __entry->range_end = wbc->range_end; + __entry->sync_mode = wbc->sync_mode; __entry->for_kupdate = wbc->for_kupdate; - __entry->for_reclaim = wbc->for_reclaim; __entry->range_cyclic = wbc->range_cyclic; __entry->writeback_index = inode->i_mapping->writeback_index; ), TP_printk("dev %d,%d ino %lu nr_to_write %ld pages_skipped %ld " - "range_start %llu range_end %llu " - "for_kupdate %d for_reclaim %d " - "range_cyclic %d writeback_index %lu", - __entry->dev_major, __entry->dev_minor, + "range_start %llu range_end %llu sync_mode %d" + "for_kupdate %d range_cyclic %d writeback_index %lu", + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->nr_to_write, __entry->pages_skipped, __entry->range_start, - __entry->range_end, - __entry->for_kupdate, __entry->for_reclaim, - __entry->range_cyclic, + __entry->range_end, __entry->sync_mode, + __entry->for_kupdate, __entry->range_cyclic, (unsigned long) __entry->writeback_index) ); @@ -381,8 +357,7 @@ TRACE_EVENT(ext4_da_write_pages, TP_ARGS(inode, mpd), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( __u64, b_blocknr ) __field( __u32, b_size ) @@ -390,11 +365,11 @@ TRACE_EVENT(ext4_da_write_pages, __field( unsigned long, first_page ) __field( int, io_done ) __field( int, pages_written ) + __field( int, sync_mode ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->b_blocknr = mpd->b_blocknr; __entry->b_size = mpd->b_size; @@ -402,14 +377,18 @@ TRACE_EVENT(ext4_da_write_pages, __entry->first_page = mpd->first_page; __entry->io_done = mpd->io_done; __entry->pages_written = mpd->pages_written; + __entry->sync_mode = mpd->wbc->sync_mode; ), - TP_printk("dev %d,%d ino %lu b_blocknr %llu b_size %u b_state 0x%04x first_page %lu io_done %d pages_written %d", - __entry->dev_major, __entry->dev_minor, + TP_printk("dev %d,%d ino %lu b_blocknr %llu b_size %u b_state 0x%04x " + "first_page %lu io_done %d pages_written %d sync_mode %d", + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->b_blocknr, __entry->b_size, __entry->b_state, __entry->first_page, - __entry->io_done, __entry->pages_written) + __entry->io_done, __entry->pages_written, + __entry->sync_mode + ) ); TRACE_EVENT(ext4_da_writepages_result, @@ -419,35 +398,100 @@ TRACE_EVENT(ext4_da_writepages_result, TP_ARGS(inode, wbc, ret, pages_written), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( int, ret ) __field( int, pages_written ) __field( long, pages_skipped ) + __field( int, sync_mode ) __field( char, more_io ) __field( pgoff_t, writeback_index ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->ret = ret; __entry->pages_written = pages_written; __entry->pages_skipped = wbc->pages_skipped; + __entry->sync_mode = wbc->sync_mode; __entry->more_io = wbc->more_io; __entry->writeback_index = inode->i_mapping->writeback_index; ), - TP_printk("dev %d,%d ino %lu ret %d pages_written %d pages_skipped %ld more_io %d writeback_index %lu", - __entry->dev_major, __entry->dev_minor, + TP_printk("dev %d,%d ino %lu ret %d pages_written %d pages_skipped %ld " + " more_io %d sync_mode %d writeback_index %lu", + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->ret, __entry->pages_written, __entry->pages_skipped, - __entry->more_io, + __entry->more_io, __entry->sync_mode, (unsigned long) __entry->writeback_index) ); +DECLARE_EVENT_CLASS(ext4__page_op, + TP_PROTO(struct page *page), + + TP_ARGS(page), + + TP_STRUCT__entry( + __field( pgoff_t, index ) + __field( ino_t, ino ) + __field( dev_t, dev ) + + ), + + TP_fast_assign( + __entry->index = page->index; + __entry->ino = page->mapping->host->i_ino; + __entry->dev = page->mapping->host->i_sb->s_dev; + ), + + TP_printk("dev %d,%d ino %lu page_index %lu", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + __entry->index) +); + +DEFINE_EVENT(ext4__page_op, ext4_readpage, + + TP_PROTO(struct page *page), + + TP_ARGS(page) +); + +DEFINE_EVENT(ext4__page_op, ext4_releasepage, + + TP_PROTO(struct page *page), + + TP_ARGS(page) +); + +TRACE_EVENT(ext4_invalidatepage, + TP_PROTO(struct page *page, unsigned long offset), + + TP_ARGS(page, offset), + + TP_STRUCT__entry( + __field( pgoff_t, index ) + __field( unsigned long, offset ) + __field( ino_t, ino ) + __field( dev_t, dev ) + + ), + + TP_fast_assign( + __entry->index = page->index; + __entry->offset = offset; + __entry->ino = page->mapping->host->i_ino; + __entry->dev = page->mapping->host->i_sb->s_dev; + ), + + TP_printk("dev %d,%d ino %lu page_index %lu offset %lu", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + __entry->index, __entry->offset) +); + TRACE_EVENT(ext4_discard_blocks, TP_PROTO(struct super_block *sb, unsigned long long blk, unsigned long long count), @@ -455,22 +499,20 @@ TRACE_EVENT(ext4_discard_blocks, TP_ARGS(sb, blk, count), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( __u64, blk ) __field( __u64, count ) ), TP_fast_assign( - __entry->dev_major = MAJOR(sb->s_dev); - __entry->dev_minor = MINOR(sb->s_dev); + __entry->dev = sb->s_dev; __entry->blk = blk; __entry->count = count; ), TP_printk("dev %d,%d blk %llu count %llu", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->blk, __entry->count) ); @@ -481,8 +523,7 @@ DECLARE_EVENT_CLASS(ext4__mb_new_pa, TP_ARGS(ac, pa), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( __u64, pa_pstart ) __field( __u32, pa_len ) @@ -491,8 +532,7 @@ DECLARE_EVENT_CLASS(ext4__mb_new_pa, ), TP_fast_assign( - __entry->dev_major = MAJOR(ac->ac_sb->s_dev); - __entry->dev_minor = MINOR(ac->ac_sb->s_dev); + __entry->dev = ac->ac_sb->s_dev; __entry->ino = ac->ac_inode->i_ino; __entry->pa_pstart = pa->pa_pstart; __entry->pa_len = pa->pa_len; @@ -500,9 +540,9 @@ DECLARE_EVENT_CLASS(ext4__mb_new_pa, ), TP_printk("dev %d,%d ino %lu pstart %llu len %u lstart %llu", - __entry->dev_major, __entry->dev_minor, - (unsigned long) __entry->ino, __entry->pa_pstart, - __entry->pa_len, __entry->pa_lstart) + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + __entry->pa_pstart, __entry->pa_len, __entry->pa_lstart) ); DEFINE_EVENT(ext4__mb_new_pa, ext4_mb_new_inode_pa, @@ -530,8 +570,7 @@ TRACE_EVENT(ext4_mb_release_inode_pa, TP_ARGS(sb, inode, pa, block, count), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( __u64, block ) __field( __u32, count ) @@ -539,16 +578,16 @@ TRACE_EVENT(ext4_mb_release_inode_pa, ), TP_fast_assign( - __entry->dev_major = MAJOR(sb->s_dev); - __entry->dev_minor = MINOR(sb->s_dev); + __entry->dev = sb->s_dev; __entry->ino = inode->i_ino; __entry->block = block; __entry->count = count; ), TP_printk("dev %d,%d ino %lu block %llu count %u", - __entry->dev_major, __entry->dev_minor, - (unsigned long) __entry->ino, __entry->block, __entry->count) + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + __entry->block, __entry->count) ); TRACE_EVENT(ext4_mb_release_group_pa, @@ -558,22 +597,20 @@ TRACE_EVENT(ext4_mb_release_group_pa, TP_ARGS(sb, pa), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( __u64, pa_pstart ) __field( __u32, pa_len ) ), TP_fast_assign( - __entry->dev_major = MAJOR(sb->s_dev); - __entry->dev_minor = MINOR(sb->s_dev); + __entry->dev = sb->s_dev; __entry->pa_pstart = pa->pa_pstart; __entry->pa_len = pa->pa_len; ), TP_printk("dev %d,%d pstart %llu len %u", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->pa_pstart, __entry->pa_len) ); @@ -583,20 +620,18 @@ TRACE_EVENT(ext4_discard_preallocations, TP_ARGS(inode), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; ), TP_printk("dev %d,%d ino %lu", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino) ); @@ -606,20 +641,19 @@ TRACE_EVENT(ext4_mb_discard_preallocations, TP_ARGS(sb, needed), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( int, needed ) ), TP_fast_assign( - __entry->dev_major = MAJOR(sb->s_dev); - __entry->dev_minor = MINOR(sb->s_dev); + __entry->dev = sb->s_dev; __entry->needed = needed; ), TP_printk("dev %d,%d needed %d", - __entry->dev_major, __entry->dev_minor, __entry->needed) + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->needed) ); TRACE_EVENT(ext4_request_blocks, @@ -628,8 +662,7 @@ TRACE_EVENT(ext4_request_blocks, TP_ARGS(ar), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( unsigned int, flags ) __field( unsigned int, len ) @@ -642,8 +675,7 @@ TRACE_EVENT(ext4_request_blocks, ), TP_fast_assign( - __entry->dev_major = MAJOR(ar->inode->i_sb->s_dev); - __entry->dev_minor = MINOR(ar->inode->i_sb->s_dev); + __entry->dev = ar->inode->i_sb->s_dev; __entry->ino = ar->inode->i_ino; __entry->flags = ar->flags; __entry->len = ar->len; @@ -655,8 +687,9 @@ TRACE_EVENT(ext4_request_blocks, __entry->pright = ar->pright; ), - TP_printk("dev %d,%d ino %lu flags %u len %u lblk %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu ", - __entry->dev_major, __entry->dev_minor, + TP_printk("dev %d,%d ino %lu flags %u len %u lblk %llu goal %llu " + "lleft %llu lright %llu pleft %llu pright %llu ", + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->flags, __entry->len, (unsigned long long) __entry->logical, @@ -673,8 +706,7 @@ TRACE_EVENT(ext4_allocate_blocks, TP_ARGS(ar, block), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( __u64, block ) __field( unsigned int, flags ) @@ -688,8 +720,7 @@ TRACE_EVENT(ext4_allocate_blocks, ), TP_fast_assign( - __entry->dev_major = MAJOR(ar->inode->i_sb->s_dev); - __entry->dev_minor = MINOR(ar->inode->i_sb->s_dev); + __entry->dev = ar->inode->i_sb->s_dev; __entry->ino = ar->inode->i_ino; __entry->block = block; __entry->flags = ar->flags; @@ -702,10 +733,11 @@ TRACE_EVENT(ext4_allocate_blocks, __entry->pright = ar->pright; ), - TP_printk("dev %d,%d ino %lu flags %u len %u block %llu lblk %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu ", - __entry->dev_major, __entry->dev_minor, - (unsigned long) __entry->ino, __entry->flags, - __entry->len, __entry->block, + TP_printk("dev %d,%d ino %lu flags %u len %u block %llu lblk %llu " + "goal %llu lleft %llu lright %llu pleft %llu pright %llu", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + __entry->flags, __entry->len, __entry->block, (unsigned long long) __entry->logical, (unsigned long long) __entry->goal, (unsigned long long) __entry->lleft, @@ -721,8 +753,7 @@ TRACE_EVENT(ext4_free_blocks, TP_ARGS(inode, block, count, flags), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( umode_t, mode ) __field( __u64, block ) @@ -731,8 +762,7 @@ TRACE_EVENT(ext4_free_blocks, ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->mode = inode->i_mode; __entry->block = block; @@ -741,20 +771,19 @@ TRACE_EVENT(ext4_free_blocks, ), TP_printk("dev %d,%d ino %lu mode 0%o block %llu count %lu flags %d", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->mode, __entry->block, __entry->count, __entry->flags) ); -TRACE_EVENT(ext4_sync_file, +TRACE_EVENT(ext4_sync_file_enter, TP_PROTO(struct file *file, int datasync), TP_ARGS(file, datasync), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( ino_t, parent ) __field( int, datasync ) @@ -763,39 +792,60 @@ TRACE_EVENT(ext4_sync_file, TP_fast_assign( struct dentry *dentry = file->f_path.dentry; - __entry->dev_major = MAJOR(dentry->d_inode->i_sb->s_dev); - __entry->dev_minor = MINOR(dentry->d_inode->i_sb->s_dev); + __entry->dev = dentry->d_inode->i_sb->s_dev; __entry->ino = dentry->d_inode->i_ino; __entry->datasync = datasync; __entry->parent = dentry->d_parent->d_inode->i_ino; ), TP_printk("dev %d,%d ino %ld parent %ld datasync %d ", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, (unsigned long) __entry->parent, __entry->datasync) ); +TRACE_EVENT(ext4_sync_file_exit, + TP_PROTO(struct inode *inode, int ret), + + TP_ARGS(inode, ret), + + TP_STRUCT__entry( + __field( int, ret ) + __field( ino_t, ino ) + __field( dev_t, dev ) + ), + + TP_fast_assign( + __entry->ret = ret; + __entry->ino = inode->i_ino; + __entry->dev = inode->i_sb->s_dev; + ), + + TP_printk("dev %d,%d ino %ld ret %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + __entry->ret) +); + TRACE_EVENT(ext4_sync_fs, TP_PROTO(struct super_block *sb, int wait), TP_ARGS(sb, wait), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( int, wait ) ), TP_fast_assign( - __entry->dev_major = MAJOR(sb->s_dev); - __entry->dev_minor = MINOR(sb->s_dev); + __entry->dev = sb->s_dev; __entry->wait = wait; ), - TP_printk("dev %d,%d wait %d", __entry->dev_major, - __entry->dev_minor, __entry->wait) + TP_printk("dev %d,%d wait %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->wait) ); TRACE_EVENT(ext4_alloc_da_blocks, @@ -804,23 +854,21 @@ TRACE_EVENT(ext4_alloc_da_blocks, TP_ARGS(inode), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( unsigned int, data_blocks ) __field( unsigned int, meta_blocks ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->data_blocks = EXT4_I(inode)->i_reserved_data_blocks; __entry->meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks; ), TP_printk("dev %d,%d ino %lu data_blocks %u meta_blocks %u", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->data_blocks, __entry->meta_blocks) ); @@ -831,8 +879,7 @@ TRACE_EVENT(ext4_mballoc_alloc, TP_ARGS(ac), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( __u16, found ) __field( __u16, groups ) @@ -855,8 +902,7 @@ TRACE_EVENT(ext4_mballoc_alloc, ), TP_fast_assign( - __entry->dev_major = MAJOR(ac->ac_inode->i_sb->s_dev); - __entry->dev_minor = MINOR(ac->ac_inode->i_sb->s_dev); + __entry->dev = ac->ac_inode->i_sb->s_dev; __entry->ino = ac->ac_inode->i_ino; __entry->found = ac->ac_found; __entry->flags = ac->ac_flags; @@ -881,7 +927,7 @@ TRACE_EVENT(ext4_mballoc_alloc, TP_printk("dev %d,%d inode %lu orig %u/%d/%u@%u goal %u/%d/%u@%u " "result %u/%d/%u@%u blks %u grps %u cr %u flags 0x%04x " "tail %u broken %u", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->orig_group, __entry->orig_start, __entry->orig_len, __entry->orig_logical, @@ -900,8 +946,7 @@ TRACE_EVENT(ext4_mballoc_prealloc, TP_ARGS(ac), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( __u32, orig_logical ) __field( int, orig_start ) @@ -914,8 +959,7 @@ TRACE_EVENT(ext4_mballoc_prealloc, ), TP_fast_assign( - __entry->dev_major = MAJOR(ac->ac_inode->i_sb->s_dev); - __entry->dev_minor = MINOR(ac->ac_inode->i_sb->s_dev); + __entry->dev = ac->ac_inode->i_sb->s_dev; __entry->ino = ac->ac_inode->i_ino; __entry->orig_logical = ac->ac_o_ex.fe_logical; __entry->orig_start = ac->ac_o_ex.fe_start; @@ -928,7 +972,7 @@ TRACE_EVENT(ext4_mballoc_prealloc, ), TP_printk("dev %d,%d inode %lu orig %u/%d/%u@%u result %u/%d/%u@%u", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->orig_group, __entry->orig_start, __entry->orig_len, __entry->orig_logical, @@ -946,8 +990,7 @@ DECLARE_EVENT_CLASS(ext4__mballoc, TP_ARGS(sb, inode, group, start, len), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( int, result_start ) __field( __u32, result_group ) @@ -955,8 +998,7 @@ DECLARE_EVENT_CLASS(ext4__mballoc, ), TP_fast_assign( - __entry->dev_major = MAJOR(sb->s_dev); - __entry->dev_minor = MINOR(sb->s_dev); + __entry->dev = sb->s_dev; __entry->ino = inode ? inode->i_ino : 0; __entry->result_start = start; __entry->result_group = group; @@ -964,7 +1006,7 @@ DECLARE_EVENT_CLASS(ext4__mballoc, ), TP_printk("dev %d,%d inode %lu extent %u/%d/%u ", - __entry->dev_major, __entry->dev_minor, + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->result_group, __entry->result_start, __entry->result_len) @@ -998,8 +1040,7 @@ TRACE_EVENT(ext4_forget, TP_ARGS(inode, is_metadata, block), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( umode_t, mode ) __field( int, is_metadata ) @@ -1007,8 +1048,7 @@ TRACE_EVENT(ext4_forget, ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->mode = inode->i_mode; __entry->is_metadata = is_metadata; @@ -1016,9 +1056,9 @@ TRACE_EVENT(ext4_forget, ), TP_printk("dev %d,%d ino %lu mode 0%o is_metadata %d block %llu", - __entry->dev_major, __entry->dev_minor, - (unsigned long) __entry->ino, __entry->mode, - __entry->is_metadata, __entry->block) + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + __entry->mode, __entry->is_metadata, __entry->block) ); TRACE_EVENT(ext4_da_update_reserve_space, @@ -1027,8 +1067,7 @@ TRACE_EVENT(ext4_da_update_reserve_space, TP_ARGS(inode, used_blocks), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( umode_t, mode ) __field( __u64, i_blocks ) @@ -1039,8 +1078,7 @@ TRACE_EVENT(ext4_da_update_reserve_space, ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->mode = inode->i_mode; __entry->i_blocks = inode->i_blocks; @@ -1050,10 +1088,12 @@ TRACE_EVENT(ext4_da_update_reserve_space, __entry->allocated_meta_blocks = EXT4_I(inode)->i_allocated_meta_blocks; ), - TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu used_blocks %d reserved_data_blocks %d reserved_meta_blocks %d allocated_meta_blocks %d", - __entry->dev_major, __entry->dev_minor, - (unsigned long) __entry->ino, __entry->mode, - (unsigned long long) __entry->i_blocks, + TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu used_blocks %d " + "reserved_data_blocks %d reserved_meta_blocks %d " + "allocated_meta_blocks %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + __entry->mode, (unsigned long long) __entry->i_blocks, __entry->used_blocks, __entry->reserved_data_blocks, __entry->reserved_meta_blocks, __entry->allocated_meta_blocks) ); @@ -1064,8 +1104,7 @@ TRACE_EVENT(ext4_da_reserve_space, TP_ARGS(inode, md_needed), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( umode_t, mode ) __field( __u64, i_blocks ) @@ -1075,8 +1114,7 @@ TRACE_EVENT(ext4_da_reserve_space, ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->mode = inode->i_mode; __entry->i_blocks = inode->i_blocks; @@ -1085,8 +1123,9 @@ TRACE_EVENT(ext4_da_reserve_space, __entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks; ), - TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu md_needed %d reserved_data_blocks %d reserved_meta_blocks %d", - __entry->dev_major, __entry->dev_minor, + TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu md_needed %d " + "reserved_data_blocks %d reserved_meta_blocks %d", + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->mode, (unsigned long long) __entry->i_blocks, __entry->md_needed, __entry->reserved_data_blocks, @@ -1099,8 +1138,7 @@ TRACE_EVENT(ext4_da_release_space, TP_ARGS(inode, freed_blocks), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) __field( umode_t, mode ) __field( __u64, i_blocks ) @@ -1111,8 +1149,7 @@ TRACE_EVENT(ext4_da_release_space, ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->mode = inode->i_mode; __entry->i_blocks = inode->i_blocks; @@ -1122,8 +1159,10 @@ TRACE_EVENT(ext4_da_release_space, __entry->allocated_meta_blocks = EXT4_I(inode)->i_allocated_meta_blocks; ), - TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu freed_blocks %d reserved_data_blocks %d reserved_meta_blocks %d allocated_meta_blocks %d", - __entry->dev_major, __entry->dev_minor, + TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu freed_blocks %d " + "reserved_data_blocks %d reserved_meta_blocks %d " + "allocated_meta_blocks %d", + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->mode, (unsigned long long) __entry->i_blocks, __entry->freed_blocks, __entry->reserved_data_blocks, @@ -1136,20 +1175,19 @@ DECLARE_EVENT_CLASS(ext4__bitmap_load, TP_ARGS(sb, group), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( __u32, group ) ), TP_fast_assign( - __entry->dev_major = MAJOR(sb->s_dev); - __entry->dev_minor = MINOR(sb->s_dev); + __entry->dev = sb->s_dev; __entry->group = group; ), TP_printk("dev %d,%d group %u", - __entry->dev_major, __entry->dev_minor, __entry->group) + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->group) ); DEFINE_EVENT(ext4__bitmap_load, ext4_mb_bitmap_load, @@ -1166,6 +1204,349 @@ DEFINE_EVENT(ext4__bitmap_load, ext4_mb_buddy_bitmap_load, TP_ARGS(sb, group) ); +DEFINE_EVENT(ext4__bitmap_load, ext4_read_block_bitmap_load, + + TP_PROTO(struct super_block *sb, unsigned long group), + + TP_ARGS(sb, group) +); + +DEFINE_EVENT(ext4__bitmap_load, ext4_load_inode_bitmap, + + TP_PROTO(struct super_block *sb, unsigned long group), + + TP_ARGS(sb, group) +); + +TRACE_EVENT(ext4_direct_IO_enter, + TP_PROTO(struct inode *inode, loff_t offset, unsigned long len, int rw), + + TP_ARGS(inode, offset, len, rw), + + TP_STRUCT__entry( + __field( ino_t, ino ) + __field( dev_t, dev ) + __field( loff_t, pos ) + __field( unsigned long, len ) + __field( int, rw ) + ), + + TP_fast_assign( + __entry->ino = inode->i_ino; + __entry->dev = inode->i_sb->s_dev; + __entry->pos = offset; + __entry->len = len; + __entry->rw = rw; + ), + + TP_printk("dev %d,%d ino %lu pos %llu len %lu rw %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + (unsigned long long) __entry->pos, __entry->len, __entry->rw) +); + +TRACE_EVENT(ext4_direct_IO_exit, + TP_PROTO(struct inode *inode, loff_t offset, unsigned long len, int rw, int ret), + + TP_ARGS(inode, offset, len, rw, ret), + + TP_STRUCT__entry( + __field( ino_t, ino ) + __field( dev_t, dev ) + __field( loff_t, pos ) + __field( unsigned long, len ) + __field( int, rw ) + __field( int, ret ) + ), + + TP_fast_assign( + __entry->ino = inode->i_ino; + __entry->dev = inode->i_sb->s_dev; + __entry->pos = offset; + __entry->len = len; + __entry->rw = rw; + __entry->ret = ret; + ), + + TP_printk("dev %d,%d ino %lu pos %llu len %lu rw %d ret %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + (unsigned long long) __entry->pos, __entry->len, + __entry->rw, __entry->ret) +); + +TRACE_EVENT(ext4_fallocate_enter, + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode), + + TP_ARGS(inode, offset, len, mode), + + TP_STRUCT__entry( + __field( ino_t, ino ) + __field( dev_t, dev ) + __field( loff_t, pos ) + __field( loff_t, len ) + __field( int, mode ) + ), + + TP_fast_assign( + __entry->ino = inode->i_ino; + __entry->dev = inode->i_sb->s_dev; + __entry->pos = offset; + __entry->len = len; + __entry->mode = mode; + ), + + TP_printk("dev %d,%d ino %ld pos %llu len %llu mode %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + (unsigned long long) __entry->pos, + (unsigned long long) __entry->len, __entry->mode) +); + +TRACE_EVENT(ext4_fallocate_exit, + TP_PROTO(struct inode *inode, loff_t offset, unsigned int max_blocks, int ret), + + TP_ARGS(inode, offset, max_blocks, ret), + + TP_STRUCT__entry( + __field( ino_t, ino ) + __field( dev_t, dev ) + __field( loff_t, pos ) + __field( unsigned, blocks ) + __field( int, ret ) + ), + + TP_fast_assign( + __entry->ino = inode->i_ino; + __entry->dev = inode->i_sb->s_dev; + __entry->pos = offset; + __entry->blocks = max_blocks; + __entry->ret = ret; + ), + + TP_printk("dev %d,%d ino %ld pos %llu blocks %d ret %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + (unsigned long long) __entry->pos, __entry->blocks, + __entry->ret) +); + +TRACE_EVENT(ext4_unlink_enter, + TP_PROTO(struct inode *parent, struct dentry *dentry), + + TP_ARGS(parent, dentry), + + TP_STRUCT__entry( + __field( ino_t, parent ) + __field( ino_t, ino ) + __field( loff_t, size ) + __field( dev_t, dev ) + ), + + TP_fast_assign( + __entry->parent = parent->i_ino; + __entry->ino = dentry->d_inode->i_ino; + __entry->size = dentry->d_inode->i_size; + __entry->dev = dentry->d_inode->i_sb->s_dev; + ), + + TP_printk("dev %d,%d ino %ld size %lld parent %ld", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, __entry->size, + (unsigned long) __entry->parent) +); + +TRACE_EVENT(ext4_unlink_exit, + TP_PROTO(struct dentry *dentry, int ret), + + TP_ARGS(dentry, ret), + + TP_STRUCT__entry( + __field( ino_t, ino ) + __field( dev_t, dev ) + __field( int, ret ) + ), + + TP_fast_assign( + __entry->ino = dentry->d_inode->i_ino; + __entry->dev = dentry->d_inode->i_sb->s_dev; + __entry->ret = ret; + ), + + TP_printk("dev %d,%d ino %ld ret %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + __entry->ret) +); + +DECLARE_EVENT_CLASS(ext4__truncate, + TP_PROTO(struct inode *inode), + + TP_ARGS(inode), + + TP_STRUCT__entry( + __field( ino_t, ino ) + __field( dev_t, dev ) + __field( blkcnt_t, blocks ) + ), + + TP_fast_assign( + __entry->ino = inode->i_ino; + __entry->dev = inode->i_sb->s_dev; + __entry->blocks = inode->i_blocks; + ), + + TP_printk("dev %d,%d ino %lu blocks %lu", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, (unsigned long) __entry->blocks) +); + +DEFINE_EVENT(ext4__truncate, ext4_truncate_enter, + + TP_PROTO(struct inode *inode), + + TP_ARGS(inode) +); + +DEFINE_EVENT(ext4__truncate, ext4_truncate_exit, + + TP_PROTO(struct inode *inode), + + TP_ARGS(inode) +); + +DECLARE_EVENT_CLASS(ext4__map_blocks_enter, + TP_PROTO(struct inode *inode, ext4_lblk_t lblk, + unsigned len, unsigned flags), + + TP_ARGS(inode, lblk, len, flags), + + TP_STRUCT__entry( + __field( ino_t, ino ) + __field( dev_t, dev ) + __field( ext4_lblk_t, lblk ) + __field( unsigned, len ) + __field( unsigned, flags ) + ), + + TP_fast_assign( + __entry->ino = inode->i_ino; + __entry->dev = inode->i_sb->s_dev; + __entry->lblk = lblk; + __entry->len = len; + __entry->flags = flags; + ), + + TP_printk("dev %d,%d ino %lu lblk %u len %u flags %u", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + (unsigned) __entry->lblk, __entry->len, __entry->flags) +); + +DEFINE_EVENT(ext4__map_blocks_enter, ext4_ext_map_blocks_enter, + TP_PROTO(struct inode *inode, ext4_lblk_t lblk, + unsigned len, unsigned flags), + + TP_ARGS(inode, lblk, len, flags) +); + +DEFINE_EVENT(ext4__map_blocks_enter, ext4_ind_map_blocks_enter, + TP_PROTO(struct inode *inode, ext4_lblk_t lblk, + unsigned len, unsigned flags), + + TP_ARGS(inode, lblk, len, flags) +); + +DECLARE_EVENT_CLASS(ext4__map_blocks_exit, + TP_PROTO(struct inode *inode, ext4_lblk_t lblk, + ext4_fsblk_t pblk, unsigned len, int ret), + + TP_ARGS(inode, lblk, pblk, len, ret), + + TP_STRUCT__entry( + __field( ino_t, ino ) + __field( dev_t, dev ) + __field( ext4_lblk_t, lblk ) + __field( ext4_fsblk_t, pblk ) + __field( unsigned, len ) + __field( int, ret ) + ), + + TP_fast_assign( + __entry->ino = inode->i_ino; + __entry->dev = inode->i_sb->s_dev; + __entry->lblk = lblk; + __entry->pblk = pblk; + __entry->len = len; + __entry->ret = ret; + ), + + TP_printk("dev %d,%d ino %lu lblk %u pblk %llu len %u ret %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + (unsigned) __entry->lblk, (unsigned long long) __entry->pblk, + __entry->len, __entry->ret) +); + +DEFINE_EVENT(ext4__map_blocks_exit, ext4_ext_map_blocks_exit, + TP_PROTO(struct inode *inode, ext4_lblk_t lblk, + ext4_fsblk_t pblk, unsigned len, int ret), + + TP_ARGS(inode, lblk, pblk, len, ret) +); + +DEFINE_EVENT(ext4__map_blocks_exit, ext4_ind_map_blocks_exit, + TP_PROTO(struct inode *inode, ext4_lblk_t lblk, + ext4_fsblk_t pblk, unsigned len, int ret), + + TP_ARGS(inode, lblk, pblk, len, ret) +); + +TRACE_EVENT(ext4_ext_load_extent, + TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_fsblk_t pblk), + + TP_ARGS(inode, lblk, pblk), + + TP_STRUCT__entry( + __field( ino_t, ino ) + __field( dev_t, dev ) + __field( ext4_lblk_t, lblk ) + __field( ext4_fsblk_t, pblk ) + ), + + TP_fast_assign( + __entry->ino = inode->i_ino; + __entry->dev = inode->i_sb->s_dev; + __entry->lblk = lblk; + __entry->pblk = pblk; + ), + + TP_printk("dev %d,%d ino %lu lblk %u pblk %llu", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + (unsigned) __entry->lblk, (unsigned long long) __entry->pblk) +); + +TRACE_EVENT(ext4_load_inode, + TP_PROTO(struct inode *inode), + + TP_ARGS(inode), + + TP_STRUCT__entry( + __field( ino_t, ino ) + __field( dev_t, dev ) + ), + + TP_fast_assign( + __entry->ino = inode->i_ino; + __entry->dev = inode->i_sb->s_dev; + ), + + TP_printk("dev %d,%d ino %ld", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino) +); + #endif /* _TRACE_EXT4_H */ /* This part must be outside protection */ diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h index 7447ea9305b..bf16545cc97 100644 --- a/include/trace/events/jbd2.h +++ b/include/trace/events/jbd2.h @@ -17,19 +17,17 @@ TRACE_EVENT(jbd2_checkpoint, TP_ARGS(journal, result), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( int, result ) ), TP_fast_assign( - __entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev); - __entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev); + __entry->dev = journal->j_fs_dev->bd_dev; __entry->result = result; ), - TP_printk("dev %d,%d result %d", - __entry->dev_major, __entry->dev_minor, __entry->result) + TP_printk("dev %s result %d", + jbd2_dev_to_name(__entry->dev), __entry->result) ); DECLARE_EVENT_CLASS(jbd2_commit, @@ -39,22 +37,20 @@ DECLARE_EVENT_CLASS(jbd2_commit, TP_ARGS(journal, commit_transaction), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( char, sync_commit ) __field( int, transaction ) ), TP_fast_assign( - __entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev); - __entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev); + __entry->dev = journal->j_fs_dev->bd_dev; __entry->sync_commit = commit_transaction->t_synchronous_commit; __entry->transaction = commit_transaction->t_tid; ), - TP_printk("dev %d,%d transaction %d sync %d", - __entry->dev_major, __entry->dev_minor, - __entry->transaction, __entry->sync_commit) + TP_printk("dev %s transaction %d sync %d", + jbd2_dev_to_name(__entry->dev), __entry->transaction, + __entry->sync_commit) ); DEFINE_EVENT(jbd2_commit, jbd2_start_commit, @@ -91,24 +87,22 @@ TRACE_EVENT(jbd2_end_commit, TP_ARGS(journal, commit_transaction), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( char, sync_commit ) __field( int, transaction ) __field( int, head ) ), TP_fast_assign( - __entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev); - __entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev); + __entry->dev = journal->j_fs_dev->bd_dev; __entry->sync_commit = commit_transaction->t_synchronous_commit; __entry->transaction = commit_transaction->t_tid; __entry->head = journal->j_tail_sequence; ), - TP_printk("dev %d,%d transaction %d sync %d head %d", - __entry->dev_major, __entry->dev_minor, - __entry->transaction, __entry->sync_commit, __entry->head) + TP_printk("dev %s transaction %d sync %d head %d", + jbd2_dev_to_name(__entry->dev), __entry->transaction, + __entry->sync_commit, __entry->head) ); TRACE_EVENT(jbd2_submit_inode_data, @@ -117,20 +111,17 @@ TRACE_EVENT(jbd2_submit_inode_data, TP_ARGS(inode), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( ino_t, ino ) ), TP_fast_assign( - __entry->dev_major = MAJOR(inode->i_sb->s_dev); - __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; ), - TP_printk("dev %d,%d ino %lu", - __entry->dev_major, __entry->dev_minor, - (unsigned long) __entry->ino) + TP_printk("dev %s ino %lu", + jbd2_dev_to_name(__entry->dev), (unsigned long) __entry->ino) ); TRACE_EVENT(jbd2_run_stats, @@ -140,8 +131,7 @@ TRACE_EVENT(jbd2_run_stats, TP_ARGS(dev, tid, stats), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( unsigned long, tid ) __field( unsigned long, wait ) __field( unsigned long, running ) @@ -154,8 +144,7 @@ TRACE_EVENT(jbd2_run_stats, ), TP_fast_assign( - __entry->dev_major = MAJOR(dev); - __entry->dev_minor = MINOR(dev); + __entry->dev = dev; __entry->tid = tid; __entry->wait = stats->rs_wait; __entry->running = stats->rs_running; @@ -167,9 +156,9 @@ TRACE_EVENT(jbd2_run_stats, __entry->blocks_logged = stats->rs_blocks_logged; ), - TP_printk("dev %d,%d tid %lu wait %u running %u locked %u flushing %u " + TP_printk("dev %s tid %lu wait %u running %u locked %u flushing %u " "logging %u handle_count %u blocks %u blocks_logged %u", - __entry->dev_major, __entry->dev_minor, __entry->tid, + jbd2_dev_to_name(__entry->dev), __entry->tid, jiffies_to_msecs(__entry->wait), jiffies_to_msecs(__entry->running), jiffies_to_msecs(__entry->locked), @@ -186,8 +175,7 @@ TRACE_EVENT(jbd2_checkpoint_stats, TP_ARGS(dev, tid, stats), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( unsigned long, tid ) __field( unsigned long, chp_time ) __field( __u32, forced_to_close ) @@ -196,8 +184,7 @@ TRACE_EVENT(jbd2_checkpoint_stats, ), TP_fast_assign( - __entry->dev_major = MAJOR(dev); - __entry->dev_minor = MINOR(dev); + __entry->dev = dev; __entry->tid = tid; __entry->chp_time = stats->cs_chp_time; __entry->forced_to_close= stats->cs_forced_to_close; @@ -205,9 +192,9 @@ TRACE_EVENT(jbd2_checkpoint_stats, __entry->dropped = stats->cs_dropped; ), - TP_printk("dev %d,%d tid %lu chp_time %u forced_to_close %u " + TP_printk("dev %s tid %lu chp_time %u forced_to_close %u " "written %u dropped %u", - __entry->dev_major, __entry->dev_minor, __entry->tid, + jbd2_dev_to_name(__entry->dev), __entry->tid, jiffies_to_msecs(__entry->chp_time), __entry->forced_to_close, __entry->written, __entry->dropped) ); @@ -220,8 +207,7 @@ TRACE_EVENT(jbd2_cleanup_journal_tail, TP_ARGS(journal, first_tid, block_nr, freed), TP_STRUCT__entry( - __field( int, dev_major ) - __field( int, dev_minor ) + __field( dev_t, dev ) __field( tid_t, tail_sequence ) __field( tid_t, first_tid ) __field(unsigned long, block_nr ) @@ -229,18 +215,16 @@ TRACE_EVENT(jbd2_cleanup_journal_tail, ), TP_fast_assign( - __entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev); - __entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev); + __entry->dev = journal->j_fs_dev->bd_dev; __entry->tail_sequence = journal->j_tail_sequence; __entry->first_tid = first_tid; __entry->block_nr = block_nr; __entry->freed = freed; ), - TP_printk("dev %d,%d from %u to %u offset %lu freed %lu", - __entry->dev_major, __entry->dev_minor, - __entry->tail_sequence, __entry->first_tid, - __entry->block_nr, __entry->freed) + TP_printk("dev %s from %u to %u offset %lu freed %lu", + jbd2_dev_to_name(__entry->dev), __entry->tail_sequence, + __entry->first_tid, __entry->block_nr, __entry->freed) ); #endif /* _TRACE_JBD2_H */ -- cgit v1.2.3 From 6de9843dab3f2a1d4d66d80aa9e5782f80977d20 Mon Sep 17 00:00:00 2001 From: Feng Tang Date: Wed, 23 Mar 2011 14:05:03 -0400 Subject: ext4: remove redundant set_buffer_mapped() in ext4_da_get_block_prep() The map_bh() call will have already set the buffer_head to mapped. Signed-off-by: Feng Tang Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index f44307a2113..dec10e2115e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2502,7 +2502,6 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock, * for partial write. */ set_buffer_new(bh); - set_buffer_mapped(bh); } return 0; } -- cgit v1.2.3 From 65922cb5ced76ba7182e955d4aada96f93446b1a Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 23 Mar 2011 14:08:27 -0400 Subject: ext4: unused variables cleanup in fs/ext4/extents.c ext4 extents cleanup: . remove unused `*ex' from check_eofblocks_fl . remove unused `*eh' from ext4_ext_map_blocks Signed-off-by: Sergey Senozhatsky Signed-off-by: "Theodore Ts'o" --- fs/ext4/extents.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index f46f6e3c02d..1763d1ab9ea 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -3112,14 +3112,13 @@ static int check_eofblocks_fl(handle_t *handle, struct inode *inode, { int i, depth; struct ext4_extent_header *eh; - struct ext4_extent *ex, *last_ex; + struct ext4_extent *last_ex; if (!ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)) return 0; depth = ext_depth(inode); eh = path[depth].p_hdr; - ex = path[depth].p_ext; if (unlikely(!eh->eh_entries)) { EXT4_ERROR_INODE(inode, "eh->eh_entries == 0 and " @@ -3299,7 +3298,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, struct ext4_map_blocks *map, int flags) { struct ext4_ext_path *path = NULL; - struct ext4_extent_header *eh; struct ext4_extent newex, *ex; ext4_fsblk_t newblock = 0; int err = 0, depth, ret; @@ -3357,7 +3355,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, err = -EIO; goto out2; } - eh = path[depth].p_hdr; ex = path[depth].p_ext; if (ex) { -- cgit v1.2.3 From 0ba0851714beebb800992e5105a79dc3a4c504b0 Mon Sep 17 00:00:00 2001 From: Tao Ma Date: Wed, 23 Mar 2011 15:48:11 -0400 Subject: ext4: fix a BUG in mb_mark_used during trim. In a bs=4096 volume, if we call FITRIM with the following parameter as fstrim_range(start = 102400, len = 134144000, minlen = 10240), we will trigger this BUG_ON: BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3)); Mar 4 00:55:52 boyu-tm kernel: ------------[ cut here ]------------ Mar 4 00:55:52 boyu-tm kernel: kernel BUG at fs/ext4/mballoc.c:1506! Mar 4 01:21:09 boyu-tm kernel: Code: d4 00 00 00 00 49 89 fe 8b 56 0c 44 8b 7e 04 89 55 c4 48 8b 4f 28 89 d6 44 01 fe 48 63 d6 48 8b 41 18 48 c1 e0 03 48 39 c2 76 04 <0f> 0b eb fe 48 8b 55 b0 8b 47 34 3b 42 08 74 04 0f 0b eb fe 48 Mar 4 01:21:09 boyu-tm kernel: RIP [] mb_mark_used+0x47/0x26c [ext4] Mar 4 01:21:09 boyu-tm kernel: RSP Mar 4 01:21:09 boyu-tm kernel: ---[ end trace 9f461696f6a9dcf2 ]--- Fix this bug by doing the accounting correctly. Cc: Lukas Czerner Signed-off-by: Tao Ma Signed-off-by: "Theodore Ts'o" --- fs/ext4/mballoc.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index cdc84953f1d..a5837a837a8 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4870,10 +4870,15 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range) break; } - if (len >= EXT4_BLOCKS_PER_GROUP(sb)) - len -= (EXT4_BLOCKS_PER_GROUP(sb) - first_block); - else + /* + * For all the groups except the last one, last block will + * always be EXT4_BLOCKS_PER_GROUP(sb), so we only need to + * change it for the last group in which case start + + * len < EXT4_BLOCKS_PER_GROUP(sb). + */ + if (first_block + len < EXT4_BLOCKS_PER_GROUP(sb)) last_block = first_block + len; + len -= last_block - first_block; if (e4b.bd_info->bb_free >= minlen) { cnt = ext4_trim_all_free(sb, &e4b, first_block, -- cgit v1.2.3