aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2013-02-14Revert "xen/PVonHVM: fix compile warning in init_hvm_pv_info"Konrad Rzeszutek Wilk1-1/+1
This reverts commit a7be94ac8d69c037d08f0fd94b45a593f1d45176. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-13xen: remove redundant NULL check before unregister_and_remove_pcpu().Cyril Roelandt1-2/+1
unregister_and_remove_pcpu on a NULL pointer is a no-op, so the NULL check in sync_pcpu can be removed. Signed-off-by: Cyril Roelandt <tipecaml@gmail.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-13x86/xen: don't assume %ds is usable in xen_iret for 32-bit PVOPS.Jan Beulich1-7/+7
This fixes CVE-2013-0228 / XSA-42 Drew Jones while working on CVE-2013-0190 found that that unprivileged guest user in 32bit PV guest can use to crash the > guest with the panic like this: ------------- general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/vbd-51712/block/xvda/dev Modules linked in: sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 1250, comm: r Not tainted 2.6.32-356.el6.i686 #1 EIP: 0061:[<c0407462>] EFLAGS: 00010086 CPU: 0 EIP is at xen_iret+0x12/0x2b EAX: eb8d0000 EBX: 00000001 ECX: 08049860 EDX: 00000010 ESI: 00000000 EDI: 003d0f00 EBP: b77f8388 ESP: eb8d1fe0 DS: 0000 ES: 007b FS: 0000 GS: 00e0 SS: 0069 Process r (pid: 1250, ti=eb8d0000 task=c2953550 task.ti=eb8d0000) Stack: 00000000 0027f416 00000073 00000206 b77f8364 0000007b 00000000 00000000 Call Trace: Code: c3 8b 44 24 18 81 4c 24 38 00 02 00 00 8d 64 24 30 e9 03 00 00 00 8d 76 00 f7 44 24 08 00 00 02 80 75 33 50 b8 00 e0 ff ff 21 e0 <8b> 40 10 8b 04 85 a0 f6 ab c0 8b 80 0c b0 b3 c0 f6 44 24 0d 02 EIP: [<c0407462>] xen_iret+0x12/0x2b SS:ESP 0069:eb8d1fe0 general protection fault: 0000 [#2] ---[ end trace ab0d29a492dcd330 ]--- Kernel panic - not syncing: Fatal exception Pid: 1250, comm: r Tainted: G D --------------- 2.6.32-356.el6.i686 #1 Call Trace: [<c08476df>] ? panic+0x6e/0x122 [<c084b63c>] ? oops_end+0xbc/0xd0 [<c084b260>] ? do_general_protection+0x0/0x210 [<c084a9b7>] ? error_code+0x73/ ------------- Petr says: " I've analysed the bug and I think that xen_iret() cannot cope with mangled DS, in this case zeroed out (null selector/descriptor) by either xen_failsafe_callback() or RESTORE_REGS because the corresponding LDT entry was invalidated by the reproducer. " Jan took a look at the preliminary patch and came up a fix that solves this problem: "This code gets called after all registers other than those handled by IRET got already restored, hence a null selector in %ds or a non-null one that got loaded from a code or read-only data descriptor would cause a kernel mode fault (with the potential of crashing the kernel as a whole, if panic_on_oops is set)." The way to fix this is to realize that the we can only relay on the registers that IRET restores. The two that are guaranteed are the %cs and %ss as they are always fixed GDT selectors. Also they are inaccessible from user mode - so they cannot be altered. This is the approach taken in this patch. Another alternative option suggested by Jan would be to relay on the subtle realization that using the %ebp or %esp relative references uses the %ss segment. In which case we could switch from using %eax to %ebp and would not need the %ss over-rides. That would also require one extra instruction to compensate for the one place where the register is used as scaled index. However Andrew pointed out that is too subtle and if further work was to be done in this code-path it could escape folks attention and lead to accidents. Reviewed-by: Petr Matousek <pmatouse@redhat.com> Reported-by: Petr Matousek <pmatouse@redhat.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-06xen: fix error handling path if xen_allocate_irq_dynamic failsWei Liu1-2/+2
It is possible that the call to xen_allocate_irq_dynamic() returns negative number other than -1. Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-06xen-pciback: rate limit error messages from xen_pcibk_enable_msi{,x}()Jan Beulich1-7/+7
... as being guest triggerable (e.g. by invoking XEN_PCI_OP_enable_msi{,x} on a device not being MSI/MSI-X capable). This is CVE-2013-0231 / XSA-43. Also make the two messages uniform in both their wording and severity. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-01-16xen: Fix stack corruption in xen_failsafe_callback for 32bit PVOPS guests.Andrew Cooper1-1/+0
This fixes CVE-2013-0190 / XSA-40 There has been an error on the xen_failsafe_callback path for failed iret, which causes the stack pointer to be wrong when entering the iret_exc error path. This can result in the kernel crashing. In the classic kernel case, the relevant code looked a little like: popl %eax # Error code from hypervisor jz 5f addl $16,%esp jmp iret_exc # Hypervisor said iret fault 5: addl $16,%esp # Hypervisor said segment selector fault Here, there are two identical addls on either option of a branch which appears to have been optimised by hoisting it above the jz, and converting it to an lea, which leaves the flags register unaffected. In the PVOPS case, the code looks like: popl_cfi %eax # Error from the hypervisor lea 16(%esp),%esp # Add $16 before choosing fault path CFI_ADJUST_CFA_OFFSET -16 jz 5f addl $16,%esp # Incorrectly adjust %esp again jmp iret_exc It is possible unprivileged userspace applications to cause this behaviour, for example by loading an LDT code selector, then changing the code selector to be not-present. At this point, there is a race condition where it is possible for the hypervisor to return back to userspace from an interrupt, fault on its own iret, and inject a failsafe_callback into the kernel. This bug has been present since the introduction of Xen PVOPS support in commit 5ead97c84 (xen: Core Xen implementation), in 2.6.23. Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Cc: stable@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-01-15Revert "xen/smp: Fix CPU online/offline bug triggering a BUG: scheduling ↵Konrad Rzeszutek Wilk1-7/+0
while atomic." This reverts commit 41bd956de3dfdc3a43708fe2e0c8096c69064a1e. The fix is incorrect and not appropiate for the latest kernels. In fact it _causes_ the BUG: scheduling while atomic while doing vCPU hotplug. Suggested-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-01-15xen/gntdev: remove erronous use of copy_to_userDaniel De Graaf1-10/+3
Since there is now a mapping of granted pages in kernel address space in both PV and HVM, use it for UNMAP_NOTIFY_CLEAR_BYTE instead of accessing memory via copy_to_user and triggering sleep-in-atomic warnings. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-01-15xen/gntdev: correctly unmap unlinked maps in mmu notifierDaniel De Graaf1-29/+63
If gntdev_ioctl_unmap_grant_ref is called on a range before unmapping it, the entry is removed from priv->maps and the later call to mn_invl_range_start won't find it to do the unmapping. Fix this by creating another list of freeable maps that the mmu notifier can search and use to unmap grants. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-01-15xen/gntdev: fix unsafe vma accessDaniel De Graaf1-5/+24
In gntdev_ioctl_get_offset_for_vaddr, we need to hold mmap_sem while calling find_vma() to avoid potentially having the result freed out from under us. Similarly, the MMU notifier functions need to synchronize with gntdev_vma_close to avoid map->vma being freed during their iteration. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-01-15xen/privcmd: Fix mmap batch ioctl.Andres Lagar-Cavilla1-36/+47
1. If any individual mapping error happens, the V1 case will mark *all* operations as failed. Fixed. 2. The err_array was allocated with kcalloc, resulting in potentially O(n) page allocations. Refactor code to not use this array. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-01-15Merge tag 'v3.7' into stable/for-linus-3.8Konrad Rzeszutek Wilk851-5507/+9301
Linux 3.7 * tag 'v3.7': (833 commits) Linux 3.7 Input: matrix-keymap - provide proper module license Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damage ipv4: ip_check_defrag must not modify skb before unsharing Revert "mm: avoid waking kswapd for THP allocations when compaction is deferred or contended" inet_diag: validate port comparison byte code to prevent unsafe reads inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run() inet_diag: validate byte code to prevent oops in inet_diag_bc_run() inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state mm: vmscan: fix inappropriate zone congestion clearing vfs: fix O_DIRECT read past end of block device net: gro: fix possible panic in skb_gro_receive() tcp: bug fix Fast Open client retransmission tmpfs: fix shared mempolicy leak mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones mm: compaction: validate pfn range passed to isolate_freepages_block mmc: sh-mmcif: avoid oops on spurious interrupts (second try) Revert misapplied "mmc: sh-mmcif: avoid oops on spurious interrupts" mmc: sdhci-s3c: fix missing clock for gpio card-detect lib/Makefile: Fix oid_registry build dependency ... Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Conflicts: arch/arm/xen/enlighten.c drivers/xen/Makefile [We need to have the v3.7 base as the 'for-3.8' was based off v3.7-rc3 and there are some patches in v3.7-rc6 that we to have in our branch]
2013-01-15Xen: properly bound buffer access when parsing cpu/*/availabilityJan Beulich1-2/+2
At the same time reduce the local buffers to 16 bytes each. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-01-15xen/grant-table: correctly initialize grant table version 1Matt Wilson1-19/+29
Commit 85ff6acb075a484780b3d763fdf41596d8fc0970 (xen/granttable: Grant tables V2 implementation) changed the GREFS_PER_GRANT_FRAME macro from a constant to a conditional expression. The expression depends on grant_table_version being appropriately set. Unfortunately, at init time grant_table_version will be 0. The GREFS_PER_GRANT_FRAME conditional expression checks for "grant_table_version == 1", and therefore returns the number of grant references per frame for v2. This causes gnttab_init() to allocate fewer pages for gnttab_list, as a frame can old half the number of v2 entries than v1 entries. After gnttab_resume() is called, grant_table_version is appropriately set. nr_init_grefs will then be miscalculated and gnttab_free_count will hold a value larger than the actual number of free gref entries. If a guest is heavily utilizing improperly initialized v1 grant tables, memory corruption can occur. One common manifestation is corruption of the vmalloc list, resulting in a poisoned pointer derefrence when accessing /proc/meminfo or /proc/vmallocinfo: [ 40.770064] BUG: unable to handle kernel paging request at 0000200200001407 [ 40.770083] IP: [<ffffffff811a6fb0>] get_vmalloc_info+0x70/0x110 [ 40.770102] PGD 0 [ 40.770107] Oops: 0000 [#1] SMP [ 40.770114] CPU 10 This patch introduces a static variable, grefs_per_grant_frame, to cache the calculated value. gnttab_init() now calls gnttab_request_version() early so that grant_table_version and grefs_per_grant_frame can be appropriately set. A few BUG_ON()s have been added to prevent this type of bug from reoccurring in the future. Signed-off-by: Matt Wilson <msw@amazon.com> Reviewed-and-Tested-by: Steven Noonan <snoonan@amazon.com> Acked-by: Ian Campbell <Ian.Campbell@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Annie Li <annie.li@oracle.com> Cc: xen-devel@lists.xen.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org # v3.3 and newer Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-01-15x86/xen : Fix the wrong check in pcibackYang Zhang1-1/+1
Fix the wrong check in pciback. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-01-11xen/privcmd: Relax access control in privcmd_ioctl_mmapTamas Lengyel1-6/+0
In the privcmd Linux driver two checks in the functions privcmd_ioctl_mmap and privcmd_ioctl_mmap_batch are not needed as they are trying to enforce hypervisor-level access control. They should be removed as they break secondary control domains when performing dom0 disaggregation. Xen itself provides adequate security controls around these hypercalls and these checks prevent those controls from functioning as intended. Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com> Cc: Daniel De Graaf <dgdegra@tycho.nsa.gov> [v1: Fixed up the patch and commit description] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-17xen/vcpu: Fix vcpu restore path.Wei Liu1-3/+4
The runstate of vcpu should be restored for all possible cpus, as well as the vcpu info placement. Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-17xen: Add EVTCHNOP_reset in Xen interface header files.Wei Liu1-0/+13
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-17xen/smp: Use smp_store_boot_cpu_info() to store cpu info for BSP during boot ↵Konrad Rzeszutek Wilk1-1/+1
time. Git commit 30106c174311b8cfaaa3186c7f6f9c36c62d17da ("x86, hotplug: Support functions for CPU0 online/offline") alters what the call to smp_store_cpu_info() does. For BSP we should use the smp_store_boot_cpu_info() and for secondary CPU's the old variant of smp_store_cpu_info() should be used. This fixes the regression introduced by said commit. Reported-and-Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-10Linux 3.7v3.7Linus Torvalds1-1/+1
2012-12-10Input: matrix-keymap - provide proper module licenseFlorian Fainelli1-0/+3
The matrix-keymap module is currently lacking a proper module license, add one so we don't have this module tainting the entire kernel. This issue has been present since commit 1932811f426f ("Input: matrix-keymap - uninline and prepare for device tree support") Signed-off-by: Florian Fainelli <florian@openwrt.org> CC: stable@vger.kernel.org # v3.5+ Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2-42/+131
Pull networking fixes from David Miller: 1) Netlink socket dumping had several missing verifications and checks. In particular, address comparisons in the request byte code interpreter could access past the end of the address in the inet_request_sock. Also, address family and address prefix lengths were not validated properly at all. This means arbitrary applications can read past the end of certain kernel data structures. Fixes from Neal Cardwell. 2) ip_check_defrag() operates in contexts where we're in the process of, or about to, input the packet into the real protocols (specifically macvlan and AF_PACKET snooping). Unfortunately, it does a pskb_may_pull() which can modify the backing packet data which is not legal if the SKB is shared. It very much can be shared in this context. Deal with the possibility that the SKB is segmented by using skb_copy_bits(). Fix from Johannes Berg based upon a report by Eric Leblond. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: ipv4: ip_check_defrag must not modify skb before unsharing inet_diag: validate port comparison byte code to prevent unsafe reads inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run() inet_diag: validate byte code to prevent oops in inet_diag_bc_run() inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
2012-12-10Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damageLinus Torvalds5-13/+20
This reverts commits a50915394f1fc02c2861d3b7ce7014788aa5066e and d7c3b937bdf45f0b844400b7bf6fd3ed50bac604. This is a revert of a revert of a revert. In addition, it reverts the even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the original commits in linux-next. It turns out that the original patch really was bogus, and that the original revert was the correct thing to do after all. We thought we had fixed the problem, and then reverted the revert, but the problem really is fundamental: waking up kswapd simply isn't the right thing to do, and direct reclaim sometimes simply _is_ the right thing to do. When certain allocations fail, we simply should try some direct reclaim, and if that fails, fail the allocation. That's the right thing to do for THP allocations, which can easily fail, and the GPU allocations want to do that too. So starting kswapd is sometimes simply wrong, and removing the flag that said "don't start kswapd" was a mistake. Let's hope we never revisit this mistake again - and certainly not this many times ;) Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-10ipv4: ip_check_defrag must not modify skb before unsharingJohannes Berg1-10/+9
ip_check_defrag() might be called from af_packet within the RX path where shared SKBs are used, so it must not modify the input SKB before it has unshared it for defragmentation. Use skb_copy_bits() to get the IP header and only pull in everything later. The same is true for the other caller in macvlan as it is called from dev->rx_handler which can also get a shared SKB. Reported-by: Eric Leblond <eric@regit.org> Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-10Revert "mm: avoid waking kswapd for THP allocations when compaction is ↵Linus Torvalds1-27/+10
deferred or contended" This reverts commit 782fd30406ecb9d9b082816abe0c6008fc72a7b0. We are going to reinstate the __GFP_NO_KSWAPD flag that has been removed, the removal reverted, and then removed again. Making this commit a pointless fixup for a problem that was caused by the removal of __GFP_NO_KSWAPD flag. The thing is, we really don't want to wake up kswapd for THP allocations (because they fail quite commonly under any kind of memory pressure, including when there is tons of memory free), and these patches were just trying to fix up the underlying bug: the original removal of __GFP_NO_KSWAPD in commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD") was simply bogus. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-09inet_diag: validate port comparison byte code to prevent unsafe readsNeal Cardwell1-7/+24
Add logic to verify that a port comparison byte code operation actually has the second inet_diag_bc_op from which we read the port for such operations. Previously the code blindly referenced op[1] without first checking whether a second inet_diag_bc_op struct could fit there. So a malicious user could make the kernel read 4 bytes beyond the end of the bytecode array by claiming to have a whole port comparison byte code (2 inet_diag_bc_op structs) when in fact the bytecode was not long enough to hold both. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-09inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()Neal Cardwell1-11/+17
Add logic to check the address family of the user-supplied conditional and the address family of the connection entry. We now do not do prefix matching of addresses from different address families (AF_INET vs AF_INET6), except for the previously existing support for having an IPv4 prefix match an IPv4-mapped IPv6 address (which this commit maintains as-is). This change is needed for two reasons: (1) The addresses are different lengths, so comparing a 128-bit IPv6 prefix match condition to a 32-bit IPv4 connection address can cause us to unwittingly walk off the end of the IPv4 address and read garbage or oops. (2) The IPv4 and IPv6 address spaces are semantically distinct, so a simple bit-wise comparison of the prefixes is not meaningful, and would lead to bogus results (except for the IPv4-mapped IPv6 case, which this commit maintains). Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-09inet_diag: validate byte code to prevent oops in inet_diag_bc_run()Neal Cardwell1-3/+45
Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND operations. Previously we did not validate the inet_diag_hostcond, address family, address length, and prefix length. So a malicious user could make the kernel read beyond the end of the bytecode array by claiming to have a whole inet_diag_hostcond when the bytecode was not long enough to contain a whole inet_diag_hostcond of the given address family. Or they could make the kernel read up to about 27 bytes beyond the end of a connection address by passing a prefix length that exceeded the length of addresses of the given family. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-09inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV stateNeal Cardwell1-14/+39
Fix inet_diag to be aware of the fact that AF_INET6 TCP connections instantiated for IPv4 traffic and in the SYN-RECV state were actually created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This means that for such connections inet6_rsk(req) returns a pointer to a random spot in memory up to roughly 64KB beyond the end of the request_sock. With this bug, for a server using AF_INET6 TCP sockets and serving IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to inet_diag_fill_req() causing an oops or the export to user space of 16 bytes of kernel memory as a garbage IPv6 address, depending on where the garbage inet6_rsk(req) pointed. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-08mm: vmscan: fix inappropriate zone congestion clearingJohannes Weiner1-3/+0
commit c702418f8a2f ("mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones") removed zone watermark checks from the compaction code in kswapd but left in the zone congestion clearing, which now happens unconditionally on higher order reclaim. This messes up the reclaim throttling logic for zones with dirty/writeback pages, where zones should only lose their congestion status when their watermarks have been restored. Remove the clearing from the zone compaction section entirely. The preliminary zone check and the reclaim loop in kswapd will clear it if the zone is considered balanced. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-08vfs: fix O_DIRECT read past end of block deviceLinus Torvalds1-1/+17
The direct-IO write path already had the i_size checks in mm/filemap.c, but it turns out the read path did not, and removing the block size checks in fs/block_dev.c (commit bbec0270bdd8: "blkdev_max_block: make private to fs/buffer.c") removed the magic "shrink IO to past the end of the device" code there. Fix it by truncating the IO to the size of the block device, like the write path already does. NOTE! I suspect the write path would be *much* better off doing it this way in fs/block_dev.c, rather than hidden deep in mm/filemap.c. The mm/filemap.c code is extremely hard to follow, and has various conditionals on the target being a block device (ie the flag passed in to 'generic_write_checks()', along with a conditional update of the inode timestamp etc). It is also quite possible that we should treat this whole block device size as a "s_maxbytes" issue, and try to make the logic even more generic. However, in the meantime this is the fairly minimal targeted fix. Noted by Milan Broz thanks to a regression test for the cryptsetup reencrypt tool. Reported-and-tested-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds6-9/+24
Pull networking fixes from David Miller: "Two stragglers: 1) The new code that adds new flushing semantics to GRO can cause SKB pointer list corruption, manage the lists differently to avoid the OOPS. Fix from Eric Dumazet. 2) When TCP fast open does a retransmit of data in a SYN-ACK or similar, we update retransmit state that we shouldn't triggering a WARN_ON later. Fix from Yuchung Cheng." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net: gro: fix possible panic in skb_gro_receive() tcp: bug fix Fast Open client retransmission
2012-12-07net: gro: fix possible panic in skb_gro_receive()Eric Dumazet3-3/+8
commit 2e71a6f8084e (net: gro: selective flush of packets) added a bug for skbs using frag_list. This part of the GRO stack is rarely used, as it needs skb not using a page fragment for their skb->head. Most drivers do use a page fragment, but some of them use GFP_KERNEL allocations for the initial fill of their RX ring buffer. napi_gro_flush() overwrite skb->prev that was used for these skb to point to the last skb in frag_list. Fix this using a separate field in struct napi_gro_cb to point to the last fragment. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-07tcp: bug fix Fast Open client retransmissionYuchung Cheng3-6/+16
If SYN-ACK partially acks SYN-data, the client retransmits the remaining data by tcp_retransmit_skb(). This increments lost recovery state variables like tp->retrans_out in Open state. If loss recovery happens before the retransmission is acked, it triggers the WARN_ON check in tcp_fastretrans_alert(). For example: the client sends SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends another 4 data packets and get 3 dupacks. Since the retransmission is not caused by network drop it should not update the recovery state variables. Further the server may return a smaller MSS than the cached MSS used for SYN-data, so the retranmission needs a loop. Otherwise some data will not be retransmitted until timeout or other loss recovery events. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-07Merge tag 'mmc-fixes-for-3.7' of ↵Linus Torvalds2-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc Pull MMC fixes from Chris Ball: "Two small regression fixes: - sdhci-s3c: Fix runtime PM regression against 3.7-rc1 - sh-mmcif: Fix oops against 3.6" * tag 'mmc-fixes-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: mmc: sh-mmcif: avoid oops on spurious interrupts (second try) Revert misapplied "mmc: sh-mmcif: avoid oops on spurious interrupts" mmc: sdhci-s3c: fix missing clock for gpio card-detect
2012-12-06tmpfs: fix shared mempolicy leakMel Gorman3-48/+16
This fixes a regression in 3.7-rc, which has since gone into stable. Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went on expecting alloc_page_vma() to drop the refcount it had acquired. This deserves a rework: but for now fix the leak in shmem_alloc_page(). Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use the same refcounting there as in shmem_alloc_page(), delete its onstack mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() - those were invented to let swapin_readahead() make an unknown number of calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e, alloc_pages_vma() has kept refcount in balance, so now no problem. Reported-and-tested-by: Tommi Rantala <tt.rantala@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-06mm: vmscan: do not keep kswapd looping forever due to individual ↵Johannes Weiner1-16/+0
uncompactable zones When a zone meets its high watermark and is compactable in case of higher order allocations, it contributes to the percentage of the node's memory that is considered balanced. This requirement, that a node be only partially balanced, came about when kswapd was desparately trying to balance tiny zones when all bigger zones in the node had plenty of free memory. Arguably, the same should apply to compaction: if a significant part of the node is balanced enough to run compaction, do not get hung up on that tiny zone that might never get in shape. When the compaction logic in kswapd is reached, we know that at least 25% of the node's memory is balanced properly for compaction (see zone_balanced and pgdat_balanced). Remove the individual zone checks that restart the kswapd cycle. Otherwise, we may observe more endless looping in kswapd where the compaction code loops back to reclaim because of a single zone and reclaim does nothing because the node is considered balanced overall. See for example https://bugzilla.redhat.com/show_bug.cgi?id=866988 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-and-tested-by: Thorsten Leemhuis <fedora@leemhuis.info> Reported-by: Jiri Slaby <jslaby@suse.cz> Tested-by: John Ellson <john.ellson@comcast.net> Tested-by: Zdenek Kabelac <zkabelac@redhat.com> Tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-06mm: compaction: validate pfn range passed to isolate_freepages_blockMel Gorman1-1/+9
Commit 0bf380bc70ec ("mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration") added a check for pfn_valid() when isolating pages for migration as the scanner does not necessarily start pageblock-aligned. Since commit c89511ab2f8f ("mm: compaction: Restart compaction from near where it left off"), the free scanner has the same problem. This patch makes sure that the pfn range passed to isolate_freepages_block() is within the same block so that pfn_valid() checks are unnecessary. In answer to Henrik's wondering why others have not reported this: reproducing this requires a large enough hole with the right aligment to have compaction walk into a PFN range with no memmap. Size and alignment depends in the memory model - 4M for FLATMEM and 128M for SPARSEMEM on x86. It needs a "lucky" machine. Reported-by: Henrik Rydberg <rydberg@euromail.se> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-06mmc: sh-mmcif: avoid oops on spurious interrupts (second try)Guennadi Liakhovetski1-2/+2
On some systems, e.g., kzm9g, MMCIF interfaces can produce spurious interrupts without any active request. To prevent the Oops, that results in such cases, don't dereference the mmc request pointer until we make sure, that we are indeed processing such a request. Reported-by: Tetsuyuki Kobayashi <koba@kmckk.co.jp> Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Tested-by: Tetsuyuki Kobayashi <koba@kmckk.co.jp> Cc: stable@vger.kernel.org Signed-off-by: Chris Ball <cjb@laptop.org>
2012-12-06Revert misapplied "mmc: sh-mmcif: avoid oops on spurious interrupts"Chris Ball1-4/+0
This reverts commit 8464dd52d3198dd05, which was a misapplied debugging version of the patch, not the final patch itself. Signed-off-by: Chris Ball <cjb@laptop.org> Cc: stable@vger.kernel.org
2012-12-06mmc: sdhci-s3c: fix missing clock for gpio card-detectHeiko Stübner1-0/+7
2abeb5c5ded2 ("Add clk_(enable/disable) in runtime suspend/resume") added the capability to stop the clocks when the device is runtime suspended, but forgot to handle the case of the card-detect using an external gpio. Therefore in the case that runtime-pm is enabled, start the io-clock when a card is inserted and stop it again once it is removed. Signed-off-by: Heiko Stuebner <heiko@sntech.de> Signed-off-by: Chris Ball <cjb@laptop.org>
2012-12-06Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linusLinus Torvalds5-20/+24
Pull MIPS fixes from Ralf Baechle: "These are the fixes for the N32 syscall bugs found by Al, an extraneous break that broke detection for R3000 and R3081 processors, an endless loop processing signals for kernel task (x86 received the same fix a while ago) and a fix for transparent huge page which took ages to track down because it was so hard to come up with a workable test case." * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: MIPS: Fix endless loop when processing signals for kernel tasks MIPS: R3000/R3081: Fix CPU detection. MIPS: N32: Fix signalfd4 syscall entry point MIPS: N32: Fix preadv(2) and pwritev(2) entry points. MIPS: Avoid mcheck by flushing page range in huge_ptep_set_access_flags()
2012-12-06Merge branch 'more-fixes' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull build fix from Rusty Russell: "Tim Gardner <tim.gardner@canonical.com> writes: > It is $(obj)/oid_registry.o that is dependent on $(obj)/oid_registry_data.c. > The object file cannot be built until $(obj)/oid_registry_data.c has been > generated. > > A periodic and hard to reproduce parallel build failure is due to > this incorrect lib/Makefile dependency. The compile error is completely > disingenuous. > > GEN lib/oid_registry_data.c > Compiling 49 OIDs > CC lib/oid_registry.o > gcc: error: lib/oid_registry.c: No such file or directory > gcc: fatal error: no input files > compilation terminated. > make[3]: *** [lib/oid_registry.o] Error 4 I can't reproduce it either. It's completely weird; nothing ever removes lib/oid_registry.c, so either gcc is giving the wrong message or it's a weird fs with a very odd race. But your version is definitely more correct than the previous one, so..." * 'more-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: lib/Makefile: Fix oid_registry build dependency
2012-12-06Merge branch 'fixes' of ↵Linus Torvalds2-8/+8
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module signing fixes from Rusty Russell: "David gave me these a month ago, during my git workflow churn :(" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: ASN.1: Fix an indefinite length skip error MODSIGN: Don't use enum-type bitfields in module signature info block
2012-12-06Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull watchdog fix from Thomas Gleixner: "Trivial CPU hotplug regression fix for the watchdog code" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: watchdog: Fix CPU hotplug regression
2012-12-06lib/Makefile: Fix oid_registry build dependencyTim Gardner1-1/+1
It is $(obj)/oid_registry.o that is dependent on $(obj)/oid_registry_data.c. The object file cannot be built until $(obj)/oid_registry_data.c has been generated. A periodic and hard to reproduce parallel build failure is due to this incorrect lib/Makefile dependency. The compile error is completely disingenuous. GEN lib/oid_registry_data.c Compiling 49 OIDs CC lib/oid_registry.o gcc: error: lib/oid_registry.c: No such file or directory gcc: fatal error: no input files compilation terminated. make[3]: *** [lib/oid_registry.o] Error 4 Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Howells <dhowells@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-12-05MIPS: Fix endless loop when processing signals for kernel tasksDmitry Adamushko1-1/+6
The problem occurs [1] when a kernel-mode task returns from a system call with a pending signal. A real-life scenario is a child of 'khelper' returning from a failed kernel_execve() in ____call_usermodehelper() [ kernel/kmod.c ]. kernel_execve() fails due to a pending SIGKILL, which is the result of "kill -9 -1" (at least, busybox's init does it upon reboot). The loop is as follows: * syscall_exit_work: - work_pending: // start_of_the_loop - work_notifysig: - do_notify_resume() - do_signal() - if (!user_mode(regs)) return; - resume_userspace // TIF_SIGPENDING is still set - work_pending // so we call work_pending => goto // start_of_the_loop More information can be found in another LKML thread: http://www.serverphorums.com/read.php?12,457826 [1] The problem was also reproduced on !CONFIG_VM86 x86, and the following fix was accepted. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=29a2e2836ff9ea65a603c89df217f4198973a74f Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/3571/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2012-12-05MIPS: R3000/R3081: Fix CPU detection.Ralf Baechle1-1/+0
Broken since e05ea74fc56f347f872ef9946d27c53e8bf20864 (lmo) rsp. cea7e2dfdef53fe55f359d00da562a268be06fd2 (kernel.org) [MIPS: Sort out CPU type to name translation.] These CPUs are no longer very popular to say the least ... Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Reported-by: Murphy McCauley <murphy.mccauley@gmail.com>
2012-12-05MIPS: N32: Fix signalfd4 syscall entry pointRalf Baechle1-1/+1
This needs to use the compat entry point or it's going to fail on big endian systems. Noticed by Al Viro. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2012-12-05vfs: clear to the end of the buffer on partial buffer readsDan Carpenter1-1/+1
READ is zero so the "rw & READ" test is always false. The intended test was "((rw & RW_MASK) == READ)". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>