The way how fuse calls truncate_pagecache() from fuse_change_attributes()
is completely wrong. Because, w/o i_mutex held, we never sure whether
'oldsize' and 'attr->size' are valid by the time of execution of
truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
released fc->lock in the middle of fuse_change_attributes(), we completely
loose control of actions which may happen with given inode until we reach
truncate_pagecache. The list of potentially dangerous actions includes
mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.
The typical outcome of doing truncate_pagecache() with outdated arguments
is data corruption from user point of view. This is (in some sense)
acceptable in cases when the issue is triggered by a change of the file on
the server (i.e. externally wrt fuse operation), but it is absolutely
intolerable in scenarios when a single fuse client modifies a file without
any external intervention. A real life case I discovered by fsx-linux
looked like this:
1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
to the server synchronously, then calls fuse_change_attributes(). The
latter updates i_size, releases fc->lock, but before comparing oldsize vs
attr->size..
3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
updating attributes and i_size, but now oldsize is equal to
outarg.attr.size because i_size has just been updated (step 2). Hence,
fuse_do_setattr() returns w/o calling truncate_pagecache().
4. As soon as ftruncate(2) completes, the user extends file size by
write(2) making a hole in the middle of file, then reads data from the hole
either by read(2) or mmap-ed read. The user expects to get zero data from
the hole, but gets stale data because truncate_pagecache() is not executed
yet.
The scenario above illustrates one side of the problem: not truncating the
page cache even though we should. Another side corresponds to truncating
page cache too late, when the state of inode changed significantly.
Theoretically, the following is possible:
1. As in the previous scenario fuse_dentry_revalidate() discovered that
i_size changed (due to our own fuse_do_setattr()) and is going to call
truncate_pagecache() for some 'new_size' it believes valid right now. But
by the time that particular truncate_pagecache() is called ...
2. fuse_do_setattr() returns (either having called truncate_pagecache() or
not -- it doesn't matter).
3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
4. mmap-ed write makes a page in the extended region dirty.
The result will be the lost of data user wrote on the fourth step.
The patch is a hotfix resolving the issue in a simplistic way: let's skip
dangerous i_size update and truncate_pagecache if an operation changing
file size is in progress. This simplistic approach looks correct for the
cases w/o external changes. And to handle them properly, more sophisticated
and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
postpone it until the issue is well discussed on the mailing list(s).
Changed in v2:
- improved patch description to cover both sides of the issue.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
[bwh: Backported to 3.2: add the fuse_inode::state field which we didn't have] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Userspace can add names containing a slash character to the directory
listing. Don't allow this as it could cause all sorts of trouble.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
[bwh: Backported to 3.2: drop changes to parse_dirplusfile() which we
don't have] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
But such error messages are consequence of file system's issue that takes
place more earlier. Fortunately, Jerome Poulin <jeromepoulin@gmail.com>
and Anton Eliasson <devel@antoneliasson.se> were reported about another
issue not so recently. These reports describe the issue with segctor
thread's crash:
BUG: unable to handle kernel paging request at 0000000000004c83
IP: nilfs_end_page_io+0x12/0xd0 [nilfs2]
These two issues have one reason. This reason can raise third issue
too. Third issue results in hanging of segctor thread with eating of
100% CPU.
REPRODUCING PATH:
One of the possible way or the issue reproducing was described by
Jermoe me Poulin <jeromepoulin@gmail.com>:
1. init S to get to single user mode.
2. sysrq+E to make sure only my shell is running
3. start network-manager to get my wifi connection up
4. login as root and launch "screen"
5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies.
6. lscp | xz -9e > lscp.txt.xz
7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs
8. start a screen to dump /proc/kmsg to text file since rsyslog is killed
9. start a screen and launch strace -f -o find-cat.log -t find
/mnt/nilfs -type f -exec cat {} > /dev/null \;
10. start a screen and launch strace -f -o apt-get.log -t apt-get update
11. launch the last command again as it did not crash the first time
12. apt-get crashes
13. ps aux > ps-aux-crashed.log
13. sysrq+W
14. sysrq+E wait for everything to terminate
15. sysrq+SUSB
Simplified way of the issue reproducing is starting kernel compilation
task and "apt-get update" in parallel.
REPRODUCIBILITY:
The issue is reproduced not stable [60% - 80%]. It is very important to
have proper environment for the issue reproducing. The critical
conditions for successful reproducing:
(1) It should have big modified file by mmap() way.
(2) This file should have the count of dirty blocks are greater that
several segments in size (for example, two or three) from time to time
during processing.
(3) It should be intensive background activity of files modification
in another thread.
INVESTIGATION:
First of all, it is possible to see that the reason of crash is not valid
page address:
Moreover, value of b_page (0x1a82) is 6786. This value looks like segment
number. And b_blocknr with b_size values look like block numbers. So,
buffer_head's pointer points on not proper address value.
Detailed investigation of the issue is discovered such picture:
Usually, for every segment we collect dirty files in list. Then, dirty
blocks are gathered for every dirty file, prepared for write and
submitted by means of nilfs_segbuf_submit_bh() call. Finally, it takes
place complete write phase after calling nilfs_end_bio_write() on the
block layer. Buffers/pages are marked as not dirty on final phase and
processed files removed from the list of dirty files.
It is possible to see that we had three prepare_write and submit_bio
phases before segbuf_wait and complete_write phase. Moreover, segments
compete between each other for dirty blocks because on every iteration
of segments processing dirty buffer_heads are added in several lists of
payload_buffers:
The next pointer is the same but prev pointer has changed. It means
that buffer_head has next pointer from one list but prev pointer from
another. Such modification can be made several times. And, finally, it
can be resulted in various issues: (1) segctor hanging, (2) segctor
crashing, (3) file system metadata corruption.
FIX:
This patch adds:
(1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write()
for every proccessed dirty block;
(2) checking of BH_Async_Write flag in
nilfs_lookup_dirty_data_buffers() and
nilfs_lookup_dirty_node_buffers();
(3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(),
nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page().
Reported-by: Jerome Poulin <jeromepoulin@gmail.com> Reported-by: Anton Eliasson <devel@antoneliasson.se> Cc: Paul Fertser <fercerpav@gmail.com> Cc: ARAI Shun-ichi <hermes@ceres.dti.ne.jp> Cc: Piotr Szymaniak <szarpaj@grubelek.pl> Cc: Juan Barry Manuel Canham <Linux@riotingpacifist.net> Cc: Zahid Chowdhury <zahid.chowdhury@starsolutions.com> Cc: Elmer Zhang <freeboy6716@gmail.com> Cc: Kenneth Langga <klangga@gmail.com> Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: nilfs_clear_dirty_page() has not been separated
from nilfs_clear_dirty_pages()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the event name is specified with all 3 components, the last one
overwrites the previous one during the name composing within the
parse_events_add_cache function.
Fixing this by properly adjusting the string index.
Reported-by: Joel Uckelman <joel@lightboxtechnologies.com> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joel Uckelman <joel@lightboxtechnologies.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LPU-Reference: 20120905175133.GA18352@krava.brq.redhat.com
[ committer note: Remove the newline fix, done already in 42e1fb7 ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Vinson Lee <vlee@twopensource.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When online_pages() is called to add new memory to an empty zone, it
rebuilds all zone lists by calling build_all_zonelists(). But there's a
bug which prevents the new zone to be added to other nodes' zone lists.
Here the node of the zone is put into N_HIGH_MEMORY state after calling
build_all_zonelists(), but build_all_zonelists() only adds zones from
nodes in N_HIGH_MEMORY state to the fallback zone lists.
build_all_zonelists()
So memory in the new zone will never be used by other nodes, and it may
cause strange behavor when system is under memory pressure. So put node
into N_HIGH_MEMORY state before calling build_all_zonelists().
Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <liuj97@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Keping Chen <chenkeping@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Qiang Huang <h.huangqiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
task->cgroups is a RCU pointer pointing to struct css_set. A task
switches to a different css_set on cgroup migration but a css_set
doesn't change once created and its pointers to cgroup_subsys_states
aren't RCU protected.
task_subsys_state[_check]() is the macro to acquire css given a task
and subsys_id pair. It RCU-dereferences task->cgroups->subsys[] not
task->cgroups, so the RCU pointer task->cgroups ends up being
dereferenced without read_barrier_depends() after it. It's broken.
Fix it by introducing task_css_set[_check]() which does
RCU-dereference on task->cgroups. task_subsys_state[_check]() is
reimplemented to directly dereference ->subsys[] of the css_set
returned from task_css_set[_check]().
This removes some of sparse RCU warnings in cgroup.
v2: Fixed unbalanced parenthsis and there's no need to use
rcu_dereference_raw() when !CONFIG_PROVE_RCU. Both spotted by Li.
While PROC_CN_MCAST_LISTEN/IGNORE is entirely advisory, it was possible
for an unprivileged user to turn off notifications for all listeners by
sending PROC_CN_MCAST_IGNORE. Instead, require the same privileges as
required for a multicast bind.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Evgeniy Polyakov <zbr@ioremap.net> Cc: Matt Helsley <matthltc@us.ibm.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Acked-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Qiang Huang <h.huangqiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When determining the page size we could use to map with the IOMMU, the
page size should also be aligned with the hva, not just the gfn. The
gfn may not reflect the real alignment within the hugetlbfs file.
Most of the time, this works fine. However, if the hugetlbfs file is
backed by non-contiguous huge pages, a multi-huge page memslot starts at
an unaligned offset within the hugetlbfs file, and the gfn is aligned
with respect to the huge page size, kvm_host_page_size() will return the
huge page size and we will use that to map with the IOMMU.
When we later unpin that same memslot, the IOMMU returns the unmap size
as the huge page size, and we happily unpin that many pfns in
monotonically increasing order, not realizing we are spanning
non-contiguous huge pages and partially unpin the wrong huge page.
Ensure the IOMMU mapping page size is aligned with the hva corresponding
to the gfn, which does reflect the alignment within the hugetlbfs file.
Newer kernels (linux-next with the transparent huge page patches)
use rrbm if the feature is announced via feature bit 66.
RRBM will cause intercepts, so KVM does not handle it right now,
causing an illegal instruction in the guest.
The easy solution is to disable the feature bit for the guest.
Any uaccess between guest_enter and guest_exit could trigger a page fault,
the page fault handler would handle it as a guest fault and translate a
user address as guest address.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> CC: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[hq: Backported to 3.4: adjust context] Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cgroup core has a bug which violates a basic rule about event
notifications - when a new entity needs to be added, you add that to
the notification list first and then make the new entity conform to
the current state. If done in the reverse order, an event happening
inbetween will be lost.
cgroup_subsys->fork() is invoked way before the new task is added to
the css_set. Currently, cgroup_freezer is the only user of ->fork()
and uses it to make new tasks conform to the current state of the
freezer. If FROZEN state is requested while fork is in progress
between cgroup_fork_callbacks() and cgroup_post_fork(), the child
could escape freezing - the cgroup isn't frozen when ->fork() is
called and the freezer couldn't see the new task on the css_set.
This patch moves cgroup_subsys->fork() invocation to
cgroup_post_fork() after the new task is added to the css_set.
cgroup_fork_callbacks() is removed.
Because now a task may be migrated during cgroup_subsys->fork(),
freezer_fork() is updated so that it adheres to the usual RCU locking
and the rather pointless comment on why locking can be different there
is removed (if it doesn't make anything simpler, why even bother?).
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@sisk.pl>
[hq: Backported to 3.4:
- Adjust context
- Iterate over first CGROUP_BUILTIN_SUBSYS_COUNT elements of subsys] Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kswapd does not in all places have the same criteria for a balanced
zone. Zones are only being reclaimed when their high watermark is
breached, but compaction checks loop over the zonelist again when the
zone does not meet the low watermark plus two times the size of the
allocation. This gets kswapd stuck in an endless loop over a small
zone, like the DMA zone, where the high watermark is smaller than the
compaction requirement.
Add a function, zone_balanced(), that checks the watermark, and, for
higher order allocations, if compaction has enough free memory. Then
use it uniformly to check for balanced zones.
This makes sure that when the compaction watermark is not met, at least
reclaim happens and progress is made - or the zone is declared
unreclaimable at some point and skipped entirely.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: George Spelvin <linux@horizon.com> Reported-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Reported-by: Tomas Racek <tracek@redhat.com> Tested-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[hq: Backported to 3.4: adjust context] Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
An invalid ioctl will never be valid, irrespective of whether multipath
has active paths or not. So for invalid ioctls we do not have to wait
for multipath to activate any paths, but can rather return an error
code immediately. This fix resolves numerous instances of:
udevd[]: worker [] unexpectedly returned with status 0x0100
that have been seen during testing.
Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It appears that in the DMA40 driver the DMA tasklet will very
often dereference memory for a descriptor just free:d from the
DMA40 slab. Nothing happens because no other part of the driver
has yet had a chance to claim this memory, but it's really
nasty to dereference free:d memory, so let's check the flag
before the descriptor is free and store it in a bool variable.
Currently last dqput() can race with dquot_scan_active() causing it to
call callback for an already deactivated dquot. The race is as follows:
CPU1 CPU2
dqput()
spin_lock(&dq_list_lock);
if (atomic_read(&dquot->dq_count) > 1) {
- not taken
if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
spin_unlock(&dq_list_lock);
->release_dquot(dquot);
if (atomic_read(&dquot->dq_count) > 1)
- not taken
dquot_scan_active()
spin_lock(&dq_list_lock);
if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
- not taken
atomic_inc(&dquot->dq_count);
spin_unlock(&dq_list_lock);
- proceeds to release dquot
ret = fn(dquot, priv);
- called for inactive dquot
Fix the problem by making sure possible ->release_dquot() is finished by
the time we call the callback and new calls to it will notice reference
dquot_scan_active() has taken and bail out.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When writing policy via /sys/fs/selinux/policy I wrote the type and class
of filename trans rules in CPU endian instead of little endian. On
x86_64 this works just fine, but it means that on big endian arch's like
ppc64 and s390 userspace reads the policy and converts it from
le32_to_cpu. So the values are all screwed up. Write the values in le
format like it should have been to start.
Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Drew Richardson reported that he could make the kernel go *boom* when hotplugging
while having perf events active.
It turned out that when you have a group event, the code in
__perf_event_exit_context() fails to remove the group siblings from
the context.
We then proceed with destroying and freeing the event, and when you
re-plug the CPU and try and add another event to that CPU, things go
*boom* because you've still got dead entries there.
When a kworker should die, the kworkre is notified through WORKER_DIE
flag instead of kthread_should_stop(). This, IIRC, is primarily to
keep the test synchronized inside worker_pool lock. WORKER_DIE is
first set while holding pool->lock, the lock is dropped and
kthread_stop() is called.
Unfortunately, this means that there's a slight chance that the target
kworker may see WORKER_DIE before kthread_stop() finishes and exits
and frees the target task before or during kthread_stop().
Fix it by pinning the target task before setting WORKER_DIE and
putting it after kthread_stop() is done.
tj: Improved patch description and comment. Moved pinning above
WORKER_DIE for better signify what it's protecting.
the following patch adds an entry for the PID of a Cressi Leonardo
diving computer interface to kernel 3.13.0.
It is detected as FT232RL.
Works with subsurface.
acpi_processor_set_throttling() uses set_cpus_allowed_ptr() to make
sure that the (struct acpi_processor)->acpi_processor_set_throttling()
callback will run on the right CPU. However, the function may be
called from a worker thread already bound to a different CPU in which
case that won't work.
Make acpi_processor_set_throttling() use work_on_cpu() as appropriate
instead of abusing set_cpus_allowed_ptr().
Reported-and-tested-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
[rjw: Changelog] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some devices have duplicate entries in there brightness levels table, ie
on my Dell Latitude E6430 the table looks like this:
[ 3.686060] acpi backlight index 0, val 80
[ 3.686095] acpi backlight index 1, val 50
[ 3.686122] acpi backlight index 2, val 5
[ 3.686147] acpi backlight index 3, val 5
[ 3.686172] acpi backlight index 4, val 5
[ 3.686197] acpi backlight index 5, val 5
[ 3.686223] acpi backlight index 6, val 5
[ 3.686248] acpi backlight index 7, val 5
[ 3.686273] acpi backlight index 8, val 6
[ 3.686332] acpi backlight index 9, val 7
[ 3.686356] acpi backlight index 10, val 8
[ 3.686380] acpi backlight index 11, val 9
etc.
Notice that brightness values 0-5 are all mapped to 5. This means that
if userspace writes any value between 0 and 5 to the brightness sysfs attribute
and then reads it, it will always return 0, which is somewhat unexpected.
This is a problem for ie gnome-settings-daemon, which uses read-modify-write
logic when the users presses the brightness up or down keys. This is done
this way to take brightness changes from other sources into account.
On this specific laptop what happens once the brightness has been set to 0,
is that gsd reads 0, adds 5, writes 5, and on the next brightness up key press
again reads 0, so things get stuck at the lowest brightness setting.
Filtering out the duplicate table entries, makes any write to brightness
read back as the written value as one would expect, fixing this.
Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The reference count changes done by pci_get_device can be a little
misleading when the usage diverges from the most common scheme. The
reference count of the device passed as the last parameter is always
decreased, even if the function returns no new device. So if we are
going to try alternative device IDs, we must manually increment the
device reference count before each retry. If we don't, we end up
decreasing the reference count, and after a few modprobe/rmmod cycles
the PCI devices will vanish.
In other words and as Alan put it: without this fix the EDAC code
corrupts the PCI device list.
This fixes kernel bug #50491:
https://bugzilla.kernel.org/show_bug.cgi?id=50491
It's a bit odd to see a newer device showing mod15write; however, the
reported behavior is highly consistent and other factors which could
contribute seem to have been verified well enough. Also, both
sata_sil itself and the drive are fairly outdated at this point making
the risk of this change fairly low. It is possible, probably likely,
that other drive models in the same family have the same problem;
however, for now, let's just add the specific model which was tested.
Without the patch the kernel generates the following error.
ata11.15: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata11.15: Port Multiplier vendor mismatch '0x197b' != '0x123'
ata11.15: PMP revalidation failed (errno=-19)
ata11.15: failed to recover PMP after 5 tries, giving up
This patch helps to bypass this error and the device becomes
functional.
Vince "Super Tester" Weaver reported a new round of syscall fuzzing (Trinity) failures,
with perf WARN_ON()s triggering. He also provided traces of the failures.
We try and add the {BP,cycles,br_insn} group (fd[3], fd[4], fd[15]).
These events are 0:cycles and 4:br_insn, the BP event isn't x86_pmu so
that's not visible.
group_sched_in()
pmu->start_txn() /* nop - BP pmu */
event_sched_in()
event->pmu->add()
But seeing the below state on x86_pmu_enable(), the must have failed,
because the 0 and 4 events aren't there anymore.
Looking at group_sched_in(), since the BP is the leader, its
event_sched_in() must have succeeded, for otherwise we would not have
seen the sibling adds.
But since neither 0 or 4 are in the below state; their event_sched_in()
must have failed; but I don't see why, the complete state: 0,0,1:p,4
fits perfectly fine on a core2.
However, since we try and schedule 4 it means the 0 event must have
succeeded! Therefore the 4 event must have failed, its failure will
have put group_sched_in() into the fail path, which will call:
event_sched_out()
event->pmu->del()
on 0 and the BP event.
Now x86_pmu_del() will reduce n_events; but it will not reduce n_added;
giving what we see below:
So the problem is that x86_pmu_del(), when called from a
group_sched_in() that fails (for whatever reason), and without x86_pmu
TXN support (because the leader is !x86_pmu), will corrupt the n_added
state.
Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stephane Eranian <eranian@google.com> Cc: Dave Jones <davej@redhat.com> Link: http://lkml.kernel.org/r/20140221150312.GF3104@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In copy_oldmem_page, the current check using max_pfn and min_low_pfn to
decide if the page is backed or not, is not valid when the memory layout is
not continuous.
This happens when running as a QEMU/KVM guest, where RTAS is mapped higher
in the memory. In that case max_pfn points to the end of RTAS, and a hole
between the end of the kdump kernel and RTAS is not backed by PTEs. As a
consequence, the kdump kernel is crashing in copy_oldmem_page when accessing
in a direct way the pages in that hole.
This fix relies on the memblock's service memblock_is_region_memory to
check if the read page is part or not of the directly accessible memory.
When a send failure occurs due to the socket being out of buffer space,
we call xs_nospace() in order to have the RPC task wait until the
socket has drained enough to make it worth while trying again.
The current patch fixes a race in which the socket is drained before
we get round to setting up the machinery in xs_nospace(), and which
is reported to cause hangs.
Link: http://lkml.kernel.org/r/20140210170315.33dfc621@notabene.brown Fixes: a9a6b52ee1ba (SUNRPC: Don't start the retransmission timer...) Reported-by: Neil Brown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is a typo in the Limiter2 Release Rate control, a wrong enum for
Limiter1 is assigned. It must point to Limiter2.
Spotted by a compile warning:
In file included from sound/soc/codecs/sta32x.c:34:0:
sound/soc/codecs/sta32x.c:223:29: warning: ‘sta32x_limiter2_release_rate_enum’ defined but not used [-Wunused-variable]
static SOC_ENUM_SINGLE_DECL(sta32x_limiter2_release_rate_enum,
^
include/sound/soc.h:275:18: note: in definition of macro ‘SOC_ENUM_DOUBLE_DECL’
struct soc_enum name = SOC_ENUM_DOUBLE(xreg, xshift_l, xshift_r, \
^
sound/soc/codecs/sta32x.c:223:8: note: in expansion of macro ‘SOC_ENUM_SINGLE_DECL’
static SOC_ENUM_SINGLE_DECL(sta32x_limiter2_release_rate_enum,
^
Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Mark Brown <broonie@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the driver tries to access Function Unit 10, the KEF X300A
speakers' firmware apparently locks up, making even PCM streaming
impossible. Work around this by ignoring this FU.
[ use zero netdev_feature mask to avoid backport of
netif_skb_dev_features function ]
Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.
Given:
Host <mtu1500> R1 <mtu1200> R2
Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.
In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.
When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.
This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.
For ipv6, we send out pkt too big error for gso if the individual
segments are too big.
For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.
Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.
However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there. I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.
Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.
This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small. Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway. Also its not like this could not be improved later
once the dust settles.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ no skb_gso_seglen helper in 3.4, leave tbf alone ]
This moves part of Eric Dumazets skb_gso_seglen helper from tbf sched to
skbuff core so it may be reused by upcoming ip forwarding path patch.
Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
SCTP's sctp_connectx() abi breaks for 64bit kernels compiled with 32bit
emulation (e.g. ia32 emulation or x86_x32). Due to internal usage of
'struct sctp_getaddrs_old' which includes a struct sockaddr pointer,
sizeof(param) check will always fail in kernel as the structure in
64bit kernel space is 4bytes larger than for user binaries compiled
in 32bit mode. Thus, applications making use of sctp_connectx() won't
be able to run under such circumstances.
Introduce a compat interface in the kernel to deal with such
situations by using a 'struct compat_sctp_getaddrs_old' structure
where user data is copied into it, and then sucessively transformed
into a 'struct sctp_getaddrs_old' structure with the help of
compat_ptr(). That fixes sctp_connectx() abi without any changes
needed in user space, and lets the SCTP test suite pass when compiled
in 32bit and run on 64bit kernels.
Fixes: f9c67811ebc0 ("sctp: Fix regression introduced by new sctp_connectx api") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch removes a generic hard_header_len check from the usbnet
module that is causing dropped packages under certain circumstances
for devices that send rx packets that cross urb boundaries.
One example is the AX88772B which occasionally send rx packets that
cross urb boundaries where the remaining partial packet is sent with
no hardware header. When the buffer with a partial packet is of less
number of octets than the value of hard_header_len the buffer is
discarded by the usbnet module.
With AX88772B this can be reproduced by using ping with a packet
size between 1965-1976.
The bug has been reported here:
https://bugzilla.kernel.org/show_bug.cgi?id=29082
This patch introduces the following changes:
- Removes the generic hard_header_len check in the rx_complete
function in the usbnet module.
- Introduces a ETH_HLEN check for skbs that are not cloned from
within a rx_fixup callback.
- For safety a hard_header_len check is added to each rx_fixup
callback function that could be affected by this change.
These extra checks could possibly be removed by someone
who has the hardware to test.
- Removes a call to dev_kfree_skb_any() and instead utilizes the
dev->done list to queue skbs for cleanup.
The changes place full responsibility on the rx_fixup callback
functions that clone skbs to only pass valid skbs to the
usbnet_skb_return function.
Signed-off-by: Emil Goode <emilgoode@gmail.com> Reported-by: Igor Gnatenko <i.gnatenko.brain@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
aggregator_identifier is used to assign unique aggregator identifiers
to aggregators of a bond during device enslaving.
aggregator_identifier is currently a global variable that is zeroed in
bond_3ad_initialize().
This sequence will lead to duplicate aggregator identifiers for eth1 and eth3:
create bond0
change bond0 mode to 802.3ad
enslave eth0 to bond0 //eth0 gets agg id 1
enslave eth1 to bond0 //eth1 gets agg id 2
create bond1
change bond1 mode to 802.3ad
enslave eth2 to bond1 //aggregator_identifier is reset to 0
//eth2 gets agg id 1
enslave eth3 to bond0 //eth3 gets agg id 2
Fix this by making aggregator_identifier private to the bond.
Signed-off-by: Jiri Bohac <jbohac@suse.cz> Acked-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Quoting David Vrabel -
"5780 cards cannot have jumbo frames and TSO enabled together. When
jumbo frames are enabled by setting the MTU, the TSO feature must be
cleared. This is done indirectly by calling netdev_update_features()
which will call tg3_fix_features() to actually clear the flags.
netdev_update_features() will also trigger a new netlink message for the
feature change event which will result in a call to tg3_get_stats64()
which deadlocks on the tg3 lock."
tg3_set_mtu() does not need to be under the tg3 lock since converting
the flags to use set_bit(). Move it out to after tg3_netif_stop().
Reported-by: David Vrabel <david.vrabel@citrix.com> Tested-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ip rules with iif/oif references do not update:
(detach/attach) across interface renames.
Signed-off-by: Maciej Żenczykowski <maze@google.com> CC: Willem de Bruijn <willemb@google.com> CC: Eric Dumazet <edumazet@google.com> CC: Chris Davis <chrismd@google.com> CC: Carlo Contavalli <ccontavalli@google.com>
Google-Bug-Id: 12936021 Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1. wpa_supplicant send a start_scan request to the nl80211 driver
2. mac80211 module call rtl_op_config with IEEE80211_CONF_CHANGE_IDLE
3. rtl_ips_nic_on is called which disable local irqs
4. rtl92c_phy_set_rf_power_state() is called
5. rtl_ps_enable_nic() is called and hw_init()is executed and then the interrupts on the device are enabled
A good solution could be to refactor the code to avoid calling rtl92ce_hw_init() with the irqs disabled
but a quick and dirty solution that has proven to work is
to reenable the irqs during the function rtl92ce_hw_init().
I think that it is safe doing so since the device interrupt will only be enabled after the init function succeed.
Signed-off-by: Olivier Langlois <olivier@trillion01.com> Acked-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes regression caused by commit a16dad77634 "MIPS: Fix
potencial corruption". That commit fixes one corruption scenario in
cost of adding another one, which actually start to cause crashes
on Yeeloong laptop when rtl8187 driver is used.
For correct DMA read operation on machines without DMA coherence, kernel
have to invalidate cache, such it will refill later with new data that
device wrote to memory, when that data is needed to process. We can only
invalidate full cache line. Hence when cache line includes both dma
buffer and some other data (written in cache, but not yet in main
memory), the other data can not hit memory due to invalidation. That
happen on rtl8187 where struct rtl8187_priv fields are located just
before and after small buffers that are passed to USB layer and DMA
is performed on them.
To fix the problem we align buffers and reserve space after them to make
them match cache line.
This patch does not resolve all possible MIPS problems entirely, for
that we have to assure that we always map cache aligned buffers for DMA,
what can be complex or even not possible. But patch fixes visible and
reproducible regression and seems other possible corruptions do not
happen in practice, since Yeeloong laptop works stable without rtl8187
driver.
Reported-by: Petr Pisar <petr.pisar@atlas.cz> Bisected-by: Tom Li <biergaizi2009@gmail.com> Reported-and-tested-by: Tom Li <biergaizi2009@gmail.com> Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> Acked-by: Larry Finger <Larry.Finger@lwfinger.next> Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It's possible for userland to pass down an iovec via writev() that has a
bogus user pointer in it. If that happens and we're doing an uncached
write, then we can end up getting less bytes than we expect from the
call to iov_iter_copy_from_user. This is CVE-2014-0069
cifs_iovec_write isn't set up to handle that situation however. It'll
blindly keep chugging through the page array and not filling those pages
with anything useful. Worse yet, we'll later end up with a negative
number in wdata->tailsz, which will confuse the sending routines and
cause an oops at the very least.
Fix this by having the copy phase of cifs_iovec_write stop copying data
in this situation and send the last write as a short one. At the same
time, we want to avoid sending a zero-length write to the server, so
break out of the loop and set rc to -EFAULT if that happens. This also
allows us to handle the case where no address in the iovec is valid.
[Note: Marking this for stable on v3.4+ kernels, but kernels as old as
v2.6.38 may have a similar problem and may need similar fix]
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For avr32 cross compiler, do not define '__linux__' internally, so it
will cause issue with allmodconfig.
The related error:
CC [M] fs/coda/psdev.o
In file included from include/linux/coda.h:64,
from fs/coda/psdev.c:45:
include/uapi/linux/coda.h:221: error: expected specifier-qualifier-list before 'u_quad_t'
The related toolchain version (which only download, not re-compile):
In file included from arch/avr32/boards/mimc200/fram.c:13:
include/linux/miscdevice.h:51: error: field 'list' has incomplete type
include/linux/miscdevice.h:55: error: expected specifier-qualifier-list before 'mode_t'
arch/avr32/boards/mimc200/fram.c:42: error: 'THIS_MODULE' undeclared here (not in a function)
During __v{6,7}_setup, we invalidate the TLBs since we are about to
enable the MMU on return to head.S. Unfortunately, without a subsequent
dsb instruction, the invalidation is not guaranteed to have completed by
the time we write to the sctlr, potentially exposing us to junk/stale
translations cached in the TLB.
This patch reworks the init functions so that the dsb used to ensure
completion of cache/predictor maintenance is also used to ensure
completion of the TLB invalidation.
Reported-by: Albin Tonnerre <Albin.Tonnerre@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The set_flexbg_block_bitmap() function assumed that the number of
blocks in a blockgroup was sb->blocksize * 8, which is normally true,
but not always! Use EXT4_BLOCKS_PER_GROUP(sb) instead, to fix block
bitmap corruption after:
If an ext4 file system is created by some tool other than mke2fs
(perhaps by someone who has a pathalogical fear of the GPL) that
doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags,
and that file system is then mounted read-only, don't try to modify
the s_flags field. Otherwise, if dm_verity is in use, the superblock
will change, causing an dm_verity failure.
In allmodconfig builds for sparc and any other arch which does
not set CONFIG_SPARSE_IRQ, the following will be seen at modpost:
CC [M] lib/cpu-notifier-error-inject.o
CC [M] lib/pm-notifier-error-inject.o
ERROR: "irq_to_desc" [drivers/gpio/gpio-mcp23s08.ko] undefined!
make[2]: *** [__modpost] Error 1
This happens because commit 3911ff30f5 ("genirq: export
handle_edge_irq() and irq_to_desc()") added one export for it, but
there were actually two instances of it, in an if/else clause for
CONFIG_SPARSE_IRQ. Add the second one.
Each sub-buffer (buffer page) has a full 64 bit timestamp. The events on
that page use a 27 bit delta against that timestamp in order to save on
bits written to the ring buffer. If the time between events is larger than
what the 27 bits can hold, a "time extend" event is added to hold the
entire 64 bit timestamp again and the events after that hold a delta from
that timestamp.
As a "time extend" is always paired with an event, it is logical to just
allocate the event with the time extend, to make things a bit more efficient.
Unfortunately, when the pairing code was written, it removed the "delta = 0"
from the first commit on a page, causing the events on the page to be
slightly skewed.
Fixes: 69d1b839f7ee "ring-buffer: Bind time extend and data events together" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix NULL pointer dereference of "chip->pdata" if platform_data was not
supplied to the driver.
The driver during probe stored the pointer to the platform_data:
chip->pdata = client->dev.platform_data;
Later it was dereferenced in max17040_get_online() and
max17040_get_status().
If platform_data was not supplied, the NULL pointer exception would
happen:
When compiling for the IA-64 ski emulator, HZ is set to 32 because the
emulation is slow and we don't want to waste too many cycles processing
timers. Alpha also has an option to set HZ to 32.
This causes integer underflow in
kernel/time/jiffies.c:
kernel/time/jiffies.c:66:2: warning: large integer implicitly truncated to unsigned type [-Woverflow]
.mult = NSEC_PER_JIFFY << JIFFIES_SHIFT, /* details above */
^
This patch reduces the JIFFIES_SHIFT value to avoid the overflow.
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:
register_cpu_notifier(&foobar_cpu_notifier);
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
put_online_cpus();
A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.
So reorganize the code in raid5 this way to fix the deadlock with callback
registration.
Cc: linux-raid@vger.kernel.org Fixes: 36d1c6476be51101778882897b315bd928c8c7b5 Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If kvm_io_bus_register_dev() fails then it returns success but it should
return an error code.
I also did a little cleanup like removing an impossible NULL test.
Fixes: 2b3c246a682c ('KVM: Make coalesced mmio use a device per zone') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When mkfs issues a full device discard and the device only
supports discards of a smallish size, we can loop in
blkdev_issue_discard() for a long time. If preempt isn't enabled,
this can turn into a softlock situation and the kernel will
start complaining.
Add an explicit cond_resched() at the end of the loop to avoid
that.
Commit afe2dab4f6 ("USB: add hex/bcd detection to usb modalias generation")
changed the routine that generates alias ranges. Before that change, only
digits 0-9 were supported; the commit tried to fix the case when the range
includes higher values than 0x9.
Unfortunately, the commit didn't fix the case when the range includes both
0x9 and 0xA, meaning that the final range must look like [x-9A-y] where
x <= 0x9 and y >= 0xA -- instead the [x-9A-x] range was produced.
Modprobe doesn't complain as it sees no difference between no-match and
bad-pattern results of fnmatch().
Fixing this simple bug to fix the aliases.
Also changing the hardcoded beginning of the range to uppercase as all the
other letters are also uppercase in the device version numbers.
Fortunately, this affects only the dvb-usb-dib0700 module, AFAIK.
Signed-off-by: Jan Moskyto Matejka <mq@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
People sometimes create their own custom-configured kernels and forget
to enable CONFIG_SCSI_MULTI_LUN. This causes problems when they plug
in a USB storage device (such as a card reader) with more than one
LUN.
Fortunately, we can tell fairly easily when a storage device claims to
have more than one LUN. When that happens, this patch asks the SCSI
layer to probe all the LUNs automatically, regardless of the config
setting.
The patch also updates the Kconfig help text for usb-storage,
explaining that CONFIG_SCSI_MULTI_LUN may be necessary.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Thomas Raschbacher <lordvan@lordvan.com> CC: Matthew Dharm <mdharm-usb@one-eyed-alien.net> CC: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Cypress ATACB unusual-devs entry for the Super Top SATA bridge
causes problems. Although it was originally reported only for
bcdDevice = 0x160, its range was much larger. This resulted in a bug
report for bcdDevice 0x220, so the range was capped at 0x219. Now
Milan reports errors with bcdDevice 0x150.
Therefore this patch restricts the range to just 0x160.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-and-tested-by: Milan Svoboda <milan.svoboda@centrum.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3GPP TS 07.10 states in section 5.4.6.3.7:
"The length byte contains the value 2 or 3 ... depending on the break
signal." The break byte is optional and if it is sent, the length is
3. In fact the driver was not able to work with modems that send this
break byte in their modem status control message. If the modem just
sends the break byte if it is really set, then weird things might
happen.
The code for deconding the modem status to the internal linux
presentation in gsm_process_modem has already a big comment about
this 2 or 3 byte length thing and it is already able to decode the
brk, but the code calling the gsm_process_modem function in
gsm_control_modem does not encode it and hand it over the right way.
This patch fixes this.
Without this fix if the modem sends the brk byte in it's modem status
control message the driver will hang when opening a muxed channel.
Signed-off-by: Lars Poeschel <poeschel@lemonage.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If an NFS client attempts to get a lock (using NLM) and the lock is
not available, the server will remember the request and when the lock
becomes available it will send a GRANT request to the client to
provide the lock.
If the client already held an adjacent lock, the GRANT callback will
report the union of the existing and new locks, which can confuse the
client.
This happens because __posix_lock_file (called by vfs_lock_file)
updates the passed-in file_lock structure when adjacent or
over-lapping locks are found.
To avoid this problem we take a copy of the two fields that can
be changed (fl_start and fl_end) before the call and restore them
afterwards.
An alternate would be to allocate a 'struct file_lock', initialise it,
use locks_copy_lock() to take a copy, then locks_release_private()
after the vfs_lock_file() call. But that is a lot more work.
Reported-by: Olaf Kirch <okir@suse.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
--
v1 had a couple of issues (large on-stack struct and didn't really work properly).
This version is much better tested. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
bind_get() checks the device number it is called with. It uses
MAX_RAW_MINORS for the upper bound. But MAX_RAW_MINORS is set at compile
time while the actual number of raw devices can be set at runtime. This
means the test can either be too strict or too lenient. And if the test
ends up being too lenient bind_get() might try to access memory beyond
what was allocated for "raw_devices".
So check against the runtime value (max_raw_minors) in this function.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It causes a NULL pointer dereference with drivers using the generic
spi_transfer_one_message(), which always calls
spi_finalize_current_message(), which zeroes master->cur_msg.
Drivers implementing transfer_one_message() theirselves must always call
spi_finalize_current_message(), even if the transfer failed:
* @transfer_one_message: the subsystem calls the driver to transfer a single
* message while queuing transfers that arrive in the meantime. When the
* driver is finished with this message, it must call
* spi_finalize_current_message() so the subsystem can issue the next
* transfer
Signed-off-by: Geert Uytterhoeven <geert+renesas@linux-m68k.org> Signed-off-by: Mark Brown <broonie@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The kernel currently crashes with a low-address-protection exception
if a user space process executes an instruction that tries to use the
linkage stack. Set the base-ASTE origin and the subspace-ASTE origin
of the dispatchable-unit-control-table to point to a dummy ASTE.
Set up control register 15 to point to an empty linkage stack with no
room left.
A user space process with a linkage stack instruction will still crash
but with a different exception which is correctly translated to a
segmentation fault instead of a kernel oops.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The memory detection function find_memory_chunks() issues tprot to
find valid memory chunks. In case of CMM it can happen that pages are
marked as unstable via set_page_unstable() in arch_free_page().
If z/VM has released that pages, tprot returns -EFAULT and indicates
a memory hole.
So fix this and switch off CMM in case of kdump or zfcpdump.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The "new" fragmentation code (since my rewrite almost 5 years ago)
erroneously sets skb->len rather than using skb_trim() to adjust
the length of the first fragment after copying out all the others.
This leaves the skb tail pointer pointing to after where the data
originally ended, and thus causes the encryption MIC to be written
at that point, rather than where it belongs: immediately after the
data.
The impact of this is that if software encryption is done, then
a) encryption doesn't work for the first fragment, the connection
becomes unusable as the first fragment will never be properly
verified at the receiver, the MIC is practically guaranteed to
be wrong
b) we leak up to 8 bytes of plaintext (!) of the packet out into
the air
This is only mitigated by the fact that many devices are capable
of doing encryption in hardware, in which case this can't happen
as the tail pointer is irrelevant in that case. Additionally,
fragmentation is not used very frequently and would normally have
to be configured manually.
Recently due to a spike in connections per second memcached on 3
separate boxes triggered the OOM killer from accept. At the time the
OOM killer was triggered there was 4GB out of 36GB free in zone 1. The
problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
hold a bitmap, and there was sufficient fragmentation that the largest
page available was 8KiB.
I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
but I do agree that order 3 allocations are very likely to succeed.
There are always pathologies where order > 0 allocations can fail when
there are copious amounts of free memory available. Using the pigeon
hole principle it is easy to show that it requires 1 page more than 50%
of the pages being free to guarantee an order 1 (8KiB) allocation will
succeed, 1 page more than 75% of the pages being free to guarantee an
order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
the pages being free to guarantee an order 3 allocate will succeed.
A server churning memory with a lot of small requests and replies like
memcached is a common case that if anything can will skew the odds
against large pages being available.
Therefore let's not give external applications a practical way to kill
linux server applications, and specify __GFP_NORETRY to the kmalloc in
alloc_fdmem. Unless I am misreading the code and by the time the code
reaches should_alloc_retry in __alloc_pages_slowpath (where
__GFP_NORETRY becomes signification). We have already tried everything
reasonable to allocate a page and the only thing left to do is wait. So
not waiting and falling back to vmalloc immediately seems like the
reasonable thing to do even if there wasn't a chance of triggering the
OOM killer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Cong Wang <cwang@twopensource.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Backend drivers shouldn't transistion to CLOSED unless the frontend is
CLOSED. If a backend does transition to CLOSED too soon then the
frontend may not see the CLOSING state and will not properly shutdown.
So, treat an unexpected backend CLOSED state the same as CLOSING.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
...and ensure that we tear down the nfs_commit_data cache too when
unloading the module.
Cc: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[bwh: Backported to 3.2: drop the nfs_cdata_cachep cleanup; it doesn't exist] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The intent behind this behavior was to return "pK-error" in cases where
the %pK format specifier was used in interrupt context, because the
CAP_SYSLOG check wouldn't be meaningful. Clearly this should only apply
when kptr_restrict is actually enabled though.
Reported-by: Stevie Trujillo <stevie.trujillo@gmail.com> Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Block layer will allocate a spinlock for the queue if the driver does
not provide one in blk_init_queue().
The reason to use the internal spinlock is that blk_cleanup_queue() will
switch to use the internal spinlock in the cleanup code path.
if (q->queue_lock != &q->__queue_lock)
q->queue_lock = &q->__queue_lock;
However, processes which are in D state might have taken the driver
provided spinlock, when the processes wake up, they would release the
block provided spinlock.
=====================================
[ BUG: bad unlock balance detected! ]
3.4.0-rc7+ #238 Not tainted
-------------------------------------
fio/3587 is trying to release lock (&(&q->__queue_lock)->rlock) at:
[<ffffffff813274d2>] blk_queue_bio+0x2a2/0x380
but there are no more locks to release!
other info that might help us debug this:
1 lock held by fio/3587:
#0: (&(&vblk->lock)->rlock){......}, at:
[<ffffffff8132661a>] get_request_wait+0x19a/0x250
Other drivers use block layer provided spinlock as well, e.g. SCSI.
Switching to the block layer provided spinlock saves a bit of memory and
does not increase lock contention. Performance test shows no real
difference is observed before and after this patch.
Changes in v2: Improve commit log as Michael suggested.
Cc: virtualization@lists.linux-foundation.org Cc: kvm@vger.kernel.org Signed-off-by: Asias He <asias@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The touchpad on the Acer Aspire One D250 will report out of range values
in the extreme lower portion of the touchpad. These appear as abrupt
changes in the values reported by the hardware from very low values to
very high values, which can cause unexpected vertical jumps in the
position of the mouse pointer.
What seems to be happening is that the value is wrapping to a two's
compliment negative value of higher resolution than the 13-bit value
reported by the hardware, with the high-order bits being truncated. This
patch adds handling for these values by converting them to the
appropriate negative values.
The only tricky part about this is deciding when to treat a number as
negative. It stands to reason that if out of range values can be
reported on the low end then it could also happen on the high end, so
not all out of range values should be treated as negative. The approach
taken here is to split the difference between the maximum legitimate
value for the axis and the maximum possible value that the hardware can
report, treating values greater than this number as negative and all
other values as positive. This can be tweaked later if hardware is found
that operates outside of these parameters.
BugLink: http://bugs.launchpad.net/bugs/1001251 Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Reviewed-by: Daniel Kurtz <djkurtz@chromium.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1. Do not allocate memory for buffers from emergency pools, unless
absolutely required. Do not warn about and do not retry non-essential
failed allocations.
2. Do not check the amount of free pages left on every single page
write, but wait until one map is completely populated and then check.
3. Set maximum number of pages for read buffering consistently, instead
of inadvertently depending on the size of the sector type.
4. Fix copyright line, which I missed when I submitted the hibernation
threading patch.
5. Dispense with bit shifting arithmetic to improve readability.
6. Really recalculate the number of pages required to be free after all
allocations have been done.
7. Fix calculation of pages required for read buffering. Only count in
pages that do not belong to high memory.
Signed-off-by: Bojan Smojver <bojan@rexursive.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kvm_set_irq() has an internal buffer of three irq routing entries, allowing
connecting a GSI to three IRQ chips or on MSI. However setup_routing_entry()
does not properly enforce this, allowing three irqchip routes followed by
an MSI route to overflow the buffer.
Fix by ensuring that an MSI entry is added to an empty list.
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch re-adds the ability to optionally run in buffered FILEIO mode
(eg: w/o O_DSYNC) for device backends in order to once again use the
Linux buffered cache as a write-back storage mechanism.
This logic was originally dropped with mainline v3.5-rc commit:
target/file: Use O_DSYNC by default for FILEIO backends
This difference with this patch is that fd_create_virtdevice() now
forces the explicit setting of emulate_write_cache=1 when buffered FILEIO
operation has been enabled.
(v2: Switch to FDBD_HAS_BUFFERED_IO_WCE + add more detailed
comment as requested by hch)
Reported-by: Ferry <iscsitmp@bananateam.nl> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Convert to use O_DSYNC for all cases at FILEIO backend creation time to
avoid the extra syncing of pure timestamp updates with legacy O_SYNC during
default operation as recommended by hch. Continue to do this independently of
Write Cache Enable (WCE) bit, as WCE=0 is currently the default for all backend
devices and enabled by user on per device basis via attrib/emulate_write_cache.
This patch drops the now unnecessary fd_buffered_io= token usage that was
originally signalling when to explictly disable O_SYNC at backend creation
time for buffered I/O operation. This can end up being dangerous for a number
of reasons during physical node failure, so go ahead and drop this option
for now when O_DSYNC is used as the default.
Also allow explict FUA WRITEs -> vfs_fsync_range() call to function in
fd_execute_cmd() independently of WCE bit setting.
Reported-by: Christoph Hellwig <hch@lst.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
[bwh: Backported to 3.2:
- We have fd_do_task() and not fd_execute_cmd()
- Various fields are in struct se_task rather than struct se_cmd
- fd_create_virtdevice() flags initialisation hasn't been cleaned up] Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
qib_user_sdma_queue_pkts() gets called with mmap_sem held for
writing. Except for get_user_pages() deep down in
qib_user_sdma_pin_pages() we don't seem to need mmap_sem at all. Even
more interestingly the function qib_user_sdma_queue_pkts() (and also
qib_user_sdma_coalesce() called somewhat later) call copy_from_user()
which can hit a page fault and we deadlock on trying to get mmap_sem
when handling that fault.
So just make qib_user_sdma_pin_pages() use get_user_pages_fast() and
leave mmap_sem locking for mm.
This deadlock has actually been observed in the wild when the node
is under memory pressure.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Roland Dreier <roland@purestorage.com>
[Backported to 3.4: (Thank to Ben Hutchings)
- Adjust context
- Adjust indentation and nr_pages argument in qib_user_sdma_pin_pages()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.
Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.
So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.
While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.
The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.
The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.
This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.
Doing some different tests, I discovered that function graph tracing, when
filtered via the set_ftrace_filter and set_ftrace_notrace files, does
not always keep with them if another function ftrace_ops is registered
to trace functions.
The reason is that function graph just happens to trace all functions
that the function tracer enables. When there was only one user of
function tracing, the function graph tracer did not need to worry about
being called by functions that it did not want to trace. But now that there
are other users, this becomes a problem.
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
# cat trace
[..]
1) + 20.825 us | }
1) + 21.651 us | }
1) + 30.924 us | } /* SyS_ioctl */
1) | do_page_fault() {
1) | __do_page_fault() {
1) 0.274 us | down_read_trylock();
1) 0.098 us | find_vma();
1) | handle_mm_fault() {
1) | _raw_spin_lock() {
1) 0.102 us | preempt_count_add();
1) 0.097 us | do_raw_spin_lock();
1) 2.173 us | }
1) | do_wp_page() {
1) 0.079 us | vm_normal_page();
1) 0.086 us | reuse_swap_page();
1) 0.076 us | page_move_anon_rmap();
1) | unlock_page() {
1) 0.082 us | page_waitqueue();
1) 0.086 us | __wake_up_bit();
1) 1.801 us | }
1) 0.075 us | ptep_set_access_flags();
1) | _raw_spin_unlock() {
1) 0.098 us | do_raw_spin_unlock();
1) 0.105 us | preempt_count_sub();
1) 1.884 us | }
1) 9.149 us | }
1) + 13.083 us | }
1) 0.146 us | up_read();
When the stack tracer was enabled, it enabled all functions to be traced, which
now the function graph tracer also traces. This is a side effect that should
not occur.
To fix this a test is added when the function tracing is changed, as well as when
the graph tracer is enabled, to see if anything other than the ftrace global_ops
function tracer is enabled. If so, then the graph tracer calls a test trampoline
that will look at the function that is being traced and compare it with the
filters defined by the global_ops.
As an optimization, if there's no other function tracers registered, or if
the only registered function tracers also use the global ops, the function
graph infrastructure will call the registered function graph callback directly
and not go through the test trampoline.
Fixes: d2d45c7a03a2 "tracing: Have stack_tracer use a separate list of functions" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The synchronization needed after ftrace_ops are unregistered must happen
after the callback is disabled from becing called by functions.
The current location happens after the function is being removed from the
internal lists, but not after the function callbacks were disabled, leaving
the functions susceptible of being called after their callbacks are freed.
This affects perf and any externel users of function tracing (LTTng and
SystemTap).
Fixes: cdbe61bfe704 "ftrace: Allow dynamically allocated function tracers" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Partial commit backported to 3.4. The ftrace_sync() code by this is
required for other fixes that 3.4 needs. ]
ftrace_trace_function is a variable that holds what function will be called
directly by the assembly code (mcount). If just a single function is
registered and it handles recursion itself, then the assembly will call that
function directly without any helper function. It also passes in the
ftrace_op that was registered with the callback. The ftrace_op to send is
stored in the function_trace_op variable.
The ftrace_trace_function and function_trace_op needs to be coordinated such
that the called callback wont be called with the wrong ftrace_op, otherwise
bad things can happen if it expected a different op. Luckily, there's no
callback that doesn't use the helper functions that requires this. But
there soon will be and this needs to be fixed.
Use a set_function_trace_op to store the ftrace_op to set the
function_trace_op to when it is safe to do so (during the update function
within the breakpoint or stop machine calls). Or if dynamic ftrace is not
being used (static tracing) then we have to do a bit more synchronization
when the ftrace_trace_function is set as that takes affect immediately
(as oppose to dynamic ftrace doing it with the modification of the trampoline).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit be35f48610 ("dm: wait until embedded kobject is
released before destroying a device") and provides an improved fix.
The kobject release code that calls the completion must be placed in a
non-module file, otherwise there is a module unload race (if the process
calling dm_kobject_release is preempted and the DM module unloaded after
the completion is triggered, but before dm_kobject_release returns).
To fix this race, this patch moves the completion code to dm-builtin.c
which is always compiled directly into the kernel if BLK_DEV_DM is
selected.
The patch introduces a new dm_kobject_holder structure, its purpose is
to keep the completion and kobject in one place, so that it can be
accessed from non-module code without the need to export the layout of
struct mapped_device to that code.
On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as
Itanium, pageblock_order is a variable with default value of 0. It's set
to the right value by set_pageblock_order() in function
free_area_init_core().
But pageblock_order may be used by sparse_init() before free_area_init_core()
is called along path:
sparse_init()
->sparse_early_usemaps_alloc_node()
->usemap_size()
->SECTION_BLOCKFLAGS_BITS
->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) *
NR_PAGEBLOCK_BITS)
The uninitialized pageblock_size will cause memory wasting because
usemap_size() returns a much bigger value then it's really needed.
For example, on an Itanium platform,
sparse_init() pageblock_order=0 usemap_size=24576
free_area_init_core() before pageblock_order=0, usemap_size=24576
free_area_init_core() after pageblock_order=12, usemap_size=8
That means 24K memory has been wasted for each section, so fix it by calling
set_pageblock_order() from sparse_init().
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Jiang Liu <liuj97@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Keping Chen <chenkeping@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lizf: Backported to 3.4: adjust context] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This has always been broken: one version takes an unsigned int and the
other version takes no arguments. This bug was hidden because one
version of set_pageblock_order() was a macro which doesn't evaluate its
argument.
Simplify it all and remove pageblock_default_order() altogether.
Especially vesafb likes to map everything as uc- (yikes), and if that
mapping hangs around still while we try to map the gtt as wc the
kernel will downgrade our request to uc-, resulting in abyssal
performance.
Unfortunately we can't do this as early as readon does (i.e. as the
first thing we do when initializing the hw) because our fb/mmio space
region moves around on a per-gen basis. So I've had to move it below
the gtt initialization, but that seems to work, too. The important
thing is that we do this before we set up the gtt wc mapping.
Now an altogether different question is why people compile their
kernels with vesafb enabled, but I guess making things just work isn't
bad per se ...
v2:
- s/radeondrmfb/inteldrmfb/
- fix up error handling
v3: Kill #ifdef X86, this is Intel after all. Noticed by Ben Widawsky.
v4: Jani Nikula complained about the pointless bool primary
initialization.
v5: Don't oops if we can't allocate, noticed by Chris Wilson.
v6: Resolve conflicts with agp rework and fixup whitespace.
Backport to 3.5 -fixes queue requested by Dave Airlie - due to grub
using vesa on fedora their initrd seems to load vesafb before loading
the real kms driver. So tons more people actually experience a
dead-slow gpu. Hence also the Cc: stable.
Reported-and-tested-by: "Kilarski, Bernard R" <bernard.r.kilarski@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Dave Airlie <airlied@redhat.com>
[lizf: Backported to 3.4: adjust context] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Now when we set the group inode free count, we don't have a proper
group lock so that multiple threads may decrease the inode free
count at the same time. And e2fsck will complain something like:
Free inodes count wrong for group #1 (1, counted=0).
Fix? no
Free inodes count wrong for group #2 (3, counted=0).
Fix? no
Directories count wrong for group #2 (780, counted=779).
Fix? no
Free inodes count wrong for group #3 (2272, counted=2273).
Fix? no
So this patch try to protect it with the ext4_lock_group.
btw, it is found by xfstests test case 269 and the volume is
mkfsed with the parameter
"-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr"
and I have run it 100 times and the error in e2fsck doesn't
show up again.
Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>