Load size was limited to 64MB, this was legacy limitation due to vmalloc()
which was removed a while ago.
Signed-off-by: zhanglin <zhang.lin16@zte.com.cn>
[PM: removed comments in the description about 'real world use cases'] Signed-off-by: Paul Moore <paul@paul-moore.com>
Merge tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A bunch of fixes that accumulated in recent weeks, mostly material for
stable.
Summary:
- fix for regression from 5.3 that prevents to use balance convert
with single profile
- qgroup fixes: rescan race, accounting leak with multiple writers,
potential leak after io failure recovery
- fix for use after free in relocation (reported by KASAN)
- other error handling fixups"
* tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls
btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space
btrfs: Fix a regression which we can't convert to SINGLE profile
btrfs: relocation: fix use-after-free on dead relocation roots
Btrfs: fix race setting up and completing qgroup rescan workers
Btrfs: fix missing error return if writeback for extent buffer never started
btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer
Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
Merge tag 'trace-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"A few more tracing fixes:
- Fix a buffer overflow by checking nr_args correctly in probes
- Fix a warning that is reported by clang
- Fix a possible memory leak in error path of filter processing
- Fix the selftest that checks for failures, but wasn't failing
- Minor clean up on call site output of a memory trace event"
* tag 'trace-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
selftests/ftrace: Fix same probe error test
mm, tracing: Print symbol name for call_site in trace events
tracing: Have error path in predicate_parse() free its allocated memory
tracing: Fix clang -Wint-in-bool-context warnings in IF_ASSIGN macro
tracing/probe: Fix to check the difference of nr_args before adding probe
Merge tag 'mmc-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull more MMC updates from Ulf Hansson:
"A couple more updates/fixes for MMC:
- sdhci-pci: Add Genesys Logic GL975x support
- sdhci-tegra: Recover loss in throughput for DMA
- sdhci-of-esdhc: Fix DMA bug"
* tag 'mmc-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: host: sdhci-pci: Add Genesys Logic GL975x support
mmc: tegra: Implement ->set_dma_mask()
mmc: sdhci: Let drivers define their DMA mask
mmc: sdhci-of-esdhc: set DMA snooping based on DMA coherence
mmc: sdhci: improve ADMA error reporting
csky: Move static keyword to the front of declaration
Move the static keyword to the front of declaration of
csky_pmu_of_device_ids, and resolve the following compiler
warning that can be seen when building with warnings
enabled (W=1):
arch/csky/kernel/perf_event.c:1340:1: warning:
‘static’ is not at beginning of declaration [-Wold-style-declaration]
Signed-off-by: Krzysztof Wilczynski <kw@linux.com> Signed-off-by: Guo Ren <guoren@kernel.org>
Merge tag 'char-misc-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull Documentation/process update from Greg KH:
"Here are two small Documentation/process/embargoed-hardware-issues.rst
file updates that missed my previous char/misc pull request.
The first one adds an Intel representative for the process, and the
second one cleans up the text a bit more when it comes to how the
disclosure rules work, as it was a bit confusing to some companies"
* tag 'char-misc-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
Documentation/process: Clarify disclosure rules
Documentation/process: Volunteer as the ambassador for Intel
Merge tag '5.4-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull more cifs updates from Steve French:
"Fixes from the recent SMB3 Test events and Storage Developer
Conference (held the last two weeks).
Here are nine smb3 patches including an important patch for debugging
traces with wireshark, with three patches marked for stable.
Additional fixes from last week to better handle some newly discovered
reparse points, and a fix the create/mkdir path for setting the mode
more atomically (in SMB3 Create security descriptor context), and one
for path name processing are still being tested so are not included
here"
* tag '5.4-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Fix oplock handling for SMB 2.1+ protocols
smb3: missing ACL related flags
smb3: pass mode bits into create calls
smb3: Add missing reparse tags
CIFS: fix max ea value size
fs/cifs/sess.c: Remove set but not used variable 'capabilities'
fs/cifs/smb2pdu.c: Make SMB2_notify_init static
smb3: fix leak in "open on server" perf counter
smb3: allow decryption keys to be dumped by admin for debugging
Mao Han [Wed, 25 Sep 2019 09:23:02 +0000 (17:23 +0800)]
csky: Fixup csky_pmu.max_period assignment
The csky_pmu.max_period has type u64, and BIT() can only return
32 bits unsigned long on C-SKY. The initialization for max_period
will be incorrect when count_width is bigger than 32.
Use BIT_ULL()
Signed-off-by: Mao Han <han_mao@c-sky.com> Signed-off-by: Guo Ren <ren_guo@c-sky.com>
This is admittedly partly "for discussion". We need to have a way
forward for the boot time deadlocks where user space ends up waiting for
more entropy, but no entropy is forthcoming because the system is
entirely idle just waiting for something to happen.
While this was triggered by what is arguably a user space bug with
GDM/gnome-session asking for secure randomness during early boot, when
they didn't even need any such truly secure thing, the issue ends up
being that our "getrandom()" interface is prone to that kind of
confusion, because people don't think very hard about whether they want
to block for sufficient amounts of entropy.
The approach here-in is to decide to not just passively wait for entropy
to happen, but to start actively collecting it if it is missing. This
is not necessarily always possible, but if the architecture has a CPU
cycle counter, there is a fair amount of noise in the exact timings of
reasonably complex loads.
We may end up tweaking the load and the entropy estimates, but this
should be at least a reasonable starting point.
As part of this, we also revert the revert of the ext4 IO pattern
improvement that ended up triggering the reported lack of external
entropy.
* getrandom() active entropy waiting:
Revert "Revert "ext4: make __ext4_get_inode_loc plug""
random: try to actively add entropy rather than passively wait for it
Instead of waiting forever for entropy that may just not happen, we now
try to actively generate entropy when required, and are thus hopefully
avoiding the problem that caused the nice ext4 IO pattern fix to be
reverted.
So revert the revert.
Cc: Ahmed S. Darwish <darwish.07@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Alexander E. Patrakov <patrakov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
random: try to actively add entropy rather than passively wait for it
For 5.3 we had to revert a nice ext4 IO pattern improvement, because it
caused a bootup regression due to lack of entropy at bootup together
with arguably broken user space that was asking for secure random
numbers when it really didn't need to.
See commit 72dbcf721566 (Revert "ext4: make __ext4_get_inode_loc plug").
This aims to solve the issue by actively generating entropy noise using
the CPU cycle counter when waiting for the random number generator to
initialize. This only works when you have a high-frequency time stamp
counter available, but that's the case on all modern x86 CPU's, and on
most other modern CPU's too.
What we do is to generate jitter entropy from the CPU cycle counter
under a somewhat complex load: calling the scheduler while also
guaranteeing a certain amount of timing noise by also triggering a
timer.
I'm sure we can tweak this, and that people will want to look at other
alternatives, but there's been a number of papers written on jitter
entropy, and this should really be fairly conservative by crediting one
bit of entropy for every timer-induced jump in the cycle counter. Not
because the timer itself would be all that unpredictable, but because
the interaction between the timer and the loop is going to be.
Even if (and perhaps particularly if) the timer actually happens on
another CPU, the cacheline interaction between the loop that reads the
cycle counter and the timer itself firing is going to add perturbations
to the cycle counter values that get mixed into the entropy pool.
As Thomas pointed out, with a modern out-of-order CPU, even quite simple
loops show a fair amount of hard-to-predict timing variability even in
the absense of external interrupts. But this tries to take that further
by actually having a fairly complex interaction.
This is not going to solve the entropy issue for architectures that have
no CPU cycle counter, but it's not clear how (and if) that is solvable,
and the hardware in question is largely starting to be irrelevant. And
by doing this we can at least avoid some of the even more contentious
approaches (like making the entropy waiting time out in order to avoid
the possibly unbounded waiting).
Cc: Ahmed Darwish <darwish.07@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Nicholas Mc Guire <hofrat@opentech.at> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Willy Tarreau <w@1wt.eu> Cc: Alexander E. Patrakov <patrakov@gmail.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Olof Johansson [Sun, 29 Sep 2019 18:19:25 +0000 (11:19 -0700)]
Merge tag 'fixes-5.4-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/fixes
Fixes for omap variants
Few fixes for ti-sysc interconnect target module driver for no-idle
quirks that caused nfsroot to fail on some dra7 boards.
And let's fixes to get LCD working again for logicpd board that got
broken a while back with removal of panel-dpi driver. We need to now
use generic CONFIG_DRM_PANEL_SIMPLE instead.
* tag 'fixes-5.4-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
bus: ti-sysc: Remove unpaired sysc_clkdm_deny_idle()
ARM: dts: logicpd-som-lv: Fix i2c2 and i2c3 Pin mux
ARM: dts: am3517-evm: Fix missing video
ARM: dts: logicpd-torpedo-baseboard: Fix missing video
ARM: omap2plus_defconfig: Fix missing video
bus: ti-sysc: Fix handling of invalid clocks
bus: ti-sysc: Fix clock handling for no-idle quirks
Olof Johansson [Sun, 29 Sep 2019 18:19:18 +0000 (11:19 -0700)]
Merge tag 'scmi-fixes-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes
ARM SCMI fixes for v5.4
Couple of fixes: one in scmi reset driver initialising missed scmi handle
and an other in scmi reset API implementation fixing the assignment of
reset state
* tag 'scmi-fixes-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
reset: reset-scmi: add missing handle initialisation
firmware: arm_scmi: reset: fix reset_state assignment in scmi_domain_reset
Merge tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
More libnvdimm updates from Dan Williams:
- Complete the reworks to interoperate with powerpc dynamic huge page
sizes
- Fix a crash due to missed accounting for the powerpc 'struct
page'-memmap mapping granularity
- Fix badblock initialization for volatile (DRAM emulated) pmem ranges
- Stop triggering request_key() notifications to userspace when
NVDIMM-security is disabled / not present
- Miscellaneous small fixups
* tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm/region: Enable MAP_SYNC for volatile regions
libnvdimm: prevent nvdimm from requesting key when security is disabled
libnvdimm/region: Initialize bad block for volatile namespaces
libnvdimm/nfit_test: Fix acpi_handle redefinition
libnvdimm/altmap: Track namespace boundaries in altmap
libnvdimm: Fix endian conversion issues
libnvdimm/dax: Pick the right alignment default when creating dax devices
powerpc/book3s64: Export has_transparent_hugepage() related functions.
Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
Pull thermal SoC updates from Eduardo Valentin:
"This is a really small pull in the midst of a lot of pending patches.
We are in the middle of restructuring how we are maintaining the
thermal subsystem, as per discussion in our last LPC. For now, I am
sending just some changes that were pending in my tree. Looking
forward to get a more streamlined process in the next merge window"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal:
thermal: db8500: Rewrite to be a pure OF sensor
thermal: db8500: Use dev helper variable
thermal: db8500: Finalize device tree conversion
thermal: thermal_mmio: remove some dead code
Merge branch 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull more i2c updates from Wolfram Sang:
- make Lenovo Yoga C630 boot now that the dependencies are merged
- restore BlockProcessCall for i801, accidently removed in this merge
window
- a bugfix for the riic driver
- an improvement to the slave-eeprom driver which should have been in
the first pull request but sadly got lost in the process
* 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: slave-eeprom: Add read only mode
i2c: i801: Bring back Block Process Call support for certain platforms
i2c: riic: Clear NACK in tend isr
i2c: qcom-geni: Disable DMA processing on the Lenovo Yoga C630
Merge tag 'iommu-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
"A couple of fixes for the AMD IOMMU driver have piled up:
- Some fixes for the reworked IO page-table which caused memory leaks
or did not allow to downgrade mappings under some conditions.
- Locking fixes to fix a couple of possible races around accessing
'struct protection_domain'. The races got introduced when the
dma-ops path became lock-less in the fast-path"
* tag 'iommu-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/amd: Lock code paths traversing protection_domain->dev_list
iommu/amd: Lock dev_data in attach/detach code paths
iommu/amd: Check for busy devices earlier in attach_device()
iommu/amd: Take domain->lock for complete attach/detach path
iommu/amd: Remove amd_iommu_devtable_lock
iommu/amd: Remove domain->updated
iommu/amd: Wait for completion of IOTLB flush in attach_device
iommu/amd: Unmap all L7 PTEs when downgrading page-sizes
iommu/amd: Introduce first_pte_l7() helper
iommu/amd: Fix downgrading default page-sizes in alloc_pte()
iommu/amd: Fix pages leak in free_pagetable()
Thomas Gleixner [Wed, 25 Sep 2019 08:29:49 +0000 (10:29 +0200)]
Documentation/process: Clarify disclosure rules
The role of the contact list provided by the disclosing party and how it
affects the disclosure process and the ability to include experts into
the development process is not really well explained.
Neither is it entirely clear when the disclosing party will be informed
about the fact that a developer who is not covered by an employer NDA needs
to be brought in and disclosed.
Explain the role of the contact list and the information policy along with
an eventual conflict resolution better.
1) Sanity check URB networking device parameters to avoid divide by
zero, from Oliver Neukum.
2) Disable global multicast filter in NCSI, otherwise LLDP and IPV6
don't work properly. Longer term this needs a better fix tho. From
Vijay Khemka.
3) Small fixes to selftests (use ping when ping6 is not present, etc.)
from David Ahern.
4) Bring back rt_uses_gateway member of struct rtable, it's semantics
were not well understood and trying to remove it broke things. From
David Ahern.
5) Move usbnet snaity checking, ignore endpoints with invalid
wMaxPacketSize. From Bjørn Mork.
6) Missing Kconfig deps for sja1105 driver, from Mao Wenan.
7) Various small fixes to the mlx5 DR steering code, from Alaa Hleihel,
Alex Vesker, and Yevgeny Kliteynik
8) Missing CAP_NET_RAW checks in various places, from Ori Nimron.
9) Fix crash when removing sch_cbs entry while offloading is enabled,
from Vinicius Costa Gomes.
10) Signedness bug fixes, generally in looking at the result given by
of_get_phy_mode() and friends. From Dan Crapenter.
11) Disable preemption around BPF_PROG_RUN() calls, from Eric Dumazet.
12) Don't create VRF ipv6 rules if ipv6 is disabled, from David Ahern.
13) Fix quantization code in tcp_bbr, from Kevin Yang.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (127 commits)
net: tap: clean up an indentation issue
nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
sk_buff: drop all skb extensions on free and skb scrubbing
tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actions
Documentation: Clarify trap's description
mlxsw: spectrum: Clear VLAN filters during port initialization
net: ena: clean up indentation issue
NFC: st95hf: clean up indentation issue
net: phy: micrel: add Asym Pause workaround for KSZ9021
net: socionext: ave: Avoid using netdev_err() before calling register_netdev()
ptp: correctly disable flags on old ioctls
lib: dimlib: fix help text typos
net: dsa: microchip: Always set regmap stride to 1
nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs
nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs
net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabled
net: sched: sch_sfb: don't call qdisc_put() while holding tree lock
...
Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes)
Merge hugepage allocation updates from David Rientjes:
"We (mostly Linus, Andrea, and myself) have been discussing offlist how
to implement a sane default allocation strategy for hugepages on NUMA
platforms.
With these reverts in place, the page allocator will happily allocate
a remote hugepage immediately rather than try to make a local hugepage
available. This incurs a substantial performance degradation when
memory compaction would have otherwise made a local hugepage
available.
This series reverts those reverts and attempts to propose a more sane
default allocation strategy specifically for hugepages. Andrea
acknowledges this is likely to fix the swap storms that he originally
reported that resulted in the patches that removed __GFP_THISNODE from
hugepage allocations.
The immediate goal is to return 5.3 to the behavior the kernel has
implemented over the past several years so that remote hugepages are
not immediately allocated when local hugepages could have been made
available because the increased access latency is untenable.
The next goal is to introduce a sane default allocation strategy for
hugepages allocations in general regardless of the configuration of
the system so that we prevent thrashing of local memory when
compaction is unlikely to succeed and can prefer remote hugepages over
remote native pages when the local node is low on memory."
Note on timing: this reverts the hugepage VM behavior changes that got
introduced fairly late in the 5.3 cycle, and that fixed a huge
performance regression for certain loads that had been around since
4.18.
Andrea had this note:
"The regression of 4.18 was that it was taking hours to start a VM
where 3.10 was only taking a few seconds, I reported all the details
on lkml when it was finally tracked down in August 2018.
__GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
workload degrade like in the "current upstream" above. And it still
would have been that bad as above until 5.3-rc5"
where the bad behavior ends up happening as you fill up a local node,
and without that change, you'd get into the nasty swap storm behavior
due to compaction working overtime to make room for more memory on the
nodes.
As a result 5.3 got the two performance fix reverts in rc5.
However, David Rientjes then noted that those performance fixes in turn
regressed performance for other loads - although not quite to the same
degree. He suggested reverting the reverts and instead replacing them
with two small changes to how hugepage allocations are done (patch
descriptions rephrased by me):
- "avoid expensive reclaim when compaction may not succeed": just admit
that the allocation failed when you're trying to allocate a huge-page
and compaction wasn't successful.
- "allow hugepage fallback to remote nodes when madvised": when that
node-local huge-page allocation failed, retry without forcing the
local node.
but by then I judged it too late to replace the fixes for a 5.3 release.
So 5.3 was released with behavior that harked back to the pre-4.18 logic.
But now we're in the merge window for 5.4, and we can see if this
alternate model fixes not just the horrendous swap storm behavior, but
also restores the performance regression that the late reverts caused.
Fingers crossed.
* emailed patches from David Rientjes <rientjes@google.com>:
mm, page_alloc: allow hugepage fallback to remote nodes when madvised
mm, page_alloc: avoid expensive reclaim when compaction may not succeed
Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
Revert "Revert "mm, thp: restore node-local hugepage allocations""
tracing: Fix clang -Wint-in-bool-context warnings in IF_ASSIGN macro
After r372664 in clang, the IF_ASSIGN macro causes a couple hundred
warnings along the lines of:
kernel/trace/trace_output.c:1331:2: warning: converting the enum
constant to a boolean [-Wint-in-bool-context]
kernel/trace/trace.h:409:3: note: expanded from macro
'trace_assign_type'
IF_ASSIGN(var, ent, struct ftrace_graph_ret_entry,
^
kernel/trace/trace.h:371:14: note: expanded from macro 'IF_ASSIGN'
WARN_ON(id && (entry)->type != id); \
^
264 warnings generated.
This warning can catch issues with constructs like:
if (state == A || B)
where the developer really meant:
if (state == A || state == B)
This is currently the only occurrence of the warning in the kernel
tree across defconfig, allyesconfig, allmodconfig for arm32, arm64,
and x86_64. Add the implicit '!= 0' to the WARN_ON statement to fix
the warnings and find potential issues in the future.
tracing/probe: Fix to check the difference of nr_args before adding probe
Steven reported that a test triggered:
==================================================================
BUG: KASAN: slab-out-of-bounds in trace_kprobe_create+0xa9e/0xe40
Read of size 8 at addr ffff8880c4f25a48 by task ftracetest/4798
Fix to check the difference of nr_args before adding probe
on existing probes. This also may set the error log index
bigger than the number of command parameters. In that case
it sets the error position is next to the last parameter.
Link: http://lkml.kernel.org/r/156966474783.3478.13217501608215769150.stgit@devnote2 Fixes: ca89bc071d5e ("tracing/kprobe: Add multi-probe per event support") Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
David Rientjes [Wed, 4 Sep 2019 19:54:25 +0000 (12:54 -0700)]
mm, page_alloc: allow hugepage fallback to remote nodes when madvised
For systems configured to always try hard to allocate transparent
hugepages (thp defrag setting of "always") or for memory that has been
explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to
remote memory to allocate the hugepage if the local allocation fails
first.
The point is to allow the initial call to __alloc_pages_node() to attempt
to defragment local memory to make a hugepage available, if possible,
rather than immediately fallback to remote memory. Local hugepages will
always have a better access latency than remote (huge)pages, so an attempt
to make a hugepage available locally is always preferred.
If memory compaction cannot be successful locally, however, it is likely
better to fallback to remote memory. This could take on two forms: either
allow immediate fallback to remote memory or do per-zone watermark checks.
It would be possible to fallback only when per-zone watermarks fail for
order-0 memory, since that would require local reclaim for all subsequent
faults so remote huge allocation is likely better than thrashing the local
zone for large workloads.
In this case, it is assumed that because the system is configured to try
hard to allocate hugepages or the vma is advised to explicitly want to try
hard for hugepages that remote allocation is better when local allocation
and memory compaction have both failed.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 4 Sep 2019 19:54:22 +0000 (12:54 -0700)]
mm, page_alloc: avoid expensive reclaim when compaction may not succeed
Memory compaction has a couple significant drawbacks as the allocation
order increases, specifically:
- isolate_freepages() is responsible for finding free pages to use as
migration targets and is implemented as a linear scan of memory
starting at the end of a zone,
- failing order-0 watermark checks in memory compaction does not account
for how far below the watermarks the zone actually is: to enable
migration, there must be *some* free memory available. Per the above,
watermarks are not always suffficient if isolate_freepages() cannot
find the free memory but it could require hundreds of MBs of reclaim to
even reach this threshold (read: potentially very expensive reclaim with
no indication compaction can be successful), and
- if compaction at this order has failed recently so that it does not even
run as a result of deferred compaction, looping through reclaim can often
be pointless.
For hugepage allocations, these are quite substantial drawbacks because
these are very high order allocations (order-9 on x86) and falling back to
doing reclaim can potentially be *very* expensive without any indication
that compaction would even be successful.
Reclaim itself is unlikely to free entire pageblocks and certainly no
reliance should be put on it to do so in isolation (recall lumpy reclaim).
This means we should avoid reclaim and simply fail hugepage allocation if
compaction is deferred.
It is also not helpful to thrash a zone by doing excessive reclaim if
compaction may not be able to access that memory. If order-0 watermarks
fail and the allocation order is sufficiently large, it is likely better
to fail the allocation rather than thrashing the zone.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit a8282608c88e ("Revert "mm, thp: restore node-local hugepage
allocations"") is reverted in this series, it is better to restore the
previous 5.2 behavior between the thp allocation and the page allocator
rather than to attempt any consolidation or cleanup for a policy that is
now reverted. It's less risky during an rc cycle and subsequent patches
in this series further modify the same policy that the pre-5.3 behavior
implements.
Consolidation and cleanup can be done subsequent to a sane default page
allocation strategy, so this patch reverts a cleanup done on a strategy
that is now reverted and thus is the least risky option.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The commit references the original intended semantic for MADV_HUGEPAGE
which has subsequently taken on three unique purposes:
- enables or disables thp for a range of memory depending on the system's
config (is thp "enabled" set to "always" or "madvise"),
- determines the synchronous compaction behavior for thp allocations at
fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
and
- reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
clear previous hugepage advice).
These are the three purposes that currently exist in 5.2 and over the
past several years that userspace has been written around. Adding a
NUMA locality preference adds a fourth dimension to an already conflated
advice mode.
Based on the semantic that MADV_HUGEPAGE has provided over the past
several years, there exist workloads that use the tunable based on these
principles: specifically that the allocation should attempt to
defragment a local node before falling back. It is agreed that remote
hugepages typically (but not always) have a better access latency than
remote native pages, although on Naples this is at parity for
intersocket.
The revert commit that this patch reverts allows hugepage allocation to
immediately allocate remotely when local memory is fragmented. This is
contrary to the semantic of MADV_HUGEPAGE over the past several years:
that is, memory compaction should be attempted locally before falling
back.
The performance degradation of remote hugepages over local hugepages on
Rome, for example, is 53.5% increased access latency. For this reason,
the goal is to revert back to the 5.2 and previous behavior that would
attempt local defragmentation before falling back. With the patch that
is reverted by this patch, we see performance degradations at the tail
because the allocator happily allocates the remote hugepage rather than
even attempting to make a local hugepage available.
zone_reclaim_mode is not a solution to this problem since it does not
only impact hugepage allocations but rather changes the memory
allocation strategy for *all* page allocations.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"An assortment of fixes that were either missed by me, or didn't arrive
quite in time for the first v5.4 pull.
- Most notable is a fix for an issue with tlbie (broadcast TLB
invalidation) on Power9, when using the Radix MMU. The tlbie can
race with an mtpid (move to PID register, essentially MMU context
switch) on another thread of the core, which can cause stores to
continue to go to a page after it's unmapped.
- A fix in our KVM code to add a missing barrier, the lack of which
has been observed to cause missed IPIs and subsequently stuck CPUs
in the host.
- A change to the way we initialise PCR (Processor Compatibility
Register) to make it forward compatible with future CPUs.
- On some older PowerVM systems our H_BLOCK_REMOVE support could
oops, fix it to detect such systems and fallback to the old
invalidation method.
- A fix for an oops seen on some machines when using KASAN on 32-bit.
- A handful of other minor fixes, and two new selftests.
Thanks to: Alistair Popple, Aneesh Kumar K.V, Christophe Leroy,
Gustavo Romero, Joel Stanley, Jordan Niethe, Laurent Dufour, Michael
Roth, Oliver O'Halloran"
* tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/eeh: Fix eeh eeh_debugfs_break_device() with SRIOV devices
powerpc/nvdimm: use H_SCM_QUERY hcall on H_OVERLAP error
powerpc/nvdimm: Use HCALL error as the return value
selftests/powerpc: Add test case for tlbie vs mtpidr ordering issue
powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9
powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag
powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisions
powerpc/pseries: Call H_BLOCK_REMOVE when supported
powerpc/pseries: Read TLB Block Invalidate Characteristics
KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
powerpc/mm: Fix an Oops in kasan_mmu_init()
powerpc/mm: Add a helper to select PAGE_KERNEL_RO or PAGE_READONLY
powerpc/64s: Set reserved PCR bits
powerpc: Fix definition of PCR bits to work with old binutils
powerpc/book3s64/radix: Remove WARN_ON in destroy_context()
powerpc/tm: Add tm-poison test
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
- Apply a number of membarrier related fixes and cleanups, which fixes
a use-after-free race in the membarrier code
- Introduce proper RCU protection for tasks on the runqueue - to get
rid of the subtle task_rcu_dereference() interface that was easy to
get wrong
- Misc fixes, but also an EAS speedup
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Avoid redundant EAS calculation
sched/core: Remove double update_max_interval() call on CPU startup
sched/core: Fix preempt_schedule() interrupt return comment
sched/fair: Fix -Wunused-but-set-variable warnings
sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
sched/membarrier: Skip IPIs when mm->mm_users == 1
selftests, sched/membarrier: Add multi-threaded test
sched/membarrier: Fix p->mm->membarrier_state racy load
sched/membarrier: Call sync_core only before usermode for same mm
sched/membarrier: Remove redundant check
sched/membarrier: Fix private expedited registration check
tasks, sched/core: RCUify the assignment of rq->curr
tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
tasks: Add a count of task RCU users
sched/core: Convert vcpu_is_preempted() from macro to an inline function
sched/fair: Remove unused cfs_rq_clock_task() function
Jarkko Nikula [Fri, 27 Sep 2019 11:09:11 +0000 (14:09 +0300)]
i2c: i801: Bring back Block Process Call support for certain platforms
Commit b84398d6d7f9 ("i2c: i801: Use iTCO version 6 in Cannon Lake PCH
and beyond") looks like to drop by accident Block Write-Block Read Process
Call support for Intel Sunrisepoint, Lewisburg, Denverton and Kaby Lake.
That support was added for above and newer platforms by the commit 315cd67c9453 ("i2c: i801: Add Block Write-Block Read Process Call
support") so bring it back for above platforms.
Fixes: b84398d6d7f9 ("i2c: i801: Use iTCO version 6 in Cannon Lake PCH and beyond") Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Chris Brandt [Thu, 26 Sep 2019 12:19:09 +0000 (07:19 -0500)]
i2c: riic: Clear NACK in tend isr
The NACKF flag should be cleared in INTRIICNAKI interrupt processing as
description in HW manual.
This issue shows up quickly when PREEMPT_RT is applied and a device is
probed that is not plugged in (like a touchscreen controller). The result
is endless interrupts that halt system boot.
Fixes: 310c18a41450 ("i2c: riic: add driver") Cc: stable@vger.kernel.org Reported-by: Chien Nguyen <chien.nguyen.eb@rvc.renesas.com> Signed-off-by: Chris Brandt <chris.brandt@renesas.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Lee Jones [Thu, 5 Sep 2019 19:24:12 +0000 (20:24 +0100)]
i2c: qcom-geni: Disable DMA processing on the Lenovo Yoga C630
We have a production-level laptop (Lenovo Yoga C630) which is exhibiting
a rather horrific bug. When I2C HID devices are being scanned for at
boot-time the QCom Geni based I2C (Serial Engine) attempts to use DMA.
When it does, the laptop reboots and the user never sees the OS.
Attempts are being made to debug the reason for the spontaneous reboot.
No luck so far, hence the requirement for this hot-fix. This workaround
will be removed once we have a viable fix.
Signed-off-by: Lee Jones <lee.jones@linaro.org> Tested-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull kernel lockdown mode from James Morris:
"This is the latest iteration of the kernel lockdown patchset, from
Matthew Garrett, David Howells and others.
From the original description:
This patchset introduces an optional kernel lockdown feature,
intended to strengthen the boundary between UID 0 and the kernel.
When enabled, various pieces of kernel functionality are restricted.
Applications that rely on low-level access to either hardware or the
kernel may cease working as a result - therefore this should not be
enabled without appropriate evaluation beforehand.
The majority of mainstream distributions have been carrying variants
of this patchset for many years now, so there's value in providing a
doesn't meet every distribution requirement, but gets us much closer
to not requiring external patches.
There are two major changes since this was last proposed for mainline:
- Separating lockdown from EFI secure boot. Background discussion is
covered here: https://lwn.net/Articles/751061/
- Implementation as an LSM, with a default stackable lockdown LSM
module. This allows the lockdown feature to be policy-driven,
rather than encoding an implicit policy within the mechanism.
The new locked_down LSM hook is provided to allow LSMs to make a
policy decision around whether kernel functionality that would allow
tampering with or examining the runtime state of the kernel should be
permitted.
The included lockdown LSM provides an implementation with a simple
policy intended for general purpose use. This policy provides a coarse
level of granularity, controllable via the kernel command line:
lockdown={integrity|confidentiality}
Enable the kernel lockdown feature. If set to integrity, kernel features
that allow userland to modify the running kernel are disabled. If set to
confidentiality, kernel features that allow userland to extract
confidential information from the kernel are also disabled.
This may also be controlled via /sys/kernel/security/lockdown and
overriden by kernel configuration.
New or existing LSMs may implement finer-grained controls of the
lockdown features. Refer to the lockdown_reason documentation in
include/linux/security.h for details.
The lockdown feature has had signficant design feedback and review
across many subsystems. This code has been in linux-next for some
weeks, with a few fixes applied along the way.
Stephen Rothwell noted that commit 9d1f8be5cf42 ("bpf: Restrict bpf
when kernel lockdown is in confidentiality mode") is missing a
Signed-off-by from its author. Matthew responded that he is providing
this under category (c) of the DCO"
* 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
kexec: Fix file verification on S390
security: constify some arrays in lockdown LSM
lockdown: Print current->comm in restriction messages
efi: Restrict efivar_ssdt_load when the kernel is locked down
tracefs: Restrict tracefs when the kernel is locked down
debugfs: Restrict debugfs when the kernel is locked down
kexec: Allow kexec_file() with appropriate IMA policy when locked down
lockdown: Lock down perf when in confidentiality mode
bpf: Restrict bpf when kernel lockdown is in confidentiality mode
lockdown: Lock down tracing and perf kprobes when in confidentiality mode
lockdown: Lock down /proc/kcore
x86/mmiotrace: Lock down the testmmiotrace module
lockdown: Lock down module params that specify hardware parameters (eg. ioport)
lockdown: Lock down TIOCSSERIAL
lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
acpi: Disable ACPI table override if the kernel is locked down
acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
ACPI: Limit access to custom_method when the kernel is locked down
x86/msr: Restrict MSR access when the kernel is locked down
x86: Lock down IO port access when the kernel is locked down
...
The traversing of this list requires protection_domain->lock to be taken
to avoid nasty races with attach/detach code. Make sure the lock is held
on all code-paths traversing this list.
iommu/amd: Check for busy devices earlier in attach_device()
Check early in attach_device whether the device is already attached to a
domain. This also simplifies the code path so that __attach_device() can
be removed.
Fixes: 92d420ec028d ("iommu/amd: Relax locking in dma_ops path") Reviewed-by: Filippo Sironi <sironi@amazon.de> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
iommu/amd: Take domain->lock for complete attach/detach path
The code-paths before __attach_device() and __detach_device() are called
also access and modify domain state, so take the domain lock there too.
This allows to get rid of the __detach_device() function.
Fixes: 92d420ec028d ("iommu/amd: Relax locking in dma_ops path") Reviewed-by: Filippo Sironi <sironi@amazon.de> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
The lock is not necessary because the device table does not
contain shared state that needs protection. Locking is only
needed on an individual entry basis, and that needs to
happen on the iommu_dev_data level.
Fixes: 92d420ec028d ("iommu/amd: Relax locking in dma_ops path") Reviewed-by: Filippo Sironi <sironi@amazon.de> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
This struct member was used to track whether a domain
change requires updates to the device-table and IOMMU cache
flushes. The problem is, that access to this field is racy
since locking in the common mapping code-paths has been
eliminated.
Move the updated field to the stack to get rid of all
potential races and remove the field from the struct.
Fixes: 92d420ec028d ("iommu/amd: Relax locking in dma_ops path") Reviewed-by: Filippo Sironi <sironi@amazon.de> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
"The major feature in this time is IMA support for measuring and
appraising appended file signatures. In addition are a couple of bug
fixes and code cleanup to use struct_size().
In addition to the PE/COFF and IMA xattr signatures, the kexec kernel
image may be signed with an appended signature, using the same
scripts/sign-file tool that is used to sign kernel modules.
Similarly, the initramfs may contain an appended signature.
This contained a lot of refactoring of the existing appended signature
verification code, so that IMA could retain the existing framework of
calculating the file hash once, storing it in the IMA measurement list
and extending the TPM, verifying the file's integrity based on a file
hash or signature (eg. xattrs), and adding an audit record containing
the file hash, all based on policy. (The IMA support for appended
signatures patch set was posted and reviewed 11 times.)
The support for appended signature paves the way for adding other
signature verification methods, such as fs-verity, based on a single
system-wide policy. The file hash used for verifying the signature and
the signature, itself, can be included in the IMA measurement list"
* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: ima_api: Use struct_size() in kzalloc()
ima: use struct_size() in kzalloc()
sefltest/ima: support appended signatures (modsig)
ima: Fix use after free in ima_read_modsig()
MODSIGN: make new include file self contained
ima: fix freeing ongoing ahash_request
ima: always return negative code for error
ima: Store the measurement again when appraising a modsig
ima: Define ima-modsig template
ima: Collect modsig
ima: Implement support for module-style appended signatures
ima: Factor xattr_verify() out of ima_appraise_measurement()
ima: Add modsig appraise_type option for module-style appended signatures
integrity: Select CONFIG_KEYS instead of depending on it
PKCS#7: Introduce pkcs7_get_digest()
PKCS#7: Refactor verify_pkcs7_signature()
MODSIGN: Export module signature definitions
ima: initialize the "template" field with the default template
Merge tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Highlights:
- Add a new knfsd file cache, so that we don't have to open and close
on each (NFSv2/v3) READ or WRITE. This can speed up read and write
in some cases. It also replaces our readahead cache.
- Prevent silent data loss on write errors, by treating write errors
like server reboots for the purposes of write caching, thus forcing
clients to resend their writes.
- Tweak the code that allocates sessions to be more forgiving, so
that NFSv4.1 mounts are less likely to hang when a server already
has a lot of clients.
- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be
limited only by the backend filesystem and the maximum RPC size.
- Allow the server to enforce use of the correct kerberos credentials
when a client reclaims state after a reboot.
And some miscellaneous smaller bugfixes and cleanup"
* tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux: (34 commits)
sunrpc: clean up indentation issue
nfsd: fix nfs read eof detection
nfsd: Make nfsd_reset_boot_verifier_locked static
nfsd: degraded slot-count more gracefully as allocation nears exhaustion.
nfsd: handle drc over-allocation gracefully.
nfsd: add support for upcall version 2
nfsd: add a "GetVersion" upcall for nfsdcld
nfsd: Reset the boot verifier on all write I/O errors
nfsd: Don't garbage collect files that might contain write errors
nfsd: Support the server resetting the boot verifier
nfsd: nfsd_file cache entries should be per net namespace
nfsd: eliminate an unnecessary acl size limit
Deprecate nfsd fault injection
nfsd: remove duplicated include from filecache.c
nfsd: Fix the documentation for svcxdr_tmpalloc()
nfsd: Fix up some unused variable warnings
nfsd: close cached files prior to a REMOVE or RENAME that would replace target
nfsd: rip out the raparms cache
nfsd: have nfsd_test_lock use the nfsd_file cache
nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache
...
Merge tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse virtio-fs support from Miklos Szeredi:
"Virtio-fs allows exporting directory trees on the host and mounting
them in guest(s).
This isn't actually a new filesystem, but a glue layer between the
fuse filesystem and a virtio based back-end.
It's similar in functionality to the existing virtio-9p solution, but
significantly faster in benchmarks and has better POSIX compliance.
Further permformance improvements can be achieved by sharing the page
cache between host and guest, allowing for faster I/O and reduced
memory use.
Kata Containers have been including the out-of-tree virtio-fs (with
the shared page cache patches as well) since version 1.7 as an
experimental feature. They have been active in development and plan to
switch from virtio-9p to virtio-fs as their default solution. There
has been interest from other sources as well.
The userspace infrastructure is slated to be merged into qemu once the
kernel part hits mainline.
This was developed by Vivek Goyal, Dave Gilbert and Stefan Hajnoczi"
* tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
virtio-fs: add virtiofs filesystem
virtio-fs: add Documentation/filesystems/virtiofs.rst
fuse: reserve values for mapping protocol
Merge tag '9p-for-5.4' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Some of the usual small fixes and cleanup.
Small fixes all around:
- avoid overlayfs copy-up for PRIVATE mmaps
- KUMSAN uninitialized warning for transport error
- one syzbot memory leak fix in 9p cache
- internal API cleanup for v9fs_fill_super"
* tag '9p-for-5.4' of git://github.com/martinetd/linux:
9p/vfs_super.c: Remove unused parameter data in v9fs_fill_super
9p/cache.c: Fix memory leak in v9fs_cache_session_get_cookie
9p: Transport error uninitialized
9p: avoid attaching writeback_fid on mmap with type PRIVATE
Merge tag 'riscv/for-v5.4-rc1-b' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull more RISC-V updates from Paul Walmsley:
"Some additional RISC-V updates.
This includes one significant fix:
- Prevent interrupts from being unconditionally re-enabled during
exception handling if they were disabled in the context in which
the exception occurred
Also a few other fixes:
- Fix a build error when sparse memory support is manually enabled
- Prevent CPUs beyond CONFIG_NR_CPUS from being enabled in early boot
And a few minor improvements:
- DT improvements: in the FU540 SoC DT files, improve U-Boot
compatibility by adding an "ethernet0" alias, drop an unnecessary
property from the DT files, and add support for the PWM device
- KVM preparation: add a KVM-related macro for future RISC-V KVM
support, and export some symbols required to build KVM support as
modules
- defconfig additions: build more drivers by default for QEMU
configurations"
* tag 'riscv/for-v5.4-rc1-b' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Avoid interrupts being erroneously enabled in handle_exception()
riscv: dts: sifive: Drop "clock-frequency" property of cpu nodes
riscv: dts: sifive: Add ethernet0 to the aliases node
RISC-V: Export kernel symbols for kvm
KVM: RISC-V: Add KVM_REG_RISCV for ONE_REG interface
arch/riscv: disable excess harts before picking main boot hart
RISC-V: Enable VIRTIO drivers in RV64 and RV32 defconfig
RISC-V: Fix building error when CONFIG_SPARSEMEM_MANUAL=y
riscv: dts: Add DT support for SiFive FU540 PWM driver
Merge tag 'pwm/for-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
Pull pwm updates from Thierry Reding:
"Besides one new driver being added for the PWM controller found in
various Spreadtrum SoCs, this series of changes brings a slew of,
mostly minor, fixes and cleanups for existing drivers, as well as some
enhancements to the core code.
Lastly, Uwe is added to the PWM subsystem entry of the MAINTAINERS
file, making official his role as a reviewer"
* tag 'pwm/for-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (34 commits)
MAINTAINERS: Add myself as reviewer for the PWM subsystem
MAINTAINERS: Add patchwork link for PWM entry
MAINTAINERS: Add a selection of PWM related keywords to the PWM entry
pwm: mediatek: Add MT7629 compatible string
dt-bindings: pwm: Update bindings for MT7629 SoC
pwm: mediatek: Update license and switch to SPDX tag
pwm: mediatek: Use pwm_mediatek as common prefix
pwm: mediatek: Allocate the clks array dynamically
pwm: mediatek: Remove the has_clks field
pwm: mediatek: Drop the check for of_device_get_match_data()
pwm: atmel: Consolidate driver data initialization
pwm: atmel: Remove unneeded check for match data
pwm: atmel: Remove platform_device_id and use only dt bindings
pwm: stm32-lp: Add check in case requested period cannot be achieved
pwm: Ensure pwm_apply_state() doesn't modify the state argument
pwm: fsl-ftm: Don't update the state for the caller of pwm_apply_state()
pwm: sun4i: Don't update the state for the caller of pwm_apply_state()
pwm: rockchip: Don't update the state for the caller of pwm_apply_state()
pwm: Let pwm_get_state() return the last implemented state
pwm: Introduce local struct pwm_chip in pwm_apply_state()
...
Merge tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block
Pull more io_uring updates from Jens Axboe:
"Just two things in here:
- Improvement to the io_uring CQ ring wakeup for batched IO (me)
- Fix wrong comparison in poll handling (yangerkun)
I realize the first one is a little late in the game, but it felt
pointless to hold it off until the next release. Went through various
testing and reviews with Pavel and peterz"
* tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block:
io_uring: make CQ ring wakeups be more efficient
io_uring: compare cached_cq_tail with cq.head in_io_uring_poll
Merge tag 'for-linus-2019-09-27' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes/changes to round off this merge window. This contains:
- Small series making some functional tweaks to blk-iocost (Tejun)
- Elevator switch locking fix (Ming)
- Kill redundant call in blk-wbt (Yufen)
- Fix flush timeout handling (Yufen)"
* tag 'for-linus-2019-09-27' of git://git.kernel.dk/linux-block:
block: fix null pointer dereference in blk_mq_rq_timed_out()
rq-qos: get rid of redundant wbt_update_limits()
iocost: bump up default latency targets for hard disks
iocost: improve nr_lagging handling
iocost: better trace vrate changes
block: don't release queue's sysfs lock during switching elevator
blk-mq: move lockdep_assert_held() into elevator_exit
nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
In nfp_abm_u32_knode_replace if the allocation for match fails it should
go to the error handling instead of returning. Updated other gotos to
have correct errno returned, too.
Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Chuang [Wed, 11 Sep 2019 07:23:44 +0000 (15:23 +0800)]
mmc: host: sdhci-pci: Add Genesys Logic GL975x support
Add support for the GL9750 and GL9755 chipsets.
Enable v4 mode and wait 5ms after set 1.8V signal enable for GL9750/
GL9755. Fix the value of SDHCI_MAX_CURRENT register and use the vendor
tuning flow for GL9750.
Co-developed-by: Michael K Johnson <johnsonm@danlj.org> Signed-off-by: Michael K Johnson <johnsonm@danlj.org> Signed-off-by: Ben Chuang <ben.chuang@genesyslogic.com.tw> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Eric Dumazet [Thu, 26 Sep 2019 22:42:51 +0000 (15:42 -0700)]
tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
Yuchung Cheng and Marek Majkowski independently reported a weird
behavior of TCP_USER_TIMEOUT option when used at connect() time.
When the TCP_USER_TIMEOUT is reached, tcp_write_timeout()
believes the flow should live, and the following condition
in tcp_clamp_rto_to_user_timeout() programs one jiffie timers :
remaining = icsk->icsk_user_timeout - elapsed;
if (remaining <= 0)
return 1; /* user timeout has passed; fire ASAP */
This silly situation ends when the max syn rtx count is reached.
This patch makes sure we honor both TCP_SYNCNT and TCP_USER_TIMEOUT,
avoiding these spurious SYN packets.
Fixes: b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Reported-by: Marek Majkowski <marek@cloudflare.com> Cc: Jon Maxwell <jmaxwell37@gmail.com> Link: https://marc.info/?l=linux-netdev&m=156940118307949&w=2 Acked-by: Jon Maxwell <jmaxwell37@gmail.com> Tested-by: Marek Majkowski <marek@cloudflare.com> Signed-off-by: Marek Majkowski <marek@cloudflare.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
sk_buff: drop all skb extensions on free and skb scrubbing
Now that we have a 3rd extension, add a new helper that drops the
extension space and use it when we need to scrub an sk_buff.
At this time, scrubbing clears secpath and bridge netfilter data, but
retains the tc skb extension, after this patch all three get cleared.
NAPI reuse/free assumes we can only have a secpath attached to skb, but
it seems better to clear all extensions there as well.
v2: add unlikely hint (Eric Dumazet)
Fixes: 95a7233c452a ("net: openvswitch: Set OvS recirc_id from tc chain index") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
There was a bug in the previous logic that attempted to ensure gain cycling
gets inflight above BDP even for small BDPs. This code correctly raised and
lowered target inflight values during the gain cycle. And this code
correctly ensured that cwnd was raised when probing bandwidth. However, it
did not correspondingly ensure that cwnd was *not* raised in this way when
*not* probing for bandwidth. The result was that small-BDP flows that were
always cwnd-bound could go for many cycles with a fixed cwnd, and not probe
or yield bandwidth at all. This meant that multiple small-BDP flows could
fail to converge in their bandwidth allocations.
Fixes: 3c346b233c68 ("tcp_bbr: fix bw probing to raise in-flight data for very small BDPs") Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Sep 2019 18:33:19 +0000 (20:33 +0200)]
Merge branch 'mlxsw-Various-fixes'
Ido Schimmel says:
====================
mlxsw: Various fixes
This patchset includes two small fixes for the mlxsw driver and one
patch which clarifies recently introduced devlink-trap documentation.
Patch #1 clears the port's VLAN filters during port initialization. This
ensures that the drop reason reported to the user is consistent. The
problem is explained in detail in the commit message.
Patch #2 clarifies the description of one of the traps exposed via
devlink-trap.
Patch #3 from Danielle forbids the installation of a tc filter with
multiple mirror actions since this is not supported by the device. The
failure is communicated to the user via extack.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw: spectrum: Clear VLAN filters during port initialization
When a port is created, its VLAN filters are not cleared by the
firmware. This causes tagged packets to be later dropped by the ingress
STP filters, which default to DISCARD state.
The above did not matter much until commit b5ce611fd96e ("mlxsw:
spectrum: Add devlink-trap support") where we exposed the drop reason to
users.
Without this patch, the drop reason users will see is not consistent. If
a port is enslaved to a VLAN-aware bridge and a packet with an invalid
VLAN tries to ingress the bridge, it will be dropped due to ingress STP
filter. If the VLAN is later enabled and then disabled, the packet will
be dropped by the ingress VLAN filter despite the above being a
seemingly NOP operation.
Fix this by clearing all the VLAN filters during port initialization.
Adjust the test accordingly.
Fixes: b5ce611fd96e ("mlxsw: spectrum: Add devlink-trap support") Reported-by: Alex Kushnarov <alexanderk@mellanox.com> Tested-by: Alex Kushnarov <alexanderk@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The SDHCI controller on Tegra186 supports 40-bit addressing, which is
usually enough to address all of system memory. However, if the SDHCI
controller is behind an IOMMU, the address space can go beyond. This
happens on Tegra186 and later where the ARM SMMU has an input address
space of 48 bits. If the DMA API is backed by this ARM SMMU, the top-
down IOVA allocator will cause IOV addresses to be returned that the
SDHCI controller cannot access.
Unfortunately, prior to the introduction of the ->set_dma_mask() host
operation, the SDHCI core would set either a 64-bit DMA mask if the
controller claimed to support 64-bit addressing, or a 32-bit DMA mask
otherwise.
Since the full 64 bits cannot be addressed on Tegra, this had to be
worked around in commit 68481a7e1c84 ("mmc: tegra: Mark 64 bit dma
broken on Tegra186") by setting the SDHCI_QUIRK2_BROKEN_64_BIT_DMA
quirk, which effectively restricts the DMA mask to 32 bits.
One disadvantage of this is that dma_map_*() APIs will now try to use
the swiotlb to bounce DMA to addresses beyond of the controller's DMA
mask. This in turn caused degraded performance and can lead to
situations where the swiotlb buffer is exhausted, which in turn leads
to DMA transfers to fail.
With the recent introduction of the ->set_dma_mask() host operation,
this can now be properly fixed. For each generation of Tegra, the exact
supported DMA mask can be configured. This kills two birds with one
stone: it avoids the use of bounce buffers because system memory never
exceeds the addressable memory range of the SDHCI controllers on these
devices, and at the same time when an IOMMU is involved, it prevents
IOV addresses from being allocated beyond the addressible range of the
controllers.
Since the DMA mask is now properly handled, the 64-bit DMA quirk can be
removed.
Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
[treding@nvidia.com: provide more background in commit message] Tested-by: Nicolin Chen <nicoleotsuka@gmail.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thierry Reding <treding@nvidia.com> Cc: stable@vger.kernel.org # v4.15 + Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Russell King [Sun, 22 Sep 2019 10:26:58 +0000 (11:26 +0100)]
mmc: sdhci-of-esdhc: set DMA snooping based on DMA coherence
We must not unconditionally set the DMA snoop bit; if the DMA API is
assuming that the device is not DMA coherent, and the device snoops the
CPU caches, the device can see stale cache lines brought in by
speculative prefetch.
This leads to the device seeing stale data, potentially resulting in
corrupted data transfers. Commonly, this results in a descriptor fetch
error such as:
but can lead to other errors, and potentially direct the SDHCI
controller to read/write data to other memory locations (e.g. if a valid
descriptor is visible to the device in a stale cache line.)
Fix this by ensuring that the DMA snoop bit corresponds with the
behaviour of the DMA API. Since the driver currently only supports DT,
use of_dma_is_coherent(). Note that device_get_dma_attr() can not be
used as that risks re-introducing this bug if/when the driver is
converted to ACPI.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Russell King [Sun, 22 Sep 2019 10:26:53 +0000 (11:26 +0100)]
mmc: sdhci: improve ADMA error reporting
ADMA errors are potentially data corrupting events; although we print
the register state, we do not usefully print the ADMA descriptors.
Worse than that, we print them by referencing their virtual address
which is meaningless when the register state gives us the DMA address
of the failing descriptor.
Print the ADMA descriptors giving their DMA addresses rather than their
virtual addresses, and print them using SDHCI_DUMP() rather than DBG().
We also do not show the correct value of the interrupt status register;
the register dump shows the current value, after we have cleared the
pending interrupts we are going to service. What is more useful is to
print the interrupts that _were_ pending at the time the ADMA error was
encountered. Fix that too.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Hans Andersson [Thu, 26 Sep 2019 07:54:37 +0000 (09:54 +0200)]
net: phy: micrel: add Asym Pause workaround for KSZ9021
The Micrel KSZ9031 PHY may fail to establish a link when the Asymmetric
Pause capability is set. This issue is described in a Silicon Errata
(DS80000691D or DS80000692D), which advises to always disable the
capability.
Micrel KSZ9021 has no errata, but has the same issue with Asymmetric Pause.
This patch apply the same workaround as the one for KSZ9031.
Fixes: 3aed3e2a143c ("net: phy: micrel: add Asym Pause workaround") Signed-off-by: Hans Andersson <hans.andersson@cellavision.se> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Thu, 26 Sep 2019 02:28:19 +0000 (19:28 -0700)]
ptp: correctly disable flags on old ioctls
Commit 415606588c61 ("PTP: introduce new versions of IOCTLs",
2019-09-13) introduced new versions of the PTP ioctls which actually
validate that the flags are acceptable values.
As part of this, it cleared the flags value using a bitwise
and+negation, in an attempt to prevent the old ioctl from accidentally
enabling new features.
This is incorrect for a couple of reasons. First, it results in
accidentally preventing previously working flags on the request ioctl.
By clearing the "valid" flags, we now no longer allow setting the
enable, rising edge, or falling edge flags.
Second, if we add new additional flags in the future, they must not be
set by the old ioctl. (Since the flag wasn't checked before, we could
potentially break userspace programs which sent garbage flag data.
The correct way to resolve this is to check for and clear all but the
originally valid flags.
Create defines indicating which flags are correctly checked and
interpreted by the original ioctls. Use these to clear any bits which
will not be correctly interpreted by the original ioctls.
In the future, new flags must be added to the VALID_FLAGS macros, but
*not* to the V1_VALID_FLAGS macros. In this way, new features may be
exposed over the v2 ioctls, but without breaking previous userspace
which happened to not clear the flags value properly. The old ioctl will
continue to behave the same way, while the new ioctl gains the benefit
of using the flags fields.
Cc: Richard Cochran <richardcochran@gmail.com> Cc: Felipe Balbi <felipe.balbi@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Christopher Hall <christopher.s.hall@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Marek Vasut [Wed, 25 Sep 2019 22:08:42 +0000 (00:08 +0200)]
net: dsa: microchip: Always set regmap stride to 1
The regmap stride is set to 1 for regmap describing 8bit registers already.
However, for 16/32/64bit registers, the stride is 2/4/8 respectively. This
is not correct, as the switch protocol supports unaligned register reads
and writes and the KSZ87xx even uses such unaligned register accesses to
read e.g. MIB counter.
This patch fixes MIB counter access on KSZ87xx.
Signed-off-by: Marek Vasut <marex@denx.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: David S. Miller <davem@davemloft.net> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: George McCollister <george.mccollister@gmail.com> Cc: Tristram Ha <Tristram.Ha@microchip.com> Cc: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Cc: Woojung Huh <woojung.huh@microchip.com> Fixes: 46558d601cb6 ("net: dsa: microchip: Initial SPI regmap support") Fixes: 255b59ad0db2 ("net: dsa: microchip: Factor out regmap config generation into common header") Reviewed-by: George McCollister <george.mccollister@gmail.com> Tested-by: George McCollister <george.mccollister@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'linux-watchdog-5.4-rc1' of git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
- addition of AST2600, i.MX7ULP and F81803 watchdog support
- removal of the w90x900 and ks8695 drivers
- ziirave_wdt improvements
- small fixes and improvements
* tag 'linux-watchdog-5.4-rc1' of git://www.linux-watchdog.org/linux-watchdog: (51 commits)
watchdog: f71808e_wdt: Add F81803 support
watchdog: qcom: remove unnecessary variable from private storage
watchdog: qcom: support pre-timeout when the bark irq is available
watchdog: imx_sc: this patch just fixes whitespaces
watchdog: apseed: Add access_cs0 option for alt-boot
watchdog: aspeed: add support for dual boot
watchdog: orion_wdt: use timer1 as a pretimeout
watchdog: Add i.MX7ULP watchdog support
dt-bindings: watchdog: Add i.MX7ULP bindings
dt-bindings: watchdog: sun4i: Add the watchdog clock
dt-bindings: watchdog: sun4i: Add the watchdog interrupts
dt-bindings: watchdog: Convert Allwinner watchdog to a schema
dt-bindings: watchdog: Add YAML schemas for the generic watchdog bindings
watchdog: aspeed: Add support for AST2600
dt-bindings: watchdog: Add ast2600 compatible
watchdog: ziirave_wdt: Update checked I2C functionality mask
watchdog: ziirave_wdt: Drop ziirave_firm_write_block_data()
watchdog: ziirave_wdt: Fix DOWNLOAD_START payload
watchdog: ziirave_wdt: Drop status polling code
watchdog: ziirave_wdt: Fix RESET_PROCESSOR payload
...
The following patchset contains Netfilter fixes for net:
1) Add NFT_CHAIN_POLICY_UNSET to replace hardcoded -1 to
specify that the chain policy is unset. The chain policy
field is actually defined as an 8-bit unsigned integer.
2) Remove always true condition reported by smatch in
chain policy check.
3) Fix element lookup on dynamic sets, from Florian Westphal.
4) Use __u8 in ebtables uapi header, from Masahiro Yamada.
5) Bogus EBUSY when removing flowtable after chain flush,
from Laura Garcia Liebana.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'drm-next-2019-09-27' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Fixes built up over the past 1.5 weeks or so, it's two weeks of
amdgpu, some core cleanups and some panfrost fixes. I also finally
figured out why my desktop was slow to do a bunch of stuff (someone
gave it an IPv6 address which can't reach anything!).
core:
- Some cleanups and fixes in the self-refresh helpers
- Some cleanups and fixes in the atomic helpers
amdgpu:
- Fix a 64 bit divide
- Prevent a memory leak in a failure case in dc
- Load proper gfx firmware on navi14 variants
- Add more navi12 and navi14 PCI ids
- Misc fixes for renoir
- Fix bandwidth issues with multiple displays on vega20
- Support for Dali
- Fix a possible oops with KFD on hawaii
- Fix for backlight level after resume on some APUs
- Other misc fixes
panfrost:
- Multiple panfrost fixes for regulator support and page fault
handling"
* tag 'drm-next-2019-09-27' of git://anongit.freedesktop.org/drm/drm: (34 commits)
drm/amd/display: prevent memory leak
drm/amdgpu/gfx10: add support for wks firmware loading
drm/amdgpu/display: include slab.h in dcn21_resource.c
drm/amdgpu/display: fix 64 bit divide
drm/panfrost: Prevent race when handling page fault
drm/panfrost: Remove NULL checks for regulator
drm/panfrost: Fix regulator_get_optional() misuse
drm: Measure Self Refresh Entry/Exit times to avoid thrashing
drm: Fix kerneldoc and remove unused struct member in self_refresh helper
drm/atomic: Rename crtc_state->pageflip_flags to async_flip
drm/atomic: Reject FLIP_ASYNC unconditionally
drm/atomic: Take the atomic toys away from X
drm/amdgpu: flag navi12 and 14 as experimental for 5.4
drm/kms: Duct-tape for mode object lifetime checks
drm/amdgpu: add navi12 pci id
drm/amdgpu: add navi14 PCI ID for work station SKU
drm/amdkfd: Swap trap temporary registers in gfx10 trap handler
drm/amd/powerplay: implement sysfs for getting dpm clock
drm/amd/display: Restore backlight brightness after system resume
drm/amd/display: Implement voltage limitation for dali
...
nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs
In nfp_flower_spawn_vnic_reprs in the loop if initialization or the
allocations fail memory is leaked. Appropriate releases are added.
Fixes: b94524529741 ("nfp: flower: add per repr private data for LAG offload") Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs
In nfp_flower_spawn_phy_reprs, in the for loop over eth_tbl if any of
intermediate allocations or initializations fail memory is leaked.
requiered releases are added.
Fixes: b94524529741 ("nfp: flower: add per repr private data for LAG offload") Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Blakey [Wed, 25 Sep 2019 15:02:35 +0000 (18:02 +0300)]
net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
This a new feature, it is preferred that it defaults to N.
We will probe the feature support from userspace before actually using it.
Fixes: 95a7233c452a ('net: openvswitch: Set OvS recirc_id from tc chain index') Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Wed, 25 Sep 2019 14:53:19 +0000 (07:53 -0700)]
vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabled
A user reported that vrf create fails when IPv6 is disabled at boot using
'ipv6.disable=1':
https://bugzilla.kernel.org/show_bug.cgi?id=204903
The failure is adding fib rules at create time. Add RTNL_FAMILY_IP6MR to
the check in vrf_fib_rule if ipv6_mod_enabled is disabled.
Fixes: e4a38c0c4b27 ("ipv6: add vrf table handling code for ipv6 mcast") Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Patrick Ruddy <pruddy@vyatta.att-mail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'ntb-5.4' of git://github.com/jonmason/ntb
Pull NTB updates from Jon Mason:
"A few bugfixes and support for new AMD NTB hardware"
* tag 'ntb-5.4' of git://github.com/jonmason/ntb:
NTB: fix IDT Kconfig typos/spellos
ntb_hw_amd: Add memory window support for new AMD hardware
ntb_hw_amd: Add a new NTB PCI device ID
NTB: ntb_transport: remove redundant assignment to rc
ntb_hw_switchtec: make ntb_mw_set_trans() work when addr == 0
ntb: point to right memory window index
for (( i = 0; i < 3; i++)); do
# Create 3 64M holes for latter fallocate to fail
truncate -s 192m $mnt/subv/file
xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
sync
# it's supposed to fail, and each failure will leak at least 64M
# data space
xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
rm $mnt/subv/file
sync
done
# Shouldn't fail after we removed the file
xfs_io -f -c "falloc 0 64m" $mnt/subv/file
[CAUSE]
Btrfs qgroup data reserve code allow multiple reservations to happen on
a single extent_changeset:
E.g:
btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
Btrfs qgroup code has its internal tracking to make sure we don't
double-reserve in above example.
The only pattern utilizing this feature is in the main while loop of
btrfs_fallocate() function.
However btrfs_qgroup_reserve_data()'s error handling has a bug in that
on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
flag but doesn't free previously reserved bytes.
This bug has a two fold effect:
- Clearing EXTENT_QGROUP_RESERVED ranges
This is the correct behavior, but it prevents
btrfs_qgroup_check_reserved_leak() to catch the leakage as the
detector is purely EXTENT_QGROUP_RESERVED flag based.
- Leak the previously reserved data bytes.
The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
the last one fails, leaking space reserved in the previous ones.
[FIX]
Also free previously reserved data bytes when btrfs_qgroup_reserve_data
fails.
btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space
[BUG]
Under the following case with qgroup enabled, if some error happened
after we have reserved delalloc space, then in error handling path, we
could cause qgroup data space leakage:
From btrfs_truncate_block() in inode.c:
ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
block_start, blocksize);
if (ret)
goto out;
[CAUSE]
In the above case, btrfs_delalloc_reserve_space() will call
btrfs_qgroup_reserve_data() and mark the io_tree range with
EXTENT_QGROUP_RESERVED flag.
In the error handling path, we have the following call stack:
btrfs_delalloc_release_space()
|- btrfs_free_reserved_data_space()
|- btrsf_qgroup_free_data()
|- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
|- qgroup_free_reserved_data(reserved=@reserved)
|- clear_record_extent_bits();
|- freed += changeset.bytes_changed;
However due to a completion bug, qgroup_free_reserved_data() will clear
EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
than the correct BTRFS_I(inode)->io_tree.
Since io_failure_tree is never marked with that flag,
btrfs_qgroup_free_data() will not free any data reserved space at all,
causing a leakage.
This type of error handling can only be triggered by errors outside of
qgroup code. So EDQUOT error from qgroup can't trigger it.
[FIX]
Fix the wrong target io_tree.
Reported-by: Josef Bacik <josef@toxicpanda.com> Fixes: bc42bda22345 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The bug is caused by the race between timeout handle and completion for
flush request.
When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.
After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.
However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.
We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().
Cc: Christoph Hellwig <hch@infradead.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: stable@vger.kernel.org # v4.18+ Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue Signed-off-by: Jens Axboe <axboe@kernel.dk>
David S. Miller [Fri, 27 Sep 2019 10:13:55 +0000 (12:13 +0200)]
Merge branch 'qdisc-destroy'
Vlad Buslov says:
====================
Fix Qdisc destroy issues caused by adding fine-grained locking to filter API
TC filter API unlocking introduced several new fine-grained locks. The
change caused sleeping-while-atomic BUGs in several Qdiscs that call cls
APIs which need to obtain new mutex while holding sch tree spinlock. This
series fixes affected Qdiscs by ensuring that cls API that became sleeping
is only called outside of sch tree lock critical section.
====================
Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: sched: sch_sfb: don't call qdisc_put() while holding tree lock
Recent changes that removed rtnl dependency from rules update path of tc
also made tcf_block_put() function sleeping. This function is called from
ops->destroy() of several Qdisc implementations, which in turn is called by
qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock,
which results sleeping-while-atomic BUG.
Steps to reproduce for sfb:
tc qdisc add dev ens1f0 handle 1: root sfb
tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10
tc qdisc change dev ens1f0 root handle 1: sfb
In sfb_change() function use qdisc_purge_queue() instead of
qdisc_tree_flush_backlog() to properly reset old child Qdisc and save
pointer to it into local temporary variable. Put reference to Qdisc after
sch tree lock is released in order not to call potentially sleeping cls API
in atomic section. This is safe to do because Qdisc has already been reset
by qdisc_purge_queue() inside sch tree lock critical section.
Reported-by: syzbot+ac54455281db908c581e@syzkaller.appspotmail.com Fixes: c266f64dbfa2 ("net: sched: protect block state with mutex") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: sched: multiq: don't call qdisc_put() while holding tree lock
Recent changes that removed rtnl dependency from rules update path of tc
also made tcf_block_put() function sleeping. This function is called from
ops->destroy() of several Qdisc implementations, which in turn is called by
qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock,
which results sleeping-while-atomic BUG.
Rearrange locking in multiq_tune() in following ways:
- In loop that removes Qdiscs from disabled queues, call
qdisc_purge_queue() instead of qdisc_tree_flush_backlog() on Qdisc that
is being destroyed. Save the Qdisc in temporary allocated array and call
qdisc_put() on each element of the array after sch tree lock is released.
This is safe to do because Qdiscs have already been reset by
qdisc_purge_queue() inside sch tree lock critical section.
- Do the same change for second loop that initializes Qdiscs for newly
enabled queues in multiq_tune() function. Since sch tree lock is obtained
and released on each iteration of this loop, just call qdisc_put()
directly outside of critical section. Don't verify that old Qdisc is not
noop_qdisc before releasing reference to it because such check is already
performed by qdisc_put*() functions.
Fixes: c266f64dbfa2 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: sched: sch_htb: don't call qdisc_put() while holding tree lock
Recent changes that removed rtnl dependency from rules update path of tc
also made tcf_block_put() function sleeping. This function is called from
ops->destroy() of several Qdisc implementations, which in turn is called by
qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock,
which results sleeping-while-atomic BUG.
Steps to reproduce for htb:
tc qdisc add dev ens1f0 root handle 1: htb default 12
tc class add dev ens1f0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps
tc qdisc add dev ens1f0 parent 1:1 handle 40: sfq perturb 10
tc class add dev ens1f0 parent 1:1 classid 1:2 htb rate 100kbps ceil 100kbps
In htb_change_class() function save parent->leaf.q to local temporary
variable and put reference to it after sch tree lock is released in order
not to call potentially sleeping cls API in atomic section. This is safe to
do because Qdisc has already been reset by qdisc_purge_queue() inside sch
tree lock critical section.
Fixes: c266f64dbfa2 ("net: sched: protect block state with mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In rds_bind(), laddr_check is called without checking if it is NULL or
not. And rs_transport should be reset if rds_add_bound() fails.
Fixes: c5c1a030a7db ("net/rds: An rds_sock is added too early to the hash table") Reported-by: syzbot+fae39afd2101a17ec624@syzkaller.appspotmail.com Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>