This is CVE-2017-2583. On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.
The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.
Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.
return_unused_surplus_pages() decrements the global reservation count,
and frees any unused surplus pages that were backing the reservation.
Commit 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in
return_unused_surplus_pages()") added a call to cond_resched_lock in the
loop freeing the pages.
As a result, the hugetlb_lock could be dropped, and someone else could
use the pages that will be freed in subsequent iterations of the loop.
This could result in inconsistent global hugetlb page state, application
api failures (such as mmap) failures or application crashes.
When dropping the lock in return_unused_surplus_pages, make sure that
the global reservation count (resv_huge_pages) remains sufficiently
large to prevent someone else from claiming pages about to be freed.
Analyzed by Paul Cassella.
Fixes: 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()") Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Paul Cassella <cassella@cray.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size. We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us. But, why?
The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB. This is not the right way for
fsdlm.
The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.
The following diagram briefly illustrates how this crash happens:
RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
dlm_lock(no DLM_LKF_VALBLK)
convert_lock(overwrite lkb->lkb_exflags
with no DLM_LKF_VALBLK)
RSB1: NULL RSB1: EX
reset Node2
dlm_recover_rsbs()
recover_lvb()
/* The LVB is not trustable if the node with EX fails and
* no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
*/
if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
return; * to invalid the LVB here.
*/
The 2nd round:
Node 1 Node2
RSB1(become master from recovery)
ocfs2_setattr()
ocfs2_inode_lock(NULL->EX)
/* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
ocfs2_truncate_file()
mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */
The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.
Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Nothing in this minimal script seems to require bash. We often run these
tests on embedded devices where the only shell available is the busybox
ash. Use sh instead.
As a part of memory initialisation the architecture passes an array to
free_area_init_nodes() which specifies the max PFN of each memory zone.
This array is not necessarily monotonic (due to unused zones) so this
array is parsed to build monotonic lists of the min and max PFN for each
zone. ZONE_MOVABLE is special cased here as its limits are managed by
the mm subsystem rather than the architecture. Unfortunately, this
special casing is broken when ZONE_MOVABLE is the not the last zone in
the zone list. The core of the issue is:
if (i == ZONE_MOVABLE)
continue;
arch_zone_lowest_possible_pfn[i] =
arch_zone_highest_possible_pfn[i-1];
As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone will
be set to zero. This patch fixes this bug by adding explicitly tracking
where the next zone should start rather than relying on the contents
arch_zone_highest_possible_pfn[].
Thie is low priority. To get bitten by this you need to enable a zone
that appears after ZONE_MOVABLE in the zone_type enum. As far as I can
tell this means running a kernel with ZONE_DEVICE or ZONE_CMA enabled,
so I can't see this affecting too many people.
I only noticed this because I've been fiddling with ZONE_DEVICE on
powerpc and 4.6 broke my test kernel. This bug, in conjunction with the
changes in Taku Izumi's kernelcore=mirror patch (d91749c1dda71) and
powerpc being the odd architecture which initialises max_zone_pfn[] to
~0ul instead of 0 caused all of system memory to be placed into
ZONE_DEVICE at boot, followed a panic since device memory cannot be used
for kernel allocations. I've already submitted a patch to fix the
powerpc specific bits, but I figured this should be fixed too.
Link: http://lkml.kernel.org/r/1462435033-15601-1-git-send-email-oohall@gmail.com Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
I am getting the following warning when I build kernel 4.9-git on my
PowerBook G4 with a 32-bit PPC processor:
AS arch/powerpc/kernel/misc_32.o
arch/powerpc/kernel/misc_32.S:299:7: warning: "CONFIG_FSL_BOOKE" is not defined [-Wundef]
This problem is evident after commit 989cea5c14be ("kbuild: prevent
lib-ksyms.o rebuilds"); however, this change in kbuild only exposes an
error that has been in the code since 2005 when this source file was
created. That was with commit 9994a33865f4 ("powerpc: Introduce
entry_{32,64}.S, misc_{32,64}.S, systbl.S").
The offending line does not make a lot of sense. This error does not
seem to cause any errors in the executable, thus I am not recommending
that it be applied to any stable versions.
Thanks to Nicholas Piggin for suggesting this solution.
Fixes: 9994a33865f4 ("powerpc: Introduce entry_{32,64}.S, misc_{32,64}.S, systbl.S") Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The GRO fast path caches the frag0 address. This address becomes
invalid if frag0 is modified by pskb_may_pull or its variants.
So whenever that happens we must disable the frag0 optimization.
This is usually done through the combination of gro_header_hard
and gro_header_slow, however, the IPv6 extension header path did
the pulling directly and would continue to use the GRO fast path
incorrectly.
This patch fixes it by disabling the fast path when we enter the
IPv6 extension header path.
Fixes: 78a478d0efd9 ("gro: Inline skb_gro_header and cache frag0 virtual address") Reported-by: Slava Shwartsman <slavash@mellanox.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
On 32bit arches, (skb->end - skb->data) is not 'unsigned int',
so we shall use min_t() instead of min() to avoid a compiler error.
Fixes: 1272ce87fa01 ("gro: Enter slow-path if there is no tailroom") Reported-by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The GRO path has a fast-path where we avoid calling pskb_may_pull
and pskb_expand by directly accessing frag0. However, this should
only be done if we have enough tailroom in the skb as otherwise
we'll have to expand it later anyway.
This patch adds the check by capping frag0_len with the skb tailroom.
Fixes: cb18978cbf45 ("gro: Open-code final pskb_may_pull") Reported-by: Slava Shwartsman <slavash@mellanox.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When a system receives a Query, it does not respond immediately.
Instead, it delays its response by a random amount of time, bounded
by the Max Resp Time value derived from the Max Resp Code in the
received Query message. A system may receive a variety of Queries on
different interfaces and of different kinds (e.g., General Queries,
Group-Specific Queries, and Group-and-Source-Specific Queries), each
of which may require its own delayed response.
Before scheduling a response to a Query, the system must first
consider previously scheduled pending responses and in many cases
schedule a combined response. Therefore, the system must be able to
maintain the following state:
o A timer per interface for scheduling responses to General Queries.
o A per-group and interface timer for scheduling responses to Group-
Specific and Group-and-Source-Specific Queries.
o A per-group and interface list of sources to be reported in the
response to a Group-and-Source-Specific Query.
When a new Query with the Router-Alert option arrives on an
interface, provided the system has state to report, a delay for a
response is randomly selected in the range (0, [Max Resp Time]) where
Max Resp Time is derived from Max Resp Code in the received Query
message. The following rules are then used to determine if a Report
needs to be scheduled and the type of Report to schedule. The rules
are considered in order and only the first matching rule is applied.
1. If there is a pending response to a previous General Query
scheduled sooner than the selected delay, no additional response
needs to be scheduled.
2. If the received Query is a General Query, the interface timer is
used to schedule a response to the General Query after the
selected delay. Any previously pending response to a General
Query is canceled.
--8<--
Currently the timer is rearmed with new random expiration time for
every incoming query regardless of possibly already pending report.
Which is not aligned with the above RFE.
It also might happen that higher rate of incoming queries can
postpone the report after the expiration time of the first query
causing group membership loss.
Now the per interface general query timer is rearmed only
when there is no pending report already scheduled on that interface or
the newly selected expiration time is before the already pending
scheduled report.
Signed-off-by: Michal Tesar <mtesar@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Final nlmsg_len field update must reflect inserted net_dm_drop_point
data.
This patch depends on previous patch:
"drop_monitor: add missing call to genlmsg_end"
Signed-off-by: Reiter Wolfgang <wr0112358@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Update nlmsg_len field with genlmsg_end to enable userspace processing
using nlmsg_next helper. Also adds error handling.
Signed-off-by: Reiter Wolfgang <wr0112358@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
There is currently a small window during which the network device registered by
stmmac can be made visible, yet all resources, including and clock and MDIO bus
have not had a chance to be set up, this can lead to the following error to
occur:
[ 473.919358] stmmaceth 0000:01:00.0 (unnamed net_device) (uninitialized):
stmmac_dvr_probe: warning: cannot get CSR clock
[ 473.919382] stmmaceth 0000:01:00.0: no reset control found
[ 473.919412] stmmac - user ID: 0x10, Synopsys ID: 0x42
[ 473.919429] stmmaceth 0000:01:00.0: DMA HW capability register supported
[ 473.919436] stmmaceth 0000:01:00.0: RX Checksum Offload Engine supported
[ 473.919443] stmmaceth 0000:01:00.0: TX Checksum insertion supported
[ 473.919451] stmmaceth 0000:01:00.0 (unnamed net_device) (uninitialized):
Enable RX Mitigation via HW Watchdog Timer
[ 473.921395] libphy: PHY stmmac-1:00 not found
[ 473.921417] stmmaceth 0000:01:00.0 eth0: Could not attach to PHY
[ 473.921427] stmmaceth 0000:01:00.0 eth0: stmmac_open: Cannot attach to
PHY (error: -19)
[ 473.959710] libphy: stmmac: probed
[ 473.959724] stmmaceth 0000:01:00.0 eth0: PHY ID 01410cc2 at 0 IRQ POLL
(stmmac-1:00) active
[ 473.959728] stmmaceth 0000:01:00.0 eth0: PHY ID 01410cc2 at 1 IRQ POLL
(stmmac-1:01)
[ 473.959731] stmmaceth 0000:01:00.0 eth0: PHY ID 01410cc2 at 2 IRQ POLL
(stmmac-1:02)
[ 473.959734] stmmaceth 0000:01:00.0 eth0: PHY ID 01410cc2 at 3 IRQ POLL
(stmmac-1:03)
Fix this by making sure that register_netdev() is the last thing being done,
which guarantees that the clock and the MDIO bus are available.
Fixes: 4bfcbd7abce2 ("stmmac: Move the mdio_register/_unregister in probe/remove") Reported-by: Kweh, Hock Leong <hock.leong.kweh@intel.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Shahar reported a soft lockup in tc_classify(), where we run into an
endless loop when walking the classifier chain due to tp->next == tp
which is a state we should never run into. The issue only seems to
trigger under load in the tc control path.
What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.
This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
after we loaded and found the requested action, we need to redo the
whole request so we don't race against others. While we had to unlock
rtnl in that time, thread B's request was processed next on that CPU.
Thread B added a new tp instance successfully to the classifier chain.
When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
and destroying its tp instance which never got linked, we goto replay
and redo A's request.
This time when walking the classifier chain in tc_ctl_tfilter() for
checking for existing tp instances we had a priority match and found
the tp instance that was created and linked by thread B. Now calling
again into tp->ops->change() with that tp was successful and returned
without error.
tp_created was never cleared in the second round, thus kernel thinks
that we need to link it into the classifier chain (once again). tp and
*back point to the same object due to the match we had earlier on. Thus
for thread B's already public tp, we reset tp->next to tp itself and
link it into the chain, which eventually causes the mentioned endless
loop in tc_classify() once a packet hits the data path.
Fix is to clear tp_created at the beginning of each request, also when
we replay it. On the paths that can cause -EAGAIN we already destroy
the original tp instance we had and on replay we really need to start
from scratch. It seems that this issue was first introduced in commit 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
and avoid kernel panic when we use cls_cgroup").
Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup") Reported-by: Shahar Klein <shahark@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Shahar Klein <shahark@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If we can't allocate the resources in gigaset_initdriver() then we
should return -ENOMEM instead of zero.
Fixes: 2869b23e4b95 ("[PATCH] drivers/isdn/gigaset: new M101 driver (v2)") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Hyper-V (and Azure) support using NVGRE which requires some extra space
for encapsulation headers. Because of this the largest allowed TSO
packet is reduced.
For older releases, hard code a fixed reduced value. For next release,
there is a better solution which uses result of host offload
negotiation.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
ep->mult is supposed to be set to Isochronous and
Interrupt Endapoint's multiplier value. This value
is computed from different places depending on the
link speed.
If we're dealing with HighSpeed, then it's part of
bits [12:11] of wMaxPacketSize. This case wasn't
taken into consideration before.
While at that, also make sure the ep->mult defaults
to one so drivers can use it unconditionally and
assume they'll never multiply ep->maxpacket to zero.
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When a disfunctional timer, e.g. dummy timer, is installed, the tick core
tries to setup the broadcast timer.
If no broadcast device is installed, the kernel crashes with a NULL pointer
dereference in tick_broadcast_setup_oneshot() because the function has no
sanity check.
Reported-by: Mason <slash.tmp@free.fr> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: Richard Cochran <rcochran@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org>, Cc: Sebastian Frias <sf84@laposte.net> Cc: Thibaud Cornic <thibaud_cornic@sigmadesigns.com> Cc: Robin Murphy <robin.murphy@arm.com> Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr Signed-off-by: Jiri Slaby <jslaby@suse.cz>
cpmac_start_xmit() used the max() macro on skb->len (an unsigned int)
and ETH_ZLEN (a signed int literal). This led to the following compiler
warning:
In file included from include/linux/list.h:8:0,
from include/linux/module.h:9,
from drivers/net/ethernet/ti/cpmac.c:19:
drivers/net/ethernet/ti/cpmac.c: In function 'cpmac_start_xmit':
include/linux/kernel.h:748:17: warning: comparison of distinct pointer
types lacks a cast
(void) (&_max1 == &_max2); \
^
drivers/net/ethernet/ti/cpmac.c:560:8: note: in expansion of macro 'max'
len = max(skb->len, ETH_ZLEN);
^
On top of this, it assigned the result of the max() macro to a signed
integer whilst all further uses of it result in it being cast to varying
widths of unsigned integer.
Fix this up by using max_t to ensure the comparison is performed as
unsigned integers, and for consistency change the type of the len
variable to unsigned int.
Signed-off-by: Paul Burton <paul.burton@imgtec.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The current_user_ns() macro currently returns &init_user_ns when user
namespaces are disabled, and that causes several warnings when building
with gcc-6.0 in code that compares the result of the macro to
&init_user_ns itself:
fs/xfs/xfs_ioctl.c: In function 'xfs_ioctl_setattr_check_projid':
fs/xfs/xfs_ioctl.c:1249:22: error: self-comparison always evaluates to true [-Werror=tautological-compare]
if (current_user_ns() == &init_user_ns)
This is a legitimate warning in principle, but here it isn't really
helpful, so I'm reprasing the definition in a way that shuts up the
warning. Apparently gcc only warns when comparing identical literals,
but it can figure out that the result of an inline function can be
identical to a constant expression in order to optimize a condition yet
not warn about the fact that the condition is known at compile time.
This is exactly what we want here, and it looks reasonable because we
generally prefer inline functions over macros anyway.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: David Howells <dhowells@redhat.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: James Morris <james.l.morris@oracle.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
This iscsit_tpg_add_portal_group() function is only called from
lio_target_tiqn_addtpg(). Both functions free the "tpg" pointer on
error so it's a double free bug. The memory is allocated in the caller
so it should be freed in the caller and not here.
Fixes: e48354ce078c ("iscsi-target: Add iSCSI fabric support for target v4.1") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Disseldorp <ddiss@suse.de>
[ bvanassche: Added "Fix" at start of patch title ] Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
gcc-7 notices that the condition in mvs_94xx_command_active looks
suspicious:
drivers/scsi/mvsas/mv_94xx.c: In function 'mvs_94xx_command_active':
drivers/scsi/mvsas/mv_94xx.c:671:15: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context]
This was introduced when the mv_printk() statement got added, and leads
to the condition being ignored. This is probably harmless.
Changing '&&' to '&' makes the code look reasonable, as we check the
command bit before setting and printing it.
Fixes: a4632aae8b66 ("[SCSI] mvsas: Add new macros and functions") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The generic command buffer entry is 128 bits (16 bytes), so the offset
of tail and head pointer should be 16 bytes aligned and increased with
0x10 per command.
When cmd buf is full, head = (tail + 0x10) % CMD_BUFFER_SIZE.
So when left space of cmd buf should be able to store only two
command, we should be issued one COMPLETE_WAIT additionally to wait
all older commands completed. Then the left space should be increased
after IOMMU fetching from cmd buf.
So left check value should be left <= 0x20 (two commands).
Fix bug https://bugzilla.kernel.org/show_bug.cgi?id=188561. Function
wm831x_clkout_is_prepared() returns "true" when it fails to read
CLOCK_CONTROL_1. "true" means the device is already prepared. So
return "true" on the read failure seems improper.
Signed-off-by: Pan Bian <bianpan2016@163.com> Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com> Fixes: f05259a6ffa4 ("clk: wm831x: Add initial WM831x clock driver") Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Fix overflows seen when writing into fan speed limit attributes.
Also fix crash due to division by zero, seen when certain very
large values (such as 2147483648, or 0x80000000) are written
into fan speed limit attributes.
Fixes: 594fbe713bf60 ("Add support for GMT G762/G763 PWM fan controllers") Cc: Arnaud Ebalard <arno@natisbad.org> Reviewed-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If CONFIG_ETRAX_AXISFLASHMAP is not configured, the flash rescue image
object file is empty. With recent versions of binutils, this results
in the following build error.
cris-linux-objcopy: error:
the input file 'arch/cris/boot/rescue/rescue.o' has no sections
This is seen, for example, when trying to build cris:allnoconfig
with recently generated toolchains.
Since it does not make sense to build a flash rescue image if there is
no flash, only build it if CONFIG_ETRAX_AXISFLASHMAP is enabled.
commit 0416e494ce7d ("usb: dwc3: ep0: correct cache
sync issue in case of ep0_bounced") introduced a bug
where we would leak DMA resources which would cause
us to starve the system of them resulting in failing
DMA transfers.
Fix the bug by making sure that we always unmap EP0
requests since those are *always* mapped.
Fixes: 0416e494ce7d ("usb: dwc3: ep0: correct cache
sync issue in case of ep0_bounced") Tested-by: Tomasz Medrek <tomaszx.medrek@intel.com> Reported-by: Janusz Dziedzic <januszx.dziedzic@linux.intel.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The oversampling ratio is controlled using the oversampling pins,
OS [2:0] with OS2 being the MSB control bit, and OS0 the LSB control
bit.
The gpio connected to the OS2 pin is not being set correctly, only OS0
and OS1 pins are being set. Fix the typo to allow proper control of the
oversampling pins.
Signed-off-by: Eva Rachel Retuya <eraretuya@gmail.com> Fixes: b9618c0 ("staging: IIO: ADC: New driver for AD7606/AD7606-6/AD7606-4") Acked-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Bind to the interface, but do not register any ports, after having
downloaded the firmware. The device will still disconnect and
re-enumerate, but this way we avoid an error messages from being logged
as part of the process:
io_ti: probe of 1-1.3:1.0 failed with error -5
Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Make sure to drop the references taken by of_parse_phandle() and
bus_find_device() before returning from am335x_get_phy_control().
Note that there is no guarantee that the devres-managed struct
phy_control will be valid for the lifetime of the sibling phy device
regardless of this change.
Fixes: 3bb869c8b3f1 ("usb: phy: Add AM335x PHY driver") Acked-by: Bin Liu <b-liu@ti.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Function klsi_105_open() calls usb_control_msg() (to "enable read") and
checks its return value. When the return value is unexpected, it only
assigns the error code to the return variable retval, but does not
terminate the exception path. This patch fixes the bug by inserting
"goto err_generic_close;" when the call to usb_control_msg() fails.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Pan Bian <bianpan2016@163.com>
[johan: rebase on prerequisite fix and amend commit message] Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The function returns -EINVAL even if it builds the stream properly.
The bogus error code sneaked in during the code refactoring, but it
wasn't noticed until now since the returned error code itself is
ignored in anyway. Kill it here, but there is no behavior change by
this patch, obviously.
drivers/usb/core/hub.c:107: warning: ‘hub_usb3_port_prepare_disable’ declared inline after being called
drivers/usb/core/hub.c:107: warning: previous declaration of ‘hub_usb3_port_prepare_disable’ was here
To fix this, move hub_port_disable() after
hub_usb3_port_prepare_disable(), and adjust forward declarations.
A static usb-serial-driver structure that is used to initialise the
interrupt URB was modified during probe depending on the currently
probed device type, something which could break a parallel probe of a
device of a different type.
Fix this up by overriding the default completion callback for MCS7715
devices in attach() instead. We may want to use two usb-serial driver
instances for the two types later.
Fixes: fb088e335d78 ("USB: serial: add support for serial port on the moschip 7715") Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Do not submit the interrupt URB until after the parport has been
successfully registered to avoid another use-after-free in the
completion handler when accessing the freed parport private data in case
of a racing completion.
Fixes: b69578df7e98 ("USB: usbserial: mos7720: add support for parallel port on moschip 7715") Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The interrupt URB was submitted on probe but never stopped on probe
errors. This can lead to use-after-free issues in the completion
handler when accessing the freed usb-serial struct:
Unable to handle kernel paging request at virtual address 6b6b6be7
...
[<bf052e70>] (mos7715_interrupt_callback [mos7720]) from [<c052a894>] (__usb_hcd_giveback_urb+0x80/0x140)
[<c052a894>] (__usb_hcd_giveback_urb) from [<c052a9a4>] (usb_hcd_giveback_urb+0x50/0x138)
[<c052a9a4>] (usb_hcd_giveback_urb) from [<c0550684>] (musb_giveback+0xc8/0x1cc)
Fixes: b69578df7e98 ("USB: usbserial: mos7720: add support for parallel port on moschip 7715") Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
In case a device is left in "boot-mode" we must not register any port
devices in order to avoid a NULL-pointer dereference on open due to
missing endpoints. This could be used by a malicious device to trigger
an OOPS:
Unable to handle kernel NULL pointer dereference at virtual address 00000030
...
[<bf0caa84>] (edge_open [io_ti]) from [<bf0b0118>] (serial_port_activate+0x68/0x98 [usbserial])
[<bf0b0118>] (serial_port_activate [usbserial]) from [<c0470ca4>] (tty_port_open+0x9c/0xe8)
[<c0470ca4>] (tty_port_open) from [<bf0b0da0>] (serial_open+0x48/0x6c [usbserial])
[<bf0b0da0>] (serial_open [usbserial]) from [<c0469178>] (tty_open+0xcc/0x5cc)
Check for the expected endpoints in attach() and fail loudly if not
present.
Note that failing to do this appears to be benign since da280e348866
("USB: keyspan_pda: clean up write-urb busy handling") which prevents a
NULL-pointer dereference in write() by never marking a non-existent
write-urb as free.
The write URB was being killed using the synchronous interface while
holding a spin lock in close().
Simply drop the lock and busy-flag update, something which would have
been taken care of by the completion handler if the URB was in flight.
Fixes: f7a33e608d9a ("USB: serial: add quatech2 usb to serial driver") Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If a device is unplugged and replugged during Sx system suspend
some Intel xHC hosts will overwrite the CAS (Cold attach status) flag
and no device connection is noticed in resume.
A device in this state can be identified in resume if its link state
is in polling or compliance mode, and the current connect status is 0.
A device in this state needs to be warm reset.
Intel 100/c230 series PCH specification update Doc #332692-006 Errata #8
Observed on Cherryview and Apollolake as they go into compliance mode
if LFPS times out during polling, and re-plugged devices are not
discovered at resume.
By convention (according to doc) if function does not provide
get_alt() callback composite framework should assume that it has only
altsetting 0 and should respond with error if host tries to set
other one.
After commit dd4dff8b035f ("USB: composite: Fix bug: should test
set_alt function pointer before use it")
we started checking set_alt() callback instead of get_alt().
This check is useless as we check if set_alt() is set inside
usb_add_function() and fail if it's NULL.
Let's fix this check and move comment about why we check the get
method instead of set a little bit closer to prevent future false
fixes.
Fixes: dd4dff8b035f ("USB: composite: Fix bug: should test set_alt function pointer before use it") Signed-off-by: Krzysztof Opasiak <k.opasiak@samsung.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The stop_activity() routine in dummy-hcd is supposed to unlink all
active requests for every endpoint, among other things. But it
doesn't handle ep0. As a result, fuzz testing can generate a WARNING
like the following:
This patch fixes the problem by iterating over all the endpoints in
the driver's ep array instead of iterating over the gadget's ep_list,
which explicitly leaves out ep0.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When checking a new device's descriptors, the USB core does not check
for duplicate endpoint addresses. This can cause a problem when the
sysfs files for those endpoints are created; trying to create multiple
files with the same name will provoke a WARNING:
WARNING: CPU: 2 PID: 865 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x8a/0xa0
sysfs: cannot create duplicate filename
'/devices/platform/dummy_hcd.0/usb2/2-1/2-1:64.0/ep_05'
Kernel panic - not syncing: panic_on_warn set ...
Andrey Konovalov's fuzz testing of gadgetfs showed that we should
improve the driver's checks for valid configuration descriptors passed
in by the user. In particular, the driver needs to verify that the
wTotalLength value in the descriptor is not too short (smaller
than USB_DT_CONFIG_SIZE). And the check for whether wTotalLength is
too large has to be changed, because the driver assumes there is
always enough room remaining in the buffer to hold a device descriptor
(at least USB_DT_DEVICE_SIZE bytes).
This patch adds the additional check and fixes the existing check. It
may do a little more than strictly necessary, but one extra check
won't hurt.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> CC: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The cause of the bug is subtle. The dev_config() routine gets called
twice by the fuzzer. The first time, the user data contains both a
full-speed configuration descriptor and a high-speed config
descriptor, causing dev->hs_config to be set. But it also contains an
invalid device descriptor, so the buffer containing the descriptors is
deallocated and dev_config() returns an error.
The second time dev_config() is called, the user data contains only a
full-speed config descriptor. But dev->hs_config still has the stale
pointer remaining from the first call, causing the routine to think
that there is a valid high-speed config. Later on, when the driver
dereferences the stale pointer to copy that descriptor, we get a
use-after-free access.
The fix is simple: Clear dev->hs_config if the passed-in data does not
contain a high-speed config descriptor.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Indeed, there is a comment saying that the value of len is restricted
to a 16-bit integer, but the code doesn't actually do this.
This patch fixes the warning. It replaces the comment with a
computation that forces the amount of data copied from the user in
ep0_write() to be no larger than the wLength size for the control
transfer, which is a 16-bit quantity.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Andrey Konovalov reported that we were not properly checking the upper
limit before of a device configuration size before calling
memdup_user(), which could cause some problems.
So set the upper limit to PAGE_SIZE * 4, which should be good enough for
all devices.
Similarly to the aemif clock - this screws up the linked list of clock
children. Create a separate clock for mdio inheriting the rate from
emac_clk.
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
[nsekhar@ti.com: add a comment over mdio_clk to explaing its existence +
commit headline updates] Signed-off-by: Sekhar Nori <nsekhar@ti.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Although the old quirk table showed ASUS X71SL with ALC663 codec being
compatible with asus-mode3 fixup, the bugzilla reporter explained that
asus-model8 fits better for the dual headphone controls. So be it.
ASUS ROG Ranger VIII with ALC1150 codec requires the extra GPIO pin to
up for the front panel. Just use the existing fixup for setting up
the GPIO pins.
The device controller is the same but it has different PCI ID. Add this new
ID to the driver's list of supported IDs.
Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Matt Fleming reported seeing crashes when enabling and disabling
function profiling which uses function graph tracer. Later Namhyung Kim
hit a similar issue and he found that the issue was due to the jmp to
ftrace_stub in ftrace_graph_call was only two bytes, and when it was
changed to jump to the tracing code, it overwrote the ftrace_stub that
was after it.
Masami Hiramatsu bisected this down to a binutils change:
This patch adds -mshared option to x86 ELF assembler. By default,
assembler will optimize out non-PLT relocations against defined non-weak
global branch targets with default visibility. The -mshared option tells
the assembler to generate code which may go into a shared library
where all non-weak global branch targets with default visibility can
be preempted. The resulting code is slightly bigger. This option
only affects the handling of branch instructions.
Declaring ftrace_stub as a weak call prevents gas from using two byte
jumps to it, which would be converted to a jump to the function graph
code.
Link: http://lkml.kernel.org/r/20160516230035.1dbae571@gandalf.local.home Reported-by: Matt Fleming <matt@codeblueprint.co.uk> Reported-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Both damn things interpret userland pointers embedded into the payload;
worse, they are actually traversing those. Leaving aside the bad
API design, this is very much _not_ safe to call with KERNEL_DS.
Bail out early if that happens.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Currently it is impossible to edit the value of a config symbol with a
prompt longer than (terminal width - 2) characters. dialog_inputbox()
calculates a negative x-offset for the input window and newwin() fails
as this is invalid. It also doesn't check for this failure, so it
busy-loops calling wgetch(NULL) which immediately returns -1.
The additions in the offset calculations also don't match the intended
size of the window.
Limit the window size and calculate the offset similarly to
show_scroll_win().
Fixes: 692d97c380c6 ("kconfig: new configuration interface (nconfig)") Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
PowerPC's "cmp" instruction has four operands. Normally people write
"cmpw" or "cmpd" for the second cmp operand 0 or 1. But, frequently
people forget, and write "cmp" with just three operands.
With older binutils this is silently accepted as if this was "cmpw",
while often "cmpd" is wanted. With newer binutils GAS will complain
about this for 64-bit code. For 32-bit code it still silently assumes
"cmpw" is what is meant.
In this instance the code comes directly from ISA v2.07, including the
cmp, but cmpd is correct. Backport to stable so that new toolchains can
build old kernels.
Fixes: 948cf67c4726 ("powerpc: Add NAP mode support on Power7 in HV mode") Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Segher Boessenkool <segher@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
GCC 5 generates different code for this bootwrapper null check that
causes the PS3 to hang very early in its bootup. This check is of
limited value, so just get rid of it.
What matters when deciding if we should make a page uptodate is
not how much we _wanted_ to copy, but how much we actually have
copied. As it is, on architectures that do not zero tail on
short copy we can leave uninitialized data in page marked uptodate.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
After sending an authorizer (ceph_x_authorize_a + ceph_x_authorize_b),
the client gets back a ceph_x_authorize_reply, which it is supposed to
verify to ensure the authenticity and protect against replay attacks.
The code for doing this is there (ceph_x_verify_authorizer_reply(),
ceph_auth_verify_authorizer_reply() + plumbing), but it is never
invoked by the the messenger.
AFAICT this goes back to 2009, when ceph authentication protocols
support was added to the kernel client in 4e7a5dcd1bba ("ceph:
negotiate authentication protocol; implement AUTH_NONE protocol").
The second param of ceph_connection_operations::verify_authorizer_reply
is unused all the way down. Pass 0 to facilitate backporting, and kill
it in the next commit.
One some systems, the firmware does not allow certain PCI devices to be put
in deep D-states. This can cause problems for wakeup signalling, if the
device does not support PME# in the deepest allowed suspend state. For
example, Pierre reports that on his system, ACPI does not permit his xHCI
host controller to go into D3 during runtime suspend -- but D3 is the only
state in which the controller can generate PME# signals. As a result, the
controller goes into runtime suspend but never wakes up, so it doesn't work
properly. USB devices plugged into the controller are never detected.
If the device relies on PME# for wakeup signals but is not capable of
generating PME# in the target state, the PCI core should accurately report
that it cannot do wakeup from runtime suspend. This patch modifies the
pci_dev_run_wake() routine to add this check.
Reported-by: Pierre de Villemereuil <flyos@mailoo.org> Tested-by: Pierre de Villemereuil <flyos@mailoo.org> Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> CC: Lukas Wunner <lukas@wunner.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The array ib_mad_mgmt_class_table.method_table has MAX_MGMT_CLASS
(80) elements. Hence compare the array index with that value instead
of with IB_MGMT_MAX_METHODS (128). This patch avoids that Coverity
reports the following:
Overrunning array class->method_table of 80 8-byte elements at element index 127 (byte offset 1016) using index convert_mgmt_class(mad_hdr->mgmt_class) (which evaluates to 127).
Fixes: commit b7ab0b19a85f ("IB/mad: Verify mgmt class in received MADs") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Sean Hefty <sean.hefty@intel.com> Reviewed-by: Hal Rosenstock <hal@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
With new binutils, gcc may get smart with its optimization and change a jmp
from a 5 byte jump to a 2 byte one even though it was jumping to a global
function. But that global function existed within a 2 byte radius, and gcc
was able to optimize it. Unfortunately, that jump was also being modified
when function graph tracing begins. Since ftrace expected that jump to be 5
bytes, but it was only two, it overwrote code after the jump, causing a
crash.
This was fixed for x86_64 with commit 8329e818f149, with the same subject as
this commit, but nothing was done for x86_32.
Fixes: d61f82d06672 ("ftrace: use dynamic patching for updating mcount calls") Reported-by: Colin Ian King <colin.king@canonical.com> Tested-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When L2 exits to L0 due to "exception or NMI", software exceptions
(#BP and #OF) for which L1 has requested an intercept should be
handled by L1 rather than L0. Previously, only hardware exceptions
were forwarded to L1.
Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Current implementation employ 16bit counter of active stripes in lower
bits of bio->bi_phys_segments. If request is big enough to overflow
this counter bio will be completed and freed too early.
Fortunately this not happens in default configuration because several
other limits prevent that: stripe_cache_size * nr_disks effectively
limits count of active stripes. And small max_sectors_kb at lower
disks prevent that during normal read/write operations.
Overflow easily happens in discard if it's enabled by module parameter
"devices_handle_discard_safely" and stripe_cache_size is set big enough.
This patch limits requests size with 256Mb - 8Kb to prevent overflows.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Shaohua Li <shli@kernel.org> Cc: Neil Brown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The buffer for iucv_message_receive() needs to be below 2 GB. In
__iucv_message_receive(), the buffer address is casted to an u32, which
would result in either memory corruption or an addressing exception when
using addresses >= 2 GB.
Fix this by using GFP_DMA for the buffer allocation.
A race between scanning and fc_remote_port_delete() may result in a
permanent stop if the device gets blocked before scsi_sysfs_add_sdev()
and unblocked after. The reason is that blocking a device sets both the
SDEV_BLOCKED state and the QUEUE_FLAG_STOPPED. However,
scsi_sysfs_add_sdev() unconditionally sets SDEV_RUNNING which causes the
device to be ignored by scsi_target_unblock() and thus never have its
QUEUE_FLAG_STOPPED cleared leading to a device which is apparently
running but has a stopped queue.
We actually have two places where SDEV_RUNNING is set: once in
scsi_add_lun() which respects the blocked flag and once in
scsi_sysfs_add_sdev() which doesn't. Since the second set is entirely
spurious, simply remove it to fix the problem.
Reported-by: Zengxi Chen <chenzengxi@huawei.com> Signed-off-by: Wei Fang <fangwei1@huawei.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
It is unavoidable that zfcp_scsi_queuecommand() has to finish requests
with DID_IMM_RETRY (like fc_remote_port_chkready()) during the time
window when zfcp detected an unavailable rport but
fc_remote_port_delete(), which is asynchronous via
zfcp_scsi_schedule_rport_block(), has not yet blocked the rport.
However, for the case when the rport becomes available again, we should
prevent unblocking the rport too early. In contrast to other FCP LLDDs,
zfcp has to open each LUN with the FCP channel hardware before it can
send I/O to a LUN. So if a port already has LUNs attached and we
unblock the rport just after port recovery, recoveries of LUNs behind
this port can still be pending which in turn force
zfcp_scsi_queuecommand() to unnecessarily finish requests with
DID_IMM_RETRY.
This also opens a time window with unblocked rport (until the followup
LUN reopen recovery has finished). If a scsi_cmnd timeout occurs during
this time window fc_timed_out() cannot work as desired and such command
would indeed time out and trigger scsi_eh. This prevents a clean and
timely path failover. This should not happen if the path issue can be
recovered on FC transport layer such as path issues involving RSCNs.
Fix this by only calling zfcp_scsi_schedule_rport_register(), to
asynchronously trigger fc_remote_port_add(), after all LUN recoveries as
children of the rport have finished and no new recoveries of equal or
higher order were triggered meanwhile. Finished intentionally includes
any recovery result no matter if successful or failed (still unblock
rport so other successful LUNs work). For simplicity, we check after
each finished LUN recovery if there is another LUN recovery pending on
the same port and then do nothing. We handle the special case of a
successful recovery of a port without LUN children the same way without
changing this case's semantics.
For debugging we introduce 2 new trace records written if the rport
unblock attempt was aborted due to still unfinished or freshly triggered
recovery. The records are only written above the default trace level.
Benjamin noticed the important special case of new recovery that can be
triggered between having given up the erp_lock and before calling
zfcp_erp_action_cleanup() within zfcp_erp_strategy(). We must avoid the
following sequence:
ERP thread rport_work other context
------------------------- -------------- --------------------------------
port is unblocked, rport still blocked,
due to pending/running ERP action,
so ((port->status & ...UNBLOCK) != 0)
and (port->rport == NULL)
unlock ERP
zfcp_erp_action_cleanup()
case ZFCP_ERP_ACTION_REOPEN_LUN:
zfcp_erp_try_rport_unblock()
((status & ...UNBLOCK) != 0) [OLD!]
zfcp_erp_port_reopen()
lock ERP
zfcp_erp_port_block()
port->status clear ...UNBLOCK
unlock ERP
zfcp_scsi_schedule_rport_block()
port->rport_task = RPORT_DEL
queue_work(rport_work)
zfcp_scsi_rport_work()
(port->rport_task != RPORT_ADD)
port->rport_task = RPORT_NONE
zfcp_scsi_rport_block()
if (!port->rport) return
zfcp_scsi_schedule_rport_register()
port->rport_task = RPORT_ADD
queue_work(rport_work)
zfcp_scsi_rport_work()
(port->rport_task == RPORT_ADD)
port->rport_task = RPORT_NONE
zfcp_scsi_rport_register()
(port->rport == NULL)
rport = fc_remote_port_add()
port->rport = rport;
Now the rport was erroneously unblocked while the zfcp_port is blocked.
This is another situation we want to avoid due to scsi_eh
potential. This state would at least remain until the new recovery from
the other context finished successfully, or potentially forever if it
failed. In order to close this race, we take the erp_lock inside
zfcp_erp_try_rport_unblock() when checking the status of zfcp_port or
LUN. With that, the possible corresponding rport state sequences would
be: (unblock[ERP thread],block[other context]) if the ERP thread gets
erp_lock first and still sees ((port->status & ...UNBLOCK) != 0),
(block[other context],NOP[ERP thread]) if the ERP thread gets erp_lock
after the other context has already cleard ...UNBLOCK from port->status.
Since checking fields of struct erp_action is unsafe because they could
have been overwritten (re-used for new recovery) meanwhile, we only
check status of zfcp_port and LUN since these are only changed under
erp_lock elsewhere. Regarding the check of the proper status flags (port
or port_forced are similar to the shown adapter recovery):
Hence, we should check for both UNBLOCK and ERP_INUSE because they are
interleaved. Also we need to explicitly check ERP_FAILED for the link
down case which currently does not clear the UNBLOCK flag in
zfcp_fsf_link_down_info_eval().
Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com> Fixes: 8830271c4819 ("[SCSI] zfcp: Dont fail SCSI commands when transitioning to blocked fc_rport") Fixes: a2fa0aede07c ("[SCSI] zfcp: Block FC transport rports early on errors") Fixes: 5f852be9e11d ("[SCSI] zfcp: Fix deadlock between zfcp ERP and SCSI") Fixes: 338151e06608 ("[SCSI] zfcp: make use of fc_remote_port_delete when target port is unavailable") Fixes: 3859f6a248cb ("[PATCH] zfcp: add rports to enable scsi_add_device to work again") Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Since quite a while, Linux issues enough SCSI commands per scsi_device
which successfully return with FCP_RESID_UNDER, FSF_FCP_RSP_AVAILABLE,
and SAM_STAT_GOOD. This floods the HBA trace area and we cannot see
other and important HBA trace records long enough.
Therefore, do not trace HBA response errors for pure benign residual
under counts at the default trace level.
This excludes benign residual under count combined with other validity
bits set in FCP_RSP_IU, such as FCP_SNS_LEN_VAL. For all those other
cases, we still do want to see both the HBA record and the corresponding
SCSI record by default.
Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com> Fixes: a54ca0f62f95 ("[SCSI] zfcp: Redesign of the debug tracing for HBA records.") Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>