SDIO cards may need clock to send the card interrupt to the host.
On a cherrytrail tablet with a RTL8723BS wifi chip, without this patch
pinging the tablet results in:
PING 192.168.1.14 (192.168.1.14) 56(84) bytes of data.
64 bytes from 192.168.1.14: icmp_seq=1 ttl=64 time=78.6 ms
64 bytes from 192.168.1.14: icmp_seq=2 ttl=64 time=1760 ms
64 bytes from 192.168.1.14: icmp_seq=3 ttl=64 time=753 ms
64 bytes from 192.168.1.14: icmp_seq=4 ttl=64 time=3.88 ms
64 bytes from 192.168.1.14: icmp_seq=5 ttl=64 time=795 ms
64 bytes from 192.168.1.14: icmp_seq=6 ttl=64 time=1841 ms
64 bytes from 192.168.1.14: icmp_seq=7 ttl=64 time=810 ms
64 bytes from 192.168.1.14: icmp_seq=8 ttl=64 time=1860 ms
64 bytes from 192.168.1.14: icmp_seq=9 ttl=64 time=812 ms
64 bytes from 192.168.1.14: icmp_seq=10 ttl=64 time=48.6 ms
Where as with this patch I get:
PING 192.168.1.14 (192.168.1.14) 56(84) bytes of data.
64 bytes from 192.168.1.14: icmp_seq=1 ttl=64 time=3.96 ms
64 bytes from 192.168.1.14: icmp_seq=2 ttl=64 time=1.97 ms
64 bytes from 192.168.1.14: icmp_seq=3 ttl=64 time=17.2 ms
64 bytes from 192.168.1.14: icmp_seq=4 ttl=64 time=2.46 ms
64 bytes from 192.168.1.14: icmp_seq=5 ttl=64 time=2.83 ms
64 bytes from 192.168.1.14: icmp_seq=6 ttl=64 time=1.40 ms
64 bytes from 192.168.1.14: icmp_seq=7 ttl=64 time=2.10 ms
64 bytes from 192.168.1.14: icmp_seq=8 ttl=64 time=1.40 ms
64 bytes from 192.168.1.14: icmp_seq=9 ttl=64 time=2.04 ms
64 bytes from 192.168.1.14: icmp_seq=10 ttl=64 time=1.40 ms
Cc: Dong Aisheng <b29396@freescale.com> Cc: Ian W MORRISON <ianwmorrison@gmail.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Acked-by: Dong Aisheng <aisheng.dong@nxp.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A previous commit (below) adds a check for already probed interfaces to
Wacom's matching heuristic. Unfortunately this causes the Bamboo Pen
(CTL-460) to match itself to its 'ghost' touch interface. After
subsequent changes to the driver this match to the ghost causes the
kernel to crash. This patch avoids calling wacom_add_shared_data()
for the BAMBOO_PEN's ghost touch interface.
Fixes: 41372d5d40e7 ("HID: wacom: Augment 'oVid' and 'oPid' with heuristics for HID_GENERIC") Signed-off-by: Aaron Armstrong Skomra <aaron.skomra@wacom.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In 'skl_tplg_set_module_init_data()', a pointer to 'params' member of
'struct skl_algo_data' is calculated, then casted to (u32 *) and assigned
to a member of configuration data. The configuration data is passed to the
other functions and used to process intel IPC. In this processing, the
value of member is used to get message data, however this can bring invalid
memory access in 'skl_set_module_params()' as a result of calculation of
a pointer for actual message data.
On this Dell AIO machine, the lineout jack does not work.
We found the pin 0x1a is assigned to lineout on this machine, and in
the past, we applied ALC298_FIXUP_DELL1_MIC_NO_PRESENCE to fix the
heaset-set mic problem for this machine, this fixup will redefine
the pin 0x1a to headphone-mic, as a result the lineout doesn't
work anymore.
After consulting with Dell, they told us this machine doesn't support
microphone via headset jack, so we add a new fixup which only defines
the pin 0x18 as the headset-mic.
[rearranged the fixup insertion position by tiwai in order to make the
merge with other branches easier -- tiwai]
Fixes: 59ec4b57bcae ("ALSA: hda - Fix headset mic detection problem for two dell machines") Signed-off-by: Hui Wang <hui.wang@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a new event is queued while processing to resize the FIFO in
snd_seq_fifo_clear(), it may lead to a use-after-free, as the old pool
that is being queued gets removed. For avoiding this race, we need to
close the pool to be deleted and sync its usage before actually
deleting it.
The host bridge memory window resource is inserted into the iomem_resource
tree and cannot be deallocated until the host bridge itself is removed.
Previously, the window was on the stack, which meant the iomem_resource
entry pointed into the stack and was corrupted as soon as the probe
function returned, which caused memory corruption and errors like this:
pcie_iproc_bcma bcma0:8: resource collision: [mem 0x40000000-0x47ffffff] conflicts with PCIe MEM space [mem 0x40000000-0x47ffffff]
Move the memory window resource from the stack into struct iproc_pcie so
its lifetime matches that of the host bridge.
Callers of scsi_dh_activate(), e.g. dm-mpath, assume that this function
either returns an error code or calls the completion function. Make
alua_activate() call the completion function even if scsi_device_get()
fails.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: commit 03197b61c5ec ("scsi_dh_alua: Use workqueue for RTPG") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The total ata xfer length may not be calculated properly, in that we do
not use the proper method to get an sg element dma length.
According to the code comment, sg_dma_len() should be used after
dma_map_sg() is called.
This issue was found by turning on the SMMUv3 in front of the hisi_sas
controller in hip07. Multiple sg elements were being combined into a
single element, but the original first element length was being use as
the total xfer length.
Fixes: ff2aeb1eb64c8a4770a6 ("libata: convert to chained sg") Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The user can control the size of the next command passed along, but the
value passed to the ioctl isn't checked against the usable max command
size.
Signed-off-by: Peter Chang <dpf@google.com> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a reflink operation causes the bmap code to allocate a btree block
we're currently doing single-AG allocations due to having ->firstblock
set and then try any higher AG due a little reflink quirk we've put in
when adding the reflink code. But given that we do not have a minleft
reservation of any kind in this AG we can still not have any space in
the same or higher AG even if the file system has enough free space.
To fix this use a XFS_ALLOCTYPE_FIRST_AG allocation in this fall back
path instead.
[And yes, we need to redo this properly instead of piling hacks over
hacks. I'm working on that, but it's not going to be a small series.
In the meantime this fixes the customer reported issue]
Also add a warning for failing allocations to make it easier to debug.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit fa7f138 ("xfs: clear delalloc and cache on buffered write
failure") fixed one regression in the iomap error handling code and
exposed another. The fundamental problem is that if a buffered write
is a rewrite of preexisting delalloc blocks and the write fails, the
failure handling code can punch out preexisting blocks with valid
file data.
This was reproduced directly by sub-block writes in the LTP
kernel/syscalls/write/write03 test. A first 100 byte write allocates
a single block in a file. A subsequent 100 byte write fails and
punches out the block, including the data successfully written by
the previous write.
To address this problem, update the ->iomap_begin() handler to
distinguish newly allocated delalloc blocks from preexisting
delalloc blocks via the IOMAP_F_NEW flag. Use this flag in the
->iomap_end() handler to decide when a failed or short write should
punch out delalloc blocks.
This introduces the subtle requirement that ->iomap_begin() should
never combine newly allocated delalloc blocks with existing blocks
in the resulting iomap descriptor. This can occur when a new
delalloc reservation merges with a neighboring extent that is part
of the current write, for example. Therefore, drop the
post-allocation extent lookup from xfs_bmapi_reserve_delalloc() and
just return the record inserted into the fork. This ensures only new
blocks are returned and thus that preexisting delalloc blocks are
always handled as "found" blocks and not punched out on a failed
rewrite.
Reported-by: Xiong Zhou <xzhou@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When block size is larger than inode cluster size, the call to
XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
would have set xfs_sb->sb_inoalignmt to 0. Hence in
xfs_set_inoalignment(), xfs_mount->m_inoalign_mask gets initialized to
-1 instead of 0. However, xfs_mount->m_sinoalign would get correctly
intialized to 0 because for every positive value of xfs_mount->m_dalign,
the condition "!(mp->m_dalign & mp->m_inoalign_mask)" would evaluate to
false.
Also, xfs_imap() worked fine even with xfs_mount->m_inoalign_mask having
-1 as the value because blks_per_cluster variable would have the value 1
and hence we would never have a need to use xfs_mount->m_inoalign_mask
to compute the inode chunk's agbno and offset within the chunk.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There are two different cases of buffered I/O errors:
- first we can have an already shutdown fs. In that case we should skip
any on-disk operations and just clean up the appen transaction if
present and destroy the ioend
- a real I/O error. In that case we should cleanup any lingering COW
blocks. This gets skipped in the current code and is fixed by this
patch.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We only want to reclaim preallocations from our periodic work item.
Currently this is archived by looking for a dirty inode, but that check
is rather fragile. Instead add a flag to xfs_reflink_cancel_cow_* so
that the caller can ask for just cancelling unwritten extents in the COW
fork.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix typos in commit message] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In various places we currently assert that xfs_bmap_btalloc allocates
from the same as the firstblock value passed in, unless it's either
NULLAGNO or the dop_low flag is set. But the reflink code does not
fully follow this convention as it passes in firstblock purely as
a hint for the allocator without actually having previous allocations
in the transaction, and without having a minleft check on the current
AG, leading to the assert firing on a very full and heavily used
file system. As even the reflink code only allocates from equal or
higher AGs for now we can simply the check to always allow for equal
or higher AGs.
Note that we need to eventually split the two meanings of the firstblock
value. At that point we can also allow the reflink code to allocate
from any AG instead of limiting it in any way.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When block size is larger than inode cluster size, the call to
XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
would have set xfs_sb->sb_inoalignmt to 0. This causes
xfs_ialloc_cluster_alignment() to return 0. Due to this
args.minalignslop (in xfs_ialloc_ag_alloc()) gets the unsigned
equivalent of -1 assigned to it. This later causes alloc_len in
xfs_alloc_space_available() to have a value of 0. In such a scenario
when args.total is also 0, the assert statement "ASSERT(args->maxlen >
0);" fails.
This commit fixes the bug by replacing the call to XFS_B_TO_FSBT() in
xfs_ialloc_cluster_alignment() with a call to xfs_icluster_size_fsb().
Suggested-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The block reservation for the transaction allocated in
xfs_shift_file_space() is an artifact of the original collapse range
support. It exists to handle the case where a collapse range occurs,
the initial extent is left shifted into a location that forms a
contiguous boundary with the previous extent and thus the extents
are merged. This code was subsequently refactored and reused for
insert range (right shift) support.
If an insert range occurs under low free space conditions, the
extent at the starting offset is split before the first shift
transaction is allocated. If the block reservation fails, this
leaves separate, but contiguous extents around in the inode. While
not a fatal problem, this is unexpected and will flag a warning on
subsequent insert range operations on the inode. This problem has
been reproduce intermittently by generic/270 running against a
ramdisk device.
Since right shift does not create new extent boundaries in the
inode, a block reservation for extent merge is unnecessary. Update
xfs_shift_file_space() to conditionally reserve fs blocks for left
shift transactions only. This avoids the warning reproduced by
generic/270.
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Certain workoads that punch holes into speculative preallocation can
cause delalloc indirect reservation splits when the delalloc extent is
split in two. If further splits occur, an already short-handed extent
can be split into two in a manner that leaves zero indirect blocks for
one of the two new extents. This occurs because the shortage is large
enough that the xfs_bmap_split_indlen() algorithm completely drains the
requested indlen of one of the extents before it honors the existing
reservation.
This ultimately results in a warning from xfs_bmap_del_extent(). This
has been observed during file copies of large, sparse files using 'cp
--sparse=always.'
To avoid this problem, update xfs_bmap_split_indlen() to explicitly
apply the reservation shortage fairly between both extents. This smooths
out the overall indlen shortage and defers the situation where we end up
with a delalloc extent with zero indlen reservation to extreme
circumstances.
Reported-by: Patrick Dung <mpatdung@gmail.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a delalloc extent is created, it can be merged with pre-existing,
contiguous, delalloc extents. When this occurs,
xfs_bmap_add_extent_hole_delay() merges the extents along with the
associated indirect block reservations. The expectation here is that the
combined worst case indlen reservation is always less than or equal to
the indlen reservation for the individual extents.
This is not always the case, however, as existing extents can less than
the expected indlen reservation if the extent was previously split due
to a hole punch. If a new extent merges with such an extent, the total
indlen requirement may be larger than the sum of the indlen reservations
held by both extents.
xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen
reservation is always available and assigns it to the merged extent
without consideration for the indlen held by the pre-existing extent. As
a result, the subsequent xfs_mod_fdblocks() call can attempt an
unintentional allocation rather than a free (indicated by an ASSERT()
failure). Further, if the allocation happens to fail in this context,
the failure goes unhandled and creates a filesystem wide block
accounting inconsistency.
Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the
indlen reservation assigned to the merged extent to the sum of the
indlen reservations held by each of the individual extents.
Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We don't just need the structure to track busy extents which can be
avoided with a synchronous transaction, but also to keep track of
pending discard.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If pag cannot be allocated, the current error exit path will trip
a null pointer deference error when calling xfs_buf_hash_destroy
with a null pag. Fix this by adding a new error exit labels and
jumping to those accordingly, avoiding the hash destroy and
unnecessary kmem_free on pag.
For any given iteration through the loop, any of the above which
succeed must be unwound for /this/ pag, and then all prior
initialized pags must be unwound.
Addresses-Coverity-Id: 1397628 ("Dereference after null check")
Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We're changing both metadata and data, so we need to update the
timestamps for clone operations. Dedupe on the other hand does
not change file data, and only changes invisible metadata so the
timestamps should not be updated.
This follows existing btrfs behavior.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: remove redundant is_dedupe test] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We currently fall back from direct to buffered writes if we detect a
remaining shared extent in the iomap_begin callback. But by the time
iomap_begin is called for the potentially unaligned end block we might
have already written most of the data to disk, which we'd now write
again using buffered I/O. To avoid this reject all writes to reflinked
files before starting I/O so that we are guaranteed to only write the
data once.
The alternative would be to unshare the unaligned start and/or end block
before doing the I/O. I think that's doable, and will actually be
required to support reflinks on DAX file system. But it will take a
little more time and I'd rather get rid of the double write ASAP.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After successful IO or permanent error, b_first_retry_time also
needs to be cleared, else the invalid first retry time will be
used by the next retry check.
Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Christoph Hellwig pointed out that there's a potentially nasty race when
performing simultaneous nearby directio cow writes:
"Thread 1 writes a range from B to c
" B --------- C
p
"a little later thread 2 writes from A to B
" A --------- B
p
[editor's note: the 'p' denote cowextsize boundaries, which I added to
make this more clear]
"but the code preallocates beyond B into the range where thread
"1 has just written, but ->end_io hasn't been called yet.
"But once ->end_io is called thread 2 has already allocated
"up to the extent size hint into the write range of thread 1,
"so the end_io handler will splice the unintialized blocks from
"that preallocation back into the file right after B."
We can avoid this race by ensuring that thread 1 cannot accidentally
remap the blocks that thread 2 allocated (as part of speculative
preallocation) as part of t2's write preparation in t1's end_io handler.
The way we make this happen is by taking advantage of the unwritten
extent flag as an intermediate step.
Recall that when we begin the process of writing data to shared blocks,
we create a delayed allocation extent in the CoW fork:
When a thread prepares to CoW some dirty data out to disk, it will now
convert the delalloc reservation into an /unwritten/ allocated extent in
the cow fork. The da conversion code tries to opportunistically
allocate as much of a (speculatively prealloc'd) extent as possible, so
we may end up allocating a larger extent than we're actually writing
out:
If the write succeeds, the end_cow function will now scan the relevant
range of the CoW fork for real extents and remap only the real extents
into the data fork:
This ensures that we never obliterate valid data fork extents with
unwritten blocks from the CoW fork.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the data fork, we only allow extents to perform the following state
transitions:
delay -> real <-> unwritten
There's no way to move directly from a delalloc reservation to an
/unwritten/ allocated extent. However, for the CoW fork we want to be
able to do the following to each extent:
delalloc -> unwritten -> written -> remapped to data fork
This will help us to avoid a race in the speculative CoW preallocation
code between a first thread that is allocating a CoW extent and a second
thread that is remapping part of a file after a write. In order to do
this, however, we need two things: first, we have to be able to
transition from da to unwritten, and second the function that converts
between real and unwritten has to be made aware of the cow fork. Do
both of those things.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Perform basic sanity checking of the directory free block header
fields so that we avoid hanging the system on invalid data.
(Granted that just means that now we shutdown on directory write,
but that seems better than hanging...)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We can't handle a bmbt that's taller than BTREE_MAXLEVELS, and there's
no such thing as a zero-level bmbt (for that we have extents format),
so if we see this, send back an error code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Don't let anybody load an obviously bad btree pointer. Since the values
come from disk, we must return an error, not just ASSERT.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we open a directory, we try to readahead block 0 of the directory
on the assumption that we're going to need it soon. If the bmbt is
corrupt, the directory will never be usable and the readahead fails
immediately, so we might as well prevent the directory from being opened
at all. This prevents a subsequent read or modify operation from
hitting it and taking the fs offline.
NOTE: We're only checking for early failures in the block mapping, not
the readahead directory block itself.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We use di_format and if_flags to decide whether we're grabbing the ilock
in btree mode (btree extents not loaded) or shared mode (anything else),
but the state of those fields can be changed by other threads that are
also trying to load the btree extents -- IFEXTENTS gets set before the
_bmap_read_extents call and cleared if it fails.
We don't actually need to have IFEXTENTS set until after the bmbt
records are successfully loaded and validated, which will fix the race
between multiple threads trying to read the same directory. The next
patch strengthens directory bmbt validation by refusing to open the
directory if reading the bmbt to start directory readahead fails.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It's possible for post-eof blocks to end up being used for direct I/O
writes. dio write performs an upfront unwritten extent allocation, sends
the dio and then updates the inode size (if necessary) on write
completion. If a file release occurs while a file extending dio write is
in flight, it is possible to mistake the post-eof blocks for speculative
preallocation and incorrectly truncate them from the inode. This means
that the resulting dio write completion can discover a hole and allocate
new blocks rather than perform unwritten extent conversion.
This requires a strange mix of I/O and is thus not likely to reproduce
in real world workloads. It is intermittently reproduced by generic/299.
The error manifests as an assert failure due to transaction overrun
because the aforementioned write completion transaction has only
reserved enough blocks for btree operations:
The root cause is that xfs_free_eofblocks() uses i_size to truncate
post-eof blocks from the inode, but async, file extending direct writes
do not update i_size until write completion, long after inode locks are
dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
to the incorrect size.
Update xfs_free_eofblocks() to serialize against dio similar to how
extending writes are serialized against i_size updates before post-eof
block zeroing. Specifically, wait on dio while under the iolock. This
ensures that dio write completions have updated i_size before post-eof
blocks are processed.
Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The xfs_eofblocks.eof_scan_owner field is an internal field to
facilitate invoking eofb scans from the kernel while under the iolock.
This is necessary because the eofb scan acquires the iolock of each
inode. Synchronous scans are invoked on certain buffered write failures
while under iolock. In such cases, the scan owner indicates that the
context for the scan already owns the particular iolock and prevents a
double lock deadlock.
eofblocks scans while under iolock are still livelock prone in the event
of multiple parallel scans, however. If multiple buffered writes to
different inodes fail and invoke eofblocks scans at the same time, each
scan avoids a deadlock with its own inode by virtue of the
eof_scan_owner field, but will never be able to acquire the iolock of
the inode from the parallel scan. Because the low free space scans are
invoked with SYNC_WAIT, the scan will not return until it has processed
every tagged inode and thus both scans will spin indefinitely on the
iolock being held across the opposite scan. This problem can be
reproduced reliably by generic/224 on systems with higher cpu counts
(x16).
To avoid this problem, simplify the semantics of eofblocks scans to
never invoke a scan while under iolock. This means that the buffered
write context must drop the iolock before the scan. It must reacquire
the lock before the write retry and also repeat the initial write
checks, as the original state might no longer be valid once the iolock
was dropped.
Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from
different contexts where the lock may or may not be held. The
need_iolock parameter exists for this reason, to indicate whether
xfs_free_eofblocks() must acquire the iolock itself before it can
proceed.
This is ugly and confusing. Simplify the semantics of
xfs_free_eofblocks() to require the caller to acquire the iolock
appropriately and kill the need_iolock parameter. While here, the mp
param can be removed as well as the xfs_mount is accessible from the
xfs_inode structure. This patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The nested_ept_enabled flag introduced in commit 7ca29de2136 was not
computed correctly. We are interested only in L1's EPT state, not the
the combined L0+L1 value.
In particular, if L0 uses EPT but L1 does not, nested_ept_enabled must
be false to make sure that PDPSTRs are loaded based on CR3 as usual,
because the special case described in 26.3.2.4 Loading Page-Directory-
Pointer-Table Entries does not apply.
Fixes: 7ca29de21362 ("KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT") Cc: qemu-stable@nongnu.org Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The DSPS glue calls del_timer_sync() in its musb_platform_disable()
implementation, which requires the caller to not hold a lock. But
musb_remove() calls musb_platform_disable() will musb->lock held. This
could causes spinlock deadlock.
So change musb_remove() to call musb_platform_disable() without holds
musb->lock. This doesn't impact the musb_platform_disable implementation
in other glue drivers.
root@am335x-evm:~# modprobe -r musb-dsps
[ 126.134879] musb-hdrc musb-hdrc.1: remove, state 1
[ 126.140465] usb usb2: USB disconnect, device number 1
[ 126.146178] usb 2-1: USB disconnect, device number 2
[ 126.416985] musb-hdrc musb-hdrc.1: USB bus 2 deregistered
[ 126.423943]
[ 126.425525] ======================================================
[ 126.431997] [ INFO: possible circular locking dependency detected ]
[ 126.438564] 4.11.0-rc1-00003-g1557f13bca04-dirty #77 Not tainted
[ 126.444852] -------------------------------------------------------
[ 126.451414] modprobe/778 is trying to acquire lock:
[ 126.456523] (((&glue->timer))){+.-...}, at: [<c01b8788>] del_timer_sync+0x0/0xd0
[ 126.464403]
[ 126.464403] but task is already holding lock:
[ 126.470511] (&(&musb->lock)->rlock){-.-...}, at: [<bf30b7f8>] musb_remove+0x50/0x1
30 [musb_hdrc]
[ 126.479965]
[ 126.479965] which lock already depends on the new lock.
[ 126.479965]
[ 126.488531]
[ 126.488531] the existing dependency chain (in reverse order) is:
[ 126.496368]
[ 126.496368] -> #1 (&(&musb->lock)->rlock){-.-...}:
[ 126.502968] otg_timer+0x80/0xec [musb_dsps]
[ 126.507990] call_timer_fn+0xb4/0x390
[ 126.512372] expire_timers+0xf0/0x1fc
[ 126.516754] run_timer_softirq+0x80/0x178
[ 126.521511] __do_softirq+0xc4/0x554
[ 126.525802] irq_exit+0xe8/0x158
[ 126.529735] __handle_domain_irq+0x58/0xb8
[ 126.534583] __irq_usr+0x54/0x80
[ 126.538507]
[ 126.538507] -> #0 (((&glue->timer))){+.-...}:
[ 126.544636] del_timer_sync+0x40/0xd0
[ 126.549066] musb_remove+0x6c/0x130 [musb_hdrc]
[ 126.554370] platform_drv_remove+0x24/0x3c
[ 126.559206] device_release_driver_internal+0x14c/0x1e0
[ 126.565225] bus_remove_device+0xd8/0x108
[ 126.569970] device_del+0x1e4/0x308
[ 126.574170] platform_device_del+0x24/0x8c
[ 126.579006] platform_device_unregister+0xc/0x20
[ 126.584394] dsps_remove+0x14/0x30 [musb_dsps]
[ 126.589595] platform_drv_remove+0x24/0x3c
[ 126.594432] device_release_driver_internal+0x14c/0x1e0
[ 126.600450] driver_detach+0x38/0x6c
[ 126.604740] bus_remove_driver+0x4c/0xa0
[ 126.609407] SyS_delete_module+0x11c/0x1e4
[ 126.614252] __sys_trace_return+0x0/0x10
Fixes: ea2f35c01d5ea ("usb: musb: Fix sleeping function called from invalid context for hdrc glue") Acked-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Bin Liu <b-liu@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
... we don't reschedule a task under certain circumstances:
Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on
CPU0) and holds a PI lock. This task is removed from the CPU because it
used up its time slice and another SCHED_OTHER task is running. Task-B on
CPU1 runs at RT priority and asks for the lock owned by task-A. This
results in a priority boost for task-A. Task-B goes to sleep until the
lock has been made available. Task-A is already runnable (but not active),
so it receives no wake up.
The reality now is that task-A gets on the CPU once the scheduler decides
to remove the current task despite the fact that a high priority task is
enqueued and waiting. This may take a long time.
The desired behaviour is that CPU0 immediately reschedules after the
priority boost which made task-A the task with the lowest priority.
Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: fd7a4bed1835 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks") Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET
to fill TXSTATUS, a well-defined default value is used, based on the
task's current value.
Suggested-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET
to fill all the registers, the thread's old registers are preserved.
Signed-off-by: Dave Martin <Dave.Martin@arm.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
regs_set() and regs_get() are vulnerable to an off-by-1 buffer overrun
if CONFIG_CPU_H8S is set, since this adds an extra entry to
register_offset[] but not to user_regs_struct.
So, iterate over user_regs_struct based on its actual size, not based on
the length of register_offset[].
Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
gpr_set won't work correctly and can never have been tested, and the
correct behaviour is not clear due to the endianness-dependent task
layout.
So, just remove it. The core code will now return -EOPNOTSUPPORT when
trying to set NT_PRSTATUS on this architecture until/unless a correct
implementation is supplied.
Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Clearing the status bit on irq_unmask will discard any pending interrupt
that did arrive after the irq_ack, i.e. while the IRQ handler function
was executing.
Fixes: f365be092572 ("pinctrl: Add Qualcomm TLMM driver") Cc: Stephen Boyd <sboyd@codeaurora.org> Reported-by: Timur Tabi <timur@codeaurora.org> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When init_vqs runs, virtio_balloon.stats is either uninitialized or
contains stale values. The host updates its state with garbage data
because it has no way of knowing that this is just a marker buffer
used for signaling.
This patch updates the stats before pushing the initial buffer.
Alternative fixes:
* Push an empty buffer in init_vqs. Not easily done with the current
virtio implementation and violates the spec "Driver MUST supply the
same subset of statistics in all buffers submitted to the statsq".
* Push a buffer with invalid tags in init_vqs. Violates the same
spec clause, plus "invalid tag" is not really defined.
Note: the spec says:
When using the legacy interface, the device SHOULD ignore all values in
the first buffer in the statsq supplied by the driver after device
initialization. Note: Historically, drivers supplied an uninitialized
buffer in the first buffer.
Unfortunately QEMU does not seem to implement the recommendation
even for the legacy interface.
Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We should hide and forbid VPID in L1 if it is disabled on L0. However, nested VPID
enable bit is set unconditionally during setup nested vmx exec controls though VPID
is not exposed through nested VMX capablity. This patch fixes it by don't set nested
VPID enable bit if it is disabled on L0.
Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Fixes: 5c614b3583e (KVM: nVMX: nested VPID emulation) Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to
wrapping issues. To ensure we are correctly ensuring that the two ESN
structures are the same size compare both the overall size as reported
by xfrm_replay_state_esn_len() and the internal length are the same.
When a new xfrm state is created during an XFRM_MSG_NEWSA call we
validate the user supplied replay_esn to ensure that the size is valid
and to ensure that the replay_window size is within the allocated
buffer. However later it is possible to update this replay_esn via a
XFRM_MSG_NEWAE call. There we again validate the size of the supplied
buffer matches the existing state and if so inject the contents. We do
not at this point check that the replay_window is within the allocated
memory. This leads to out-of-bounds reads and writes triggered by
netlink packets. This leads to memory corruption and the potential for
priviledge escalation.
We already attempt to validate the incoming replay information in
xfrm_new_ae() via xfrm_replay_verify_len(). This confirms that the user
is not trying to change the size of the replay state buffer which
includes the replay_esn. It however does not check the replay_window
remains within that buffer. Add validation of the contained
replay_window.
Dmitry reports following splat:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 0 PID: 13059 Comm: syz-executor1 Not tainted 4.10.0-rc7-next-20170207 #1
[..]
spin_lock_bh include/linux/spinlock.h:304 [inline]
xfrm_policy_flush+0x32/0x470 net/xfrm/xfrm_policy.c:963
xfrm_policy_fini+0xbf/0x560 net/xfrm/xfrm_policy.c:3041
xfrm_net_init+0x79f/0x9e0 net/xfrm/xfrm_policy.c:3091
ops_init+0x10a/0x530 net/core/net_namespace.c:115
setup_net+0x2ed/0x690 net/core/net_namespace.c:291
copy_net_ns+0x26c/0x530 net/core/net_namespace.c:396
create_new_namespaces+0x409/0x860 kernel/nsproxy.c:106
unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:205
SYSC_unshare kernel/fork.c:2281 [inline]
Problem is that when we get error during xfrm_net_init we will call
xfrm_policy_fini which will acquire xfrm_policy_lock before it was
initialized. Just move it around so locks get set up first.
one can immediatelly see an UBSAN warning:
UBSAN: Undefined behaviour in crypto/algif_hash.c:187:7
variable length array bound value 0 <= 0
CPU: 0 PID: 15949 Comm: syz-executor Tainted: G E 4.4.30-0-default #1
...
Call Trace:
...
[<ffffffff81d598fd>] ? __ubsan_handle_vla_bound_not_positive+0x13d/0x188
[<ffffffff81d597c0>] ? __ubsan_handle_out_of_bounds+0x1bc/0x1bc
[<ffffffffa0e2204d>] ? hash_accept+0x5bd/0x7d0 [algif_hash]
[<ffffffffa0e2293f>] ? hash_accept_nokey+0x3f/0x51 [algif_hash]
[<ffffffffa0e206b0>] ? hash_accept_parent_nokey+0x4a0/0x4a0 [algif_hash]
[<ffffffff8235c42b>] ? SyS_accept+0x2b/0x40
It is a correct warning, as hash state is propagated to accept as zero,
but creating a zero-length variable array is not allowed in C.
Fix this as proposed by Herbert -- do "?: 1" on that site. No sizeof or
similar happens in the code there, so we just allocate one byte even
though we do not use the array.
fbcon can deal with vc_hi_font_mask (the upper 256 chars) and adjust
the vc attrs dynamically when vc_hi_font_mask is changed at
fbcon_init(). When the vc_hi_font_mask is set, it remaps the attrs in
the existing console buffer with one bit shift up (for 9 bits), while
it remaps with one bit shift down (for 8 bits) when the value is
cleared. It works fine as long as the font gets updated after fbcon
was initialized.
However, we hit a bizarre problem when the console is switched to
another fb driver (typically from vesafb or efifb to drmfb). At
switching to the new fb driver, we temporarily rebind the console to
the dummy console, then rebind to the new driver. During the
switching, we leave the modified attrs as is. Thus, the new fbcon
takes over the old buffer as if it were to contain 8 bits chars
(although the attrs are still shifted for 9 bits), and effectively
this results in the yellow color texts instead of the original white
color, as found in the bugzilla entry below.
An easy fix for this is to re-adjust the attrs before leaving the
fbcon at con_deinit callback. Since the code to adjust the attrs is
already present in the current fbcon code, in this patch, we simply
factor out the relevant code, and call it from fbcon_deinit().
When writing the generic nonblocking commit code I assumed that
through clever lifetime management I can assure that the completion
(stored in drm_crtc_commit) only gets freed after it is completed. And
that worked.
I also wanted to make nonblocking helpers resilient against driver
bugs, by having timeouts everywhere. And that worked too.
Unfortunately taking boths things together results in oopses :( Well,
at least sometimes: What seems to happen is that the drm event hangs
around forever stuck in limbo land. The nonblocking helpers eventually
time out, move on and release it. Now the bug I tested all this
against is drivers that just entirely fail to deliver the vblank
events like they should, and in those cases the event is simply
leaked. But what seems to happen, at least sometimes, on i915 is that
the event is set up correctly, but somohow the vblank fails to fire in
time. Which means the event isn't leaked, it's still there waiting for
eventually a vblank to fire. That tends to happen when re-enabling the
pipe, and then the trap springs and the kernel oopses.
The correct fix here is simply to refcount the crtc commit to make
sure that the event sticks around even for drivers which only
sometimes fail to deliver vblanks for some arbitrary reasons. Since
crtc commits are already refcounted that's easy to do.
References: https://bugs.freedesktop.org/show_bug.cgi?id=96781 Cc: Jim Rees <rees@umich.edu> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161221102331.31033-1-daniel.vetter@ffwll.ch Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Revert the main part of commit: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
That commit introduced reading the pci device's msi message data to see
if a pirq was previously configured for the device's msi/msix, and re-use
that pirq. At the time, that was the correct behavior. However, a
later change to Qemu caused it to call into the Xen hypervisor to unmap
all pirqs for a pci device, when the pci device disables its MSI/MSIX
vectors; specifically the Qemu commit: c976437c7dba9c7444fb41df45468968aaa326ad
("qemu-xen: free all the pirqs for msi/msix when driver unload")
Once Qemu added this pirq unmapping, it was no longer correct for the
kernel to re-use the pirq number cached in the pci device msi message
data. All Qemu releases since 2.1.0 contain the patch that unmaps the
pirqs when the pci device disables its MSI/MSIX vectors.
This bug is causing failures to initialize multiple NVMe controllers
under Xen, because the NVMe driver sets up a single MSIX vector for
each controller (concurrently), and then after using that to talk to
the controller for some configuration data, it disables the single MSIX
vector and re-configures all the MSIX vectors it needs. So the MSIX
setup code tries to re-use the cached pirq from the first vector
for each controller, but the hypervisor has already given away that
pirq to another controller, and its initialization fails.
This is discussed in more detail at:
https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") Signed-off-by: Dan Streetman <dan.streetman@canonical.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit <f2e767bb5d6e> ("mpt3sas: Force request partial completion
alignment") was not considering the case of commands not operating on
logical block size units (e.g. REQ_OP_ZONE_REPORT and its 64B aligned
partial replies). In this case, forcing alignment of resid to the device
logical block size can break the command result, e.g. in the case of
REQ_OP_ZONE_REPORT, the exact number of zone reported by the device.
Move the partial completion alignement check of mpt3sas to a generic
implementation in sd_done(). The check is added within the default
section of the initial req_op() switch case so that the report and reset
zone commands are ignored. In addition, as sd_done() is not called for
passthrough requests, resid corrections are not done as intended by the
initial mpt3sas patch.
Fixes: f2e767bb5d6e ("mpt3sas: Force request partial completion alignment") Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With a device dax alignment of 4KB or 2MB, I get sigbus when running
the attached fio job file for the current kernel (4.11.0-rc1+). If
I specify an alignment of 1GB, it works.
I turned on debug output, and saw that it was failing in the huge
fault code.
fio config for reproduce:
[global]
ioengine=dev-dax
direct=0
filename=/dev/dax0.0
bs=2m
[write]
rw=write
[read]
stonewall
rw=read
The driver fails to fallback when taking a fault that is larger than
the device alignment, or handling a larger fault when a smaller
mapping is already established. While we could support larger
mappings for a device with a smaller alignment, that change is
too large for the immediate fix. The simplest change is to force
fallback until the fault size matches the alignment.
Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Cc: <stable@vger.kernel.org> Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs. Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.
Commit 15520111500c ("mmc: core: Further fix thread wake-up") allowed a
queue to release the host with is_waiting_last_req set to true. A queue
waiting to claim the host will not reset it, which can result in the
queue getting stuck in a loop.
Fixes: 15520111500c ("mmc: core: Further fix thread wake-up") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we close a channel that has been rescinded, we will leak memory since
vmbus_teardown_gpadl() returns an error. Fix this so that we can properly
cleanup the memory allocated to the ring buffers.
Fixes: ccb61f8a99e6 ("Drivers: hv: vmbus: Fix a rescind handling bug") Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Cc: Dexuan Cui <decui@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Output 'activation' may fail for the reasons of the output driver,
for example, if msc's buffer is not allocated. We forget, however,
to drop the module reference in this case. So each attempt at
activation in this case leaks a reference, preventing the module
from ever unloading.
This patch adds the missing module_put() in the activation error
path.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In journal_init_common(), if we failed to allocate the j_wbuf array, or
if we failed to create the buffer_head for the journal superblock, we
leaked the memory allocated for the revocation tables. Fix this.
Fixes: f0c9fd5458bacf7b12a9a579a727dc740cbe047e Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With posix timers having become optional, we get a build error with
the cpts time sync option of the CPSW driver:
drivers/net/ethernet/ti/cpts.c: In function 'cpts_find_ts':
drivers/net/ethernet/ti/cpts.c:291:23: error: implicit declaration of function 'ptp_classify_raw';did you mean 'ptp_classifier_init'? [-Werror=implicit-function-declaration]
This adds a hard dependency on PTP_CLOCK to avoid the problem, as
building it without PTP support makes no sense anyway.
Fixes: baa73d9e478f ("posix-timers: Make them configurable") Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When iterating busy requests in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer or driver, and
isn't submitted to hardware yet.
In current implementation of blk_mq_check_expired(),
if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is
wrong, and can cause rq corruption or double allocation[1][2],
when doing I/O and removing&resetting NVMe device at the sametime.
This patch fixes several issues reported by Yi Zhang.
Reported-by: Yi Zhang <yizhan@redhat.com> Tested-by: Yi Zhang <yizhan@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The net_cls controller controls the classid field of each socket which
is associated with the cgroup. Because the classid is per-socket
attribute, when a task migrates to another cgroup or the configured
classid of the cgroup changes, the controller needs to walk all
sockets and update the classid value, which was implemented by 3b13758f51de ("cgroups: Allow dynamically changing net_classid").
While the approach is not scalable, migrating tasks which have a lot
of fds attached to them is rare and the cost is born by the ones
initiating the operations. However, for simplicity, both the
migration and classid config change paths call update_classid() which
scans all fds of all tasks in the target css. This is an overkill for
the migration path which only needs to cover a much smaller subset of
tasks which are actually getting migrated in.
On cgroup v1, this can lead to unexpected scalability issues when one
tries to migrate a task or process into a net_cls cgroup which already
contains a lot of fds. Even if the migration traget doesn't have many
to get scanned, update_classid() ends up scanning all fds in the
target cgroup which can be extremely numerous.
Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
even worse. Before bfc2cf6f61fc ("cgroup: call subsys->*attach() only
for subsystems which are actually affected by migration"), cgroup core
would call the ->css_attach callback even for controllers which don't
see actual migration to a different css.
As net_cls is always disabled but still mounted on cgroup v2, whenever
a process is migrated on the cgroup v2 hierarchy, net_cls sees
identity migration from root to root and cgroup core used to call
->css_attach callback for those. The net_cls ->css_attach ends up
calling update_classid() on the root net_cls css to which all
processes on the system belong to as the controller isn't used. This
makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
which is horrible and easily leads to noticeable stalls triggering RCU
stall warnings and so on.
The worst symptom is already fixed in upstream by bfc2cf6f61fc
("cgroup: call subsys->*attach() only for subsystems which are
actually affected by migration"); however, backporting that commit is
too invasive and we want to avoid other cases too.
This patch updates net_cls's cgrp_attach() to iterate fds of only the
processes which are actually getting migrated. This removes the
surprising migration cost which is dependent on the total number of
fds in the target cgroup. As this leaves write_classid() the only
user of update_classid(), open-code the helper into write_classid().
Reported-by: David Goode <dgoode@fb.com> Fixes: 3b13758f51de ("cgroups: Allow dynamically changing net_classid") Cc: Nina Schiff <ninasc@fb.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On CPU online the cpufreq core restores the previous governor (or
the previous "policy" setting for ->setpolicy drivers), but it does
not restore the min/max limits at the same time, which is confusing,
inconsistent and real pain for users who set the limits and then
suspend/resume the system (using full suspend), in which case the
limits are reset on all CPUs except for the boot one.
Fix this by making cpufreq_online() restore the limits when an inactive
policy is brought online.
The commit log and patch are inspired from Rafael's earlier work.
Reported-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If kernel image extends across alignment boundary, existing
code increases the KASLR offset by size of kernel image. The
offset is masked after resizing. There are cases, where after
masking, we may still have kernel image extending across
boundary. This eventually results in only 2MB block getting
mapped while creating the page tables. This results in data aborts
while accessing unmapped regions during second relocation (with
kaslr offset) in __primary_switch. To fix this problem, round up the
kernel image size, by swapper block size, before adding it for
correction.
For example consider below case, where kernel image still crosses
1GB alignment boundary, after masking the offset, which is fixed
by rounding up kernel image size.
SWAPPER_TABLE_SHIFT = 30
Swapper using section maps with section size 2MB.
CONFIG_PGTABLE_LEVELS = 3
VA_BITS = 39
On some DDR controllers, compatible with the sama5d3 one,
the sequence to enter/exit/re-enter the self-refresh mode adds
more constrains than what is currently written in the at91_idle
driver. An actual access to the DDR chip is needed between exit
and re-enter of this mode which is somehow difficult to implement.
This sequence can completely hang the SoC. It is particularly
experienced on parts which embed a L2 cache if the code run
between IDLE calls fits in it...
Moreover, as the intention is to enter and exit pretty rapidly
from IDLE, the power-down mode is a good candidate.
So now we use power-down instead of self-refresh. As we can
simplify the code for sama5d3 compatible DDR controllers,
we instantiate a new sama5d3_ddr_standby() function.
Signed-off-by: Nicolas Ferre <nicolas.ferre@microchip.com> Fixes: 017b5522d5e3 ("ARM: at91: Add new binding for sama5d3-ddramc") Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit cab43282682e ("ARM: at91/dt: sama5d2: Use new
compatible for ohci node")
It depends from commit 7150bc9b4d43 ("usb: ohci-at91: Forcibly suspend
ports while USB suspend") which was reverted and implemented
differently. With the new implementation, the compatible string must
remain the same.
The compatible string introduced by this commit has been used in the
default SAMA5D2 dtsi starting from Linux 4.8. As it has never been
working correctly in an official release, removing it should not be
breaking the stability rules.
Fixes: cab43282682e ("ARM: at91/dt: sama5d2: Use new compatible for ohci node") Signed-off-by: Romain Izard <romain.izard.pro@gmail.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For some unknown reasons, in some cases, FLPD cache invalidation doesn't
work properly with SYSMMU v5 controllers found in Exynos5433 SoCs. This
can be observed by a firmware crash during initialization phase of MFC
video decoder available in the mentioned SoCs when IOMMU support is
enabled. To workaround this issue perform a full TLB/FLPD invalidation
in case of replacing any first level page descriptors in case of SYSMMU v5.
Fixes: 740a01eee9ada ("iommu/exynos: Add support for v5 SYSMMU") Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Documentation specifies that SYSMMU should be in blocked state while
performing TLB/FLPD cache invalidation, so add needed calls to
sysmmu_block/unblock.
This was broken in commit cd979883b9ed ("xen/acpi-processor:
fix enabling interrupts on syscore_resume"). do_suspend (from
xen/manage.c) and thus xen_resume_notifier never get called on
the initial-domain at resume (it is if running as guest.)
The rationale for the breaking change was that upload_pm_data()
potentially does blocking work in syscore_resume(). This patch
addresses the original issue by scheduling upload_pm_data() to
execute in workqueue context.
Cc: Stanislaw Gruszka <sgruszka@redhat.com> Based-on-patch-by: Konrad Wilk <konrad.wilk@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The intent of the original warning is make sure that the mdev vendor
driver has removed any group notifiers at the point where the group
is closed by the user. Theoretically this would be through an
orderly shutdown where any devices are release prior to the group
release. We can't always count on an orderly shutdown, the user can
close the group before the notifier can be removed or the user task
might be killed. We'd like to add this sanity test when the group is
idle and the only references are from the devices within the group
themselves, but we don't have a good way to do that. Instead check
both when the group itself is removed and when the group is opened.
A bit later than we'd prefer, but better than the current over
aggressive approach.
Fixes: ccd46dbae77d ("vfio: support notifier chain in vfio_group") Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Cc: Jike Song <jike.song@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Filesystem encryption ostensibly supported revoking a keyring key that
had been used to "unlock" encrypted files, causing those files to become
"locked" again. This was, however, buggy for several reasons, the most
severe of which was that when key revocation happened to be detected for
an inode, its fscrypt_info was immediately freed, even while other
threads could be using it for encryption or decryption concurrently.
This could be exploited to crash the kernel or worse.
This patch fixes the use-after-free by removing the code which detects
the keyring key having been revoked, invalidated, or expired. Instead,
an encrypted inode that is "unlocked" now simply remains unlocked until
it is evicted from memory. Note that this is no worse than the case for
block device-level encryption, e.g. dm-crypt, and it still remains
possible for a privileged user to evict unused pages, inodes, and
dentries by running 'sync; echo 3 > /proc/sys/vm/drop_caches', or by
simply unmounting the filesystem. In fact, one of those actions was
already needed anyway for key revocation to work even somewhat sanely.
This change is not expected to break any applications.
In the future I'd like to implement a real API for fscrypt key
revocation that interacts sanely with ongoing filesystem operations ---
waiting for existing operations to complete and blocking new operations,
and invalidating and sanitizing key material and plaintext from the VFS
caches. But this is a hard problem, and for now this bug must be fixed.
This bug affected almost all versions of ext4, f2fs, and ubifs
encryption, and it was potentially reachable in any kernel configured
with encryption support (CONFIG_EXT4_ENCRYPTION=y,
CONFIG_EXT4_FS_ENCRYPTION=y, CONFIG_F2FS_FS_ENCRYPTION=y, or
CONFIG_UBIFS_FS_ENCRYPTION=y). Note that older kernels did not use the
shared fs/crypto/ code, but due to the potential security implications
of this bug, it may still be worthwhile to backport this fix to them.
Fixes: b7236e21d55f ("ext4 crypto: reorganize how we store keys in the inode") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The CCP driver generally uses a round-robin approach when
assigning operations to available CCPs. For the DMA engine,
however, the DMA mappings of the SGs are associated with a
specific CCP. When an IOMMU is enabled, the IOMMU is
programmed based on this specific device.
If the DMA operations are not performed by that specific
CCP then addressing errors and I/O page faults will occur.
Update the CCP driver to allow a specific CCP device to be
requested for an operation and use this in the DMA engine
support.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the 'commit ebee76f7fa46 ("ath10k: allow setting coverage class")',
it inherits the design and the address offset from ath9k, but the address
is not applicable to QCA6174, which leads to a random crash while doing the
resume() operation, since the set_coverage_class.ops will be called from
ieee80211_reconfig() when resume() (if the wow is not configured).
Fix the incorrect address offset here to avoid the random crash.
Verified on QCA6174/hw3.0 with firmware WLAN.RM.4.4-00022-QCARMSWPZ-2.
kvalo: this also seems to fix a regression with firmware restart.
Fixes: ebee76f7fa46 ("ath10k: allow setting coverage class") Signed-off-by: Ryan Hsu <ryanhsu@qca.qualcomm.com> Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When PCIe FLR support was added, much of the remove/release code for
PCIe was migrated to ->down_dev(), but ->down_dev() is never called for
device removal. Let's refactor the cleanup to be done in both cases.
Also, drop the comments above mwifiex_cleanup_pcie(), because they were
clearly wrong, and it's better to have clear and obvious code than to
detail the code steps in comments anyway.
Fixes: 4c5dae59d2e9 ("mwifiex: add PCIe function level reset support") Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The MP style clocks support an mux with pre-dividers. While the driver
correctly accounted for them in the .determine_rate callback, it did
not in the .recalc_rate and .set_rate callbacks.
This means when calculating the factors in the .set_rate callback, they
would be off by a factor of the active pre-divider. Same goes for
reading back the clock rate after it is set.
After commit e9afc746299d ("hwrng: geode - Use linux/io.h instead of
asm/io.h") the geode-rng driver uses devres with pci_dev->dev to keep
track of resources, but does not actually register a PCI driver. This
results in the following issues:
1. The driver leaks memory because the driver does not attach to a
device. The driver only uses the PCI device as a reference. devm_*()
functions will release resources on driver detach, which the geode-rng
driver will never do. As a result,
2. The driver cannot be reloaded because there is always a use of the
ioport and region after the first load of the driver.
Revert the changes made by e9afc746299d ("hwrng: geode - Use linux/io.h
instead of asm/io.h").
After commit 31b2a73c9c5f ("hwrng: amd - Migrate to managed API"), the
amd-rng driver uses devres with pci_dev->dev to keep track of resources,
but does not actually register a PCI driver. This results in the
following issues:
1. The message
WARNING: CPU: 2 PID: 621 at drivers/base/dd.c:349 driver_probe_device+0x38c
is output when the i2c_amd756 driver loads and attempts to register a PCI
driver. The PCI & device subsystems assume that no resources have been
registered for the device, and the WARN_ON() triggers since amd-rng has
already do so.
2. The driver leaks memory because the driver does not attach to a
device. The driver only uses the PCI device as a reference. devm_*()
functions will release resources on driver detach, which the amd-rng
driver will never do. As a result,
3. The driver cannot be reloaded because there is always a use of the
ioport and region after the first load of the driver.
Revert the changes made by 31b2a73c9c5f ("hwrng: amd - Migrate to managed
API").
Disabling interrupts for even a millisecond can cause problems for some
devices. That can happen when Intel host controllers wait for the present
state to propagate.
The spin lock is not necessary here. Anything that is racing with changes
to the I/O state is already broken. The mmc core already provides
synchronization via "claiming" the host.
Although the spin lock probably should be removed from the code paths that
lead to this point, such a patch would touch too much code to be suitable
for stable trees. Consequently, for this patch, just drop the spin lock
while waiting.
Disabling interrupts for even a millisecond can cause problems for some
devices. That can happen when sdhci changes clock frequency because it
waits for the clock to become stable under a spin lock.
The spin lock is not necessary here. Anything that is racing with changes
to the I/O state is already broken. The mmc core already provides
synchronization via "claiming" the host.
Although the spin lock probably should be removed from the code paths that
lead to this point, such a patch would touch too much code to be suitable
for stable trees. Consequently, for this patch, just drop the spin lock
while waiting.
sdhci_arasan_get_timeout_clock() divides the frequency it has with (1 <<
(13 + divisor)).
However, the divisor is not some Arasan-specific value, but instead is
just the Data Timeout Counter Value from the SDHCI Timeout Control
Register.
Applying it here like this is wrong as the sdhci driver already takes
that value into account when calculating timeouts, and in fact it *sets*
that register value based on how long a timeout is wanted.
Additionally, sdhci core interprets the .get_timeout_clock callback
return value as if it were read from hardware registers, i.e. the unit
should be kHz or MHz depending on SDHCI_TIMEOUT_CLK_UNIT capability bit.
This bit is set at least on the tested Zynq-7000 SoC.
With the tested hardware (SDHCI_TIMEOUT_CLK_UNIT set) this results in
too high a timeout clock rate being reported, causing the core to use
longer-than-needed timeouts. Additionally, on a partitioned MMC
(therefore having erase_group_def bit set) mmc_calc_max_discard()
disables discard support as it looks like controller does not support
the long timeouts needed for that.
Do not apply the extra divisor and return the timeout clock in the
expected unit.
Tested with a Zynq-7000 SoC and a partitioned Toshiba THGBMAG5A1JBAWR
eMMC card.
The SDHCI controller in the SAMA5D2 chip requires a valid voltage set
in the power control register, otherwise commands will fail with a
timeout error.
When using the regulator framework to specify the regulator used by the
mmc device, the voltage is not configured, and it is not possible to use
the connected device.
Implement a custom 'set_power' function for this specific hardware, that
configures the voltage in the register in all cases.