--- /dev/null
+From 3a8b36f378060d20062a0918e99fae39ff077bf0 Mon Sep 17 00:00:00 2001
+From: Filipe Manana <fdmanana@suse.com>
+Date: Sun, 1 Mar 2015 20:36:00 +0000
+Subject: Btrfs: fix data loss in the fast fsync path
+
+From: Filipe Manana <fdmanana@suse.com>
+
+commit 3a8b36f378060d20062a0918e99fae39ff077bf0 upstream.
+
+When using the fast file fsync code path we can miss the fact that new
+writes happened since the last file fsync and therefore return without
+waiting for the IO to finish and write the new extents to the fsync log.
+
+Here's an example scenario where the fsync will miss the fact that new
+file data exists that wasn't yet durably persisted:
+
+1. fs_info->last_trans_committed == N - 1 and current transaction is
+ transaction N (fs_info->generation == N);
+
+2. do a buffered write;
+
+3. fsync our inode, this clears our inode's full sync flag, starts
+ an ordered extent and waits for it to complete - when it completes
+ at btrfs_finish_ordered_io(), the inode's last_trans is set to the
+ value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
+ btrfs_set_inode_last_trans);
+
+4. transaction N is committed, so fs_info->last_trans_committed is now
+ set to the value N and fs_info->generation remains with the value N;
+
+5. do another buffered write, when this happens btrfs_file_write_iter
+ sets our inode's last_trans to the value N + 1 (that is
+ fs_info->generation + 1 == N + 1);
+
+6. transaction N + 1 is started and fs_info->generation now has the
+ value N + 1;
+
+7. transaction N + 1 is committed, so fs_info->last_trans_committed
+ is set to the value N + 1;
+
+8. fsync our inode - because it doesn't have the full sync flag set,
+ we only start the ordered extent, we don't wait for it to complete
+ (only in a later phase) therefore its last_trans field has the
+ value N + 1 set previously by btrfs_file_write_iter(), and so we
+ have:
+
+ inode->last_trans <= fs_info->last_trans_committed
+ (N + 1) (N + 1)
+
+ Which made us not log the last buffered write and exit the fsync
+ handler immediately, returning success (0) to user space and resulting
+ in data loss after a crash.
+
+This can actually be triggered deterministically and the following excerpt
+from a testcase I made for xfstests triggers the issue. It moves a dummy
+file across directories and then fsyncs the old parent directory - this
+is just to trigger a transaction commit, so moving files around isn't
+directly related to the issue but it was chosen because running 'sync' for
+example does more than just committing the current transaction, as it
+flushes/waits for all file data to be persisted. The issue can also happen
+at random periods, since the transaction kthread periodicaly commits the
+current transaction (about every 30 seconds by default).
+The body of the test is:
+
+ _scratch_mkfs >> $seqres.full 2>&1
+ _init_flakey
+ _mount_flakey
+
+ # Create our main test file 'foo', the one we check for data loss.
+ # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
+ # bit from its flags (btrfs inode specific flags).
+ $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
+ -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
+
+ # Now create one other file and 2 directories. We will move this second file
+ # from one directory to the other later because it forces btrfs to commit its
+ # currently open transaction if we fsync the old parent directory. This is
+ # necessary to trigger the data loss bug that affected btrfs.
+ mkdir $SCRATCH_MNT/testdir_1
+ touch $SCRATCH_MNT/testdir_1/bar
+ mkdir $SCRATCH_MNT/testdir_2
+
+ # Make sure everything is durably persisted.
+ sync
+
+ # Write more 8Kb of data to our file.
+ $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
+
+ # Move our 'bar' file into a new directory.
+ mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
+
+ # Fsync our first directory. Because it had a file moved into some other
+ # directory, this made btrfs commit the currently open transaction. This is
+ # a condition necessary to trigger the data loss bug.
+ $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
+
+ # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
+ # data we wrote previously to be persisted and available if a crash happens.
+ # This did not happen with btrfs, because of the transaction commit that
+ # happened when we fsynced the parent directory.
+ $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
+
+ # Simulate a crash/power loss.
+ _load_flakey_table $FLAKEY_DROP_WRITES
+ _unmount_flakey
+
+ _load_flakey_table $FLAKEY_ALLOW_WRITES
+ _mount_flakey
+
+ # Now check that all data we wrote before are available.
+ echo "File content after log replay:"
+ od -t x1 $SCRATCH_MNT/foo
+
+ status=0
+ exit
+
+The expected golden output for the test, which is what we get with this
+fix applied (or when running against ext3/4 and xfs), is:
+
+ wrote 8192/8192 bytes at offset 0
+ XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+ wrote 8192/8192 bytes at offset 8192
+ XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+ File content after log replay:
+ 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
+ *
+ 0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
+ *
+ 0040000
+
+Without this fix applied, the output shows the test file does not have
+the second 8Kb extent that we successfully fsynced:
+
+ wrote 8192/8192 bytes at offset 0
+ XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+ wrote 8192/8192 bytes at offset 8192
+ XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+ File content after log replay:
+ 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
+ *
+ 0020000
+
+So fix this by skipping the fsync only if we're doing a full sync and
+if the inode's last_trans is <= fs_info->last_trans_committed, or if
+the inode is already in the log. Also remove setting the inode's
+last_trans in btrfs_file_write_iter since it's useless/unreliable.
+
+Also because btrfs_file_write_iter no longer sets inode->last_trans to
+fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
+bail out if last_trans is 0, otherwise something as simple as the following
+example wouldn't log the second write on the last fsync:
+
+ 1. write to file
+
+ 2. fsync file
+
+ 3. fsync file
+ |--> btrfs_inode_in_log() returns true and it set last_trans to 0
+
+ 4. write to file
+ |--> btrfs_file_write_iter() no longers sets last_trans, so it
+ remained with a value of 0
+ 5. fsync
+ |--> inode->last_trans == 0, so it bails out without logging the
+ second write
+
+A test case for xfstests will be sent soon.
+
+Signed-off-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: Chris Mason <clm@fb.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/btrfs/file.c | 56 ++++++++++++++++++++++++++++----------------------------
+ 1 file changed, 28 insertions(+), 28 deletions(-)
+
+--- a/fs/btrfs/file.c
++++ b/fs/btrfs/file.c
+@@ -1774,22 +1774,10 @@ static ssize_t btrfs_file_aio_write(stru
+ mutex_unlock(&inode->i_mutex);
+
+ /*
+- * we want to make sure fsync finds this change
+- * but we haven't joined a transaction running right now.
+- *
+- * Later on, someone is sure to update the inode and get the
+- * real transid recorded.
+- *
+- * We set last_trans now to the fs_info generation + 1,
+- * this will either be one more than the running transaction
+- * or the generation used for the next transaction if there isn't
+- * one running right now.
+- *
+ * We also have to set last_sub_trans to the current log transid,
+ * otherwise subsequent syncs to a file that's been synced in this
+ * transaction will appear to have already occured.
+ */
+- BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
+ BTRFS_I(inode)->last_sub_trans = root->log_transid;
+ if (num_written > 0) {
+ err = generic_write_sync(file, pos, num_written);
+@@ -1892,25 +1880,37 @@ int btrfs_sync_file(struct file *file, l
+ atomic_inc(&root->log_batch);
+
+ /*
+- * check the transaction that last modified this inode
+- * and see if its already been committed
+- */
+- if (!BTRFS_I(inode)->last_trans) {
+- mutex_unlock(&inode->i_mutex);
+- goto out;
+- }
+-
+- /*
+- * if the last transaction that changed this file was before
+- * the current transaction, we can bail out now without any
+- * syncing
++ * If the last transaction that changed this file was before the current
++ * transaction and we have the full sync flag set in our inode, we can
++ * bail out now without any syncing.
++ *
++ * Note that we can't bail out if the full sync flag isn't set. This is
++ * because when the full sync flag is set we start all ordered extents
++ * and wait for them to fully complete - when they complete they update
++ * the inode's last_trans field through:
++ *
++ * btrfs_finish_ordered_io() ->
++ * btrfs_update_inode_fallback() ->
++ * btrfs_update_inode() ->
++ * btrfs_set_inode_last_trans()
++ *
++ * So we are sure that last_trans is up to date and can do this check to
++ * bail out safely. For the fast path, when the full sync flag is not
++ * set in our inode, we can not do it because we start only our ordered
++ * extents and don't wait for them to complete (that is when
++ * btrfs_finish_ordered_io runs), so here at this point their last_trans
++ * value might be less than or equals to fs_info->last_trans_committed,
++ * and setting a speculative last_trans for an inode when a buffered
++ * write is made (such as fs_info->generation + 1 for example) would not
++ * be reliable since after setting the value and before fsync is called
++ * any number of transactions can start and commit (transaction kthread
++ * commits the current transaction periodically), and a transaction
++ * commit does not start nor waits for ordered extents to complete.
+ */
+ smp_mb();
+ if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
+- BTRFS_I(inode)->last_trans <=
+- root->fs_info->last_trans_committed) {
+- BTRFS_I(inode)->last_trans = 0;
+-
++ (full_sync && BTRFS_I(inode)->last_trans <=
++ root->fs_info->last_trans_committed)) {
+ /*
+ * We'v had everything committed since the last time we were
+ * modified so clear this flag in case it was set for whatever
--- /dev/null
+From b3cffac04eca9af46e1e23560a8ee22b1bd36d43 Mon Sep 17 00:00:00 2001
+From: James Hogan <james.hogan@imgtec.com>
+Date: Tue, 24 Feb 2015 11:46:20 +0000
+Subject: KVM: MIPS: Fix trace event to save PC directly
+
+From: James Hogan <james.hogan@imgtec.com>
+
+commit b3cffac04eca9af46e1e23560a8ee22b1bd36d43 upstream.
+
+Currently the guest exit trace event saves the VCPU pointer to the
+structure, and the guest PC is retrieved by dereferencing it when the
+event is printed rather than directly from the trace record. This isn't
+safe as the printing may occur long afterwards, after the PC has changed
+and potentially after the VCPU has been freed. Usually this results in
+the same (wrong) PC being printed for multiple trace events. It also
+isn't portable as userland has no way to access the VCPU data structure
+when interpreting the trace record itself.
+
+Lets save the actual PC in the structure so that the correct value is
+accessible later.
+
+Fixes: 669e846e6c4e ("KVM/MIPS32: MIPS arch specific APIs for KVM")
+Signed-off-by: James Hogan <james.hogan@imgtec.com>
+Cc: Paolo Bonzini <pbonzini@redhat.com>
+Cc: Ralf Baechle <ralf@linux-mips.org>
+Cc: Marcelo Tosatti <mtosatti@redhat.com>
+Cc: Gleb Natapov <gleb@kernel.org>
+Cc: Steven Rostedt <rostedt@goodmis.org>
+Cc: Ingo Molnar <mingo@redhat.com>
+Cc: linux-mips@linux-mips.org
+Cc: kvm@vger.kernel.org
+Acked-by: Steven Rostedt <rostedt@goodmis.org>
+Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ arch/mips/kvm/trace.h | 6 +++---
+ 1 file changed, 3 insertions(+), 3 deletions(-)
+
+--- a/arch/mips/kvm/trace.h
++++ b/arch/mips/kvm/trace.h
+@@ -26,18 +26,18 @@ TRACE_EVENT(kvm_exit,
+ TP_PROTO(struct kvm_vcpu *vcpu, unsigned int reason),
+ TP_ARGS(vcpu, reason),
+ TP_STRUCT__entry(
+- __field(struct kvm_vcpu *, vcpu)
++ __field(unsigned long, pc)
+ __field(unsigned int, reason)
+ ),
+
+ TP_fast_assign(
+- __entry->vcpu = vcpu;
++ __entry->pc = vcpu->arch.pc;
+ __entry->reason = reason;
+ ),
+
+ TP_printk("[%s]PC: 0x%08lx",
+ kvm_mips_exit_types_str[__entry->reason],
+- __entry->vcpu->arch.pc)
++ __entry->pc)
+ );
+
+ #endif /* _TRACE_KVM_H */