Jeff Peeler [Tue, 18 Jan 2011 20:13:52 +0000 (20:13 +0000)]
Convert device state callbacks to ao2 objects to fix a deadlock in chan_sip.
Lock scenario presented here:
Thread 1
holds ast_rdlock_contexts &conlock
holds handle_statechange hints
holds handle_statechange hint
waiting for cb_extensionstate
Locked Here: chan_sip.c line 7428 (find_call)
Thread 2
holds handle_request_do &netlock
holds find_call sip_pvt_ptr
waiting for ast_rdlock_contexts &conlock
Locked Here: pbx.c line 9911 (ast_rdlock_contexts)
Chan_sip has an established locking order of locking the sip_pvt and then
getting the context lock. So the as stated by the summary, the operations in
thread 2 have been modified to no longer require the context lock.
(closes issue #18310)
Reported by: one47
Patches:
statecbs_ao2.mk2.patch uploaded by one47 (license 23),
modified by me
Issue #17999
1) A calls B. B answers.
2) B using DTMF dial *2 (code in features.conf for attended transfer).
3) A hears MOH. B dial number C
4) C ringing. A hears MOH.
5) B hangup. A still hears MOH. C ringing.
6) A hangup. C still ringing until "atxfernoanswertimeout" expires.
For v1.4 C will ring forever until C answers the dead line. (Issue #17096)
Problem: When A and B hangup, C is still ringing.
Issue #18395
SIP call limit of B is 1
1. A call B, B answered
2. B *2(atxfer) call C
3. B hangup, C ringing
4. Timeout waiting for C to answer
5. Recall to B fails because B has reached its call limit.
Because B reached its call limit, it cannot do anything until the transfer
it started completes.
Issue #17273
Same scenario as issue 18395 but party B is an FXS port. Party B cannot
do anything until the transfer it started completes. If B goes back off
hook before C answers, B hears ringback instead of the expected dialtone.
**********
Note for the issue #17273 and #18395 fix:
DTMF attended transfer works within the channel bridge. Unfortunately,
when either party A or B in the channel bridge hangs up, that channel is
not completely hung up until the transfer completes. This is a real
problem depending upon the channel technology involved.
For chan_dahdi, the channel is crippled until the hangup is complete.
Either the channel is not useable (analog) or the protocol disconnect
messages are held up (PRI/BRI/SS7) and the media is not released.
For chan_sip, a call limit of one is going to block that endpoint from any
further calls until the hangup is complete.
For party A this is a minor problem. The party A channel will only be in
this condition while party B is dialing and when party B and C are
conferring. The conversation between party B and C is expected to be a
short one. Party B is either asking a question of party C or announcing
party A. Also party A does not have much incentive to hangup at this
point.
For party B this can be a major problem during a blonde transfer. (A
blonde transfer is our term for an attended transfer that is converted
into a blind transfer. :)) Party B could be the operator. When party B
hangs up, he assumes that he is out of the original call entirely. The
party B channel will be in this condition while party C is ringing, while
attempting to recall party B, and while waiting between call attempts.
WARNING:
The ATXFER_NULL_TECH conditional is a hack to fix the problem. It will
replace the party B channel technology with a NULL channel driver to
complete hanging up the party B channel technology. The consequences of
this code is that the 'h' extension will not be able to access any channel
technology specific information like SIP statistics for the call.
ATXFER_NULL_TECH is not defined by default.
**********
Only offer codecs both sides support for directmedia
When using directmedia, Asterisk needs to limit the codecs offered to just
the ones that both sides recognize, otherwise they may end up sending audio
that the other side doesn't understand.
Fix CPU spike when pressing DTMF after agent login.
The problem here is that DTMF was being continuously deferred and requeued
since ast_safe_sleep is called in a loop. There are serveral other places in the
code that sleeps and then loops in a similar fashion. Because of this fact I
opted to not defer DTMF any more, which will not affect the original fix:
Paul Belanger [Sun, 9 Jan 2011 21:38:24 +0000 (21:38 +0000)]
SOUND_CACHE_DIR now defaults to empty
Sounds files included in the Asterisk tarball were being ignored and
re-downloaded. Users wanting to cache the files can still override the setting
using the --with-sounds-cache option.
This only skips authentication on retransmissions that are already
authenticated. A similar method is already used for INVITES. This
is the kind of thing we end up having to do when we don't have a
transaction layer...
Remove changes to via processing that were not supposed to go into the last commit.
........
r299220 | mnicholson | 2010-12-20 15:21:39 -0600 (Mon, 20 Dec 2010) | 4 lines
Let Asterisk find better backtrace information with libbfd.
The menuselect option BETTER_BACKTRACES, if enabled, will use libbfd to search
for better symbol information within both the Asterisk binary, as well as
loaded modules, to assist when using inline backtraces to track down problems.
........
Fix improper hangup when doing an attended transfer to queue.
Had to indicate ringing in wait_for_answer so the attended transfer code would
not try and hang up the local channel it created, which would kill the call.
Fix reference and container leaks when running 'astobj2 test.'
We need to make sure that ao2_iterator_destroy is called once for each time that
ao2_iterator_init is called. Also make sure to unref a newly allocated object
that we've linked into a container.
........
Outgoing PRI/BRI calls cannot do DTMF triggered transfers.
Outgoing PRI/BRI calls cannot do DTMF triggered transfers if a PROCEEDING
message is not received. The debug output shows that the DTMF begin event
is seen, but the DTMF end event is missing. When the DTMF begin happens,
the call is muted so we now have one way audio (until a DTMF end event is
somehow seen).
* Made set the proceeding flag when the PRI_EVENT_ANSWER event is
received.
* Made absorb the DTMF begin and DTMF end events if we are overlap dialing
and have not seen a PROCEEDING message.
* Added a debug message when absorbing a DTMF event.
If a REGISTER request with a Call-ID matching an existing transaction is received
it was possible that the REGISTER request would overwrite the initreq of the
private structure. This info is used to generate messages for other responses in
the transaction. This patch ignores REGISTER requests that match non-REGISTER
transactions.
Some previous behavior was attempted to be restored, but mistakingly I did
not realize that the previous behavior was incorrect. This fixes DTMF not
being detected since DTMF shouldn't cause the SSRC to change.
Don't create a Local channel if the target extension does not exist.
(closes issue #18126)
Reported by: junky
Patches:
followme.diff uploaded by junky (license 177)
(partially restructured by me to avoid a possible memory leak)
........
Improve handling of REGISTER requests with multiple contact headers.
The changes here attempt to more strictly follow RFC 3261 section 10.3.
Basically the following will now cause a 400 Bad Response to be returned, if:
- multiple Contact headers are present with one set to expire all bindings ("*")
- wildcard parameter is specified for Contact without Expires header or Expires
header is not set to zero.
When the adaptive jitter buffer is enabled in sip.conf, the first frame placed
in the jitter buffer fails with something like:
jb_warning_output: Resyncing the jb. last_delay 0, this delay -215886466,
threshold 1000, new offset 215886466
This happens because the offset is not initialized before calling jb_put(). This
patch modifies jb_put_first_adaptive() to set the offset to the frame's
timestamp.
Russell Bryant [Thu, 2 Dec 2010 13:16:47 +0000 (13:16 +0000)]
Merged revisions 297228 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4
........
r297228 | russell | 2010-12-02 07:16:15 -0600 (Thu, 02 Dec 2010) | 6 lines
Add "DAHDI" to a couple of app_meetme error messages.
This is in response to some questions on IRC. To the user, there was nothing
that made it obvious that this error had anything to do with DAHDI not being
loaded.
........
Fix not stopping MOH when transfered local channel queue member is answered.
The problem here is only present when local channels are used with the MOH
passthru option as well as no optimization (/nm). I will describe the slightly
bizarre scenario that was used to test, where phones B and C are queue members:
Phone A dials into a queue with two members using local channels and the above
options. Phone B answers. Phone A blind transfers phone B into the same queue.
Phone A hangs up. Phone C answers, but phone B didn't stop playing MOH.
In this scenario, the unhold frame that should have gotten to phone B never
arrived due to the masquerade from the blind transfer. This is usually fine
since app_queue manages the starting and stopping of MOH. However, with the
passthrough option enabled when app_queue attempts to stop MOH it tries to do
so on the local channel rather than the real channel. The easiest solution
was to just make sure to send an unhold frame during the transfer since it
wouldn't make sense to have MOH playing after a transfer anyway. This only
modifies SIP transfers, but the other transfers did not seem to be a problem.
If DTMF based transfers were a problem it might be okay to add ast_moh_stop
to finishup, but I didn't want to have to add that unless required.
Russell Bryant [Wed, 24 Nov 2010 23:28:19 +0000 (23:28 +0000)]
Merged revisions 296213 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4
........
r296213 | russell | 2010-11-24 17:26:43 -0600 (Wed, 24 Nov 2010) | 6 lines
Make Asterisk less crashy.
Since we might not put a new translation path on the channel, go ahead and
set it to NULL right after destroying the old one to ensure we don't try
to free an invalid translation path later on.
........
Oneway audio to SIP phone from FXS port after FXS port gets a CallWaiting pip.
The FXS connected phone has to have CW/CID support to fail, as it will
send back a DTMF 'A' or 'D' when it's ready to receive CallerID. A normal
phone with no CID never fails. Also the SIP phone does not hear MOH when
the CW call is answered.
The DTMF end frame is suppressed when the phone acknowledges the CW signal
for CID. The problem is the DTMF begin frame needs to be suppressed as
well. The DTMF begin frame is causing SIP to start sending the DTMF RTP
frames. Since the DTMF end frame is suppressed, SIP will not stop sending
those DTMF RTP packets.
* Suppress the DTMF begin and end frames when the channel driver is
looking for DTMF digits.
* Fixed a couple issues caused by not cleaning up the CID spill if you
answer the CW call while it is sending the CID spill.
* Fixed not sending CW/CID spill to the phone when the call is natively
bridged. (Fixed by not using native bridge if CW/CID is possible.)
* Suppress received audio when sending CW/CID spills. The other parties
involved do not need to hear the CW/CID spills and may be confused if the
CW call is for them.
* v1.4 does not have the main problem fixed by suppressing the DTMF start
frames. The other three items fixed are relevant.
* If you really must restore native bridging between analog ports, you
need to disable CW/CID either by configuring chan_dahdi.conf
callwaitingcallerid=no or dialing *70 before dialing the number to
temporarily disable CW.
........
Russell Bryant [Wed, 24 Nov 2010 20:23:11 +0000 (20:23 +0000)]
Merged revisions 296082 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4
........
r296082 | russell | 2010-11-24 14:22:32 -0600 (Wed, 24 Nov 2010) | 12 lines
Fix false reporting of an error by set_format().
In the case that the native format was able to be changed to match the
new requested format, the code proceeded to attempt to build a translation
path, anyway. The result would be NULL, since no translation path is
necessary and resulted in this function thinking an error has occurred.
This case is now specifically caught and no attempt to build a translation
path is attempted.
Thanks to our automated tests and bamboo.asterisk.org for catching this problem
and making a whole lot of noise when things started failing. :-)
........
Russell Bryant [Wed, 24 Nov 2010 17:03:16 +0000 (17:03 +0000)]
Merged revisions 296000 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4
........
r296000 | russell | 2010-11-24 10:48:39 -0600 (Wed, 24 Nov 2010) | 38 lines
Handle failures building translation paths more effectively.
The problem scenario occurred on a heavily loaded system that was using the
codec_dahdi module and exceeded the hardware transcoding capacity. The failure
mode at that point was not good. The report came in to us as an Asterisk
lock-up. The "core show locks" shows a ton of threads locked up (but no
obvious deadlock). Upon deeper investigation, when the system is in this
state, the CPU was maxed out. The CPU was being consumed by the Asterisk
logger spewing messages on every audio frame for calls set up after transcoder
capacity was reached.
The purpose of this patch is to make Asterisk handle failures to create a
translation path in a more graceful manner. If we can't translate, then the
call just needs to be dropped, as it's not going to work. These are the
changes:
1) In set_format() of channel.c (which is called by set_read_format() and
set_write_format()), it was ignoring if ast_translator_build_path() failed and
returned NULL. It now pays attention to that case and returns a result
reflecting failure. With this change in place, the bridging code will
immediately detect a failure and end the bridge instead of proceeding to try to
bridge frames that can't be translated and making channel drivers freak out by
sending them frames in a format they weren't expecting.
2) In ast_indicate_data() of channel.c, failure of ast_playtones_start() was
ignored. It is now reflected in the return value of the function. This didn't
turn out to have any affect on the bug, but seemed like a good change to leave
in.
3) In app_dial(), when only sending a call to a single endpoint, it will
attempt to do some bridging of its own of early audio. It uses
make_compatible() when it's going to do this. However, it ignored failure from
make compatible. So, even with the fix from #1, if there was early audio going
through app_dial, there would still be a period of invalid frames passing
through. After detecting failure here, Dial() exits.
The channel redirect function (CLI or AMI) hangs up the call instead of redirecting the call.
To recreate the problem:
1) Party A calls Party B
2) Invoke CLI "channel redirect" command to redirect channel call leg
associated with A.
3) All associated channels are hung up.
Note that if the CLI command were done on the channel call leg associated
with B it works.
This regression was a result of the fix for issue #16946
(https://reviewboard.asterisk.org/r/740/).
The regression affects all features that use an async goto to execute the
dialplan because of an external event: Channel redirect, AMI redirect, SIP
REFER, and FAX detection.
The struct ast_channel._softhangup code is a mess. The variable is used
for several purposes that do not necessarily result in the call being hung
up. I have added doxygen comments to describe how the various _softhangup
bits are used. I have corrected all the places where the variable was
tested in a non-bit oriented manner.
The primary fix is the new AST_CONTROL_END_OF_Q frame. It acts as a weak
hangup request so the soft hangup requests that do not normally result in
a hangup do not hangup.
Russell Bryant [Sat, 20 Nov 2010 00:45:51 +0000 (00:45 +0000)]
Fix cache of device state changes for multiple servers.
This patch addresses a regression where device states across multiple servers
were not being processing completely correctly. The code works to determine
the overall state by looking at the last known state of a device on each
server. However, there was a regression due to some invasive rewrites of how
the cache works that led to the cache only storing the last device state change
for a device, regardless of which server it was on.
The code is set up to cache device state change events by ensuring that each
event in the cache has a unique device name + entity ID (server ID). The code
that was responsible for comparing raw information elements (which EID is)
always returned a match due to a memcmp() with a length of 0.
There isn't much code to fix the actual bug. This patch also introduces a new
CLI command that was very useful for debugging this problem. The command
allows you to dump the contents of the event cache.
(closes issue #18284)
Reported by: klaus3000
Patches:
issue18284.rev1.txt uploaded by russell (license 2)
Tested by: russell, klaus3000
This is not a perfect solution as headers that are joined via commas are not
detected. This is a parsing issue that to fix "correctly" would necessitate
a new SIP parser.
Fix regression causing abort in voicemail after opening a mailbox with no mesgs.
In order to be more safe, some error handling code was changed to respect more
error conditions including the potential memory allocation failure for deleted
and heard message tracking introduced in 293004. However, last_message_index
returns -1 for zero messages (perhaps as expected) and was triggering the
stricter error checking. Because last_message_index is only called directly
in one place, just return 0 from open_mailbox (for file based storage) when no
messages are detected unless a real error has occurred.
Fix problem with qualify option packets for realtime peers never stopping.
The option packets not only never stopped, but if a realtime peer was not in
the peer list multiple options dialogs could accumulate over time. This
scenario has the potential to progress to the point of saturating a link just
from options packets. The fix was to ensure that the poke scheduler checks to
see if a peer is in the peer list before continuing to poke. The reason a peer
must be in the peer list to be able to properly manage an options dialog is
because otherwise the call pointer is lost when the peer is regenerated from
the database, which is how existing qualify dialogs are detected.
Copied from some notes from the original author (Russell):
Deadlock scenario:
Thread 1: device state change thread
Holds - rdlock on contexts
Holds - hints lock
Waiting on channels container lock
Thread 2: SIP monitor thread
Holds the "iflock"
Holds a sip_pvt lock
Holds channel container lock
Waiting for a channel lock
Thread 3: A channel thread (chan_local in this case)
Holds 2 channel locks acquired within app_dial
Holds a 3rd channel lock it got inside of chan_local
Holds a local_pvt lock
Waiting on a rdlock of the contexts lock
A bunch of other threads waiting on a wrlock of the contexts lock
To address this deadlock, some locking order rules must be put in place and
enforced. Existing relevant rules:
1) channel lock before a pvt lock
2) contexts lock before hints lock
3) channels container before a channel
What's missing is some enforcement of the order when you involve more than any
two. To fix this problem, I put in some code that ensures that (at least in the
code paths involved in this bug) the locks in (3) come before the locks in (2).
To change the operation of thread 1 to comply, I converted the storage of hints
to an astobj2 container. This allows processing of hints without holding the
hints container lock. So, in the code path that led to thread 1's state, it no
longer holds either the contexts or hints lock while it attempts to lock the
channels container.
Jeff Peeler [Mon, 8 Nov 2010 21:58:13 +0000 (21:58 +0000)]
Fix playback failure when using IAX with the timerfd module.
To fix this issue the alert pipe will now be used when the timerfd module is
in use. There appeared to be a race that was not solved by adding locking in the
timerfd module, but needed to be there anyway. The race was between the timer
being put in non-continuous mode in ast_read on the channel thread and the IAX
frame scheduler queuing a frame which would enable continuous mode before the
non-continuous mode event was read. This race for now is simply avoided.
Modify our handling of 491 responses to drop any pending reinvite retry scheduler entries if we get a new 491.
This prevents a scheduler entry from leaking if we receive a 491 response when one is pending. If a scheduler entry leaks, the pvt it is associated my get destroyed before the scheduler entry fires, and then memory corruption and crashes can occur when the scheduled reinvite attempts to access and modify the memory of the destroyed pvt.
codecs/codec_dahdi: Prevent "choppy" audio when receiving unexpected frame sizes.
dahdi-linux 2.4.0 (specifically commit 9034) added the capability for
the wctc4xxp to return more than a single packet of data in response to
a read. However, when decoding packets, codec_dahdi was still assuming
that the default number of samples was in each read.
In other words, each packet your provider sent you, regardless of size,
would result in 20 ms of decoded data (30 ms if decoding G723). If your
provider was sending 60 ms packets then codec_dahdi would end up
stripping 40 ms of data from each transcoded frame resulting in "choppy"
audio.
This would only affect systems where G729 packets are arriving in sizes
greater than 20ms or G723 packets arriving in sizes greater than 30ms.
Party A in an analog 3-way call would continue to hear ringback after party C answers.
All parties are analog FXS ports.
1) A calls B.
2) A flash hooks to call C.
3) A flash hooks to bring C into 3-way call before C answers. (A and B hear ringback)
4) C answers
5) A continues to hear ringback during the 3-way call. (All parties can hear each other.)
* Fixed use of wrong variable in dahdi_bridge() that stopped ringback on
the wrong subchannel.
* Made several debug messages have more information.
A similar issue happens if B and C are SIP channels. B continues to hear
ringback. For some reason this only affects v1.8 and trunk.
* Don't start ringback on the real and 3-way subchannels when creating the
3-way conference. Removing this code is benign on v1.6.2 and earlier.
........
Make warning message have more useful information in it.
Change "Unable to get index, and nullok is not asserted" to "Unable to get
index for '<channel-name>' on channel <number> (<function>(), line
<number>)".
........
"!00" evaluated as false, which is incorrect. Fixing.
Reported (though the reporter did not understand he was reporting a bug) on the asterisk-users list:
http://lists.digium.com/pipermail/asterisk-users/2010-October/255505.html
........
"!00" evaluated as false, which is incorrect. Fixing.
Reported (though the reporter did not understand he was reporting a bug) on the asterisk-users list:
http://lists.digium.com/pipermail/asterisk-users/2010-October/255505.html
........