When unloading the app_queue module the members in each queue are
destroyed and as part of this they are removed from the pending
members container. Unfortunately a crash would occur as the container
was destroyed before the members were removed.
This change tweaks ordering so the container destruction occurs
after the members are destroyed.
Kevin Harwell [Thu, 21 Apr 2016 19:23:21 +0000 (14:23 -0500)]
app_queue: queue members can receive multiple calls
It was possible for a queue member that is a member of at least 2 or more
queues to receive mulitiple calls at the same time. This happened because
of a race between when a member was being rung and when the device state
notified the other queue(s) member object of the state change.
This patch makes it so when a queue member is being rung it gets added to
a global pool of queue members. If that same member is tried again, e.g.
from another queue, and it is found to already exist in the pending member
container then it will not ring that member.
George Joseph [Fri, 22 Apr 2016 22:53:23 +0000 (16:53 -0600)]
res_agi: Prevent run_agi from eating frames it shouldn't
The run_agi function is eating control frames when it shouldn't be. This is
causing issues when an AGI is run from CONNECTED_LINE_SEND_SUB in a blond
transfer.
Alice calls Bob. Bob attended transfers to Charlie but hangs up before Charlie
answers.
Alice gets the COLP UPDATE indicating Charlie but Charlie never gets an UPDATE
and is left thinking he's connected to Bob.
In this case, when CONNECTED_LINE_SEND_SUB runs on Alice's channel and it calls
an AGI, the extra eaten frames prevent CONNECTED_LINE_SEND_SUB from running on
Charlie's channel.
The fix was to accumulate deferrable frames in the "forever" loop instead of
dropping them, and re-queue them just before running the actual agi command
or exiting.
Richard Mudgett [Fri, 15 Apr 2016 19:36:59 +0000 (14:36 -0500)]
res_stasis: Handle re-enter stasis bridge with swap channel.
We lose the fact that there is a swap channel if there is one. We
currently wind up rejoining the stasis bridge as a normal join after the
swap channel has already been kicked from the bridge.
This patch preserves the swap channel so the AMI/ARI events can note that
the channel joining the bridge is swapping with another channel. Another
benefit to swaqpping in one operation is if there are any channels that
get lonely (MOH, bridge playback, and bridge record channels). The lonely
channels won't leave before the joining channel has a chance to come back
in under stasis if the swap channel is the only reason the lonely channels
are staying in the bridge.
ASTERISK-25947 #close
Reported by: Richard Mudgett
Richard Mudgett [Tue, 19 Apr 2016 21:58:32 +0000 (16:58 -0500)]
bridge: Hold off more than one imparting channel at a time.
An earlier patch blocked the ast_bridge_impart() call until the channel
either entered the target bridge or it failed. Unfortuantely, if the
target bridge is stasis and the imprted channel is not a stasis channel,
stasis bounces the channel out of the bridge to come back into the bridge
as a proper stasis channel. When the channel is bounced out, that
released the block on ast_bridge_impart() to continue. If the impart was
a result of a transfer, then it became a race to see if the swap channel
would get hung up before the imparted channel could come back into the
stasis bridge. If the imparted channel won then everything is fine. If
the swap channel gets hung up first then the transfer will fail because
the swap channel is leaving the bridge.
* Allow a chain of ast_bridge_impart()'s to happen before any are
unblocked to prevent the race condition described above. When the channel
finally joins the bridge or completely fails to join the bridge then the
ast_bridge_impart() instances are unblocked.
Kevin Harwell [Wed, 8 Jul 2015 19:56:10 +0000 (14:56 -0500)]
bridge.c: Fixed race condition during attended transfer
During an attended transfer a thread is started that handles imparting the
bridge channel. From the start of the thread to when the bridge channel is
ready exists a gap that can potentially cause problems (for instance, the
channel being swapped is hung up before the replacement channel enters the
bridge thus stopping the transfer). This patch adds a condition that waits
for the impart thread to get to a point of acceptable readiness before
allowing the initiating thread to continue.
Kevin Harwell [Mon, 22 Jun 2015 20:11:18 +0000 (15:11 -0500)]
bridge.c: Hangup attended transfer target if bridged
After completing an attended transfer the transfer target channel was not being
hung up after leaving the bridge. Added an explicit softhangup to hangup said
channel, but only if it was previously bridged.
Kevin Harwell [Tue, 7 Apr 2015 16:40:44 +0000 (16:40 +0000)]
bridge.c: Hangup attended transfer target after it has been swapped out
After completing an attended transfer the transfer target channel (the one that
gets swapped out) was not being hung up after leaving the bridge. This resulted
in a channel possibly being left around. Added an explicit softhangup for the
channel in question after the transfer is successfully completed in order to
make sure the channel is hung up.
ASTERISK-24782 #close
Reported by: John Bigelow
Review: https://reviewboard.asterisk.org/r/4575/
stasis transfer: fix stasis bridge push race part two
When swapping a Local channel in place of one already
in a bridge (to complete a bridge attended transfer),
the channel that was swapped out can actually be hung
up before the stasis bridge push callback executes on
the independant transfer thread. This results in the
stasis app loop dropping out and removing the control
that has the the app name which the local replacement
channel needs so it can re-enter stasis.
To avoid this race condition a new push_peek callback
has been added, and called from the ast_bridge_impart
thread before it launches the independant thread that
will complete the transfer. Now the stasis push_peek
callback can copy the stasis app name before the swap
channel can hang up.
stasis transfer: fix a race condition on stasis bridge push
After a bridge transfer completes where a local replacement
channel is used, a stasis transfer message with the details
of the transfer is sent. This is processed by stasis which
then sets the stasis app name and replaced channel snapshot
on the replacement channel.
However, since a separate thread was already started to run
stasis on the new replacement channel, a race was on to see
if the message processing would be completed before the app
name was needed, otherwise the channel would be hung up.
This change moves the calls used to set the stasis app name
and the replace snapshot to the bridge_stasis_push function
callback from the bridge transfer logic, allowing the steps
to be completed earlier and more deterministically, and the
race elimianted.
NOTE: the swap channel parameter to bridge_stasis_push (and
thus all bridge push callbacks) must always be present when
performing a swap with another channel.
ASTERISK-24649 #close
Reported by: John Bigelow
Review: https://reviewboard.asterisk.org/r/4341/
Richard Mudgett [Thu, 22 Jan 2015 19:24:28 +0000 (19:24 +0000)]
Bridge core: Pass a ref with the swap channel when joining a bridge.
When code imparts a channel into a bridge to swap with another channel, a
ref needs to be held on the swap channel to ensure that it cannot
dissapear before finding it in the bridge.
* The ast_bridge_join() swap channel parameter now always steals a ref for
the swap channel. This is the only change to the bridge framework's
public API semantics.
* bridge_channel_internal_join() now requires the bridge_channel->swap
channel to pass in a ref.
George Joseph [Tue, 19 Apr 2016 22:52:15 +0000 (16:52 -0600)]
res_pjsip_callerid: Clear out display name if id->name is not valid
When create_new_id_hdr creates a new RPID or PAI header, it starts by cloning
the From header, then it overwrites the display name and uri from the channel's
connected.id. If the connected.id.name wasn't valid, create_new_id_hdr was
leaving the display name from the From header in the new RPID or PAI header.
On an attended transfer where the originator had a caller id number set but not
a display name, the re-INVITE to the final transferee had the number of the
originator but the display name of the transferer.
Added a check to clear out the display name in the new header if
connected.id.name was invalid.
Mark Michelson [Mon, 18 Apr 2016 17:12:37 +0000 (12:12 -0500)]
PJSIP: Remove PJSIP parsing functions from uri length validation.
The PJSIP parsing functions provide a nice concise way to check the
length of a hostname in a SIP URI. The problem is that in order to use
those parsing functions, it's required to use them from a thread that
has registered with PJLib.
On startup, when parsing AOR configuration, the permanent URI handler
may not be run from a PJLib-registered thread. Specifically, this could
happen when Asterisk was started in daemon mode rather than
console-mode. If PJProject were compiled with assertions enabled, then
this would cause Asterisk to crash on startup.
The solution presented here is to do our own parsing of the contact URI
in order to ensure that the hostname in the URI is not too long. The
parsing does not attempt to perform a full SIP URI parse/validation,
since the hostname in the URI is what is important.
Mark Michelson [Mon, 18 Apr 2016 22:00:42 +0000 (17:00 -0500)]
res_pjsip_registrar: Fix bad memory-ness with user_agent.
Recent changes to the PJSIP registrar resulted in tests failing due to
missing AOR_CONTACT_ADDED test events. The reason for this was that the
user_agent string had junk values in it, resulting in being unable to
generate the event.
I'm going to be honest here, I have no idea why this was happening. Here
are the steps needed for the user_agent variable to get messed up:
* REGISTER is received
* First contact in the REGISTER results in a contact being removed
* Second contact in the REGISTER results in a contact being added
* The contact, AOR, expiration, and user agent all have to be passed as
format parameters to the creation of a string. Any subset of those
parameters would not be enough to cause the problem.
Looking into what was happening, the thing that struck me as odd was
that the user_agent variable was meant to be set to the value of the
User-Agent SIP header in the incoming REGISTER. However, when removing a
contact, the user_agent variable would be set (via ast_strdupa inside a
loop) to the stored contact's user_agent. This means that the
user_agent's value would be incorrect when attempting to process further
contacts in the incoming REGISTER.
The fix here is to use a different variable for the stored user agent
when removing a contact. Correcting the behavior to be correct also
means the memory usage is less weird, and the issue no longer occurs.
res_pjsip_transport_management: Allow unload to occur.
At shutdown it is possible for modules to be unloaded that wouldn't
normally be unloaded. This allows the environment to be cleaned up.
The res_pjsip_transport_management module did not have the unload
logic in it to clean itself up causing the res_pjsip module to not
get unloaded. As a result the res_pjsip monitor thread kept going
processing traffic and timers when it shouldn't.
Mark Michelson [Thu, 14 Apr 2016 18:49:35 +0000 (13:49 -0500)]
transport management: Register thread with PJProject.
The scheduler thread that kills idle TCP connections was not registering
with PJProject properly and causing assertions if PJProject was built in
debug mode.
This change registers the thread with PJProject the first time that the
scheduler callback executes.
"Idle" here means that someone connects to us and does not send a SIP
request. PJProject will not automatically time out such connections, so
it's up to Asterisk to do it instead.
When we receive an incoming TCP connection, we will start a timer
(equivalent to transaction timer D) waiting to receive an incoming
request. If we do not receive a request in that timeframe, then we will
shut down the TCP connection.
Some SIP devices indicate hold/unhold using deferred SDP reinvites. In
other words, they provide no SDP in the reinvite.
A typical transaction that starts hold might look something like this:
* Device sends reinvite with no SDP
* Asterisk sends 200 OK with SDP indicating sendrecv on streams.
* Device sends ACK with SDP indicating sendonly on streams.
At this point, PJMedia's SDP negotiator saves Asterisk's local state as
being recvonly.
Now, when the device attempts to unhold, it again uses a deferred SDP
reinvite, so we end up doing the following:
* Device sends reinvite with no SDP
* Asterisk sends 200 OK with SDP indicating recvonly on streams
* Device sends ACK with SDP indicating sendonly on streams
The problem here is that Asterisk offered recvonly, and by RFC 3264's
rules, if an offer is recvonly, the answer has to be sendonly. The
result is that the device is not taken off hold.
What is supposed to happen is that Asterisk should indicate sendrecv in
the 200 OK that it sends. This way, the device has the freedom to
indicate sendrecv if it wants the stream taken off hold, or it can
continue to respond with sendonly if the purpose of the reinvite was
something else (like a session timer refresher).
The fix here is to alter the SDP negotiator's state when we receive a
reinvite with no SDP. If the negotiator's state is currently in the
recvonly or inactive state, then we alter our local state to be
sendrecv. This way, we allow the device to indicate the stream state as
desired.
ASTERISK-25854 #close
Reported by Robert McGilvray
Richard Mudgett [Mon, 28 Mar 2016 23:10:40 +0000 (18:10 -0500)]
res_stasis: Fix crash on a hanging up channel.
* Give the struct stasis_app_control ao2 object a ref to the channel held
in the object. Now the channel will still be around if a thread needs to
post a stasis message instead of crash because the topic was destroyed.
* Moved stopping any lingering silence generator out of the struct
stasis_app_control destructor and made it a part of exiting the Stasis
application. Who knows which thread the destructor will be called under
so it cannot affect the channel's silence generator. Not only was the
channel unprotected when the silence generator was stopped, stasis may no
longer even control the channel.
Joshua Colp [Mon, 15 Feb 2016 18:52:22 +0000 (14:52 -0400)]
res_pjsip_pubsub: Move where the subscription is stored to after initialized.
A problem arose when testing the AMI subscription listing actions where it
was possible for a subscription that had not been fully initialized to be
listed. This was problematic as the underlying listing code would crash.
This change makes it so the subscription tree is fully set up before it is
added to the list of subscriptions. This ensures that when the listing actions
get the subscription it is valid.
Mark Michelson [Thu, 4 Feb 2016 22:17:55 +0000 (16:17 -0600)]
Check for OpenSSL defines before trying to use them.
The SSL_OP_NO_TLSv1_1 and SSL_OP_NO_TLSv1_2 defines did not exist prior
to OpenSSL version 1.0.1. A recent commit attempts to, by default, set
these options, which can cause problems on systems with older OpenSSL
installations.
This commit adds a configure script check for those defines and will not
attempt to make use of those if they do not exist. We will print a
warning urging the user to upgrade their OpenSSL installation if those
defines are not present.
Mark Michelson [Thu, 4 Feb 2016 17:39:10 +0000 (11:39 -0600)]
res_stasis_device_state: Fix refcounting error.
Device state subscription lifetimes were governed by when the
subscription was established and unsubscribed from. However, it is
possible that at the time of unsubscription, there could be device state
events still in flight. When those device state events occur, the device
state callback could attempt to dereference a freed pointer. Crash.
This change ensures that the lifetime of the device state subscription
does not end until the underlying stasis subscription has confirmed that
its final message has been sent.
Joshua Colp [Wed, 3 Feb 2016 18:05:52 +0000 (14:05 -0400)]
AST-2016-001 http: Provide greater control of TLS and set modern defaults.
This change exposes the configuration of various aspects of the TLS
support and sets the default to the modern standards.
The TLS cipher is now set to the best values according to the
Mozilla OpSec team, different TLS versions can now be disabled, and
the cipher order can be forced to be that of the server instead of
the client.
Richard Mudgett [Mon, 7 Dec 2015 18:46:53 +0000 (12:46 -0600)]
AST-2016-003 udptl.c: Fix uninitialized values.
Sending UDPTL packets to Asterisk with the right amount of missing
sequence numbers and enough redundant 0-length IFP packets, can make
Asterisk crash.
Setting the sip.conf timert1 value to a value higher than 1245 can cause
an integer overflow and result in large retransmit timeout times. These
large timeout times hold system file descriptors hostage and can cause the
system to run out of file descriptors.
NOTE: The default sip.conf timert1 value is 500 which does not expose the
vulnerability.
* The overflow is now detected and the previous timeout time is
calculated.
ASTERISK-25397 #close
Reported by: Alexander Traud
Joshua Colp [Mon, 25 Jan 2016 15:35:21 +0000 (11:35 -0400)]
config: Allow options to register when documentation is unavailable.
The config options framework is strict in that configuration options must
be documented unless XML documentation support is not available. In
practice this is useful as it ensures documentation exists however in
off-nominal cases this can cause strange problems.
If it is expected that a config option has a non-zero or non-empty
default value but the config option documentation is unavailable
this reasonable expectation will not be met. This can cause obscure
crashes and weirdness depending on how the code handles it.
This change tweaks the behavior to ensure that the config option
is still allowed to register, apply default values, and be set when
devmode is not enabled. If devmode is enabled then the option can
NOT be set.
This also does not remove the initial documentation error message that
is output on load when registering the configuration option.
Mark Michelson [Mon, 25 Jan 2016 22:51:25 +0000 (16:51 -0600)]
res_pjsip_pubsub: Prevent crash from AMI command on freed subscription.
A test recently uncovered that running an ill-timed AMI command to show
inbound subscriptions could cause a crash since Asterisk will try to
operate on a freed subscription.
The fix for this is to remove the subscription tree from the list of
subscriptions at the time that we are sending our final NOTIFY request
out. This way, as the subscription is in the process of dying, it is
inaccessible from AMI.
Kevin Harwell [Thu, 14 Jan 2016 20:42:57 +0000 (14:42 -0600)]
bridge_basic: don't cache xferfailsound during an attended transfer
The xferfailsound was read from the channel at the beginning of the transfer,
and that value is "cached" for the duration of the transfer. Therefore, changing
the xferfailsound on the channel using the FEATURE() dialplan function does
nothing once the transfer is under way.
This makes it so the transfer code instead gets the xferfailsound configuration
options from the channel when it is actually going to be used.
This patch also fixes a potential memory leak of the props object as well as
making sure the condition variable gets initialized before being destroyed.
Joshua Colp [Mon, 28 Dec 2015 20:02:19 +0000 (16:02 -0400)]
test_time: Provide a timeout when waiting.
The test_timezone_watch unit test is written to expect a
condition to be signaled when the inotify daemon thread runs.
There exists a small window where the test_timezone_watch
thread can signal the inotify daemon thread while it is not
reading on the underlying file descriptor. If this occurs
the test_timezone_watch thread will wait indefinitely for a
signal that will never arrive.
This change adds a timeout to the condition so it will return
regardless after a period of time.
Joshua Colp [Tue, 12 Jan 2016 17:14:29 +0000 (13:14 -0400)]
app: Queue hangup if channel is hung up during sub or macro execution.
This issue was exposed when executing a connected line subroutine.
When connected or redirected subroutines or macros are executed it is
expected that the underlying applications and logic invoked are fast
and do not consume frames. In practice this constraint is not enforced
and if not adhered to will cause channels to continue when they shouldn't.
This is because each caller of the connected or redirected logic does not
check whether the channel has been hung up on return. As a result the
the hung up channel continues.
This change makes it so when the API to execute a subroutine or
macro is invoked the channel is checked to determine if it has hung up.
If it has then a hangup is queued again so the caller will see it
and stop.
Kevin Harwell [Fri, 8 Jan 2016 21:22:05 +0000 (15:22 -0600)]
pbx: Deadlock between contexts container and context_merge locks
Recent changes (ASTERISK-25394 commit 2bd27d12223fe33b58c453965ed5c6ed3af7c4f5)
introduced the possibility of a deadlock. Due to the mentioned modifications
ast_change_hints now needs to keep both merge/delete and state callbacks from
occurring while it executes. Unfortunately, sometimes ast_change_hints can be
called with the contexts container locked. When this happens it's possible for
another thread to grab the context_merge_lock before the thread calling into
ast_change_hints does and then try to obtain the contexts container lock. This
of course causes a deadlock between the two threads. The thread calling into
ast_change_hints waits for the other thread to release context_merge_lock and
the other thread is waiting on that one to release the contexts container lock.
Unfortunately, there is not a great way to fix this problem. When hints change,
the subsequent state callbacks cannot run at the same time as a merge/delete,
nor when the usual state callbacks do. This patch alleviates the problem by
having those particular callbacks (the ones run after a hint change) occur in a
serialized task. By moving the context_merge_lock to a task it can now safely be
attempted or held without a deadlock occurring.
ASTERISK-25640 #close
Reported by: Krzysztof Trempala
Mark Michelson [Thu, 7 Jan 2016 21:37:36 +0000 (15:37 -0600)]
PJSIP: Prevent deadlock due to dialog/transaction lock inversion.
A deadlock was observed where the monitor thread was stuck, therefore
resulting in no incoming SIP traffic being processed.
The problem occurred when two 200 OK responses arrived in response to a
terminating NOTIFY request sent from Asterisk. The first 200 OK was
dispatched to a threadpool worker, who locked the corresponding
transaction. The second 200 OK arrived, resulting in the monitor thread
locking the dialog. At this point, the two threads are at odds, because
the monitor thread attempts to lock the transaction, and the threadpool
thread loops attempting to try to lock the dialog.
In this case, the fix is to not have the monitor thread attempt to hold
both the dialog and transaction locks at the same time. Instead, we
release the dialog lock before attempting to lock the transaction.
There have also been some debug messages added to the process in an
attempt to make it more clear what is going on in the process.
Jonathan Rose [Thu, 10 Dec 2015 17:44:03 +0000 (11:44 -0600)]
chan_sip: Add TCP/TLS keepalive to TCP/TLS server
Adds the TCP Keep Alive option to TCP and TLS server sockets. Previously
this option was only being set on session sockets.
http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/
According to the link above, the SO_KEEPALIVE option is useful for knowing
when a TCP connected endpoint has severed communication without indicating
it or has become unreachable for some reason. Without this patch, keep
alive is not set on the socket listening for incoming TCP sessions and
in Komatsu's report this resulted in the thread listening for TCP becoming
stuck in a waiting state.
When a caller calls a FAX number and then hangs up right after the call is
answered then the T.38 re-INVITE automatic reject timer may still be
running after the channel goes away.
* Added session NULL channel checks on the code paths that get executed by
t38_automatic_reject() to prevent a crash when the T.38 re-INVITE
automatic reject timer expires.
Jonathan Rose [Tue, 1 Dec 2015 22:11:07 +0000 (16:11 -0600)]
Unset BRIDGEPEER when leaving a bridge
Currently if a channel is transferred out of a bridge, the BRIDGEPEER
variable (also BRIDGEPVTCALLID) remain set even once the channel is
out of the bridge. This patch removes these variables when leaving
the bridge.
Richard Mudgett [Mon, 30 Nov 2015 22:42:47 +0000 (16:42 -0600)]
sched.c: Make not return a sched id of 0.
According to the API doxygen a sched ID of 0 is valid. Unfortunately, 0
was never returned historically and several users incorrectly coded usage
of the returned sched ID assuming that 0 was invalid.
Richard Mudgett [Tue, 24 Nov 2015 18:44:53 +0000 (12:44 -0600)]
Audit improper usage of scheduler exposed by 5c713fdf18f.
channels/chan_iax2.c:
* Initialize struct chan_iax2_pvt scheduler ids earlier because of
iax2_destroy_helper().
channels/chan_sip.c:
channels/sip/config_parser.c:
* Fix initialization of scheduler id struct members. Some off nominal
paths had 0 as a scheduler id to be destroyed when it was never started.
chan_skinny.c:
* Fix some scheduler id comparisons that excluded the valid 0 id.
channel.c:
* Fix channel initialization of the video stream scheduler id.
pbx_dundi.c:
* Fix channel initialization of the packet retransmission scheduler id.
Richard Mudgett [Mon, 23 Nov 2015 20:27:27 +0000 (14:27 -0600)]
res_sorcery_realtime.c: Fix crash from NULL sorcery object type.
If the sorcery object type is not found a NULL is returned.
Unfortunately, sorcery_realtime_filter_objectset() will crash after
complaining about not finding the object type and saying to expect errors.
* Use ao2_cleanup() instead of ao2_ref() to prevent the crash.
Richard Mudgett [Tue, 5 May 2015 23:17:54 +0000 (18:17 -0500)]
features: Fix crash when transferee hangs up during DTMF attended transfer.
A crash happens with this sequence of steps:
1) Party A is connected to party B.
2) Party B starts a DTMF attended transfer.
3) Party A hangs up while party B is dialing party C.
When party A hangs up the bridge that party A and party B are in is
dissolved and party B is kicked out of the bridge. When party B finishes
dialing party C he attempts to move to the new bridge with party C. Since
party B is no longer in a bridge the attempted move dereferences a NULL
bridge_channel pointer and crashes.
* Made the hold(), unhold(), ringing(), and the bridge_move() functions
tolerant of the channel not being in a bridge. The assertion that party B
is always in a bridge is not true if the bridged peer of party B hangs up
and dissolves the bridge. Being tolerant of not being in a bridge allows
the peer hangup stimulus to be processed by the FSM.
* Made the bridge_move() function return void since where the return value
for a failed move was checked generated a FSM coding ERROR message for a
normal off-nominal condition.
* Eliminated most uses of RAII_VAR in bridge_basic.c.
Mark Michelson [Fri, 13 Nov 2015 20:03:35 +0000 (14:03 -0600)]
Confbridge: Add a user timeout option
This option adds the ability to specify a timeout, in seconds, for a
participant in a ConfBridge. When the user's timeout has been reached,
the user is ejected from the conference with the CONFBRIDGE_RESULT
channel variable set to "TIMEOUT".
The rationale for this change is that there have been times where we
have seen channels get "stuck" in ConfBridge because a network issue
results in a SIP BYE not being received by Asterisk. While these
channels can be hung up manually via CLI/AMI/ARI, adding some sort of
automatic cleanup of the channels is a nice feature to have.
Mark Michelson [Fri, 13 Nov 2015 20:19:35 +0000 (14:19 -0600)]
Taskprocessors: Increase high-water mark
In practical tests, we have seen certain taskprocessors, specifically
Stasis subscription taskprocessors, cross the recently-added high-water
mark and emit a warning. This high-water mark warning is only intended
to be emitted when things have tanked on the system and things are
heading south quickly. In the practical tests, the Stasis taskprocessors
sometimes had a max depth of 180 tasks in them, and Asterisk wasn't in
any danger at all.
As such, this ups the high-water mark to 500 tasks instead. It also
redefines the SIP threadpool request denial number to be a multiple of
the taskprocessor high-water mark.
Mark Michelson [Thu, 12 Nov 2015 17:17:51 +0000 (11:17 -0600)]
res_pjsip distributor: Don't send 503 response to responses.
When the SIP threadpool is backed up with tasks, we send 503 responses
to ensure that we don't try to overload ourselves. The problem is that
we were not insuring that we were not trying to send a 503 to an
incoming SIP response.
This change makes it so that we only send the 503 on incoming requests.
Steve Davies [Wed, 11 Nov 2015 10:16:22 +0000 (10:16 +0000)]
Further fixes to improper usage of scheduler
When ASTERISK-25449 was closed, a number of scheduler issues mentioned in
the comments were missed. These have since beed raised in ASTERISK-25476
and elsewhere.
This patch attempts to collect all of the scheduler issues discovered so
far and address them sensibly.
Mark Michelson [Wed, 11 Nov 2015 23:11:53 +0000 (17:11 -0600)]
res_pjsip: Deny requests when threadpool queue is backed up.
We have observed situations where the SIP threadpool may become
deadlocked. However, because incoming traffic is still arriving, the SIP
threadpool's queue can continue to grow, eventually running the system
out of memory.
This change makes it so that incoming traffic gets rejected with a 503
response if the queue is backed up too much.
Joshua Colp [Wed, 11 Nov 2015 17:04:08 +0000 (13:04 -0400)]
threadpool: Handle worker thread transitioning to dead when going active.
This change adds handling of dead worker threads when moving them
to be active. When this happens the worker thread is removed from
both the active and idle threads container. If no threads are able
to be moved to active then the pool grows as configured.
A unit test has also been added which thrashes the idle timeout
and thread activation to exploit any race conditions between the
two.
Jonathan Rose [Tue, 3 Nov 2015 22:19:43 +0000 (16:19 -0600)]
taskprocessor: Add high water mark warnings
If a taskprocessor's queue grows large, this can indicate that there
may be a problem with tasks not leaving the processor or else that
the number of available task processors for a given type of task is
too low. This patch makes it so that if a taskprocessor's task queue
grows above 100 queued tasks that it will emit a warning message.
Warning messages are emitted only once per task processor.
Joshua Colp [Tue, 23 Jun 2015 16:21:41 +0000 (13:21 -0300)]
app_dial: Hold reference to calling channel formats when dialing outbound.
Currently when requesting a channel the native formats of the
calling channel are provided to the core for usage when dialing
the outbound channel. This occurs without holding the channel lock
or keeping a reference to the formats. This is problematic as
the channel driver may end up changing the formats during this time.
In the case of chan_sip this happens when an SDP negotiation
completes.
This change makes it so app_dial keeps a reference to the native
formats of the calling channel which guarantees that they will
remain valid for the period of time needed.
Matt Jordan [Wed, 4 Nov 2015 20:31:28 +0000 (14:31 -0600)]
main/dial: Protect access to the format_cap structure of the requesting channel
When a dial attempt is made that involves a requesting channel, we previously
were not:
a) Protecting access to the native format capabilities structure on the
requesting channel. That is inherently unsafe.
b) Reference bumping the lifetime of the format capabilities structure.
In both cases, something else could sneak in, blow away the format
capabilities, and we'd be holding onto an invalid format_cap structure. When
the newly created channel attempts to construct its format capabilities, things
go poorly.
This patch:
a) Ensures that we get a reference to the native format capabilities while
the requesting channel is locked
b) Holds a reference to the native format capabilities during the creation
of the new channel.
Mark Michelson [Mon, 2 Nov 2015 23:19:21 +0000 (17:19 -0600)]
res_pjsip: Set threadpool max size default to 50.
During a stress test of subscriptions, a huge blast of
subscription-related traffic resulted in the threadpool expanding to a
ridiculous number of threads. The balooning of threads resulted in an
increase of memory, which led to a crash due to being out of memory.
An easy fix for the particular test was to limit the size of the
threadpool, thus reining in the amount of memory that would be used. It
was decided that there really is no downside to having a non-infinite
default value for the maximum size of the threadpool, so this change
introduces 50 threads as the maximum threadpool size for the SIP
threadpool.