Joshua Colp [Wed, 3 Feb 2016 18:05:52 +0000 (14:05 -0400)]
AST-2016-001 http: Provide greater control of TLS and set modern defaults.
This change exposes the configuration of various aspects of the TLS
support and sets the default to the modern standards.
The TLS cipher is now set to the best values according to the
Mozilla OpSec team, different TLS versions can now be disabled, and
the cipher order can be forced to be that of the server instead of
the client.
Richard Mudgett [Mon, 7 Dec 2015 18:46:53 +0000 (12:46 -0600)]
AST-2016-003 udptl.c: Fix uninitialized values.
Sending UDPTL packets to Asterisk with the right amount of missing
sequence numbers and enough redundant 0-length IFP packets, can make
Asterisk crash.
Setting the sip.conf timert1 value to a value higher than 1245 can cause
an integer overflow and result in large retransmit timeout times. These
large timeout times hold system file descriptors hostage and can cause the
system to run out of file descriptors.
NOTE: The default sip.conf timert1 value is 500 which does not expose the
vulnerability.
* The overflow is now detected and the previous timeout time is
calculated.
ASTERISK-25397 #close
Reported by: Alexander Traud
Joshua Colp [Mon, 25 Jan 2016 15:35:21 +0000 (11:35 -0400)]
config: Allow options to register when documentation is unavailable.
The config options framework is strict in that configuration options must
be documented unless XML documentation support is not available. In
practice this is useful as it ensures documentation exists however in
off-nominal cases this can cause strange problems.
If it is expected that a config option has a non-zero or non-empty
default value but the config option documentation is unavailable
this reasonable expectation will not be met. This can cause obscure
crashes and weirdness depending on how the code handles it.
This change tweaks the behavior to ensure that the config option
is still allowed to register, apply default values, and be set when
devmode is not enabled. If devmode is enabled then the option can
NOT be set.
This also does not remove the initial documentation error message that
is output on load when registering the configuration option.
Mark Michelson [Mon, 25 Jan 2016 22:51:25 +0000 (16:51 -0600)]
res_pjsip_pubsub: Prevent crash from AMI command on freed subscription.
A test recently uncovered that running an ill-timed AMI command to show
inbound subscriptions could cause a crash since Asterisk will try to
operate on a freed subscription.
The fix for this is to remove the subscription tree from the list of
subscriptions at the time that we are sending our final NOTIFY request
out. This way, as the subscription is in the process of dying, it is
inaccessible from AMI.
Kevin Harwell [Thu, 14 Jan 2016 20:42:57 +0000 (14:42 -0600)]
bridge_basic: don't cache xferfailsound during an attended transfer
The xferfailsound was read from the channel at the beginning of the transfer,
and that value is "cached" for the duration of the transfer. Therefore, changing
the xferfailsound on the channel using the FEATURE() dialplan function does
nothing once the transfer is under way.
This makes it so the transfer code instead gets the xferfailsound configuration
options from the channel when it is actually going to be used.
This patch also fixes a potential memory leak of the props object as well as
making sure the condition variable gets initialized before being destroyed.
Joshua Colp [Mon, 28 Dec 2015 20:02:19 +0000 (16:02 -0400)]
test_time: Provide a timeout when waiting.
The test_timezone_watch unit test is written to expect a
condition to be signaled when the inotify daemon thread runs.
There exists a small window where the test_timezone_watch
thread can signal the inotify daemon thread while it is not
reading on the underlying file descriptor. If this occurs
the test_timezone_watch thread will wait indefinitely for a
signal that will never arrive.
This change adds a timeout to the condition so it will return
regardless after a period of time.
Joshua Colp [Tue, 12 Jan 2016 17:14:29 +0000 (13:14 -0400)]
app: Queue hangup if channel is hung up during sub or macro execution.
This issue was exposed when executing a connected line subroutine.
When connected or redirected subroutines or macros are executed it is
expected that the underlying applications and logic invoked are fast
and do not consume frames. In practice this constraint is not enforced
and if not adhered to will cause channels to continue when they shouldn't.
This is because each caller of the connected or redirected logic does not
check whether the channel has been hung up on return. As a result the
the hung up channel continues.
This change makes it so when the API to execute a subroutine or
macro is invoked the channel is checked to determine if it has hung up.
If it has then a hangup is queued again so the caller will see it
and stop.
Kevin Harwell [Fri, 8 Jan 2016 21:22:05 +0000 (15:22 -0600)]
pbx: Deadlock between contexts container and context_merge locks
Recent changes (ASTERISK-25394 commit 2bd27d12223fe33b58c453965ed5c6ed3af7c4f5)
introduced the possibility of a deadlock. Due to the mentioned modifications
ast_change_hints now needs to keep both merge/delete and state callbacks from
occurring while it executes. Unfortunately, sometimes ast_change_hints can be
called with the contexts container locked. When this happens it's possible for
another thread to grab the context_merge_lock before the thread calling into
ast_change_hints does and then try to obtain the contexts container lock. This
of course causes a deadlock between the two threads. The thread calling into
ast_change_hints waits for the other thread to release context_merge_lock and
the other thread is waiting on that one to release the contexts container lock.
Unfortunately, there is not a great way to fix this problem. When hints change,
the subsequent state callbacks cannot run at the same time as a merge/delete,
nor when the usual state callbacks do. This patch alleviates the problem by
having those particular callbacks (the ones run after a hint change) occur in a
serialized task. By moving the context_merge_lock to a task it can now safely be
attempted or held without a deadlock occurring.
ASTERISK-25640 #close
Reported by: Krzysztof Trempala
Mark Michelson [Thu, 7 Jan 2016 21:37:36 +0000 (15:37 -0600)]
PJSIP: Prevent deadlock due to dialog/transaction lock inversion.
A deadlock was observed where the monitor thread was stuck, therefore
resulting in no incoming SIP traffic being processed.
The problem occurred when two 200 OK responses arrived in response to a
terminating NOTIFY request sent from Asterisk. The first 200 OK was
dispatched to a threadpool worker, who locked the corresponding
transaction. The second 200 OK arrived, resulting in the monitor thread
locking the dialog. At this point, the two threads are at odds, because
the monitor thread attempts to lock the transaction, and the threadpool
thread loops attempting to try to lock the dialog.
In this case, the fix is to not have the monitor thread attempt to hold
both the dialog and transaction locks at the same time. Instead, we
release the dialog lock before attempting to lock the transaction.
There have also been some debug messages added to the process in an
attempt to make it more clear what is going on in the process.
Jonathan Rose [Thu, 10 Dec 2015 17:44:03 +0000 (11:44 -0600)]
chan_sip: Add TCP/TLS keepalive to TCP/TLS server
Adds the TCP Keep Alive option to TCP and TLS server sockets. Previously
this option was only being set on session sockets.
http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/
According to the link above, the SO_KEEPALIVE option is useful for knowing
when a TCP connected endpoint has severed communication without indicating
it or has become unreachable for some reason. Without this patch, keep
alive is not set on the socket listening for incoming TCP sessions and
in Komatsu's report this resulted in the thread listening for TCP becoming
stuck in a waiting state.
When a caller calls a FAX number and then hangs up right after the call is
answered then the T.38 re-INVITE automatic reject timer may still be
running after the channel goes away.
* Added session NULL channel checks on the code paths that get executed by
t38_automatic_reject() to prevent a crash when the T.38 re-INVITE
automatic reject timer expires.
Jonathan Rose [Tue, 1 Dec 2015 22:11:07 +0000 (16:11 -0600)]
Unset BRIDGEPEER when leaving a bridge
Currently if a channel is transferred out of a bridge, the BRIDGEPEER
variable (also BRIDGEPVTCALLID) remain set even once the channel is
out of the bridge. This patch removes these variables when leaving
the bridge.
Richard Mudgett [Mon, 30 Nov 2015 22:42:47 +0000 (16:42 -0600)]
sched.c: Make not return a sched id of 0.
According to the API doxygen a sched ID of 0 is valid. Unfortunately, 0
was never returned historically and several users incorrectly coded usage
of the returned sched ID assuming that 0 was invalid.
Richard Mudgett [Tue, 24 Nov 2015 18:44:53 +0000 (12:44 -0600)]
Audit improper usage of scheduler exposed by 5c713fdf18f.
channels/chan_iax2.c:
* Initialize struct chan_iax2_pvt scheduler ids earlier because of
iax2_destroy_helper().
channels/chan_sip.c:
channels/sip/config_parser.c:
* Fix initialization of scheduler id struct members. Some off nominal
paths had 0 as a scheduler id to be destroyed when it was never started.
chan_skinny.c:
* Fix some scheduler id comparisons that excluded the valid 0 id.
channel.c:
* Fix channel initialization of the video stream scheduler id.
pbx_dundi.c:
* Fix channel initialization of the packet retransmission scheduler id.
Richard Mudgett [Mon, 23 Nov 2015 20:27:27 +0000 (14:27 -0600)]
res_sorcery_realtime.c: Fix crash from NULL sorcery object type.
If the sorcery object type is not found a NULL is returned.
Unfortunately, sorcery_realtime_filter_objectset() will crash after
complaining about not finding the object type and saying to expect errors.
* Use ao2_cleanup() instead of ao2_ref() to prevent the crash.
Richard Mudgett [Tue, 5 May 2015 23:17:54 +0000 (18:17 -0500)]
features: Fix crash when transferee hangs up during DTMF attended transfer.
A crash happens with this sequence of steps:
1) Party A is connected to party B.
2) Party B starts a DTMF attended transfer.
3) Party A hangs up while party B is dialing party C.
When party A hangs up the bridge that party A and party B are in is
dissolved and party B is kicked out of the bridge. When party B finishes
dialing party C he attempts to move to the new bridge with party C. Since
party B is no longer in a bridge the attempted move dereferences a NULL
bridge_channel pointer and crashes.
* Made the hold(), unhold(), ringing(), and the bridge_move() functions
tolerant of the channel not being in a bridge. The assertion that party B
is always in a bridge is not true if the bridged peer of party B hangs up
and dissolves the bridge. Being tolerant of not being in a bridge allows
the peer hangup stimulus to be processed by the FSM.
* Made the bridge_move() function return void since where the return value
for a failed move was checked generated a FSM coding ERROR message for a
normal off-nominal condition.
* Eliminated most uses of RAII_VAR in bridge_basic.c.
Mark Michelson [Fri, 13 Nov 2015 20:03:35 +0000 (14:03 -0600)]
Confbridge: Add a user timeout option
This option adds the ability to specify a timeout, in seconds, for a
participant in a ConfBridge. When the user's timeout has been reached,
the user is ejected from the conference with the CONFBRIDGE_RESULT
channel variable set to "TIMEOUT".
The rationale for this change is that there have been times where we
have seen channels get "stuck" in ConfBridge because a network issue
results in a SIP BYE not being received by Asterisk. While these
channels can be hung up manually via CLI/AMI/ARI, adding some sort of
automatic cleanup of the channels is a nice feature to have.
Mark Michelson [Fri, 13 Nov 2015 20:19:35 +0000 (14:19 -0600)]
Taskprocessors: Increase high-water mark
In practical tests, we have seen certain taskprocessors, specifically
Stasis subscription taskprocessors, cross the recently-added high-water
mark and emit a warning. This high-water mark warning is only intended
to be emitted when things have tanked on the system and things are
heading south quickly. In the practical tests, the Stasis taskprocessors
sometimes had a max depth of 180 tasks in them, and Asterisk wasn't in
any danger at all.
As such, this ups the high-water mark to 500 tasks instead. It also
redefines the SIP threadpool request denial number to be a multiple of
the taskprocessor high-water mark.
Mark Michelson [Thu, 12 Nov 2015 17:17:51 +0000 (11:17 -0600)]
res_pjsip distributor: Don't send 503 response to responses.
When the SIP threadpool is backed up with tasks, we send 503 responses
to ensure that we don't try to overload ourselves. The problem is that
we were not insuring that we were not trying to send a 503 to an
incoming SIP response.
This change makes it so that we only send the 503 on incoming requests.
Steve Davies [Wed, 11 Nov 2015 10:16:22 +0000 (10:16 +0000)]
Further fixes to improper usage of scheduler
When ASTERISK-25449 was closed, a number of scheduler issues mentioned in
the comments were missed. These have since beed raised in ASTERISK-25476
and elsewhere.
This patch attempts to collect all of the scheduler issues discovered so
far and address them sensibly.
Mark Michelson [Wed, 11 Nov 2015 23:11:53 +0000 (17:11 -0600)]
res_pjsip: Deny requests when threadpool queue is backed up.
We have observed situations where the SIP threadpool may become
deadlocked. However, because incoming traffic is still arriving, the SIP
threadpool's queue can continue to grow, eventually running the system
out of memory.
This change makes it so that incoming traffic gets rejected with a 503
response if the queue is backed up too much.
Joshua Colp [Wed, 11 Nov 2015 17:04:08 +0000 (13:04 -0400)]
threadpool: Handle worker thread transitioning to dead when going active.
This change adds handling of dead worker threads when moving them
to be active. When this happens the worker thread is removed from
both the active and idle threads container. If no threads are able
to be moved to active then the pool grows as configured.
A unit test has also been added which thrashes the idle timeout
and thread activation to exploit any race conditions between the
two.
Jonathan Rose [Tue, 3 Nov 2015 22:19:43 +0000 (16:19 -0600)]
taskprocessor: Add high water mark warnings
If a taskprocessor's queue grows large, this can indicate that there
may be a problem with tasks not leaving the processor or else that
the number of available task processors for a given type of task is
too low. This patch makes it so that if a taskprocessor's task queue
grows above 100 queued tasks that it will emit a warning message.
Warning messages are emitted only once per task processor.
Joshua Colp [Tue, 23 Jun 2015 16:21:41 +0000 (13:21 -0300)]
app_dial: Hold reference to calling channel formats when dialing outbound.
Currently when requesting a channel the native formats of the
calling channel are provided to the core for usage when dialing
the outbound channel. This occurs without holding the channel lock
or keeping a reference to the formats. This is problematic as
the channel driver may end up changing the formats during this time.
In the case of chan_sip this happens when an SDP negotiation
completes.
This change makes it so app_dial keeps a reference to the native
formats of the calling channel which guarantees that they will
remain valid for the period of time needed.
Matt Jordan [Wed, 4 Nov 2015 20:31:28 +0000 (14:31 -0600)]
main/dial: Protect access to the format_cap structure of the requesting channel
When a dial attempt is made that involves a requesting channel, we previously
were not:
a) Protecting access to the native format capabilities structure on the
requesting channel. That is inherently unsafe.
b) Reference bumping the lifetime of the format capabilities structure.
In both cases, something else could sneak in, blow away the format
capabilities, and we'd be holding onto an invalid format_cap structure. When
the newly created channel attempts to construct its format capabilities, things
go poorly.
This patch:
a) Ensures that we get a reference to the native format capabilities while
the requesting channel is locked
b) Holds a reference to the native format capabilities during the creation
of the new channel.
Mark Michelson [Mon, 2 Nov 2015 23:19:21 +0000 (17:19 -0600)]
res_pjsip: Set threadpool max size default to 50.
During a stress test of subscriptions, a huge blast of
subscription-related traffic resulted in the threadpool expanding to a
ridiculous number of threads. The balooning of threads resulted in an
increase of memory, which led to a crash due to being out of memory.
An easy fix for the particular test was to limit the size of the
threadpool, thus reining in the amount of memory that would be used. It
was decided that there really is no downside to having a non-infinite
default value for the maximum size of the threadpool, so this change
introduces 50 threads as the maximum threadpool size for the SIP
threadpool.
Kevin Harwell [Wed, 21 Oct 2015 17:35:08 +0000 (12:35 -0500)]
res_pjsip_outbound_registration: registration stops due to fatal 4xx response
During outbound registration it is possible to receive a fatal (any permanent/
non-temporary 4xx, 5xx, 6xx) response from the registrar that is simply due
to a problem with the registrar itself. Upon receiving the failure response
Asterisk terminates outbound registration for the given endpoint.
This patch adds an option, 'fatal_retry_interval', that when set continues
outbound registration at the given interval up to 'max_retries' upon receiving
a fatal response.
Mark Michelson [Thu, 22 Oct 2015 22:07:55 +0000 (17:07 -0500)]
format_cap: Detect vector allocation failures.
A crash was seen on a system that ran out of memory due to Asterisk not
checking for vector allocation failures in format_cap.c. With this
change, if either of the AST_VECTOR_INIT calls fail, we will return a
value indicating failure.
Mark Michelson [Fri, 2 Oct 2015 20:32:09 +0000 (15:32 -0500)]
res_pjsip_pubsub: Prevent sending NOTIFY on destroyed dialog.
A certain situation can result in our attempting to send a NOTIFY on a
destroyed dialog. Say we attempt to send a NOTIFY to a subscriber, but
that subscriber has dropped off the network. We end up retransmitting
that NOTIFY until the appropriate SIP timer says to destroy the NOTIFY
transaction. When the pjsip evsub code is told that the transaction has
been terminated, it responds in kind by alerting us that the
subscription has been terminated, destroying the subscription, and then
removing its reference to the dialog, thus destroying the dialog.
The problem is that when we get told that the subscription is being
terminated, we detect that we have not sent a terminating NOTIFY
request, so we queue up such a NOTIFY to be sent out. By the time that
queued NOTIFY gets sent, the dialog has been destroyed, so attempting to
send that NOTIFY can result in a crash.
The fix being introduced here is actually a reintroduction of something
the pubsub code used to employ. We hold a reference to the dialog and
wait to decrement our reference to the dialog until our subscription
tree object is destroyed. This way, we can send messages on the dialog
even if the PJSIP evsub code wants to terminate earlier than we would
like.
In doing this, some NULL checks for subscription tree dialogs have been
removed since NULL dialogs are no longer actually possible.
Mark Michelson [Tue, 29 Sep 2015 19:53:22 +0000 (14:53 -0500)]
res_pjsip_pubsub: Ensure dialog lock balance.
When sending a NOTIFY, we lock the dialog and then unlock the dialog
when finished. A recent change made it so that the subscription tree's
dialog pointer will be set NULL when sending the final NOTIFY request
out. This means that when we attempt to unlock the dialog, we pass a
NULL pointer to pjsip_dlg_dec_lock(). The result is that the dialog
remains locked after we think we have unlocked it. When a response to
the NOTIFY arrives, the monitor thread attempts to lock the dialog, but
it cannot because we never released the dialog lock. This results in
Asterisk being unable to process incoming SIP traffic any longer.
The fix in this patch is to use a local pointer to save off the pointer
value of the subscription tree's dialog when locking and unlocking the
dialog. This way, if the subscription tree's dialog pointer is NULLed
out, the local pointer will still have point to the proper place and the
dialog lock will be unlocked as we expect.
Mark Michelson [Mon, 28 Sep 2015 21:36:25 +0000 (16:36 -0500)]
res_pjsip_pubsub: Prevent crashes on final NOTIFY.
The SIP dialog is removed from the subscription tree when the final
NOTIFY is sent. However, after the final NOTIFY is sent, the persistence
update function still attempts to access the cseq from the dialog,
resulting in a crash.
This fix removes the subscription persistence at the same time that the
dialog is removed from the subscription tree. This way, there is no
attempt to update persistence when the subscription is being destroyed.
Mark Michelson [Thu, 17 Sep 2015 22:28:30 +0000 (17:28 -0500)]
res_pjsip_pubsub: Remove serializer when sending final NOTIFY.
There have been crashes seen where a taskprocessor's listener is NULL
unexpectedly.
Looking at backtraces, the problem was specifically seen in PJSIP
serializers.
Subscriptions make the mistake of removing a serializer from a dialog
during subscription tree destruction. Since subscription trees are
reference-counted, guaranteeing the circumstances behind the destruction
are not possible. This makes it so that the dialog serializer can be
removed while not holding the dialog lock. This makes it possible for
the distributor to get a pointer to the dialog serializer and have that
serializer get freed out from under it.
The fix for this is to remove the serializer from a subscription dialog
when sending the final NOTIFY. This guarantees that the serializer is
removed with the dialog lock held. By doing this, we guarantee that if
the distributor gains access to the dialog's serializer, it will not be
possible for the serializer to get freed by another thread.
Mark Michelson [Wed, 2 Sep 2015 14:14:19 +0000 (09:14 -0500)]
res_pjsip_pubsub: Fix crash on destruction of empty subscription tree.
If an old persistent subscription is recreated but then immediately
destroyed because it is out of date, the subscription tree will have no
leaf subscriptions on it. This was resulting in a crash when attempting
to destroy the subscription tree.
Mark Michelson [Tue, 1 Sep 2015 20:47:19 +0000 (15:47 -0500)]
res_pjsip_pubsub: Solidify lifetime and ownership of objects.
There have been crashes and general instability seen in the pubsub code,
so this patch introduces three changes to increase the stability.
First, the ownership model for subscriptions has been modified. Due to
RLS, subscriptions are stored in memory as a tree structure. Prior to my
patch, the PJSIP subscription was the owner of the subscription tree.
When the PJSIP subscription told us that it was terminating, we started
destroying the subscription tree along with all of the individual leaf
subscriptions that belong to the tree. The problem with this model is
that the two actors in play here, the PJSIP subscription and the
individual leaf subscriptions, need to have joint ownership of the
subscription tree. So now, the PJSIP subscription and the individual
leaf subscriptions each have a reference to the subscription tree. This
way, we will not actually free memory until no players are left that
care. The PJSIP subscription is a bigger stakeholder, in that if the
PJSIP subscription's reference to the subscription tree is removed, the
subscription tree instructs the leaf subscriptions to shut down and drop
their references to the subscription tree when possible. The individual
leaf subscriptions, upon being told to shut down, can drop their stasis
subscriptions or whatever they use to learn of new state, and then drop
their reference to the subscription tree once they are ready to die.
Second, the lifetime of a PJSIP subscription's reference to our
subscription tree has been altered. As I learned from doing a deep dive,
the PJSIP evsub code can tell Asterisk multiple times that the
subscription has been terminated, and not all of these times
are especially helpful. I have altered the message flow that we use for
SIP subscriptions such that we will always drop the PJSIP subscription's
reference to the subscription tree when we send the NOTIFY that
terminates a SIP subscription. This also means that we will now queue
NOTIFY requests to be sent after responding to incoming SUBSCRIBEs so
that we can have predictable state changes from the PJSIP evsub code.
Third, the synchronization of operations has been improved. PJSIP can
call into our code from a serializer thread (e.g. upon receiving an
incoming request) or from the monitor thread (e.g. when a subscription
times out). Because of this, there is the possibility of competing
threads stepping on each other. PJSIP attempts to do some
synchronization on its own by always keeping the dialog lock held when
it calls into us. However, since we end up pushing tasks into the
serializer, the result was that serialized operations were not grabbing
the dialog lock and could, as a result, step on something that was being
attempted by a different thread. Now we ensure that serialized
operations grab the dialog lock, then check for extenuating
circumstances, then proceed with their operation if they can.
Mark Michelson [Mon, 20 Apr 2015 19:30:47 +0000 (14:30 -0500)]
res_pjsip_pubsub: Set the endpoint on SUBSCRIBE dialogs.
When SUBSCRIBE dialogs were established, we never associated
the endpoint that created the subscription with the dialog
we end up creating. In most cases, this ended up not causing
any problems.
The actual bug that was observed was that when a device that
was behind NAT established a subscription with Asterisk, Asterisk
would end up sending in-dialog NOTIFY requests to the device's
private IP addres instead of the public address of the NAT router.
When Asterisk receives the initial SUBSCRIBE from the device,
res_pjsip_nat rewrites the contact to the public address on which the
SUBSCRIBE was received. This allows for the dialog to have its target
address set to the proper public address. Asterisk then would send a 200
OK response to the SUBSCRIBE, then a NOTIFY with the initial
subscription state. The device would then send a 200 OK response to
Asterisk's NOTIFY.
Here's where things went wrong. When the 200 OK arrived, res_pjsip_nat
did not rewrite the address in the Contact header. Then, when the PJSIP
dialog layer processed the 200 OK, PJSIP would perform a comparison
between the IP address in the Contact header and its saved target
address for the dialog. Since they differed, PJSIP would update the
target dialog address to be the address in the Contact header. From this
point, if Asterisk needed to send a NOTIFY to the device, the result was
that the NOTIFY would be sent to the private address that the device
placed in the Contact header.
The reason why res_pjsip_nat did not rewrite the address when it
received the 200 OK response was that it could not associate the
incoming response with a configured endpoint. This is because on a
response, the only way to associate the response to an endpoint is by
finding the dialog that the response is associated with and then finding
the endpoint that is associated with that dialog. We do not perform
endpoint lookups on responses. res_pjsip_pubsub skipped the step of
associating the endpoint with the dialog we created, so res_pjsip_nat
could not find the associated endpoint and therefore couldn't rewrite
the contact.
This commit message is like 50x longer than the actual fix.
Joshua Colp [Wed, 21 Oct 2015 16:44:17 +0000 (13:44 -0300)]
res_pjsip: Move URI validation to use time.
In a realtime based system with a limited number of threadpool threads
it is possible for a deadlock to occur. This happens when permanent
endpoint state is updated, which will cause database queries to be done.
These queries may result in URI validation being done which is done
synchronously using a PJSIP thread. If all PJSIP threads are in use
processing traffic they themselves may be blocked waiting to get the
permanent endpoint container lock when identifying an endpoint.
This change moves URI validation to occur at use time instead of
configuration time. While this comes at a cost of not seeing a problem
until you use it it does solve the underlying deadlock problem.
Corey Farrell [Thu, 26 Mar 2015 22:19:21 +0000 (22:19 +0000)]
Replace most uses of ast_register_atexit with ast_register_cleanup.
Since 'core stop now' and 'core restart now' do not stop modules,
it is unsafe for most of the core to run cleanups. Originally all
cleanups used ast_register_atexit, and were only changed when it
was shown to be unsafe. ast_register_atexit is now used only when
absolutely required to prevent corruption and close child processes.
Exceptions that need to use ast_register_atexit:
* CDR: Flush records.
* res_musiconhold: Kill external applications.
* AstDB: Close the DB.
* canary_exit: Kill canary process.
Matt Jordan [Tue, 20 Oct 2015 00:59:50 +0000 (19:59 -0500)]
contrib/scripts/autosupport: Update for Asterisk 13
This patch adds some minor tweaks for autosupport to update it for Asterisk 13.
This includes:
* Finally removing most references to Zaptel
* Adding support for some additional 'core' commands, and fixing nomenclature
that generally hasn't been used for some time
* Adding some PJSIP/SIP commands to gather endpoints/peers and active channels
Richard Mudgett [Fri, 5 Jun 2015 20:37:33 +0000 (15:37 -0500)]
res_pjsip: Need to use the same serializer for a pjproject SIP transaction.
All send/receive processing for a SIP transaction needs to be done under
the same threadpool serializer to prevent reentrancy problems inside
pjproject and res_pjsip.
* Add threadpool API call to get the current serializer associated with
the worker thread.
* Pick a serializer from a pool of default serializers if the caller of
res_pjsip.c:ast_sip_push_task() does not provide one.
This is a simple way to ensure that all outgoing SIP request messages are
processed under a serializer. Otherwise, any place where a pushed task is
done that would result in an outgoing out-of-dialog request would need to
be modified to supply a serializer. Serializers from the default
serializer pool are picked in a round robin sequence for simplicity.
A side effect is that the default serializer pool will limit the growth of
the thread pool from random tasks. This is not necessarily a bad thing.
* Made pjsip_distributor.c save the thread's serializer name on the
outgoing request tdata struct so the response can be processed under the
same serializer.
NOTE: session_inv_on_state_changed() is disassociating the dialog from the
session when the invite dialog becomes PJSIP_INV_STATE_DISCONNECTED.
Unfortunately this is a tad too soon because our BYE request transaction
has not completed yet.
Richard Mudgett [Fri, 2 Oct 2015 22:05:16 +0000 (17:05 -0500)]
Fix issue with AST_THREADSTORAGE_RAW when DEBUG_THREADLOCALS is enabled.
When DEBUG_THREADLOCALS is enabled it causes the threadlocal cleanup to be
called as a function. This causes a compile error with raw threadstorage as
it uses NULL for cleanup. This fix uses a macro that provides NULL when
DEBUG_THREADLOCALS is disabled, and replaces the call to "c_cleanup(data);"
with "{};" when DEBUG_THREADLOCALS is enabled.
Cherry-pick from v13 with additional definitions of
AST_THREADSTORAGE_RAW(), ast_threadstorage_get_ptr() and
ast_threadstorage_set_ptr() from
commit d01706ce1ee518118456d5673f529204bdac73bb.
Richard Mudgett [Mon, 12 Oct 2015 16:20:29 +0000 (11:20 -0500)]
config.c: Fix potential memory corruption after [section](+).
The memory corruption could happen if the [section](+) is the last section
in the file with trailing comments. In this case process_text_line() has
left *last_cat is set to newcat and newcat is destroyed.
Richard Mudgett [Mon, 12 Oct 2015 16:21:55 +0000 (11:21 -0500)]
config.c: Fix #include after [section](+).
An #include right after a [section](+) would associate any variable
assignments before a new section in the #include with the wrong section.
* Fix section association by setting the current section to the appended
section.
* Fix '+' and '!' section flag interaction corner case depending upon
which flag came first. If the '!' came first then it would be ignored.
If the '!' came after then it would affect the appended section. The '!'
will now no longer be ignored.
Matt Jordan [Wed, 7 Oct 2015 01:43:58 +0000 (20:43 -0500)]
res/res_rtp_asterisk: Fix assignment after ao2 decrement
When we decide we will no longer schedule an RTCP write, we remove the
reference to the RTP instance, then assign -1 to the stored scheduler ID
in case something else comes along and wants to see if anything is scheduled.
That scheduler ID is on the RTP instance. After 60a9172d7ef2 was merged to
fix the regression introduced by 3cf0f29310, this improper assignment on a
potentially destroyed object started getting tripped on the build agents.
Frankly, this should have been crashing a lot more often earlier. I can only
assume that the timing was changed just enough by both changes to start
actually hitting this problem.
As it is, simply moving the assignment prior to the ao2 deference is sufficient
to keep the RTP instance from being referenced when it is very, truly,
aboslutely dead.
(Note that it is still good practice to assign -1 to the scheduler ID when we
know we won't be scheduling it again, as the ao2 deref *may* not always destroy
the ao2 object.)
Richard Mudgett [Mon, 5 Oct 2015 21:53:44 +0000 (16:53 -0500)]
chan_pjsip: Fix crash on reINVITE before initial INVITE completes.
Apparently some endpoints attempt to send a reINVITE before completing the
initial INVITE transaction. In this case PJSIP responds appropriately to
the reINVITE with a 491 INVITE request pending. Unfortunately chan_pjsip
is using the initial INVITE transaction state to determine if an INVITE is
the initial INVITE or a reINVITE. Since the initial INVITE transaction
has not been confirmed yet chan_pjsip thinks the reINVITE is an initial
INVITE and starts another PBX thread on the channel. The extra PBX thread
ensures that hilarity ensues.
* Fix checks for a reINVITE on incoming requests to look for the presence
of a to-tag instead of the initial INVITE transaction state.
* Made caller_id_incoming_request() determine what to do if there is a
channel on the session or not. After a channel is created it is too late
to just store the new party id on the session because the session's party
id has already been copied to the channel's caller id.
Matt Jordan [Tue, 6 Oct 2015 02:34:41 +0000 (21:34 -0500)]
Fix improper usage of scheduler exposed by 5c713fdf18f
When 5c713fdf18f was merged, it allowed for scheduled items to have an ID of
'0' returned. While this was valid per the documentation for the API, it was
apparently never returned previously. As a result, several users of the
scheduler API viewed the result as being invalid, causing them to reschedule
already scheduled items or otherwise fail in interesting ways.
This patch corrects the users such that they view '0' as valid, and a returned
ID of -1 as being invalid.
Note that the failing HEP RTCP tests now pass with this patch. These tests
failed due to a duplicate scheduling of the RTCP transmissions.
Richard Mudgett [Wed, 30 Sep 2015 22:28:19 +0000 (17:28 -0500)]
res_sorcery_memory_cache.c: Fix deadlock with scheduler.
A deadlock can happen when a sorcery object is being expired from the
memory cache when at the same time another object is being placed into the
memory cache. There are a couple other variations on this theme that
could cause the deadlock. Basically if an object is being expired from
the sorcery memory cache at the same time as another thread tries to
update the next object expiration timer the deadlock can happen.
* Add a deadlock avoidance loop in expire_objects_from_cache() to check if
someone is trying to remove the scheduler callback from the scheduler.
Richard Mudgett [Thu, 1 Oct 2015 19:27:34 +0000 (14:27 -0500)]
res_sorcery_memory_cache.c: Shutdown in a less crash potential order.
Basically you should shutdown in the opposite order of how you setup since
later setup pieces likely depend on earlier setup pieces. e.g.,
Registering your external API with the rest of the system should be the
last thing setup and the first thing unregistered during shutdown.
Matthew Jordan [Thu, 9 Apr 2015 15:42:16 +0000 (15:42 +0000)]
res/res_pjsip_dlg_options: Add a module to handle in-dialog OPTIONS requests
This patch adds a new session supplement that handles in-dialog OPTIONS
requests. Said OPTIONS requests are sent a 200 OK, as an endpoint lookup
for the OPTIONS request would already have been done by the time the
session supplement receives the inbound request.
Richard Mudgett [Thu, 24 Sep 2015 19:56:24 +0000 (14:56 -0500)]
app_queue.c: Force COLP update if outgoing channel name changed.
* When a call is answered and the outgoing channel name has changed then
force a connected line update because the channel is no longer the same.
The channel was masqueraded into by another channel. This is usually
because of a call pickup.
Note: Forwarded calls are handled in a controlled manner so the original
channel name is replaced with the forwarded channel.
Richard Mudgett [Thu, 24 Sep 2015 17:59:08 +0000 (12:59 -0500)]
app_dial.c: Force COLP update if outgoing channel name changed.
* When a call is answered and the outgoing channel name has changed then
force a connected line update because the channel is no longer the same.
The channel was masqueraded into by another channel. This is usually
because of a call pickup.
Note: Forwarded calls are handled in a controlled manner so the original
channel name is replaced with the forwarded channel.
Mark Michelson [Thu, 24 Sep 2015 19:49:46 +0000 (14:49 -0500)]
Do not swallow frames on channels leaving bridges.
When leaving a bridge, indications on a channel could be swallowed by
the internal indication logic because it appears that the channel is on
its way to be hung up anyway. One such situation where this is
detrimental is when channels on hold are redirected out of a bridge. The
AST_CONTROL_UNHOLD indication from the bridging code is swallowed,
leaving the channel in question to still appear to be on hold.
The fix here is to modify the logic inside ast_indicate_data() to not
drop the indication if the channel is simply leaving a bridge. This way,
channels on hold redirected out of a bridge revert to their expected "in
use" state after the redirection.
Richard Mudgett [Tue, 22 Sep 2015 22:08:49 +0000 (17:08 -0500)]
app_page.c: Fix crash when forwarding with a predial handler.
Page uses the async method of dialing with the dial API. When a call gets
forwarded there is no calling channel available. If the predial handler
was set then the calling channel could not be put into auto-service
for the forwarded call because it doesn't exist. A crash is the result.
* Moved the callee predial parameter string processing to before the
string is passed to the dial API rather than having the dial API do it.
There are a few benefits do doing this. The first is the predial
parameter string processing doesn't need to be done for each channel
called by the dial API. The second is in async mode and the forwarded
channel is to have the predial handler executed on it then the
non-existent calling channel does not need to be present to process the
predial parameter string.
* Don't start auto-service on a non-existent calling channel to execute
the predial handler when the dial API is in async mode and forwarding a
call.
Mark Michelson [Thu, 18 Jun 2015 18:16:29 +0000 (13:16 -0500)]
Resolve race conditions involving Stasis bridges.
This resolves two observed race conditions.
First, a bit of background on what the Stasis application does:
1a Creates a stasis_app_control structure. This structure is linked into
a global container and can be looked up using a channel's unique ID.
2a Puts the channel in an event loop. The event loop can exit either
because the stasis_app_control structure has been marked done, or
because of some other factor, such as a hangup. In the event loop, the
stasis_app_control determines if any specific ARI commands need to be
run on the channel and will run them from this thread.
3a Checks if the channel is bridged. If the channel is bridged, then
ast_bridge_depart() is called since channels that are added to Stasis
bridges are always imparted as departable.
4a Unlink the stasis_app_control from the container.
When an ARI command is received by Asterisk, the following occurs
1b A thread is spawned to handle the HTTP request
2b The stasis_app_control(s) that corresponds to the channel(s) in the
request is/are retrieved. If the stasis_app_control cannot be
retrieved, then it is assumed that the channel in question has exited
the Stasis app or perhaps was never in Stasis in the first place.
3b A command is queued onto the stasis_app_control, and the channel's
event loop thread is signaled to run the command.
4b While most ARI commands do nothing further, some, such as adding or
removing channels from a bridge, will block until the command they
issued has been completed by the channel's event loop.
The first race condition that is solved by this patch involves a crash
that can occur due to faulty detection of the channel's bridged status
in step 3a. What can happen is that in step 2a, the event loop may run
the ast_bridge_impart() function to asynchronously place the channel
into a bridge, then immediately exit the event loop because the channel
has hung up. In step 3a, we would detect that the channel was not
bridged and would not call ast_bridge_depart(). The reason that the
channel did not appear to be bridged was that the depart_thread that is
spawned by ast_bridge_impart() had not yet started. That is the thread
where the channel is marked as being bridged. Since we did not call
ast_bridge_depart(), the Stasis application would exit, and then the
channel would be destroyed Then the depart_thread would start up and
try to manipulate the destroyed channel, causing a crash.
The fix for this is to switch from using ast_channel_is_bridged() to
checking the NULLity of ast_channel_internal_bridge_channel() to
determine if ast_bridge_depart() needs to be called. The channel's
internal bridge_channel is set when ast_bridge_impart() is called and
is NULLed by the call to ast_bridge_depart(). If the channel's internal
bridge_channel is non-NULL, then the channel must have been imparted
into the bridge and needs to be departed, even if the actual bridging
operation has not yet started. By departing the channel when necessary,
the thread that is running the Stasis application will block until the
bridge gives the okay that the depart_thread has exited.
The second race condition that is solved by this patch involves a leak
of HTTP handler threads. The problem was that step 2b would successfully
retrieve a stasis_app_control structure. Then step 2a would exit the
channel from the event loop due to a hangup. Steps 3a and 4a would
execute, and then finally steps 3b and 4b would. The problem is that at
step 4b, when attempting to add a channel to a bridge, the thread would
block forever since the channel would never execute the queued command
since it was finished with the event loop. This meant that the HTTP
handling thread would be leaked, along with any references that thread
may have owned (in my case, I was seeing bridges leaked).
The fix for this is to hone in better on when the channel has exited the
event loop. The stasis_app_control structure has an is_done field that
is now set at each point where the channel may exit the event loop. If
step 2b retrieves a valid stasis_app_control structure but the control
is marked as done, then the attempted operation exits immediately since
there will be nothing to service the attempted command.
pbx: Update device and presence state when changing a hint extension.
When changing a hint extension without removing the hint first the
device state and presence state is not updated. This causes the state
of the hint to be that of the previous extension and not the current
one. This state is kept until a state change occurs as a result of
something (presence state change, device state change).
This change updates the hint with the current device and presence
state of the new extension when it is changed. Any state callbacks
which may have been added before the hint extension is changed are
also informed of the new device and presence state if either have
changed.
Mark Michelson [Wed, 16 Sep 2015 22:36:32 +0000 (17:36 -0500)]
res_pjsip_pubsub: Eliminate race during initial NOTIFY.
There is a slim chance of a race condition occurring where two threads
can both attempt to manipulate the same area.
Thread A can be handling an incoming initial SUBSCRIBE request. Thread A
lets the specific subscription handler know that the subscription has
been established.
At this point, Thread B may detect a state change on the subscribed
resource and queue up a notification task on Thread C, the subscription
serializer thread.
Now Thread A attempts to generate the initial NOTIFY request to send to
the subscriber at the same time that Thread C attempts to generate a
state change NOTIFY request to send to the subscriber.
The result is that Threads A and C can step on the same memory area,
resulting in a crash. The crash has been observed as happening when
attempting to allocate more space to hold the body for the NOTIFY.
The solution presented here is to queue the subscription establishment
and initial NOTIFY generation onto the subscription serializer thread
(Thread C in the above scenario). This way, there is no way that a state
change notification can occur before the initial NOTIFY is sent, and if
there is a quick succession of NOTIFYs, we can guarantee that the two
NOTIFY requests will be sent in succession.
Mark Michelson [Thu, 10 Sep 2015 22:19:26 +0000 (17:19 -0500)]
scheduler: Use queue for allocating sched IDs.
It has been observed that on long-running busy systems, a scheduler
context can eventually hit INT_MAX for its assigned IDs and end up
overflowing into a very low negative number. When this occurs, this can
result in odd behaviors, because a negative return is interpreted by
callers as being a failure. However, the item actually was successfully
scheduled. The result may be that a freed item remains in the scheduler,
resulting in a crash at some point in the future.
The scheduler can overflow because every time that an item is added to
the scheduler, a counter is bumped and that counter's current value is
assigned as the new item's ID.
This patch introduces a new method for assigning scheduler IDs. Instead
of assigning from a counter, a queue of available IDs is maintained.
When assigning a new ID, an ID is pulled from the queue. When a
scheduler item is released, its ID is pushed back onto the queue. This
way, IDs may be reused when they become available, and the growth of ID
numbers is directly related to concurrent activity within a scheduler
context rather than the uptime of the system.
Mark Michelson [Thu, 10 Sep 2015 14:49:45 +0000 (09:49 -0500)]
res_pjsip: Copy default_from_user to avoid crash.
The default_from_user retrieval function was pulling the
default_from_user from the global configuration struct in an unsafe way.
If using a database as a backend configuration store, the global
configuration struct is short-lived, so grabbing a pointer from it
results in referencing freed memory.
The fix here is to copy the default_from_user value out of the global
configuration struct.
Thanks go to John Hardin for discovering this problem and proposing the
patch on which this fix is based.
George Joseph [Thu, 23 Apr 2015 14:16:45 +0000 (08:16 -0600)]
res_pjsip: Validate that contact uris start with sip: or sips:
Currently we use pjsip_parse_hdr to validate contact uris but it
appears that it allows uris without a scheme if there's a port
supplied. I.E myexample.com will fail but myexample.com:5060 will
pass even though it has no scheme. This causes SEGVs later on
whenever the uri is used.
To prevent this, permanent_contact_validate has been updated to check
that the scheme is either 'sip' or 'sips'.
2 uses of possibly-null endpoint have also been fixed in
create_out_of_dialog_request.
Jonathan Rose [Thu, 3 Sep 2015 19:07:35 +0000 (14:07 -0500)]
ParkAndAnnounce: Add variable inheritance
In Asterisk 11, the announcer channel would receive channel variables
from the channel being parked by means of normal channel inheritance.
This functionality was lost during the big res_parking project in
Asterisk 12. This patch restores that functionality.
Joshua Colp [Sat, 29 Aug 2015 15:36:35 +0000 (12:36 -0300)]
taskprocessor: Fix race condition between unreferencing and finding.
When unreferencing a taskprocessor its reference count is checked
to determine if it should be unlinked from the taskprocessors
container and its listener shut down. In between the time when the
reference count is checked and unlinking it is possible for
another thread to jump in, find it, and get a reference to it. If
the thread then uses the taskprocessor it may find that it is not
in the state it expects.
This change locks the taskprocessors container during almost the
entire unreference operation to ensure that any other thread which
may attempt to find the taskprocessor has to wait.