Author |
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 26 2007 at 7:02pm | IP Logged
|
|
|
Ok, we have learned a bunch new insight.
First off, it is absolutely the TerminateCall API method which locks up. And, for whatever reason, it always locks up on the incoming leg. No matter which leg of the conversation caused the incoming leg to hang up, that is the line that gets stuck when calling terminatecall on.
The other thing learned, is what always accompanies this is a call with no audio. The weird thing about the no audio, is that the client was receiving rtp from the lanscape media engine, but was not sending any audio to lanscape - must be a negotiation failure, I have not looked at the ports assigned to the SDP yet, just noted that is always involves a call where neither side hears any audio.
Also, when using the method above you posted, the last state of the line 0 (for example, incoming leg that locked), is "_OnHook" even though the last message we got in regards to line 0 is: "SipIncomingCallConnected". So, seems are two views of this line differ - which is probably why there is really no rtp data.
Anyway, when the line gets this way, calling TerminateCall on it locks the thread. We were able to reproduce this very quick after much headache by simply:
1) Send an invite to media engine with some destination (gets line 0)
2) Media Engine sends invite to destination (gets line 9)
3) when destination picks up, take line 0 off hook, then bridge the audio of the two calls
4) destination & originator talk for only a few seconds
5) destination hangs up
Do this real quick (but only ever one conversation at a time is needed) by repeating steps 1 - 5. After around 3 to 6 calls, the last call will have no audio.
Once there is a call with no audio, if the destination hangs up, media engine reports SipOnHook for line 9. We then hang up line 0, which locks on terminatecall... Or, if the originator hangs up, and this is odd (but what I have mentioned above with the multiple BYEs), lanscape will not answer originator's bye request, nor will it inform us of any further line 0 sip messages. Since line 0 cannot end conversation, line 9 (destination) will have probably hung up, which is when we try to hang up line 0 but again, get stuck doing so.
As I get more information, I will attempt to provide more detailed steps, and possibly even a program which reproduces this issue, which is easier now that I can reproduce this issue at will ;-)
But, I'm hoping that just letting you know the sequence of events can really help you guys. Basically, last line state seems to be "_OnHook", but we only ever got SipIncomingCallConneccted. Also, no RTP data for connected call. Additionally, all further attempts to send Sip Messages to this line fail - BYE's are ignored - not even a transaction does not exist message or anything is returned. TerminateCall called on the line fails, and locks the thread.
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 26 2007 at 7:05pm | IP Logged
|
|
|
I should put a step 6 in there:
5) destination hangs up
6) call TerminateCall on incoming line (0).
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 26 2007 at 7:21pm | IP Logged
|
|
|
Just a little bit more info, the more the better.
We just performed the test with instead of destination hanging up over and over, the originator does. So, everything the same, except instead of destination initiating conversation's ending, the incoming leg always hangs up first. Now, this took slightly longer to reproduce, but maybe just a fluke (took around 12 calls).
But, the same symptoms, last call has no RTP, and gets stuck trying to hang up line 0 (incoming) - in this case, once there was no audio, I hung up with line 9 (destination), but all the other times originator hung up first.
But, one little thing also noted, looking at at UDP capture, Lanscape sent a BYE to line 0, line 0 then sent 200 OK to lanscape media engine. But, the terminate call API method never returned... Which is also strange, because we were under the impression that TerminateCall would do the call asynchronously when passed false. Or, this means terminatecall scheduled the BYE to be sent out, but for some reason or another was not able to return immediately.
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 26 2007 at 7:31pm | IP Logged
|
|
|
Hmm, nevermind about the last paragraph. Got a little confused from the IP's (having the client and media engine on the same system). I never saw the TerminateCall API send a BYE, it might have, but wireshark does not show internal network messages.
But the rest still applies, does not really matter who is doing the hang up, just that the incoming line gets stuck immediately after no audio.
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 26 2007 at 7:53pm | IP Logged
|
|
|
One last thing for the night, perhaps related, perhaps not. Either, we are doing something wrong, or something has changed... maybe leading to this issue, not sure yet.
#define NUMBER_OF_DEFAULT_CODECS 1
MEDIA_FORMAT_AUDIO DEFAULT_LINE_CODECS[NUMBER_OF_DEFAULT_CODECS] = {Media_Format_uLaw8k};
SetAudioMediaFormats(mSipEngine, phoneLine, NUMBER_OF_DEFAULT_CODECS, DEFAULT_LINE_CODECS);
However, no matter how many codecs we set there, what order, or anything (something that used to work before), codec's sent out on the invite are always:
a=rtpmap:18 G729/8000/1
a=rtpmap:97 ILBC/8000/1
a=rtpmap:0 PCMU/8000/1
a=rtpmap:8 PCMA/8000/1
a=rtpmap:124 PCM/8000/1
a=rtpmap:101 telephone-event/8000/1
a=sendrecv
a=fmtp:97 mode=30
a=ptime:20
a=fmtp:101 0-15
Perhaps we are setting codec's wrong? Not sure how, this worked fine in previous release (one that is working), but is now always using that order of codec's. (729, ILBC, PCMU, PCMA, PCM) Maybe a coincidence, or maybe not, but that is the order that we had the default codecs setup before in source - is that the normal default? If so, why can we not override it. Or is this something that is only happening locally.. hmm.
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: December 27 2007 at 6:53am | IP Logged
|
|
|
juice,
good information. we are reviewing...
Support
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: December 27 2007 at 10:13am | IP Logged
|
|
|
Juice,
Audio formats:
If a VOIP app does not change the default codec usage for a phone line, the default SDP in an INVITE request defaults to uLaw and will look something like:
Code:
v=0
o=333 14142125 14142125 IN IP4 192.168.1.2
s=LanScape
c=IN IP4 192.168.1.2
t=0 0
m=audio 20004 RTP/AVP 0 101
a=rtpmap:0 PCMU/8000/1
a=rtpmap:101 telephone-event/8000/1
a=sendrecv
a=ptime:20
a=fmtp:101 0-15
|
|
|
Using a utility soft phone here based on the VOIP Media Engine, we tested all different combinations of codecs. Single useage and multiple useage. All looks OK. The SetAudioMediaFormats() API procedure is easy to use and self explanatory. If that API proc is not working, then we really have a problem. When we get you updated media engine images, you are completely rebuilding your product binary images, right?
If you are getting 5 specified codecs in your INVITE requests, your app must be setting those somehow.
Could all of these issues we face be associated with some kind of memory corruption on your applications part? This codec issue you report is a bit too weird. :(
IMPORTANT:
Yes, yes, yes, please put together a console sample app that experiences the TerminateCall() deadlock. If its as repeatable as you suggest, then it will get fixed quickly. If you cannot do this, we frankly are just guessing at this end and burning tons of time.
At this point we just want to resolve this one way or another.
Support
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 27 2007 at 1:00pm | IP Logged
|
|
|
In regards to the audio formats, forget what I said, it was late, and I missed a spot where we were changing the audio format for outgoing invites. That is actually working correctly (now that I found where we were setting those).
Yes, of course, always a complete rebuild when libraries change - that is just good practice.
Possible some kind of memory corruption - but, looks completely like a stuck mutex lock inside lanscape - we are currently trying to strip down functionality to locate what causes this, as this app initiates a lot of Lanscape stuff, as well as does some SDP modifications (to support 101 telephony events better - another thread sometime :)). We are systematically removing all functionality from our product - no DTMF handling, no RTP handling (only audio callback), no Sip Message Modifications, removing unneeded threads (only two threads left now, and issue still remains). When we isolate it, trust us that we will provide a sample app.
The one thing that remains consistent is what I have stated over and over again:
* The incoming leg is the line that gets deadlocked when trying to call TerminateCall on it (async).
* The last call state reported for the line when we detect the thread as deadlocked is "_OnHook"
* The last sip message that arrived for the incoming line is SipIncomingCallConnected
* There is no SipInCall for the line
* There is no audio (probably because it was never SipInCall of course)
* No More Messages ever come from that line - no matter what the other client sends, that Call Dialog is broken internally in lanscape
* TerminateCall on that line will cause deadlock
* Other incoming messages (new dialogs) get answered by lanscape, it sends out 100/180 to incoming invites
We will post any new info, and when we have a test app that demonstrates this. But, the information we provided should provide you with a clue of what is happening inside your engine at the time. However you are queuing up Sip Messages inside your engine, the particular thread/mutex handling the one leg, does not want to release a lock.
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: December 27 2007 at 2:28pm | IP Logged
|
|
|
Juice,
Thanks for the update. We know what it is like to put in long hours. Glad to know that codec negotiation in INVITE requests are again OK. That just did not sound right. Usually when many disjoint problems are reported by a customer all at once that do not make sense, we start to consider process heap corruption on the part of the VOIP app. When that occurs, many things start blowing up.
We are continuing to work on this and are testing and analyzing media engine code. Its got to be something very stupid. Your test app will surely allow us to nail this one.
Keep posting with whatever you continue to uncover. All of a sudden, the solution will pop out.
Thanks,
Support
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 27 2007 at 5:02pm | IP Logged
|
|
|
Ok, I think this is now starting to border a little bit on your statement "got to be something very stupid" :-)
Is there some chance that the recent versions of Lanscape media engine are somehow redirecting input, or doing something different in terms of Standard Out/Err?
I ask, because I finally found a way to make the TerminateCall API method block. And, a way to make it not block. When it blocks it consistantly happens around 2-6 calls. When it does not block, this is after a dozen+ calls. But, only in real app, have not been able to reproduce it yet in sample app. I've performed this test dozens of times now, and results are still the same...
We have a logging class which logs items to file. However, when we enable echoing, it will echo the message to the console. Currently, it was using std:cout. I also tried with std::cerr. Now, when echoing is enabled (never thought it could be possible problem), the call to cout does not block. However, it seems to randomly cause the internalness of Lanscape to block (on the certain incoming call). When replaced with something different, like printf call, it does not cause any conflicts with Lanscape. Does at all seem possible? The strings are fine (using std::strings, not a real chance of any kind of buffer problems), computer memory is low as is CPU. Nothing going on besides the one call, but this seems to trigger it.
Now, I could go and remove the cout, but that doesn't all sit right with me. There has to be some kind of logical reason for this. Note, the following code in our logging causes the eventual blocked thread:
void Log::logMessage(const std::string &message, LogLevelPriority priority)
{
std::cout << "Something" << std::endl;
return;
}
Where, I physically logged the string "Something" (to make sure it wasn't a memory issue with the message object. Normally, the logging code is a little longer of course, to log to file, but this was just to isolate.
Now, std::cout should be entirely thread safe in terms of blocking/crashes of course... the only issue would be text within other text if called at the same time. But, it is not blocking or crashing, which is why I suspect perhaps the landscape engine (while not logging anything to iostreams / stdout, etc) may be redirecting I/O, which may trigger a hold on some kind of internal mutex?
Just a thought, but this really puzzles me.
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 27 2007 at 7:53pm | IP Logged
|
|
|
I think I found the final piece of the puzzle. It looks like the threads coming from IVRCALLBACKPROC is the primary culprit... why the standard out/iostream stuff exaggerates the issue I do not know.
We are using boost::threads. Anyway, we are using a boost::mutex::scope_lock method to acquire a lock in the IVR transmit thread of this certain conversation (where we than mix audio between the incoming <-> outgoing)... Basically, we use the mutex to make sure both lines are still attached to the current conversation.
Now, in previous Lanscape build, this is no issue. but, in these recent builds, this looks to have become an issue. I don't exactly know the implementation of scope_lock, except that it is meant to not handle recursive calls into the function. Which is ok, since we are not dealing with any recursive calls into this function (or any of the places where we use that type). But, in this particular case, it seems that there is some event which causes this lock to mess up lanscape threads. The lock itself is fine, because if we were to try to double lock this mutex, we'd never get out of it - but since the other thread handling the other line's events still acquires this lock, our mutex works ok.
Anyway, this must be related to the internal threading model changes you made in these releases which is simply incompatible at this spot. Any ideas why? A work around, not so sure how good, it seems if we use a try_scoped_mutex lock, we can acquire the lock successfully.
But, we use the other scope_lock method in other places, and have not had any issues like this. We will continue investigating, hopefully this work around will prevent further problems... but, the oddness of the issue means we are probably just covering up the real issue by using a different type of lock.
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: December 28 2007 at 7:54am | IP Logged
|
|
|
Juice,
You are posting good information. Lets keep going until we absolutely locate the cause of this problem. Just so you are aware, all of the information you are posting is going straight to our dev guys for evaluation. The dev guys are also formulating the answers to all of your questions which we are posting for them to this forum. We wanted to tell you this so you can be assured that you are not talking to many different people about the same problem. We are taking all of your posted information very seriously.
OK….
You >>>
Is there some chance that the recent versions of Lanscape media engine are somehow redirecting input, or doing something different in terms of Standard Out/Err?
<<< Support
No. The media engine does not perform any standard I/O redirection.
The only recorded change since v5.12.4.8 that has any thing to do with std I/O is this:
v5.12.4.9 allows SIP logs to be captured for individual phone lines. At this time, the SIP logging capability uses stdio for log file writing. The max number of open stdio files for a process normally defaults to 512. When the media egnine initializes, it calls the _setmaxstdio() sdtio library proc to increase this limit to 2048. Additional process memory overhead for doing this is approximately 12k bytes. Eventually the media enigne will remove all stdio file write operations and use Windows file handle operations.
You >>>
Anyway, this must be related to the internal threading model changes you made in these releases which is simply incompatible at this spot. Any ideas why?
<<< Support
The threading changes we spoke about were introduced in v5.12.4.10. The internal thread model for the most part remains intact from the original v5.12.4.8 product. The threading changes were done such that they would not impact existing VOIP applications. That was the goal anyway.
The recent threading changes primarily dealt with removing serialization restrictions of API commands from multiple application threads and how these requests were internally handled and executed. Also handling of incoming SIP requests were modified to also remove unecesary serialization due to critical sections. The internal thread changes were introduced to increase in/out call processing performance and “should not” interfere with existing VOIP application thread handling/logic. That being said, obviously there is something that has caused this condition to occur.
Other Info:
If the IVRCALLBACKPROC for a phone line gets blocked indefinitely, this is bad. The IVRCALLBACKPROC is executed in the thread context of the RTP receiver for the phone line. If this occurs, the phone line will be useless forever and this would explain the “no RTP” you have identified. If the call on the phone line gets terminated, RTP receiver logic will not be able to terminate normally and it may be possible that the internal thread logic for the phone line’s call thread will hang. We have to verify this yet but this sounds very possible. Its not a good thing if the IVRCALLBACKPROC blocks indefinitely.
We would really like to see your apps threading model and how it handles mutual exclusion. We will have to continue to rely on your input in order to get further on this. Unfortunately we are not experts using boost::mutex::scope_lock.
It makes no sense to us that standard out/err is causing a problem by itself – but we have been burned before.
We are facing what seems to be a typical multithreading issue. Hmm… we need to think.
Support
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 28 2007 at 11:51am | IP Logged
|
|
|
Ok, I think I found a reproducible console application. At least, I finally got a small sample to demonstrate this problem.
The issue is odd, most definitely. It looks, like maybe there is some limit to amount of mutex/critical section locks windows can have. At least, I am no expert, but it appears to be a combination of items.
* First, in the logger, we have a mutex lock guarding concurrent access to the writing to the log file.
* Second, when console echo is turned on, we use cout. Cout will create a temporary lock (which may explain why it was looking like a cause)
* Third, we have a mutex lock when we get SipOnHook (which we send to a class method which terminates the other leg of the call), which locks the same conversation mutex I mentioned in the last post.
* Fourth, we use the same mutex (mentioned above) when we receive the IVR audio data. We do this to make sure we do not try to send audio to the other leg who just got disconnected.
* Fifth, I am sure you use some internal mutex's when you called us from the IVR callback...
So, it looks like a combination of things created this issue. But, this issue never identified itself in previous release. Is it possible you added a few more critical section / mutex / sephomore / etc protections in these latest releases?
For the console app, I determined that during the call to SipOnHook, there must be some logging calls happening, and the other locks going on. Anyway, I'll clean it up, and get it sent to you as soon as possible. We use VC8 SP1, so I can send you precompiled boost libraries to make the demo self contained. If you don't have VC8 SP1, building boost is not that difficult, but you could probably replace the calls to the boost mutex with the Window's equivalent.
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: December 28 2007 at 1:05pm | IP Logged
|
|
|
You >>>
Is it possible you added a few more critical section / mutex / sephomore / etc protections in these latest releases?
<<< Support
You bet. The internals of the media engine do not use mutexes but critical section objects. Cirtical sections were change or moved or removed to improve processing throughtput.
You >>>
We use VC8 SP1, so I can send you precompiled boost libraries to make the demo self contained.
<<< Support
VC8 SP1 (aka VS2005 SP1) is good.
Precompiled Boots is good.
Your support FTP account is ready for uploading. If you want, upload the boost source you used to create the boost lib. We may want to trace through that too. As a matter of fact, you do not have an FTP disk quota so upload all the source code you used to build the test app.
If you can, give us a brief description on how to use your test app and how we go about duplicating the deadlock. It will be good to get this issue solved and move on :)
Support
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 28 2007 at 1:36pm | IP Logged
|
|
|
The test app is prepared. It will be uploaded as soon as possible. Prebuilt binaries of boost are included. Befire building, simply alter some configuration settings in the Engine.h file (defines) with current system ip/port and a gateway ip/port. Phone lines are set to 10. The destination gateway should be some other system (don't know why, but when everything is on the same system, seems to happen less frequently).
Then, place a call to the system, it will route the destination To: address to the destination gateway IP/port you configured. Answer the other system, let RTP/audio flow for a couple seconds, then hang up destination side. Repeat this is fairly quick succession about half a dozen times (or more if needed). After some quick time, the last incoming leg will be answered, and it's last received sip message will be SipIncomingCallConnected - that is when you will see the deadlock for that line (hopefully, if the problem follows to your system).
Unfortunately, we do not have the source we used to build boost with. We used the 1.33.1 release (was the latest at the time). Though, it can be grabbed from here: http://sourceforge.net/project/showfiles.php?group_id=7586
The first step in building boost is to grab the boost source, and the boost-jam source. bjam is used to build boost libraries. Building bjam is simple (make sure you run the MSVC command line batch file to setup your compiler paths of course - or invoke the VS Command Prompt shortcut to get to the shell): http://www.boost.org/doc/html/jam/building.html
simply use the build.bat file.
Building boost, simply add bjam to the path.. Or, copy it to the boost source build directory. In the boost directory, you'll need to invoke the following:
bjam --toolset=msvc --with-boost-thread stage
--with-boost-thread is simply to limit building of boost to that library. If that fails, you should run "bjam --show-libraries" as I'm not positive if boost-thread is the actual name of the thread lib, but you really don't want to build all of boost, since it has quite a lot of libraries. More complete doc can be found here: http://www.boost.org/more/getting_started/windows.html (applies to 1.33.1 build also). After which, you just need to set up paths to point to build location/headers/libraries.
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: December 28 2007 at 1:42pm | IP Logged
|
|
|
Got it.
What are you using as a call generator????
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 28 2007 at 2:13pm | IP Logged
|
|
|
The current scenario is:
xlite client calling lanscape server app which calls Grandstream as destination. Xlite and Lanscape server are on the same system (Win XP Pro SP2, 2GHz, 2GBs Ram).
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: December 28 2007 at 5:52pm | IP Logged
|
|
|
Sorry for the delay, zip file is there now.
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: January 02 2008 at 11:18am | IP Logged
|
|
|
I hope everyone had a good New Year!
Just checking up on any status updates. We are hoping the source/demo provided has shown the issue to your developers, and if not we will try to provide more information and/or alter the demo so that it does show up. If the problem/lock has shown up, that is wonderful (not wonderful that it happens, but wonderful that you can finally see the issue ;).
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: January 03 2008 at 7:02am | IP Logged
|
|
|
Juice,
We are back full time now from the holiday. We are reviewing your test app now. We will repost with further info as soon as we have it.
Support
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: January 04 2008 at 9:44am | IP Logged
|
|
|
Hi juice,
Ok. We have gone through your application code and see what is going on. It’s a typical multithreading deadlock issue. This is not a bug in the media engine. In older versions of the media engine, you would not have seen this issue due to the old threading model.
The good news is:
* Overall, your test application is using the media engine properly.
* The usage of an application mutex object is causing the deadlock problem.
* We hammer tested your application code all night with modest code changes and it runs great.
* Your test app has good call processing performance in calls per minute.
* No other issues were detected while testing (memory or handle leaks, etc).
We have placed into your support FTP account the “Updated source code.zip” file image. Download that and take a look at its contents.
We have included in the above archive an engineering report that summarizes our findings. See the “Source code summary.pdf” file image.
Here is the summary:
------------------------------------------------------------ ------------------------
LanScape Engineering Report
Customer Test Application
Date: 1-3-08
This file contains engineering notes recorded while debugging a customer app thought to exhibit a VOIP Media Engine deadlock situation. Here are the basic findings.
1)
Using "catch all" exception handling/recovery.
Note:
This information has nothing to do with a deadlock but we wanted to convey our thoughts to the customer.
Going through the customer’s test application code we noticed the use of "catch all" exception handling syntax such as:
Code:
try
{
// do some stuff....
//
}
catch(...)
{
}
|
|
|
Though not a cause of the reported problem, the customer should be made aware that using catch(...) leads to more difficult bug identifications in general.
Code like this makes locating application bugs extremely difficult. For completed details why the above construct should be avoided at all costs, see the book:
Debugging Applications for Microsoft .NET and Microsoft Windows
Section: Structured Exception Handling vs. C++ Exception Handling
ISBN:0735615365
2)
Retrieving incoming call information.
Note:
This information has nothing to do with a deadlock but we wanted to convey our thoughts to the customer.
It was noticed that the source code called the GetIncomingCallInfo() API procedure in a deferred manner. It was being called such that if an incoming call was connected and quickly disconnected, the application logic would use an old SIP_INCOMING_CALL_INFO structure that would contain invalid pointer values. This will cause exceptions while executing. The customer is instructed to review the source code for the USE_CALL_INFO_FIXES macro to understand how we changed the source code to make the application more robust.
3)
Removed 1 second sleep from the Call::setRoutingInformation() procedure.
Note:
This information has nothing to do with a deadlock but we wanted to convey our thoughts to the customer.
We removed the 1 second sleep in the Call::setRoutingInformation() proc. We did this so we could hammer test a greater number of calls per minute. For code changes, see the REMOVE_SETROUTING_INFORMATION_SLEEP macro in the source code.
4)
Deadlock situation associated with TerminateCall() API procedure and the receive IVR callback procedure.
After building and configuring a test setup, a deadlock situation was detected in the supplied customer application.
This issue is not associated with the media engine. The deadlock is due to how application software is performing critical section operations using mutex objects across API threads and the receive IVR callback procedure that is registered with the media engine.
The problem is a basic multithreading issue and is not an officially recognized bug in the media engine product. The customer will be given modified source code to solve this deadlock issue. The customer can search the source code for the USE_DEADLOCK_FIXES macro to see what we have modified.
Here is the basic problem that leads to the application deadlock:
The deadlock situation is due to 3 threads of execution.
Thread 1:
An application thread that calls a mutex object and then later calls the TerminateCall() Media Engine API procedure.
Thread 2:
The internal call thread for the phone line. This thread takes API commands and executes them in the context of the call thread.
Thread 3:
The RTP receiver thread. This thread handles all received RTP payloads and calls the user registered receive IVR callback procedure in the user’s application.
Deadlock Scenario:
The application executes the Engine::threadEventHandler() proc that processes the SipOnHook phone line event. This is thread 1.
The above procedure then calls owner->sessionDisconnected() proc in the SipOnHook logic. The sessionDisconnected() proc that is executed is an instance of CallSession:: sessionDisconnected().
The CallSession:: sessionDisconnected() locks a mutex (mTestMutex) and then proceeds to eventually call the media engine TerminateCall() API procedure.
This causes internal thread 2 (the internal call thread) to execute for the phone line. The internal call thread executes the terminate call command from the API and starts internal call termination.
Part of internal call termination is to shut down RTP receiver activity for the phone line. RTP receiver logic executes in its own thread context (thread 3).
Because RTP payloads may be received by the media engine phone line any time after the application locks the mutex in thread 1, the RTP receive IVR callback procedure the user has registered will block forever because it also wants to take ownership of the application mutex (mTestMutex). This is an application flaw. Mutual exclusion and critical sections should be limited to small sections of multithreaded code such as mutual exclusion of Boolean flags or other sets of status variables. Using a critical section or mutex for a large extended period of time across different thread contexts generally result in deadlock situations.
The usage of the application mutex (mTestMutex) is problematic because the internal call thread (thread 2) will not continue until it has acknowledgment from the RTP receiver logic (thread 3) that RTP reception has been terminated.
Because the receive IVR callback executing in the RTP receiver thread (thread 3) is blocked forever by the application mutex, the internal call thread (thread 2) will never detect RTP receiver termination and will wait forever. This is turn will cause the applications API thread (thread 1) that called TerminateCall() API to block indefinitely thus causing the deadlock.
In past versions of the media engine, this deadlock situation did not occur for asynchronous or timed synchronous TerminateCall() processing. The internal threading model has been changed for performance and call state logic reasons. The same code executing under an older version of the media engine would have succeeded.
LanScape will note this deadlock behavior in the VOIP Media Engine Software Developer’s Reference in an effort to warn other developers.
Deadlock Solution:
Application software should use critical section or mutex objects to lock short spans of code. A good idea is to lock access to Boolean flags or status variables. We have modified the application source code to show an example of this. Search the source code for the USE_DEADLOCK_FIXES macro to view the changes.
Final Testing:
After applying the changes as per the USE_DEADLOCK_FIXES macro, we ran the customers test application overnight using their version of the media engine (v5.12.7.21) and the current version we have here (v5.12.7.24).
The test application was configured to use 128 phone lines and was executed on the same host machine as a 16 line LanScape call tester application (call generator). The call tester application was also executing on another host machine to act as the call termination point using 64 lines.
The test scenario (and the customer’s test application) ran overnight for approximately 12 hours and successfully processed roughly 200,000 calls without errors or deadlocks. It was also noted that memory usage and windows resource usage (handles, threads, etc) were all normal with no leaks.
------------------------------------------------------------ ------------------------
We can get you a final update of the media engine any time.
Thanks,
Support
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: January 04 2008 at 11:32am | IP Logged
|
|
|
In regards to:
Deadlock Solution:
Application software should use critical section or mutex objects to lock short spans of code. A good idea is to lock access to Boolean flags or status variables. We have modified the application source code to show an example of this. Search the source code for the USE_DEADLOCK_FIXES macro to view the changes.
If you notice, we only do use mutex's in short spans of code. And, to be frank, using mutex's to guard access to the current class instance is a perfectly correct practice. However, we will take a look at your implementation.
In regards to point 1, yes, we are aware of the consequences, simply a last resort catch in this instance (be aware, this is a from the ground up quick test/demo app). Thanks for pointing it out anyway.
To Point 2, thanks, we will look at that.
For point 3, that was simply to act like a real world application having to make database queries, account for delays, and provide a close approximation of the real app just with a slight longer delay - note, we were not trying to maximize call flow, just reproduce the issue :-)
In regards to deadlock scenario, we will examine the modifications and post back comments. Good that there will be a warning in the future docs pointing out what lanscape threads lock on other threads, so future issues can be avoided.
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: January 07 2008 at 12:29pm | IP Logged
|
|
|
Hi again
Looking at your changes for handling incoming call data looks like a good compromise of safety, since you are probably guarding that data in that method only. So, thanks for pointing it out :)
To the deadlock scenario, if it is a simple deadlock, replacing boost::mutex with boost::try_mutex and boost::mutex::scoped_lock with boost::try_mutex::try_scope_lock should avoid the scenario of deadlocking without sacrificing some thread safety.
As, it seems that there is still a possible unseen potential of dangerous use of pointers when using only the boolean LocBridgeSamples. As, once the boolean is set in the transmit audio function, the call could be ended and pointer leaving the transmit handles possibly pointing to invalid objects (though, the danger of that relies in how lanscape will treat invalid TX handle values).
With the try_mutex, there should be no need for a secondary conditional, as the pointer objects themselves are the conditionals. And, when a mutex is already locked, the try_mutex will/should fail gracefully and not cause a dead lock scenario from recursive locks. Also, there is a recursive lock in boost, but I think a try_lock will suffice in this situation, though I will double check which one would be better in this situation.
However, if you still feel a conditional variable would be better, please feel free to post additional insight.
In any case, thanks for your help with this in finding a difficult to spot issue and getting this resolved with nice feedback into our usage of the media engine :-) I think our issue is resolved now. We will implement changes/fixes to better deal with the new Lanscape locking design and will update this thread when we complete our findings. Hopefully, this will be the last complaint for this release.
Thanks again.
|
Back to Top |
|
|
juice Vetran
Joined: December 05 2006 Location: United States Posts: 139
|
Posted: January 07 2008 at 1:35pm | IP Logged
|
|
|
It looks like try_mutex would be the best bet for this instance. The reasoning:
* Since it is our thread, that calls into lanscape (TermianteCall), and our terminate call thread is apparently waiting to place an shutdown event into the Interal Lanscape command thread, which is waiting to access the internal RTP thread of Lanscape, which is in turn waiting to access the SessionAudio class member - which is locked - that is causing the issue.
* So, it is not a recursive call into our sessionaudio class (i.e. our call into TerminateCall is not directly trying to call the RTP method). So, using recursive locking stratgey would probably offer no benefit
* However, using the try_mutex, from our threaded event handler, we simply use a scope lock, and from the RTP callback sessionaudio, we use a try scope lock... And, if ever that mutex is locked, it would raise a local error, and then we could immediately return.
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: January 07 2008 at 3:54pm | IP Logged
|
|
|
Perfect.
|
Back to Top |
|
|
|
|