Return to LanScape's home page Go back a page...       Active TopicsActive Topics   Display List of Forum MembersMember List   Knowledge Base SearchSearch   HelpHelp  RegisterRegister  LoginLogin

LanScape VOIP Media Engine™ - Technical Support
 LanScape Support Forum -> LanScape VOIP Media Engine™ - Technical Support
Subject Topic: PlaybackBufferingDefault Post ReplyPost New Topic
Author
Message << Prev Topic | Next Topic >>
hermes
Junior
Junior


Joined: October 27 2006
Posts: 64
Posted: May 05 2008 at 10:10am | IP Logged Quote hermes

Hello,
we are developing a voice recognition system over SIP with your VoIP Media Engine.
When I test it in our LAN it works successfully but when I test it outside LAN our voice recognition system doesn´t recognize anything. Voice systems are very sensitive to any factor: delays, jitter... but we have checked that delay is very small and jitter isn´t more than 1.2ms.
We have been testing Media Engine with different parameters and when we have set PlaybackBufferingDefault and PlaybackBufferingDuringSounds to 20 it works fine. I don´t undestand well what are the purpose of these parameters. Could you explain me in detail, please? Aren´t they the same that jitter buffer?
I don´t know if you know something about voice recognition but it isn´t matter. I only wanted to know what do these parameters do and the differences between these and jitter.
Thank you one more time.
Back to Top View hermes's Profile Search for other posts by hermes
 
support
Administrator
Administrator


Joined: January 26 2005
Location: United States
Posts: 1666
Posted: May 05 2008 at 12:02pm | IP Logged Quote support

Hi hermes,

About buffering depths:
The two values you ask about are used to control the audio buffering depth of any sounds that the media engine plays back on local multimedia hardware. They do not affect anything else. The two values you ask about are not in any way related to received phone line audio jitter compensation settings.

The media engine can be configured to use local multimedia hardware. If local multimedia hardware is enabled when the media engine is started, the media engine will send internal telephony generated sounds (phone ringing, dial tone, etc) and received phone call audio from one or more phone lines to the multimedia hardware via a final digital mixer that sits directly in front of the multimedia hardware. It is good to think of phone line audio and internal telephony sounds as individual signal sources that route to the final internal “playback mixer”. The output of the playback mixer then “drives” the local multimedia hardware.

Before actual playback on local multimedia hardware, all the signal sources are digitally mixed to create a single media stream that will be sent to the multimedia hardware using a common rate/format.

PlaybackBufferingDefault and PlaybackBufferingDuringSounds simply determine the buffering depth that will be applied at the playback mixer output before sample blocks are sent to local multimedia hardware. Normally these two values are set to the same value (2 for double buffering) unless there is some reason otherwise (broken audio, etc). Historically we also say in the Software Developer’s Reference that PlaybackBufferingDefault=2 and PlaybackBufferingDuringSounds=4 are also good settings for these parameters.

PlaybackBufferingDefault is the buffering depth used to playback all received phone line audio.

PlaybackBufferingDuringSounds is the buffering depth used to playback audio when internal telephony sounds are being generated.

The internal playback mixer will switch between PlaybackBufferingDefault and PlaybackBufferingDuringSounds buffering depths depending on what type of audio is being mixed. If both received phone line audio and internally generated telephony sounds are mixed, then the playback mixer will always use PlaybackBufferingDuringSounds buffering depth until internal telephony sounds are finished and the playback mixer can then simply mix received phone line audio again for playback.

Historically the reason these two playback buffering parameters were separated was because of sound quality when running on older operating systems (like Windows 9x stuff). Now, with today’s multimedia hardware and selecting an OS of Windows 2000+SP4 or higher, these two parameters can be set to the same value.

Regarding speech recognition accuracy:
It is good that you have a feeling for your RTP jitter and delays. Have you a feeling for dropped or missing received RTP packets from the WAN?

If you are experiencing proper speech recognition in your private network and then not obtaining the same behavior from the WAN (internet), the first thing to look for is UDP RTP packet sequence issues or missing RTP packets. Too many dropped or out of sequence RTP packets will kill a good speech recognition deployment.

For speech recognition, we assume you are receiving your phone line audio from each of the phone line’s receive IVR channels. Using this method is the preferred way to obtain voice audio from the phone calls to send to speech engines. However, you may want to receive the RAW RTP media samples directly from the phone lines and use that to drive your speech engine. If you use the RAW RTP payload data received from the phone lines, you can then detect in your app missing RTP packets and RTP packets that are received out of order (based on the RTP header packet sequence number). The drawback of using this alternate method is that your app will have to handle RTP format/rate conversion from phone call audio format/rate to your speech engine format/rate using the sample block conversion support in the media engine. Not the worst thing in the world but not as easy as just using the automatic media handling features of the media engine.

About phone line receive IVR channels:
The receive IVR channels for the phone lines take whatever RTP packets are received, perform required format/rate conversion and send them to the app’s Rx IVR callback handler(s). If there are “too many” RTP packet sequence errors or missing RTP packets, then this may affect your ability to accurately perform speech recognition of the phone line audio. If you use the RAW RTP API method, then at least your app could perform some kind of buffering and then a re-sequencing of RTP data before sending the phone line audio samples to the speech recognition engine.

One thing we may want to do in the future is allow the app (your application) to enable a receive IVR buffering depth and automatic packet re-sequencing for Rx IVR channels). This should not be too big of an issue to implement and would most likely improve speech recognition performance while not forcing the app to implement its own elaborate methods using the RAW RTP interface).

Further testing – speech recognition:
We need to get a measure for missing RTP packets and out of sequence RTP packets coming from your WAN. A simple way to do this is to enable the reception of RAW RTP packets in your app. See the EnableRawRtpPacketAccess() API procedure in the developer’s reference on how to do this. Your app can then monitor missing and out of sequence RTP packet statistics and report on any missing and out of order RTP packets in log files using the RTP header sequence number as your trigger.

Question:
For our own curiosity, what speech recognition engine have you selected for your development?

Repost with any other info.

Thanks hermes,


Support


Back to Top View support's Profile Search for other posts by support Visit support's Homepage
 
hermes
Junior
Junior


Joined: October 27 2006
Posts: 64
Posted: May 05 2008 at 4:07pm | IP Logged Quote hermes

Hello,
thanks for your response. These days we´re developping a test application to have a objetive result of recognition engine.
Last week I checked RTP stream and I didn´t detect any packet sequence issues or missing packets.
We can´t to use the IVR channel because our recognition engine only accept audio stream from a soundcard so we have to use a virtual audio card and we interconnect 'audio out' to 'audio in' virtually. We configure VoIP engine and recognition engine with this 'soundcard' and it works fine.
Now our recognition engine is a custom development but we don´t have access to sourcecode so we can´t change it to integrate it with a IVR channel.

Thank you very much.
Back to Top View hermes's Profile Search for other posts by hermes
 
support
Administrator
Administrator


Joined: January 26 2005
Location: United States
Posts: 1666
Posted: May 05 2008 at 4:38pm | IP Logged Quote support

hermes,

OK. Your last post adds important information we were not aware of.

You may also be affected by the way the media engine is internally re-sampling the audio streams in an effort to minimize voice path latencies.

There are two undocumented APIs:

Code:

TELEPHONY_RETURN_VALUE VOIP_API GetMixerResampleState(
            SIPHANDLE hStateMachine,
            BOOL *pEnableState
            );

TELEPHONY_RETURN_VALUE VOIP_API SetMixerResampleState(
            SIPHANDLE hStateMachine,
            BOOL EnableState
            );



These are extern “C” procs.

Call the SetMixerResampleState() API with the EnableState parameter to FALSE (zero). This will turn OFF internal audio re-sampling. This may improve your speech recognition accuracy. It may be worth a try.


Support


Back to Top View support's Profile Search for other posts by support Visit support's Homepage
 

If you wish to post a reply to this topic you must first login
If you are not already registered you must first register

  Post ReplyPost New Topic
Printable version Printable version

Forum Jump
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot delete your posts in this forum
You cannot edit your posts in this forum
You cannot create polls in this forum
You cannot vote in polls in this forum






Contact LanScape Hear what the Lawyers have to say How youm may use this site Read your privacy rights