Author |
|
hermes Junior
Joined: October 27 2006 Posts: 64
|
Posted: May 05 2008 at 10:10am | IP Logged
|
|
|
Hello,
we are developing a voice recognition system over SIP with your VoIP Media Engine.
When I test it in our LAN it works successfully but when I test it outside LAN our voice recognition system doesn´t recognize anything. Voice systems are very sensitive to any factor: delays, jitter... but we have checked that delay is very small and jitter isn´t more than 1.2ms.
We have been testing Media Engine with different parameters and when we have set PlaybackBufferingDefault and PlaybackBufferingDuringSounds to 20 it works fine. I don´t undestand well what are the purpose of these parameters. Could you explain me in detail, please? Aren´t they the same that jitter buffer?
I don´t know if you know something about voice recognition but it isn´t matter. I only wanted to know what do these parameters do and the differences between these and jitter.
Thank you one more time.
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: May 05 2008 at 12:02pm | IP Logged
|
|
|
Hi hermes,
About buffering depths:
The two values you ask about are used to control the audio buffering depth of any sounds that the media engine plays back on local multimedia hardware. They do not affect anything else. The two values you ask about are not in any way related to received phone line audio jitter compensation settings.
The media engine can be configured to use local multimedia hardware. If local multimedia hardware is enabled when the media engine is started, the media engine will send internal telephony generated sounds (phone ringing, dial tone, etc) and received phone call audio from one or more phone lines to the multimedia hardware via a final digital mixer that sits directly in front of the multimedia hardware. It is good to think of phone line audio and internal telephony sounds as individual signal sources that route to the final internal “playback mixer”. The output of the playback mixer then “drives” the local multimedia hardware.
Before actual playback on local multimedia hardware, all the signal sources are digitally mixed to create a single media stream that will be sent to the multimedia hardware using a common rate/format.
PlaybackBufferingDefault and PlaybackBufferingDuringSounds simply determine the buffering depth that will be applied at the playback mixer output before sample blocks are sent to local multimedia hardware. Normally these two values are set to the same value (2 for double buffering) unless there is some reason otherwise (broken audio, etc). Historically we also say in the Software Developer’s Reference that PlaybackBufferingDefault=2 and PlaybackBufferingDuringSounds=4 are also good settings for these parameters.
PlaybackBufferingDefault is the buffering depth used to playback all received phone line audio.
PlaybackBufferingDuringSounds is the buffering depth used to playback audio when internal telephony sounds are being generated.
The internal playback mixer will switch between PlaybackBufferingDefault and PlaybackBufferingDuringSounds buffering depths depending on what type of audio is being mixed. If both received phone line audio and internally generated telephony sounds are mixed, then the playback mixer will always use PlaybackBufferingDuringSounds buffering depth until internal telephony sounds are finished and the playback mixer can then simply mix received phone line audio again for playback.
Historically the reason these two playback buffering parameters were separated was because of sound quality when running on older operating systems (like Windows 9x stuff). Now, with today’s multimedia hardware and selecting an OS of Windows 2000+SP4 or higher, these two parameters can be set to the same value.
Regarding speech recognition accuracy:
It is good that you have a feeling for your RTP jitter and delays. Have you a feeling for dropped or missing received RTP packets from the WAN?
If you are experiencing proper speech recognition in your private network and then not obtaining the same behavior from the WAN (internet), the first thing to look for is UDP RTP packet sequence issues or missing RTP packets. Too many dropped or out of sequence RTP packets will kill a good speech recognition deployment.
For speech recognition, we assume you are receiving your phone line audio from each of the phone line’s receive IVR channels. Using this method is the preferred way to obtain voice audio from the phone calls to send to speech engines. However, you may want to receive the RAW RTP media samples directly from the phone lines and use that to drive your speech engine. If you use the RAW RTP payload data received from the phone lines, you can then detect in your app missing RTP packets and RTP packets that are received out of order (based on the RTP header packet sequence number). The drawback of using this alternate method is that your app will have to handle RTP format/rate conversion from phone call audio format/rate to your speech engine format/rate using the sample block conversion support in the media engine. Not the worst thing in the world but not as easy as just using the automatic media handling features of the media engine.
About phone line receive IVR channels:
The receive IVR channels for the phone lines take whatever RTP packets are received, perform required format/rate conversion and send them to the app’s Rx IVR callback handler(s). If there are “too many” RTP packet sequence errors or missing RTP packets, then this may affect your ability to accurately perform speech recognition of the phone line audio. If you use the RAW RTP API method, then at least your app could perform some kind of buffering and then a re-sequencing of RTP data before sending the phone line audio samples to the speech recognition engine.
One thing we may want to do in the future is allow the app (your application) to enable a receive IVR buffering depth and automatic packet re-sequencing for Rx IVR channels). This should not be too big of an issue to implement and would most likely improve speech recognition performance while not forcing the app to implement its own elaborate methods using the RAW RTP interface).
Further testing – speech recognition:
We need to get a measure for missing RTP packets and out of sequence RTP packets coming from your WAN. A simple way to do this is to enable the reception of RAW RTP packets in your app. See the EnableRawRtpPacketAccess() API procedure in the developer’s reference on how to do this. Your app can then monitor missing and out of sequence RTP packet statistics and report on any missing and out of order RTP packets in log files using the RTP header sequence number as your trigger.
Question:
For our own curiosity, what speech recognition engine have you selected for your development?
Repost with any other info.
Thanks hermes,
Support
|
Back to Top |
|
|
hermes Junior
Joined: October 27 2006 Posts: 64
|
Posted: May 05 2008 at 4:07pm | IP Logged
|
|
|
Hello,
thanks for your response. These days we´re developping a test application to have a objetive result of recognition engine.
Last week I checked RTP stream and I didn´t detect any packet sequence issues or missing packets.
We can´t to use the IVR channel because our recognition engine only accept audio stream from a soundcard so we have to use a virtual audio card and we interconnect 'audio out' to 'audio in' virtually. We configure VoIP engine and recognition engine with this 'soundcard' and it works fine.
Now our recognition engine is a custom development but we don´t have access to sourcecode so we can´t change it to integrate it with a IVR channel.
Thank you very much.
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: May 05 2008 at 4:38pm | IP Logged
|
|
|
hermes,
OK. Your last post adds important information we were not aware of.
You may also be affected by the way the media engine is internally re-sampling the audio streams in an effort to minimize voice path latencies.
There are two undocumented APIs:
Code:
TELEPHONY_RETURN_VALUE VOIP_API GetMixerResampleState(
SIPHANDLE hStateMachine,
BOOL *pEnableState
);
TELEPHONY_RETURN_VALUE VOIP_API SetMixerResampleState(
SIPHANDLE hStateMachine,
BOOL EnableState
);
|
|
|
These are extern “C” procs.
Call the SetMixerResampleState() API with the EnableState parameter to FALSE (zero). This will turn OFF internal audio re-sampling. This may improve your speech recognition accuracy. It may be worth a try.
Support
|
Back to Top |
|
|
|
|