Author |
|
hermes Junior
Joined: October 27 2006 Posts: 64
|
Posted: October 14 2008 at 2:54am | IP Logged
|
|
|
Hello,
I´m testing v5.12.8.14 under a Linux/Wine system but I´ve found a compilation error in my project. With the v5.12.8.1 we used undocumented API 'SetMixerResampleState' with the 'EnableState' parameter to FALSE because otherwise our speech recognition accuracy wasn´t very good. This API isn´t implemented in this new version. What can I do?
Thanks.
|
Back to Top |
|
|
hermes Junior
Joined: October 27 2006 Posts: 64
|
Posted: October 14 2008 at 3:17am | IP Logged
|
|
|
I´ve just checked that SetMixerResampleState is implemented in v5.12.8.14 but it has got five parameters while in v5.12.8.1 only has got two. Am I right? Could you say me what are the others?
Thanks again
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: October 14 2008 at 8:48am | IP Logged
|
|
|
Hi hermes,
Yes, you are correct. The API signature for that API procedure changed in v5.12.8.3. This post talked about it:
SIP 487 Request Terminated:
Section: Posted: 19 May 2008 at 2:42pm
http://www.lanscapecorp.com/forum/forum_posts.asp?TID=477&KW =SetMixerResampleState
“Mixer resampling” (which also includes time scaling) is turned on by default in the media engine so you do not need to call this API proc to enable it.
If you are calling this API proc to disable internal digital mixer resampling and time scaling, please let us know. If you have to turn this capability off to get your desired results, we need to know this because all apps should be able to get great sonic audio quality with default settings.
v5.12.8.3 and greater API prototype:
Code:
TELEPHONY_RETURN_VALUE VOIP_API SetMixerResampleState(
SIPHANDLE hStateMachine,
BOOL MixerResampleEnableState,
int MixerResampleBlockTrigger,
BOOL MasterPlaybackResampleEnableState,
int MasterPlaybackResampleBlockTrigger
);
|
|
|
Here is the description for the above parameters:
hStateMachine:
The media engine instance handle.
MixerResampleEnableState:
When non zero, enables mixer resampling/time scaling for all intrernal digital mixer function blocks.
MixerResampleBlockTrigger:
Can be set to a value of 4 or greater. This value controls the maximum stream delay that may build up internally in any digitally mixed audio stream. This value specifies the max latency using an integer number of 20Ms block times.
MasterPlaybackResampleEnableState:
When non zero, enables mixer resampling/time scaling for the internal multimedia audio playback.
MasterPlaybackResampleBlockTrigger:
Can be set to a value of 4 or greater. This value controls the maximum stream delay that may build up internally in the playback digitally mixed audio stream. This value specifies the max latency using an integer number of 20Ms block times.
Support
Notes:
This post discusses VOIP Media Engine undocumented API procedures that are used for internal test purposes. Do not use these API procedures in your VOIP applications.
|
Back to Top |
|
|
hermes Junior
Joined: October 27 2006 Posts: 64
|
Posted: October 19 2008 at 4:55pm | IP Logged
|
|
|
Hello,
I don´t know why but when we connect to our recognition speech system over WAN its accuracy isn´t very good; but when we disable internal digital mixer resampling it works fine. Over LAN its accuracy is always very good.
We have done all tests under good network conditions, with low jitter, without lost packets and without 'out of sequence' packets.
If you want more information about this problem, please let me know.
Thanks again.
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: October 20 2008 at 11:37am | IP Logged
|
|
|
Hi hermes,
Yes, we want more info about this issue. If there is something that is affecting your speech recognition accuracy over WAN, then we would like to identify it and get it fixed. Chances are that others will eventually want to do something similar to what your VOIP deployment is doing. We really hate to put off a possible improvement. Especially when there may be a very simple reason why WAN recognition is not good and a fix/change could be very simple.
If you can, please simply describe your VOIP system and how you are using the media engine to perform speech recognition. Please give a good explanation as if we are hearing this information for the first time. We will then discuss the issue in more detail with you.
There should be no reason that WAN speech recognition is poor – unless there are huge packet losses or UDP packet reordering problems.
It will be good to identify what is really going on and to not mess with the default media engine resampling/time scaling of media signal paths.
Thanks much,
Support
|
Back to Top |
|
|
hermes Junior
Joined: October 27 2006 Posts: 64
|
Posted: October 20 2008 at 4:38pm | IP Logged
|
|
|
These days we´re very busy to prepare a good planning to detect this problem, but next week I´ll try to do it.
Thanks for your support.
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: October 21 2008 at 5:57am | IP Logged
|
|
|
hermes,
OK, sounds good. Any time is OK and we will try to also fit it into our schedules.
At the moment, there are two things we have discussed here that would ensure a higher level of speech recognition accuracy. We will have to get around to testing these assumptions via WAN sometime over the next few months if we can. No rush though…
Support
|
Back to Top |
|
|
hermes Junior
Joined: October 27 2006 Posts: 64
|
Posted: December 11 2008 at 8:47am | IP Logged
|
|
|
Sorry for the delay with this post. We´re very busy here. We have done a lot of tests to determine our speech recognition accuracy problem.
This was our problem:
1. When a SIP client calls to our SRS (Speech Recognition System) over LAN, accuracy is very good.
2. Otherwise, in some cases, when a SIP client calls to our SRS (Speech Recognition System) over WAN, speech recognition accuracy isn´t very good.
This other is our solution:
1. If we turn off 'MasterPlaybackResampleEnableState' all works fine.
2. If we turn on 'MasterPlaybackResampleEnableState' and we set MasterPlaybackResampleBlockTrigger parameter to 8 or higher, speech recognition accuracy is very good too.
I don´t know if it´s important to remark that we don´t use your IVR callbacks for recognition. We use 'audio output' as 'input' for our SPS.
And these are our conclusions:
1. All test have been done over a network with not missing RTP packets (less than 0.4%). In all test jitter buffering is enabled with a length of 150ms. iLBC codec is used for all calls.
2. I don´t know if it can be by chance, but in the cases that speech recognition doesn´t work well (when MasterPlaybackResampleEnableState is enabled and MasterPlaybackResampleBlockTrigger is less than 8), we´ve measured a jitter around 15ms, in the other cases jitter is less than 3ms.
3. I´ve read that 'Time Scale Modification' is used in VoIP to accelerate/decelerate signal in adaptive jitter buffer, is it correct? Perhaps, for our hearing is not a problem, but a speech recognition system is more sensitive for any change of voice signal. It can be the problem, do you think so?
I hope I had explained correctly.
Thank you very much.
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: December 11 2008 at 10:32am | IP Logged
|
|
|
hi hermes,
Excellent information. I have to think a bit before I post a response.....
In the mean time, please post the version of the media engine you used during these speech recognition tests.
Also, please slightly elaborate on what you mean by: We use 'audio output' as 'input' for our SPS.
Thanks,
Randal
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: December 11 2008 at 10:34am | IP Logged
|
|
|
... never mind about the media engine version info... I looked right at the title of your post and had a brain freeze. :)
|
Back to Top |
|
|
support Administrator
Joined: January 26 2005 Location: United States Posts: 1666
|
Posted: December 19 2008 at 11:12am | IP Logged
|
|
|
Hi hermes,
Your explanation above and in the “architecture” email you sent to me was very clear to understand. Good job. Without knowing the deep internals of the media engine signal processing paths, you have done a terrific job.
Normally for voice recognition applications, we recommend that the receive IVR channels of the phone lines be used directly. This way sample block data can be fed to the app’s speech engine input(s) directly. From a signal standpoint, the only computation performed on the received RTP data is a codec conversion of the media data received to Rx IVR sample block data rate/format as requested by the app.
In your case where the speech recognition data propagates the entire signal path all the way to local multimedia hardware, there is additional signal processing being performed. One such processing is used to ensure minimal signal latencies and another ensures that the 20Ms call time periods can be matched with playback rates of the multimedia hardware. All of this is done to minimize playback latency and to make sure there are no audio “gaps” in the media being streamed to the multimedia hardware. It’s a tough problem seeing different multimedia hardware have differing record and playback rates that do not exactly match the 20Ms “time slice” of received call data. Also, there is no guarantee that 20Ms of data from one SIP device exactly matches 20Ms worth of data from another SIP device in real time. These are complex issue to get right.
The internal methods we use to time scale signal paths works extremely well for voice audio (in your case) and introduces no significant noise artifacts that would cause speech recognition problems. We employ another signal path signal processing technique that ensures overall maximum worst case signal path latency. This I suspect – combined with the behavior of our jitter buffering algorithm may be introducing sufficient noise into the signal that it is causing your speech recognition engine to misbehave. The human ear on the other hand does not detect these possible artifacts.
From your experimental results, you have exactly identified the solution. In order for the signal path latency computation to not be perturbed by possible RTP jitter buffer behavior the rule is this: Signal path latency threshold (as set by the MasterPlaybackResampleBlockTrigger value) must be greater than the max jitter buffer time setting specified. Just to be on the safe side, I would set the MasterPlaybackResampleBlockTrigger value to at least 40Ms greater that what is specified for the max allowed jitter. That way, even if your VOIP receiving end of the call is heavily loaded, speech recognition will still function properly.
What I suspect is occurring is at some point in time, the media engine’s jitter buffer logic becomes active. For a short time duration, any downstream signal will not be propagated. When the jitter algorithm turns off, it then sends sample block data downstream for further processing (i.e. ensure max tolerable signal latency and possible time scaling of the signal for playback). What occurs next is that the computations we use for minimizing overall signal path latency becomes active and there we have our possible noise issue. We could easily prove that this is occurring by comparing the received raw signal content (as sampled at a Rx IVR channel) with the signal content being sent to local multimedia hardware for playback. Both of these signal points can be recorded to disk by the media engine for later analysis.
If we compared the spectrums of both signals, I would suspect we would see immediately the signal difference if what I say is true.
Note that this issue is not related to any specific codec used for a phone call.
Note also that jitter may not be the only thing that could cause the signal path latency computation in the media engine to be applied. If there are other processes of higher class + thread priority that get executed before the media engine’s threads execute, or if the system is sufficiently loaded, then the media engine threads that process signal paths could become retarded in time. This would in effect cause similar speech recognition behavior using your current software architecture. In this case call the SetMixerResampleState() API procedure and set the MasterPlaybackResampleBlockTrigger parameters to a higher value. Each integer increment of this parameter represents 20Ms sample block time. Make sure it is set to be greater than your max allowed jitter setting. If your system is heavily loaded, increase it appropriately until speech recognition is again at the accuracy you require.
Thanks,
Randal
Notes:
This post discusses VOIP Media Engine undocumented API procedures that are used for internal test purposes. Do not use these API procedures in your VOIP applications.
|
Back to Top |
|
|