| 
    
     | 
       
        | Author |  |  
        | Pete Intermediate
 
  
 
 Joined: December 05 2006
 Posts: 12
 | 
          Hello Lanscape support folks,
           | Posted: February 06 2007 at 2:27pm | IP Logged |   |  
           | 
 |  
 As a test, I've captured the RTP packet data for a phone call by modifying the RawRtpCallback() routine and writing the pSampleBuffer to a file.  I then experimented with a few ways of trying to verify that this was good data.  One way was to use Cool Edit Pro to "play back" the data.  But the result is far from optimal and I'm trying to decide why.
 
 The only two data formats where I could even recognize my voice were uLaw and aLaw.  All permuations of the other formats were either not recognized or not converted well by Cool Edit Pro.  Even for uLaw/aLaw, the resulting play back was barely audible.
 
 Okay, so this opens up a bunch of questions for me:
 
 1) Is this "raw" data in a proper format so that audio programs like Cool Edit Pro can recognize it?  Or do I need to have some kind of header or reformat the data?
 
 2) My ultimate goal is to eventually pass this captured data to some postprocessor (speech to text etc.) so I need to make sure it's good data and in some kind of standard form.  Do you have any suggestions on a better way to do this?  I realize writing to a file is not optimal, I'm just using this approach as a proof of concept right now.
 
 3) Cool Edit Pro is overkill for what I want to do - are you folks familiar with other products that might make this task simpler?  I've checked out Microsoft Speech Recognition, Dragon Naturally Speaking, IBM Via Voice and a few others but they don't seem like a good match for what I'm trying to do.  Typically they want to grab the voice data directly from the sound card themselves although I think Naturally Speaking does have an SDK interface.  I've contacted them but their pre-sales contact was sorely lacking compared to you folks :-)
 
 Pete
 |  
        | Back to Top |     |  
        |  |  
        | support Administrator
 
  
 
 Joined: January 26 2005
 Location: United States
 Posts: 1666
 | 
          Hi Pete,
           | Posted: February 06 2007 at 4:14pm | IP Logged |   |  
           | 
 |  
 Item 1:
 Is this "raw" data in a proper format so that audio programs like Cool Edit Pro can recognize it?
 
 <<< Support
 Yes – with the exception of G729/G729A. I don’t think Cool Edit can handle G729. Come to think of it, I don’t think Cool Edit can natively handle iLBC either.
 
 In your RTP packet callback proc, your code will get an address that will point to a RAW_RTP_DATA structure.
 
 You app should take the pSampleBuffer address from that structure and write the number of bytes from the pSampleBuffer address to your sample file using the SampleBufferLengthInBytes value.
 
 Note that doing this, you will be writing to a file transmitted or received 20 Ms sample block data that is in the native RTP format for the call. Do not simply write the entire contents of the RAW_RTP_DATA to your sample file. It will not sound good at all.
 
 If you only wanted to perform speech recognition of received call audio, then you should look at the receive IVR interface API procs. See the dev reference for more info. Also if you only want to perform speech recognition of local recorded audio, see the speech recognition API procs.
 
 Also remember that when opening your created “sample files” that you will have to tell Cool Edit to open the file using a raw data format like aLaw, uLaw, 8k PCM, 11k PCM, or 22k PCM.
 
 
 Item 2:
 My ultimate goal is to eventually pass this captured data to some postprocessor (speech to text etc.) so I need to make sure it's good data and in some kind of standard form.
 
 <<< Support
 The data is in standard form so this should not be a problem.
 
 The way we would do it would be to identify a speech recognition engine we like and performs well with 8kPCM or 8k aLaw or uLaw. If a speech recognition engine can be found that is good enough to perform with “CELP” based codec like G729 or linear-predictive coding codecs like iLCB, that is a big plus.
 
 We would want to identify a speech engine that we can pass sample block data to directly – to the speech engine’s API. The media engine works with sample blocks of 20 Ms in length and generally these can be passed directly to speech recognition engines.
 
 
 Item 3:
 About speech recognition engines…
 
 <<< Support
 
 As far as a speech engine having an API, the simplest thing to do would be to use Microsoft’s SAPI (speech API). With their default speech recognition capability you should be able to do a proof of concept app no problem.  Granted we have not looked at Microsoft SAPI in the recent past but I’m sure it has only gotten better – not worse. The last time I checked you could stream media directly to the speech recognition API.
 
 
 Support
 
 |  
        | Back to Top |       |  
        |  |  |  |