Prof Yannick Estève has informed me that Prof Paul Deleglise has actually done a pretty large portion of work on the 4g tools. It is now corrected. Please kindly accept my appology for this.
It has been for another while for me to say something on the development of 3.6. This might be of general interest to know what we are up to in these days.
If you have read this message board, you might know that there is very deep refactoring going on Sphinx 3.X. Part of the innovation in the architecture is in GMM computation which we have already published two papers on that. Those work has been completed in 3.4 and 3.5 period of the development. The other part would be refactoring of search routine
My observation is that generally search is a very different issue from GMM computation. For example, in real-life development of speech recognition, one seldom see that there is more than one search programmers. That's why it becomes a problem for sites which tries to develop different types of search. Mainly because it is difficult to do.
Not to my surprise, CMU speech group has great and long tradition in developing different types of search. Especially in Sphinx 3.0- 3.5 period of development, different types of search become a curse for users who want to learn to use the recognizer. That's why in 3.6, the key theme of our research will be a unified framework for all these different types of searches.
As far as I can tell, Mode 4 and Mode 5 will not be the only way to do search in Sphinx. I have also completed another two modes recently. They are mode 2 and mode 3. At this point, they are not very complete so I don't want to reveal what they do. If you are interested, you could take a look of the source code. :-)
The existence of such a framework will hopefully encourage more people to work on speech recognition search algorithm. (If the new algorithm is eventually checked in to Sphinx, of course I will be happier. ) With this, Sphinx 3.X and Sphinx 4 will together provide a very research framework to the community. One is in C and one is in Java. This is something fullfilling the vision of me and a lot of my predecessors of the Sphinx project.
Hopefully, this is the last time I reported to you before the actualy release. I have been sleeping less in these days and I hope that I can finish the final touches on different searches as soon as possible
As a final note, LIUM has used Sphinx 3.3 in a French evaluation task. They developed their own versions of quadgram lm4g2dmp and s3_dag. They were also very nice to open source both of the tools in their web page. More importantly, they open sourced a set of speech segmentation and clustering tools. I have already added a link in "Sphinx resource" web page.
Also allow me to say thank you to both Prof Paul Deleglise, Prof. Yannick Estève and Prof. Sylvain Meignier . They are very kind to open the work in the first place.
I personally think that both 4g LM tools and speech clustering tools are great work. My hope is that if someone could make these tools to be also available in the canonical Sphinx framework. It will benefit even more users in the world. If you are interested, please kindly give us a mail.
It has been a while and I haven't been lazy in these days. There are some interesting progress of 3.6 in these days
A new architecture of the search is implemented and this could potentially allow multiple types of search to be used in 3.X with just changing some function pointers. It is probably one major rewrite we have done since Ravi has written the search.
Currently there are couple of modes of the search, the old 3.5 search, or technically a time-switching tree copies search implementation is now called mode TST or mode 4. As you know, mode TST has been the true work-horse for many real-life projects. I decided not to remove it even though there is a need for a new search.
The new search, or technically word-conditioned tree copies search implementation is call mode WST or mode 5. The idea is very similar to RWTH's attempt on using tree copies to keep track of different LM states without combining them. Though, we knew this concept for a long time, we lacked of the correct know-how to implement it in practice in the past. That is also partially caused by the fact the search requires huge amount of memory. If no tricks applied, a correct version is nearly impossible to be implemented.
The fact that the computation and resource have again doubled their power in last 5 years give us a better chance. After a whole months of struggle, a correct but unoptimized version of this search for a 2k is finally completed. The rest, for me, will be much simpler to deal with. All I need to do is implement several tricks that could either speed up the search or reduce its memory.
Writing the new search for 3.6. This time is very involved. Let's see how it will take.
We will add some new methods in GMM computation to Sphinx 3.5. This time, we will just create a 3.5.1, instead of 3.6.
And...... sorry folks, I am pretty busy these days because I need to rewrtie significant portion of the search routine. So, I probably can't help you too much in your problems in Sphinx.
Happy new year everyone.
Several interesting things happened last months
1, We just released Sphinx 3.5 at SF.
2, We start a Twiki for Sphinx 3 .
3, I will start to remove some of the links from this page because I want to be focused on the Twiki page and cmusphinx official page here. I will still maintain the presentation at here.
Put a **sample** internal documentation for Sphinx3 (hidden). Worked on regression test and refacotring. Busy.
Update of some progress on Sphinx 3 and also Sphinx 2: We spent most of the time last month in speed up and adaptation. I also spent some of my time to refactor the source code of sphinx3 and SphinxTrain. One discovery is that the current sphinx2 codebase is also largely overlapped with s3 and st. The team is still considering whether we should merge the 5 legacy codebases (s2, share, s3.0, s3.5, st) together. That would mean the lives of the developers will be much easier.
Although every developer in the team is convinced that s3.0/3.5/st/share merge is something necessary and healthy. Our opinion differs in whether we should put s2 and s3/st together is a good idea. Obviously, the problem of code duplication will be greatly alleviated. It will cause a lot of conceptual integrity problem to the users
At the end, the most likely solution would likely be putting all the code into one place but still allow different CVS aliases to check out different things.
So what will be the final package look like? It contains both the recognizer and the trainer. It can also be used in application and can be used research. It can be fast and it can be slow. This is something pretty unique among all open source speech recognition systems. I personally find the prospect is extremely exciting. Also, if we work on it a little bit more, the current codebase is ready enough to assist all generations of Sphinx recognizers (2,3,4) (and perhaps other recognizers ) to provide state of the art models and libraries. This will truly benefit research/business/users.
Even I am working on it, I don't know the detail of the above yet. That is the good thing about using CVS, suprises are always there. Most of them are interesting. My guess is the joint package will start to be in-use for distributing s3.6, s2.6 and SphinxTrain 1.0 . I will keep you posted. :-)
During the last month, most of my time was spent in meetings, conferences , talking. We found that that these can surprisingly help the projects.
It has been a while since last upate.
During the last month, me and the team spent most of our time in preparing the Sphinx 3.5 release. It is now release in CMU internally. It has been tested in VC/Linux/OSX/cygwin. It should be pretty simple to convert it to other POSIX platforms.
Some key features you can find in this release include live-mode API, speaker adaptation. For the speaker adatpation, we also change SphinxTrain such that adaptation matrix training facilities is there
Another important change we made is that we incorporated 4 of the s3.0 tools into s3.X. That includes align, allphone, dag and astar . This is an important step for the architecture of the code.
Do tell me what you think by sending me a mail . -Arthur :-)
Insert logos to this page. After all, I am not an artist. So, someone with artistic talents (well, in this case, if you just have some. :-) ) Please contact me.
BTW, if you understand Egyptian hieroglyphs and understood what I am doing, tell me whether I am doing something correct.
Update of this web page. Highlight, I added a new Sphinx Open Projects page for developers. Check it out at here .
I was ill for couple of day last week and wasn't able to update this page for a while. I am sorry. There are couple of interesting things happened in these days. First of all Sphinx 3.4 is officially released, we soon found that there are still couple of bugs in the released code. (One of them is an illegal memory writing.) We soon released s3.4.1 to replace it. You can find the code at here .
The next piece of news is that we start to have some adaptation capability in sphinx3.5 tree-lexicon decoder. We were able to implement a single regression class transform-based adaptation in s3.5 and the code was checked-in into Sourcefoge's CVS. We will probably extend it to multiple regression class later on.
The next piece of news is suprising and exciting, we were able to compile and link all the s3.0 tools (including the flat-lexicon decoder and the aligner) using s3.5 code base (which the tree-lexicon decoder is using) . I said "compile and link" because I didn't test them out yet but I see this as a big step. We have to thank to Carl Quillen about this because he spent his valuable time to do the change for s3astar, s3align and s3allphone. This makes me realized that this porting is possible.
The last thing is an announcment of a software request. We think that it is pretty suitable for programmers. Currently, using sphinx 2,3,4 can be something pretty hard to learn. We are seeking for programmers who can help us to write a portable Administration interface (temp. name SphinxAdmin). One example implementation is Thomas Harris's perl/tk wrapper (find it at here ). If you are interested in this project, please contact me.
Evandro and I put the latest release candidate of Sphinx 3.4 at here .
Yitao also completed building a set of live-mode APIs and we are putting it inside the CALO recorder. This will be the final test for the live-mode APIs before we released it in s3.5.
I just tested s3.4 in the WSJ 5k task. The performance is 6.65%. Compared to the 6.5% as in s3. You can probably say there is not much degradation.
Some users gave submit some very useful feedbacks for improving the live-mode recognizer. Thanks a lot. This kind of feedback is very useful for further enhance the functionalities of s3.4. We will release RC III in these days.
Evandro and I put the latest release candidate of Sphinx 3.4 at here . I will start to change this web page a little bit to make more on my personal web page. Major reason is that the official web page of sphinx is doing its job pretty well. I don't want to add yet another web page to confuse the users.
I also significantly change this web page to prepare distributing information for our next release. We already started some of the works
I put a release candidate of sphinx 3.4 in this web page. Please help to try it out, you can find the instruction of testing in the "Compilation Status" page.
Starting from now I will also join the team in Sourceforge to develope sphinx. This will avoid problems of merging the code. I will also start to do more thing on SphinxTrain. You will see all this are there in this web page.
I had some fun in ICASSP 2004 so didn't do too much stuff last week. When I came back, I found that Yitao has completed a prototype of the live mode APIs and Jahanzeb has incorporated phoneme-lookahead to the live mode recognizer. Nice! You can find all these stuffs in the latest tgz.
For developers, Jahanzeb inform me that one may need to run autogen.sh twice before running configure.
We are also tring to fix the configuration process in Mac OSX.
Evandro has incorporated a new version of libtool in the code.
I was also able to build 3.4 in using Intel C++ compiler in a Pentium 4 machine. I didn't able to test the performance yet. I will give you guys an update very soon.
Finally able to fix the desktop models problem. You can find it in the latest tar ball. One step before we can put it into Sourceforge.
I was also able to build 3.4 in using Intel C++ compiler in a Pentium 4 machine. I didn't able to test the performance yet. I will give you guys an update very soon.
Evandro and I have put a lot of information about compilation in a file called ISSUES in the root directory of s3.4. The performance of the Intel Compile does not really beat loop unrolling in a P4 SMP. Of course, this conclusion is not final because I havn't tried to run any performance test yet.
Built a .dsp file for the batch mode decoder. Every executables can be compiled in program.dsw now. I also built an automatic suite for building the tgz file. The executable from now on should have more stable sizes.
Incorporated Jahanzeb's phoneme look-ahead code into the main-branch. Have done significant rewriting. Also fixed some bugs, need to re-test that part again. Also make some fixes on the argument files of sphinx-simple, didn't have to test it in a windows machine yet. Will do it later.
Add the html version of the outline of Sphinx's manual to the information page. Got comments about the outline and it seems that I may need to simplify the structure a little bit. Will post the 3rd draft soon.
Add the outline of Sphinx's manual in the information page. I welcome anyone to comment on it. Just drop me a mail.
-Update s3.4. 1) Front-end parameter can now be specifed inlive-mode recognizer(s). 2) All the beam parameter are in real-domain, so changing the log base doesn't bother. 3) Fix the makefile to make autoconf. 4)
-Add couple of links which build a better interface for Sphinx 2. Sphinx 3.4 need better interface.
-Someone also told me about problems in sphinx-test and sphinx-simple. There is an LM issue in sphinx3.4 code I need to resolve. I am working on this.
The download page is updated. Some users told me that the gzip compression scheme may cause some versions of Winzip not able to work. Therefore, I also put a zipped file there.
I am currently busy in training the ICSI database. May have 1 to 2 weeks time unable to touch the recognizer code. Before that, have fun with the live-mode recognizer.
Several Updates. Hard-wiring in the feature-type is finally cleared out. I didn't do any live-mode test yet. However, I did ensure that the live-mode simulator works sanely. Possibly I will just test it out when I wake up tommorrow.
I also put a experimental version of s3 somewhere in a so-called "Sphinx extension" page. This version is a branch of s3 (not from s3.3) that use flat-lexicon to do search but include frame-skipping, GMM selection and gaussian selection. Of course, we later made a decision that we should abandoned that path of development. I share that part of the code so that people can play with those codes.
I finally able to use the live mode recognizers built in the s3.3 age. The current live-mode demo hard-wired the frame window size and fft windows size. It also hard-wire to a particular feature type. My current fix only works for feature type -s3_1x39 and some front-end parameter is still hard-wired. Well, again, we are one step closer.
If you are interested in building sphinx 3.4 in Visual C++. Current setups include the .dsp file for those who want to compile all live mode recognizers and tools. (Well, batch mode tool compilation setup is coming.)
Jahanzeb also complete his research in phoneme-lookahead and it is more or less the time for me to incorporate it. I will keep you posted.
Removed some contents in info and resources pages, feel that it may be a little bit biased and inaccurate. Hopefully, it doesn't hurt too much. :-| I am going to re-write them such that the standpoint will become more neutral. In general, I hope that this will become a page which provide more information for developers of Sphinx.
Gather and include a lot of information in the developer's info page and resources page. . Still need a lot of proofreading to make it good enough to be public.
This page is created
I put a tarball of Sphinx 3.4 in this page. It has all the speed-up facilities me and Jahanzeb worked for last three months. We already come to a stage that back-bone of Sphinx 3.4 is completed. Our plan is to optimize of s3.3's GMM computation. The following features are included 1) computation skipping, 2) CI-based senone selection, 3) VQ-based and SVQ-based Gaussian selection, 4) sub-vector quantization and 5) phoneme lookahed.
Speed-up facilities are pretty stable. It runs at 1.3xRT for Communicator task in a 2G machine. There are many people telling me that it is not fast enough. I fully aware of it. In the next one and half month I will spend most of my time to further improve the speed. I'll make sure you will be posted.
I also included all the code for using dynamic LMs and class-based LM. Run one to two tests and I won't say it is very stable. The current implementation assumes all LMs are initialized at the beginning of recognition. This significantly limits its flexibility because developer cannot change LM in run-time. Again, this is something I'll work on it.