-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kaldi_writer.save() aborts after 3 hours #107
Comments
Thanks for the report. What Python version are you using, what operating system? Is there anything else you can tell us to help track down the problem? |
Hi, this would be Python 3.7.3, Linux 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64 GNU/Linux. I am now trying the same on a 128GB mem sibling computer. I followed the megs download instructions, but in such a way that the input corpus in |
Did your problem occur, when executing the scripts as they are? |
Typically I never really execute the do-it-all scripts as I tend to store source databases somewhat differently from what the do-it-all script write had in mind. Also, the do-it-all scripts tend to never actually run in one go for me because things go wrong, and take long. So I have But of course I can re-run megs |
Did you also run the waverize script before exporting to kaldi? |
No, I didn't. The instructions about |
Ok, so the consistency check is just for making sure having the same data. |
Sure, I understand that is is hard to reproduce the problem from the description, and the databases are quite large, but is there a way to get kaldi_writer.save() a bit more verbose? I can run the waverise, I suppose at the cost some extra disk space. But generally, I don't really know why it is so hard to produce the kaldi files, the information is in the |
The main problem is, that all utterance durations are needed in the kaldi files. |
OK, just for the sake of trying, I re-did I understand you need the utterance durations, I think these are already computed in I think I'm going ahead now by making my own script that will process the relevant files in |
With utt2dur you are right, but it is also needed for the segments file, since some of the utterances only are segments of the full audio file. Have you tried with waverize? |
Calling from german-asr local/prepare_data.py it looks like the process aborts after 3 hours consistently, without any diagnostics. Just the
train/wav.scp
is written, and that happens in the first few minutes. After that the 64GB mem machine is busy with roughly 20% CPU and 14% memory for three hours, after which it silently aborts.The text was updated successfully, but these errors were encountered: