In my previous post about language generation I mentioned that sample languages using characters beyond just a-z weren't being processed properly. Investigation showed that two portions of the code had problems and needed to be altered. The first was that the encoding being used for reading the sample files was stripping some characters; explicitly reading the sample files in using the Windows-1252 encoding resolved this problem. The second was in generating the permutations used for searching the sample for potential syllables. Only a-z were being used in creating the permutations, but characters in the set ôõöøùúûüýþÿáâãäåæçèéêëìíîïðñÞ all needed to be considered as well, just to allow the Norse samples to be processed properly. Altering the code in those two portions rapidly improved the generated words.
Another couple of measures were added. If insufficient vowels are present, the generated word is rejected and a new one is generated instead. If the number of consecutive consonants exceeds five, the word is also rejected. An example of output generated from a Norse sample is below.
Next up to add is automatic elimination of one character if a run of characters in the word involves the same letter repeating three consecutive times (e.g. if word contains "rrr", it gets changed to "rr"). Once that's in place I will likely perform a brief cleanup and commit it to GitHub. (And I need to check the license on the original langgen. As this is a direct port, I need to release it under that license. I remember it was relatively permissive but don't recall if it was BSD, MIT, LGPL, etc.) There will be an additional post here when that all is done.