How does it work? There are two distinct aspects, analysis and generation. The first aspect, analysis, involves the program reading a source sample and calculating statistics from it. The second aspect, generation, requires the statistical results from the analysis.
The analysis aspect has three basic steps. First, it reads in a word list from a source language. Next it produces the permutations of potential syllables based upon the letters from the alphabet (up to a limited length). Finally, it analyzes the word list to determine the frequency of the potential syllables, including the frequency they occur as starts or ends of words. This provides the statistical data that is needed for generation.
The generation aspect requires the statistical data from the analysis aspect. It has five basic steps. The first is to decide upon the number of syllables for the new word. The second step is to makes a probablistic choice of a starting syllable. In the third step, if the number of syllables is greater than two, a loop runs that makes probablistic selection(s) of additional syllable(s). Finally, a probablistic selection of an ending syllable is made. Finally, the selected syllables are combined in order to form a word.
There are a number of additional features I plan to add in the future, either as part of thus language generator or as something layered on top. Nevertheless, the port of langgen will serve as a good starting point. With a set of names from Celtic mythology it produced the list below, which I felt was acceptable.
Testing it, I think it needs some additional work to handle characters featuring umlauts, cedillas, accent symbols, etc. Why do I think this? Because when I use a source word list from a language featuring such characters (e.g. Norse, Polish), I get less desirable output, like so:
So adding support for those types of characters is my next priority. At some point I'll get it into GitHub and make it available.