Friday, September 2, 2016

Language generation

For a few different things I'm working on, I need (or at least want) automated generation of names/words that seem to be from a common language.  There are programs that will do this; for example, I found that Federico Tomassetti produced a language generator using Python.  However, Python is not the easiest language to integrate with some of the software I'm planning to use the language generator for.  As a result, I've nearly complete a port of Tomassetti's work to C#.

How does it work?  There are two distinct aspects, analysis and generation.  The first aspect, analysis, involves the program reading a source sample and calculating statistics from it.  The second aspect, generation, requires the statistical results from the analysis.

The analysis aspect has three basic steps.  First, it reads in a word list from a source language.  Next it produces the permutations of potential syllables based upon the letters from the alphabet (up to a limited length). Finally, it analyzes the word list to determine the frequency of the potential syllables, including the frequency they occur as starts or ends of words.  This provides the statistical data that is needed for generation.

The generation aspect requires the statistical data from the analysis aspect.  It has five basic steps. The first is to decide upon the number of syllables for the new word.  The second step is to makes a probablistic choice of a starting syllable.  In the third step, if the number of syllables is greater than two, a loop runs that makes probablistic selection(s) of additional syllable(s).  Finally, a probablistic selection of an ending syllable is made.  Finally, the selected syllables are combined in order to form a word.

There are a number of additional features I plan to add in the future, either as part of thus language generator or as something layered on top.  Nevertheless, the port of langgen will serve as a good starting point.  With a set of names from Celtic mythology it produced the list below, which I felt was acceptable.

suis
mata
aniois
brius
belona
belus
matiais
esveus
nerus
artirus
camviis
leio
aria
matiaus
caes
camvina
artios
caona
maos
canuus

Testing it, I think it needs some additional work to handle characters featuring umlauts, cedillas, accent symbols, etc.  Why do I think this?  Because when I use a source word list from a language featuring such characters (e.g. Norse, Polish), I get less desirable output, like so:

stdr
rarg
ragvall
inirrr
gudrndr
rgdr
gudrnn
guir
bjldrir
hieirn
raldr
arrr
ragarr
eirrra
frri
ridr
eildr
siglfrr
ragrr
rall

So adding support for those types of characters is my next priority.  At some point I'll get it into GitHub and make it available.


No comments:

Post a Comment