The input data is given to the Self-Organizing Map data processor like to any other Davisor Chart data processor. All the available datasource can be used with SOM and there are also two special datasources designed for SOM use, namely normalize and memoryBuffer. Even if the SOM processor can use any data source, there are some requirements for the data itself listed below.
The SOM algorithm as well as the map initialization algorihtms and their implementations set few requirements to the data.
In order to get good results, please note also the following recommendations.
The implemented SOM algorithm works with numeric data. If you have categorical variables, convert each category to a binary valued variable. For example, a variable 'Operating System' having values 'Linux', 'Windows', and 'Other' would be replaced with variables 'OS_Linux', 'OS_Windows', and 'OS_Other' where one of them will get value '1' and others '0'. Free-text fields and documents need to be converted e.g. to vector space model in order to use them with SOM. Ask Davisor Sales for more info and tools.
SOM data source itself does not need any special setup, but the data source inside the SOM data source may need configuration.
The SOM data source can determine suitable configuration parameters based on the input data, or those can be given. The parameters are given in nested 'init', 'train', label', and 'qerror' tags. The SOM data source syntax is:
<som source="String">
any data source
<init />
<train />
<label />
</som>
Here 'source' attribute may refer to the SOM created earlier.
See Using Existing Maps.
The order of the nested elements is important, because each of them also
executes a function for the map. The recommended order is the one given above.
The input data source may be any of the Chart data sources or one of the special SOM data sources:
The map size, topology, used neighborhood function, and the function to create initial values for the map are defined by <init/> tag. It's syntax is:
<init width="int"
height="int"
topology="hexagon|rectangle"
neighborhood="bubble|gaussian"
initializer="linear|rand"
seed="long" />
The meanings of the attributes are:
SOM can be trained in one or several phases. Typically, if the map is initialized using linear initializer, one training phase is enough. When random initializer is used, the map is usually trained twice: in the first time the map will be ordered roughly using big learning factor and big initial radius, the finetuning is done during the second training phase. It is also possible to train an exisiting map with new or additional data, but it is not usually recommended. The syntax for training parameters is:
<train rounds="int"
alpha="float"
radius="float"
alphaCalc="inverse_t|linear"
radiusCalc="inverse_t|linear" />
If one train tag is given, then the training is done once. If two tags are
given, the the traning is done twice. Then the first time is 'rough' training
and the second time is 'finetuning'. The default values of train tags are affect according to the training type. The meanings of the attributes are:
Within one training session data is often used over and over again. If the amount of input data is not too large, training can be speeded up by buffering data into memory. However, the sheer amount of training data can prevent this approach. As a default data is fetched always from the original data source. To enable memory buffering one pipes input data through <memorybuffer> datasource.
<som>
<memorybuffer>
any data source
</memorybuffer>
</som>
In map labeling (sometimes called also calibration) the winning node is searched for each data row and the data row label is added to the winning node. Labeling can be used to
<label clearLabels="true|false" />If clearLabels is set to 'true' then all the existing labels are removed. When labeling first time, there are no labels and therefore there is no need to do the clearing.
Examples of simple SOMs.
|
| Example 1. A SOM map created by the most simplest way. |
The simplest way to create SOM is to wrap a datasource inside a <som> element and select one of the SOM chartTypes. Here the colorized Sammon's mapping is used. The system determines the SOM parameters from the available data. As here are only a few rows, the map size is 3 x 3, which is the smallest map size produced by the automatic parameter detection procedure.
If you try to render the SOM by yourself, you'll notice that the map coloring and country positions will change for each turn. This is due to the random factor in the map initialization. However, when you update the SOM several times, you'll see that the information in the map will stay at the same: Sweden lies between the USA and Finland, and Finland lies between Sweden and the UK. Please note, that the map colors count more than the measurements got by a ruler. For example, in the image below (which is got with the same XML than the image above), Sweden belongs to the green group, which is located between the blue USA group and the brown Finland and the UK group (where Finland is between Sweden and the UK).
|
| The same result as in the image above! |
In this first example the data is used without modifications. It means that the variable "GNI per capita" has much more impact to the map ordering than the other variables. It is OK to create a SOM this way. Especially if the variables are of the same unit (e.g. Euros or Dollars) and the order of magnitude is important. However, when you examine the SOM, you need to remember that the variables had got different weights. If you want to create a SOM where each variable counts as much you need to use the <normalize> data source as shown in the next example.
|
| Example 2. A SOM created from a normalized data source. |
This is like the first example, but now the data columns are normalized. This affects naturally to the results. Now we can say that according to the all values the Nordic countries, Finland and Sweden, are quite similar when compared along with the USA and the UK.
Here too, the map varies a bit from run to run, but Finland and Sweden always form a group separated of the others. Sometimes these two countries lie in the same cell and sometimes in adjacent cells with the same color.
|
| Example 3. A typical way to create SOM using linear initializer and one training phase. |
The third and the fourth example show how user can set the SOM paramters itself. If the linear initializer is used then one training phase is enough, otherwise there is good to run 'rough' and 'finetuning' phases.
In these examples the (pseudo) random number generator is initialized using the same seed, so the resulting SOM map will remain the same from run to run.
|
| Example 4. A SOM created using random initializer and two training phases. |