Normalization, in this context, means equalizing data columns when distance between data vectors is compared (Euclidean distance). For example, consider case when data vectors contain weight measurements of similar objects, but some columns have weights in kilograms and some in grams. If normalization is omitted values in grams will have bigger numeric difference, and gram columns will dominate the distance. If normalization is carried out, gram and kilogram columns will have equal impact on the distance, regardless of the units used.
Normalization is done by calculating for each input data column average and standard deviation. From each data value column average is subtracted and then divided by standard deviation.
In addition to the basic approach it is possible to weight different columns. For example, if some of the columns are more interesting, these columns can be given a scalar weight. In this case these columns are multiplied by this weight.
Only those data columns that have numeric values are normalized, other data columns are passed through as is. One can also deny normalization of numeric columns, by giving negative weights for these columns.
Input data is introduced by adding importing data source configuration inside the <normalize> element:
<normalize>
<csv source="..."/>
</normalize>
As a default, output data contains as many columns as the input data has. One can select only some of the input data columns, by defining <normalizeColumn/> elements.
<normalize>
<normalizeColumn column="data1">
<normalizeColumn column="data2" weight="2" >
<normalizeColumn column="labels" weight="-1">
<csv source="...">
<csvcolumn column="1" id="data1" />
<csvcolumn column="2" id="data2" />
<csvcolumn column="3" id="data3" />
<csvcolumn column="4" id="labels" />
</csv>
</normalize>
Below is a visualization of how normalization modifies the data:
|
| Original data before normalization. |
|
| Normalized data. Notice that average has been removed from both columns and data is divided columnwise using standard deviation. |