Outlier

An outlier is a data value that does not come from the typical population of data. In other words, it is an extreme value. In a normal distribution, outliers are typically at least 3 standard deviations from the mean. Outliers are usually replaced with values that are not extreme, or they are replaced with Null.


Note:

You can define outlier treatments for numeric columns only.

To define an Outlier transformation:

  1. In the Transform Type field, select the option Outlier.

  2. In the Outlier Type field, select any one of the following options:

    • Standard Deviation: This is the default Outlier type. For this outlier type, enter a Standard Deviation to define the Outlier in the following field:

      • Multiples of Sigma: This is the number of standard deviations that define an outlier.
        The default is 3, that is, 3 standard deviations.
        3 Standard Deviation means that an outlier is a value less than mean - 3 * Standard Deviation or greater than mean + 3* Standard Deviation.

    • Percent: Enables you to specify that outliers are values in a bottom percentage and a top percent. The default is to specify that outliers are in the bottom 5 percent or in the top 5 percent. You can change the defaults by entering values in these fields:

      • Lower Percent Value

      • Upper Percent Value

    • Value: Enables you to specify a lower value and an upper value so that outliers are those values less than the lower value or greater than the upper value.
      You can change these values, but the upper value must be bigger than the lower value.

      • Lower Value: If statistics are available, then the default is -3* standard deviation.
        If statistics are not available, then the default is 0.

      • Upper Value: If statistics are available, then the default is +3* standard deviation.
        If statistics are not available, then the default is 1.

  3. In the Replace With field, select an option to specify how to replace outliers. The options are:

    • Null (Default)

    • Edge Value

      For example:
      If the mean of a column distribution is 10, and
      If standard deviation is 10
      Then, outliers can be:

      • Values that are less than -5, that is, Mean-3*Standard Deviation

      • Values that are greater than 25, that is, Mean+3*Standard Deviation

      Outlier=-10. You can replace -10 with Null or with -5, which is the edge value.

  4. After you are done, click OK.