Home > Model Nodes > Data Used for Model Building > Specifying Text Characteris...
If you are connected to Oracle Database 12c, the Text tab in the Edit Model Build dialog box enables you to specify text characteristics.
If you specify text characteristics on the Text tab, then you are not required to use the Text nodes.
|
Note: If you are connected to Oracle Database 11g Release 2 (11.2) or earlier, then use Text nodes. The Text tab is not available in Oracle Database 11g Release 2 and earlier. |
Text is available for any of the following data types: CHAR, VARCHAR2, BLOB, CLOB, NCHAR, or NVARCHAR2.
To examine or specify text characteristics for data mining, either double-click the build node or right-click the node and select Edit from the context menu. Click the Text tab.
The Text tab enables you to modify the following:
Categorical cutoff value: Enables you to control the cutoff used to determine whether a column should be considered as a Text or Categorical mining type. The cutoff value is an integer. It must be 10 or greater and less than or equal to 4000. The default value is 200.
Default Transform Type: Specifies the default transformation type for column-level text settings. The values are:
Token (Default): If Token is selected, the Default Settings are as follows:
Languages: Specifies the languages used in the documents. The default is English. To change this value, select an option from the drop-down list. You can select more than one language.
Stemming: By default, this option is not selected.
Not all languages support stemming. If the language selected is English, Dutch, French, German, Italian, or Spanish, then stemming is automatically enabled.
If Stemming is enabled, then stemmed words are returned for supported languages. Otherwise the original words are returned.
Stoplists: Specifies the stoplist to use. The default setting is to use the default stoplist. You can add stoplists or edit stoplists.
If you select more than one language and the selected stoplist is Default, then the default stop words for languages are added to the default stoplist from the repository. No duplicate stop words are added.
Tokens: Specifies the maximum number of tokens across all documents. The default is 3000.
Theme: If Theme is selected, then the Default Settings are as follows:
Language: Specifies the languages used in the documents. The default is Arabic. To change this value, select one from the drop-down list. You can select more than one language.
Stoplists: Specifies the stoplist to use. The default setting is to use the default stoplist. You can add stoplists or edit stoplists.
If you select more than one language and the selected stoplist is Default, then the default stop words for languages are added to the default stoplist (from the repository). No duplicate stop words are added.
Themes: Specifies the maximum number of themes across all documents. The default is 3000.
Click Stoplists to open the Stoplist Editor. You can view, edit, and create stoplists.
You can use the same stoplist for all text columns.