Home > Text Nodes > Text Mining in Oracle Data ... > Build Text > Edit Build Text Node > Add/Edit Text Transform
The Add/Edit Text Transform dialog box can be opened from the Edit Build Text Node dialog box. To open or edit a text transform, click
The default values for the transformation are illustrated in this graphic:

Description of the illustration textxformedit.png
Source Column: This is the name of the column to be transformed.
Transform Type: This is either Token (the default) or Theme.
Output Column: This is the name of the new column. The default name is the source column name with either TOK (for Token) or THM (for Theme) appended, depending on the transformation type. To specify the output column name, deselect Automatic and enter a name in the Output Column field.
In the Settings section, specify characteristics of the text and the transform:
Language: Select any one of the following options:
Single Language: By default, a single language is specified. English is the default language. You can select a different language.
Multiple Language: Select this option to specify multiple language. For example, to specify Single Byte languages, such as Arabic, Turkish, Thai, and European languages, select them from the Single Byte list.
To specify Multibyte languages, such as Chinese (simplified or traditional), Japanese or Korean, select them from the Multibyte languages.
Stoplist: Oracle Text provides default stoplists for several single languages. If there is a default stoplist, it is selected. For several languages, the default is no stoplist. You can select any stoplist that was previously created for this attribute from the drop-down list. You can perform the following tasks:
Edit a Stoplist: To edit a stoplist, click
. The Stoplist Editor opens.
Add a Stoplist: To add a stoplist, click
. The Stoplist Editor opens.
Token: If you select Token, the defaults are:
Maximum number per document: 50 (default)
Maximum number across all document: 3000 (default)
You can change these values. The tokens per document and across all documents cutoffs are for rankings, not for an absolute count of tokens. You could have more than 3000 tokens across all documents if there were ties.
Theme: If you select theme, the defaults are:
Maximum number per documents: 50 (default)
Maximum number across all document: 3000 (default)
You can change these values. The themes per document and across all documents cutoffs are for rankings, not for absolute count of themes. You could have more then 3000 themes across all documents if there were ties.
Theme incudes a Theme Type specification. The default is Single. You can select Full.
Frequency: The default is Term Frequency. You can select Term Frequency IDF.
|
Note: Frequency is a sticky setting. If you change it, then the changed value becomes the default. |
Term Frequency uses the term frequency in the document itself. It does not take collection information into account.
Term Frequency IDF is the traditional TF-IDF. It takes into account information from the document (Term Frequency) and collection-level information (IDF plus the terms to use if a maximum overall number of terms for the collection is set).
TF-IDF (Term Frequency–Inverse Document Frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection. The importance increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the collection.