class staticanalytics. Tokenizer
Source: analyticsdoc.
Breaks text into tokens (i.e. words).
new Tokenizer([arg])
Tokenizer
Example
// import analytics module
var analytics = require('qminer').analytics;
// construct Tokenizer object
var tokenizer = new analytics.Tokenizer({ type: "simple" });
Parameter
| Name | Type | Optional | Description |
|---|---|---|---|
|
arg |
Yes |
Construction arguments. If arg is not given it uses the |
Methods
getParagraphs(str) → Array of String
Breaks string into paragraphs.
Example
// import modules
var analytics = require('qminer').analytics;
var la = require('qminer').la;
// construct model
var tokenizer = new analytics.Tokenizer();
// string you wish to tokenize
var string = "Yes!\t No?\n Maybe...";
// tokenize text using getParagraphs
var tokens = tokenizer.getParagraphs(string);
// output:
tokens = ["Yes", " No", " Maybe"];
Parameter
| Name | Type | Optional | Description |
|---|---|---|---|
|
str |
String |
|
String given to break into paragraphs. |
- Returns
-
Array of StringArray of paragraphs. The number of paragraphs is equal to number of paragraphs in inputstr. When function detects escape sequences'\n','\r'or'\t'it breaks text as new paragraph.
getSentences(str) → Array of String
Breaks string into sentences.
Example
// import modules
var analytics = require('qminer').analytics;
var la = require('qminer').la;
// construct model
var tokenizer = new analytics.Tokenizer();
// string you wish to tokenize
var string = "C++? Alright. Let's do this!";
// tokenize text using getSentences
var tokens = tokenizer.getSentences(string);
// output:
tokens = ["C++", " Alright", " Let's do this"];
Parameter
| Name | Type | Optional | Description |
|---|---|---|---|
|
str |
String |
|
String given to break into sentences. |
- Returns
-
Array of StringArray of sentences. The number of sentences is equal to number of sentences in inputstr. How function breaks sentences depends on where you use a full-stop, exclamation mark, question mark or the new line command. Careful: the space between the lines is not ignored.
getTokens(str) → Array of String
Tokenizes given string.
Example
// import modules
var analytics = require('qminer').analytics;
var la = require('qminer').la;
// construct model
var tokenizer = new analytics.Tokenizer();
// string you wish to tokenize
var string = "What a beautiful day!";
// tokenize string using getTokens
var tokens = tokenizer.getTokens(string);
// output:
tokens = ["What", "a", "beautiful", "day"];
Parameter
| Name | Type | Optional | Description |
|---|---|---|---|
|
str |
String |
|
String given to tokenize. |
- Returns
-
Array of StringArray of tokens. The number of tokens is equal to number of words in inputstr. Only keeps words, skips all punctuation. Tokenizing contractions (i.e. don't) depends on which type you use. Example: type'html'breaks contractions into 2 tokens.