analytics. Tokenizer
Source: analyticsdoc.
Breaks text into tokens (i.e. words).
new Tokenizer([arg])
Tokenizer
Example
// import analytics module
var analytics = require('qminer').analytics;
// construct Tokenizer object
var tokenizer = new analytics.Tokenizer({ type: "simple" });
Parameter
Name | Type | Optional | Description |
---|---|---|---|
arg |
Yes |
Construction arguments. If arg is not given it uses the |
Methods
getParagraphs(str) → Array of String
Breaks string into paragraphs.
Example
// import modules
var analytics = require('qminer').analytics;
var la = require('qminer').la;
// construct model
var tokenizer = new analytics.Tokenizer();
// string you wish to tokenize
var string = "Yes!\t No?\n Maybe...";
// tokenize text using getParagraphs
var tokens = tokenizer.getParagraphs(string);
// output:
tokens = ["Yes", " No", " Maybe"];
Parameter
Name | Type | Optional | Description |
---|---|---|---|
str |
String |
|
String given to break into paragraphs. |
- Returns
-
Array of String
B Array of paragraphs. The number of paragraphs is equal to number of paragraphs in inputstr
. When function detects escape sequences'\n'
,'\r'
or'\t'
it breaks text as new paragraph.
getSentences(str) → Array of String
Breaks string into sentences.
Example
// import modules
var analytics = require('qminer').analytics;
var la = require('qminer').la;
// construct model
var tokenizer = new analytics.Tokenizer();
// string you wish to tokenize
var string = "C++? Alright. Let's do this!";
// tokenize text using getSentences
var tokens = tokenizer.getSentences(string);
// output:
tokens = ["C++", " Alright", " Let's do this"];
Parameter
Name | Type | Optional | Description |
---|---|---|---|
str |
String |
|
String given to break into sentences. |
- Returns
-
Array of String
B Array of sentences. The number of sentences is equal to number of sentences in inputstr
. How function breaks sentences depends on where you use a full-stop, exclamation mark, question mark or the new line command. Careful: the space between the lines is not ignored.
getTokens(str) → Array of String
Tokenizes given string.
Example
// import modules
var analytics = require('qminer').analytics;
var la = require('qminer').la;
// construct model
var tokenizer = new analytics.Tokenizer();
// string you wish to tokenize
var string = "What a beautiful day!";
// tokenize string using getTokens
var tokens = tokenizer.getTokens(string);
// output:
tokens = ["What", "a", "beautiful", "day"];
Parameter
Name | Type | Optional | Description |
---|---|---|---|
str |
String |
|
String given to tokenize. |
- Returns
-
Array of String
B Array of tokens. The number of tokens is equal to number of words in inputstr
. Only keeps words, skips all punctuation. Tokenizing contractions (i.e. don't) depends on which type you use. Example: type'html'
breaks contractions into 2 tokens.