new Tokenizer([arg])

Tokenizer

Example

// import analytics module
var analytics = require('qminer').analytics;
// construct Tokenizer object
var tokenizer = new analytics.Tokenizer({ type: "simple" });

Parameter

Name Type Optional Description

arg

module:analytics~tokenizerParam

Yes

Construction arguments. If arg is not given it uses the 'unicode' tokenizer type.

Methods

getParagraphs(str) → Array of String

Breaks string into paragraphs.

Example

// import modules
var analytics = require('qminer').analytics;
var la = require('qminer').la;
// construct model
var tokenizer = new analytics.Tokenizer();
// string you wish to tokenize
var string = "Yes!\t No?\n Maybe...";
// tokenize text using getParagraphs
var tokens = tokenizer.getParagraphs(string);
// output:
tokens = ["Yes", " No", " Maybe"];

Parameter

Name Type Optional Description

str

String

 

String given to break into paragraphs.

Returns

Array of StringB Array of paragraphs. The number of paragraphs is equal to number of paragraphs in input str. When function detects escape sequences '\n', '\r' or '\t' it breaks text as new paragraph.

getSentences(str) → Array of String

Breaks string into sentences.

Example

// import modules
var analytics = require('qminer').analytics;
var la = require('qminer').la;
// construct model
var tokenizer = new analytics.Tokenizer();
// string you wish to tokenize
var string = "C++? Alright. Let's do this!";
// tokenize text using getSentences
var tokens = tokenizer.getSentences(string);
// output:
tokens = ["C++", " Alright", " Let's do this"];

Parameter

Name Type Optional Description

str

String

 

String given to break into sentences.

Returns

Array of StringB Array of sentences. The number of sentences is equal to number of sentences in input str. How function breaks sentences depends on where you use a full-stop, exclamation mark, question mark or the new line command. Careful: the space between the lines is not ignored.

getTokens(str) → Array of String

Tokenizes given string.

Example

// import modules
var analytics = require('qminer').analytics;
var la = require('qminer').la;
// construct model
var tokenizer = new analytics.Tokenizer();
// string you wish to tokenize
var string = "What a beautiful day!";
// tokenize string using getTokens
var tokens = tokenizer.getTokens(string);
// output:
tokens = ["What", "a", "beautiful", "day"];

Parameter

Name Type Optional Description

str

String

 

String given to tokenize.

Returns

Array of StringB Array of tokens. The number of tokens is equal to number of words in input str. Only keeps words, skips all punctuation. Tokenizing contractions (i.e. don't) depends on which type you use. Example: type 'html' breaks contractions into 2 tokens.