We considered this competition as a classification problem rather than a regression. However, regression turned out to be a better fit for us. Our reasoning was based on the given information of grading. Simply, each tweet is classified by several humans and each grader has some reliability weight (but we are not given any information on the graders). So, we replicated each tweet in the training dataset as n_class times (for each category) and each time classified differently. An example is available here. We used the real values as sample weights during classification.
What we learned from others (kaggle forum):
%autoreload 2
from weather import *
from variableNames import *
train_file='train.csv'
test_file='test.csv'
#read files into numpy arrays
train = p.read_csv(train_file)
t2 = p.read_csv(test_file)
#preprocessing for prediction: emoticons included as happy or sad, and stop words removed
for row in train.index:
train['tweet'][row] = pre.preprocess_pipeline(' '.join([train['tweet'][row],train['state'][row],str(train['location'][row])]),
return_as_str=True, do_remove_stopwords=True,do_emoticons=True)
for row in t2.index:
t2['tweet'][row] = pre.preprocess_pipeline(' '.join([t2['tweet'][row],str(t2['state'][row]),str(t2['location'][row])]),
return_as_str=True, do_remove_stopwords=True,do_emoticons=True)
import matplotlib.pyplot as plt
y = np.array(train.ix[:,4:])
ys = y[:,:5]#4:9 labeles of sentiment
yw = y[:,5:9]#9:13 labels of when
yk = y[:,9:]#13: labels of kind
plotClasses(ys)
plotClasses(yw)
plotClasses(yk)
model = linear_model.Ridge (alpha = 3.0, normalize = True)
pred,y_true = cv_loop(train, t2, model)
predRidge = pred.copy()
0.153 is our best CV score so far, by Ridge Regression (with clip(0,1))
model = MultinomialNB()
pred,y_true = cv_loop(train, t2, model,is_nominal=True)
predNB = pred.copy()
Multinomial NB on filtered data that has only max's class without sample weight resulted in .166
Multinomial NB on filtered data that has only max's class with sample weight resulted in .169
Multinomial NB with sample weight (for Sentiment & When) and Kind with Ridge Regression resulted in .163
predEnsemble=np.add(predNB,predRidge)/2.0
mse = np.sqrt(np.sum(np.array(predEnsemble-y_true)**2)/(y_true.shape[0]*float(y_true.shape[1])))
mse
Ensembling/Combining MultinomialNB and Ridgre Regression scored in CV as 0.155
model = linear_model.Ridge (alpha = 3.0, normalize = True)
pRidge = predictThis(model,train,t2)
model = MultinomialNB()
pNB = predictThis(model,train,t2,is_nominal=True)
pEnsemble=np.add(pNB,pRidge)/2.0
submission(pEnsemble,filename="pEnsemble.csv")
We submitted our MultinomialNB & Ridge Regression Ensemble prediction but leaderboard score was no better
model = linear_model.Ridge (alpha = 3.0, normalize = True)
pred,y_true = cv_loop(train, t2, model,is_LSA=True)
Dimensionality reduction using truncated SVD (aka LSA): 20K features to 300 and 1K both gave an MSE score of ~0.18 in CV.
model = SGDClassifier(loss="modified_huber")
pred,y_true = cv_loop(train, t2, model,is_nominal=True)
SGDClassifier estimator implements regularized linear models with stochastic gradient descent (SGD) learning.
model = SGDClassifier(loss="log")
pred,y_true = cv_loop(train, t2, model,is_nominal=True)
model = SGDClassifier(loss="log")
pSGDlog = predictThis(model,train,t2,is_nominal=True)
model = SGDClassifier(loss="log",class_weight="auto")
pred,y_true = cv_loop(train, t2, model,is_nominal=True)
I thought that giving more weight to less represented class might increase the score, but it didn't.
model = SGDClassifier(loss="log",penalty="l1",n_iter=1000)
pred,y_true = cv_loop(train, t2, model,is_nominal=True)
model = SGDClassifier(loss="log",penalty="l1",fit_intercept=False)
pred,y_true = cv_loop(train, t2, model,is_nominal=True)
I considered my data as inbalanced and wanted to set intercept to False but it didn't help either.
model = SGDClassifier(loss="log",penalty="elasticnet")
pred,y_true = cv_loop(train, t2, model,is_nominal=True)
tfidf = TfidfVectorizer(strip_accents='unicode', analyzer='word', smooth_idf=True,sublinear_tf=True,max_df=0.5,min_df=5,ngram_range=(1,2),use_idf=True)
X_train, X_test, y_train, y_true = cross_validation.train_test_split(train['tweet'], ys, test_size=.20, random_state = 0)
y = np.array(train.ix[:,4:])
ys = y[:,:5]#4:9 labeles of sentiment
yw = y[:,5:9]#9:13 labels of when
yk = y[:,9:]#13: labels of kind
tfidf.fit(X_train)
X_train = tfidf.transform(X_train)
X_test = tfidf.transform(X_test)
X_train
tfidf.fit(train['tweet'])
X_all = tfidf.transform(train['tweet'])
X_all
train_file='train.csv'
test_file='test.csv'
#read files into pandas
train = p.read_csv(train_file)
t2 = p.read_csv(test_file)
for row in train.index:
train['tweet'][row]=' '.join([train['tweet'][row],train['state'][row],str(train['location'][row])])
tfidf = TfidfVectorizer(strip_accents='unicode', analyzer='word', smooth_idf=True,sublinear_tf=True,max_df=0.5,min_df=5,ngram_range=(1,2),use_idf=True)
tfidf.fit(train['tweet'])
X_all = tfidf.transform(train['tweet'])
X_all
tfidf.fit(X_train)
X_train = tfidf.transform(X_train)
X_test = tfidf.transform(X_test)
model = linear_model.Ridge (alpha = 3.0, normalize = True)
pred,y_true = cv_loop(train, t2, model)
this result is obtained when max_features is set to 20K and no stemming is done.
model = linear_model.Ridge (alpha = 3.0, normalize = True)
pred,y_true = cv_loop(train, t2, model,max_features=1000)
model = linear_model.Ridge (alpha = 3.0, normalize = True)
pred,y_true = cv_loop(train, t2, model,max_features=1000)
model = linear_model.Ridge (alpha = 3.0, normalize = True)
pred,y_true = cv_loop(train, t2, model,max_features=10000)