We considered this competition as a classification problem rather than a regression. However, regression turned out to be a better fit for us. Our reasoning was based on the given information of grading. Simply, each tweet is classified by several humans and each grader has some reliability weight (but we are not given any information on the graders). So, we replicated each tweet in the training dataset as n_class times (for each category) and each time classified differently. An example is available here. We used the real values as sample weights during classification.

What we learned from others (kaggle forum):

voting/stacking/grading methods can increase the success a lot
voting - you have some constant method v, that given answers a_1,...,a_k results in a=v(a_1,...,a_k)
stacking - you use the answers as the new representation of the problem, so for each (x_i,y_i) you get (a_i_1,...,a_i_k) and so create the training sample ((a_i_1,...,a_i_k),y_i) and train meta-classifier on it
grading - you train a separate meta-classifier for each of your k classifiers to predict its "classification grade" for current point, and use it to make decision
2-level : feed the output of a first level model to use it as features to the 2nd level model
Add more features: POS tagging, sentiment dictionary, etc. (1st ranked fellow had ~1.9M features)

In [1]:

%autoreload 2
from weather import *
from variableNames import *

In [2]:

train_file='train.csv'
test_file='test.csv'
#read files into numpy arrays
train = p.read_csv(train_file)
t2 = p.read_csv(test_file)
#preprocessing for prediction: emoticons included as happy or sad, and stop words removed
for row in train.index:
    train['tweet'][row] = pre.preprocess_pipeline(' '.join([train['tweet'][row],train['state'][row],str(train['location'][row])]),
                                                  return_as_str=True, do_remove_stopwords=True,do_emoticons=True)
for row in t2.index:
    t2['tweet'][row] = pre.preprocess_pipeline(' '.join([t2['tweet'][row],str(t2['state'][row]),str(t2['location'][row])]),
                                               return_as_str=True, do_remove_stopwords=True,do_emoticons=True)

In [4]:

import matplotlib.pyplot as plt
y = np.array(train.ix[:,4:])
ys = y[:,:5]#4:9 labeles of sentiment
yw = y[:,5:9]#9:13 labels of when
yk = y[:,9:]#13: labels of kind
plotClasses(ys)
plotClasses(yw)
plotClasses(yk)

In [29]:

model = linear_model.Ridge (alpha = 3.0, normalize = True)

In [6]:

pred,y_true = cv_loop(train, t2, model)
predRidge = pred.copy()

Train error: 0.153310841843

0.153 is our best CV score so far, by Ridge Regression (with clip(0,1))

In [7]:

model = MultinomialNB()

In [9]:

pred,y_true = cv_loop(train, t2, model,is_nominal=True)
predNB = pred.copy()

Train error: 0.162996084532

Multinomial NB on filtered data that has only max's class without sample weight resulted in .166

Multinomial NB on filtered data that has only max's class with sample weight resulted in .169

Multinomial NB with sample weight (for Sentiment & When) and Kind with Ridge Regression resulted in .163

In [12]:

predEnsemble=np.add(predNB,predRidge)/2.0

In [16]:

mse = np.sqrt(np.sum(np.array(predEnsemble-y_true)**2)/(y_true.shape[0]*float(y_true.shape[1])))
mse

Out[16]:

0.15538461442161469

Ensembling/Combining MultinomialNB and Ridgre Regression scored in CV as 0.155

In [30]:

model = linear_model.Ridge (alpha = 3.0, normalize = True)
pRidge = predictThis(model,train,t2)

In [31]:

model = MultinomialNB()
pNB = predictThis(model,train,t2,is_nominal=True)

In [33]:

pEnsemble=np.add(pNB,pRidge)/2.0
submission(pEnsemble,filename="pEnsemble.csv")

We submitted our MultinomialNB & Ridge Regression Ensemble prediction but leaderboard score was no better

In [40]:

model = linear_model.Ridge (alpha = 3.0, normalize = True)
pred,y_true = cv_loop(train, t2, model,is_LSA=True)

Train error: 0.181060399966

Dimensionality reduction using truncated SVD (aka LSA): 20K features to 300 and 1K both gave an MSE score of ~0.18 in CV.

In [45]:

model = SGDClassifier(loss="modified_huber")
pred,y_true = cv_loop(train, t2, model,is_nominal=True)

Train error: 0.189550950066

SGDClassifier estimator implements regularized linear models with stochastic gradient descent (SGD) learning.

In [52]:

model = SGDClassifier(loss="log")
pred,y_true = cv_loop(train, t2, model,is_nominal=True)

Train error: 0.175439455944

In [8]:

model = SGDClassifier(loss="log")
pSGDlog = predictThis(model,train,t2,is_nominal=True)

In [55]:

model = SGDClassifier(loss="log",class_weight="auto")
pred,y_true = cv_loop(train, t2, model,is_nominal=True)

Train error: 0.199461085673

I thought that giving more weight to less represented class might increase the score, but it didn't.

In [6]:

model = SGDClassifier(loss="log",penalty="l1",n_iter=1000)
pred,y_true = cv_loop(train, t2, model,is_nominal=True)

Train error: 0.194566204339

In [58]:

model = SGDClassifier(loss="log",penalty="l1",fit_intercept=False)
pred,y_true = cv_loop(train, t2, model,is_nominal=True)

Train error: 0.175865087643

I considered my data as inbalanced and wanted to set intercept to False but it didn't help either.

In [59]:

model = SGDClassifier(loss="log",penalty="elasticnet")
pred,y_true = cv_loop(train, t2, model,is_nominal=True)

Train error: 0.174285676501

use_idf = true
ngram_range = (1,2)
min_df = 5
max_df = 0.5
sublinear_tf = true
smooth_idf = true
max_features = 20000
mse = np.sqrt(np.sum(np.array(cRidge2-ytrue)**2)/(ytrue.shape[0]*float(ytrue.shape[1])))
tweet + state + location

In [6]:

tfidf = TfidfVectorizer(strip_accents='unicode', analyzer='word', smooth_idf=True,sublinear_tf=True,max_df=0.5,min_df=5,ngram_range=(1,2),use_idf=True)

In [17]:

X_train, X_test, y_train, y_true = cross_validation.train_test_split(train['tweet'], ys, test_size=.20, random_state = 0)

In [8]:

	y = np.array(train.ix[:,4:])
	ys = y[:,:5]#4:9 labeles of sentiment
	yw = y[:,5:9]#9:13 labels of when
	yk = y[:,9:]#13: labels of kind

In [10]:

		tfidf.fit(X_train)
		X_train = tfidf.transform(X_train)
		X_test = tfidf.transform(X_test)

In [11]:

X_train

Out[11]:

<62356x23890 sparse matrix of type '<type 'numpy.float64'>'
	with 1065928 stored elements in Compressed Sparse Row format>

In [12]:

		tfidf.fit(train['tweet'])
		X_all = tfidf.transform(train['tweet'])

In [13]:

X_all

Out[13]:

<77946x29076 sparse matrix of type '<type 'numpy.float64'>'
	with 1355291 stored elements in Compressed Sparse Row format>

In [14]:

	train_file='train.csv'
	test_file='test.csv'
	#read files into pandas
	train = p.read_csv(train_file)
	t2 = p.read_csv(test_file)
	for row in train.index:
		train['tweet'][row]=' '.join([train['tweet'][row],train['state'][row],str(train['location'][row])])

In [15]:

tfidf = TfidfVectorizer(strip_accents='unicode', analyzer='word', smooth_idf=True,sublinear_tf=True,max_df=0.5,min_df=5,ngram_range=(1,2),use_idf=True)
tfidf.fit(train['tweet'])
X_all = tfidf.transform(train['tweet'])

In [16]:

X_all

Out[16]:

<77946x40097 sparse matrix of type '<type 'numpy.float64'>'
	with 2001704 stored elements in Compressed Sparse Row format>

In [18]:

		tfidf.fit(X_train)
		X_train = tfidf.transform(X_train)
		X_test = tfidf.transform(X_test)

In [19]:

model = linear_model.Ridge (alpha = 3.0, normalize = True)
pred,y_true = cv_loop(train, t2, model)

Train error: 0.151103593167

this result is obtained when max_features is set to 20K and no stemming is done.

In [20]:

model = linear_model.Ridge (alpha = 3.0, normalize = True)
pred,y_true = cv_loop(train, t2, model,max_features=1000)

Train error: 0.151010902315

In [22]:

model = linear_model.Ridge (alpha = 3.0, normalize = True)
pred,y_true = cv_loop(train, t2, model,max_features=1000)

Train error: 0.162763152741

In [23]:

model = linear_model.Ridge (alpha = 3.0, normalize = True)
pred,y_true = cv_loop(train, t2, model,max_features=10000)

Train error: 0.151853010216

Partly Sunny with a Chance of Hashtags (organized by CrowdFlower on Kaggle)¶

GMU Fall 2013, CS 780, Data Mining for Multimedia Data (Dr. Jessica Lin), by Talha and Venkat¶

Our Approach¶