March 14th, 2016

About Distributed Representations

  • Proven important resource for NLP
    • In conjunction with DNNs, reduce need for handcrafted features
  • Usually learned using neural networks

  • Capture syntactic and semantic information about words
    • Information about morphology normally ignored
    • Trained discarding key character information that would be useful

POS/NER Tagging with DNNs

  • For tasks like POS tagging, intra-word info is useful
    • Especially true for morphologically rich languages
    • Previous attempts for DNNs have included handcrafted shape features (Collobert 2011)
  • DNN with char level representations of words AND word representations to perform tagging
    • Idea here builds off that Collobert 2011 model, but truly is "From Scratch"
    • Actually idea inspired from footnotes in Collobert 2011

Network Overview

  • Create DNN with
    • "Window approach" - fixed window fully connected deep network (Collobert 2011)
    • Sentence Log Likelihood Cost - structured prediction using transition weights (Collobert 2011)
    • Concatenate word and character features - word2vec + char embeddings with convolution and pooling
  • Should work for all tagging applications, multiple languages
  • POS Paper
    • English (WSJ Section of PTB Corpus)
    • Portuguese (Mac Morpho Corpus)
  • NER Paper
    • Portuguese (Harem I Corpus)
    • Spanish (SPA CoNLL 2002 Corpus)

Refresher on Window Approach

  • Fixed context window, centered on word to tag
    • Words first converted to vectors
  • Fully connected layer to hidden layer
  • Non-linearity
  • Fully connected layer outputs to number of output classes
    • 45 labels for English POS tagger, 22 for Portuguese
    • 10 for NER tagger (5 for "selective")

Window Approach (Collobert 2011)

How to incorporate characters?

  • Handcrafted features, limited, undesirable
  • Want to take into account most important information into fixed vector for a word
    • Feature extraction with Convolutional Nets
    • Tracks closely with Collobert 2011's Sentence Approach for global coherency!

Character Level Feature Vector for Words

  • Get varying number of chars into fixed representation for a single word
    • First, convert to continuous char representations
    • Next, convolution layer over chars in word
    • Fix width using max over time pooling
  • Concatentate with word vector

Visual Representation of Character Level Embedding Components

Full Architecture

  • So we have 2 independent representations, char and word level for single word
  • For each window, apply Window Approach
    • Concatenate word vector, char word vector
    • Linear layer
    • Non-linearity
    • Output layer produces scores for each tag
  • We still have to produce a structured prediction!

Full Architecture

Structured Prediction Problem

  • Could train to minimize word level error, greedily select tags
  • Max score for each tag not necessarily coherent
    • Tags have dependencies, Window Approach not globally coherent
  • Uses Sentence Log Likelihood cost with transition scores between tags

\[S([w]_{1}^{N}, [t]_{1}^{N}, \theta) = \sum_{n=1}^{N}(A_{t{_n-1},t_n} + s(x_n)t_n)\]

  • Viterbi determines selected tag sequence \[[t^*]_{1}^{N} = \arg \max S([w]_{1}^{N}, [t]_{1}^{N}, \theta)\]

POS Training Details

  • Minimize negative log likelihood over training set as in Collobert 2011

\[\log p([t]_{1}^{N} | [w]_{1}^{N}, \theta) = S([w]_{1}^{N}, [t]_{1}^{N}, \theta) - \log( \sum_{\forall [u]_{1}^{N} \in T^n} e^{S([w]_{1}^{N}, [u]_{1}^{N}, \theta)})\]

  • SGD used to minimize log-likelihood with respect to \(\theta\)

\[\theta \mapsto \sum_{([w]_{1}^{N}, [y]_{1}^{N}) \in D} -\log p([y]_{1}^{N}|[w]_{1}^{N},\theta)\]

  • Learning rate decay \(\lambda_t = \frac{\lambda}{t}\)

NER Training Details

  • Same (except SPA CoNLL-2002)
  • Learning rate set to 0.005 "to avoid divergence"

POS Unsupervised Pre-Training Details

  • Word2vec skip-grams used for unsupervised pre-training for word vectors
    • Context window size 5
    • words lower cased
    • English, Wikipedia 12/2013 snapshot, min frequency 10 (870,214 \(|V|\))
    • Portuguese Wikipedia, CETEN Folha Corpus, CETEMPublico corpus, min frequency 5 (453,990 \(|V|\))
  • Char embeddings not pre-trained
    • Not lower-cased!
  • Custom Portuguese tokenizer

POS Experiment Details

  • Comparison against 2 other Deep Models, and SotA
  • Deep Models: Words plus two shape parameters (Caps, Suffix)
    • caps outcomes: {all lower, first upper, all upper, contains upper, other}
    • suffix length 2 for English, 3 for Portuguese
  • SoTA uses many handcrafted features
    • Decisions Trees + TBL (Portuguese) (dos Santos et. al. 2012)
    • Structured Perceptron with Entropy Guided Feature Induction (Portuguese) (Fernandes, 2012)
    • Semi-Supervised Condensed Nearest Neighbor with SVM (Sogaard 2011)
    • DNN with handcrafted features (Collobert, 2011)
    • Cyclic Dependency Network with rich features (Toutanova 2003, Manning 2011) ## POS Corpora Details

POS Results (Portuguese)

POS Results (English)

NER Results (Spanish)

  • CoNLL scorer

NER Results (Portuguese)

  • CoNLL Scorer
  • HAREM I scorer