2.2 Identifying Dialogue Act Types
When processing dialogue, it can be useful to think of
utterances as a type of action performed by the speaker. This
interpretation is most straightforward for performative statements
such as "I forgive you" or "I bet you can't climb that hill." But
greetings, questions, answers, assertions, and clarifications can all
be thought of as types of speech-based actions. Recognizing the
dialogue acts underlying the utterances in a dialogue can be an
important first step in understanding the conversation.
The NPS Chat Corpus, which was demonstrated in
1, consists of over 10,000 posts from
instant messaging sessions. These posts have all been labeled with
one of 15 dialogue act types, such as "Statement," "Emotion,"
"ynQuestion", and "Continuer." We can therefore use this data to
build a classifier that can identify the dialogue act types for new
instant messaging posts. The first step is to extract the basic
messaging data. We will call xml_posts() to get a data structure
representing the XML annotation for each post:
>>> posts = nltk.corpus.nps_chat.xml_posts()[:10000]
Next, we'll define a simple feature extractor that checks what words
the post contains:
>>> def dialogue_act_features(post):
... features = {}
... for word in nltk.word_tokenize(post):
... features['contains({})'.format(word.lower())] = True
... return features
Finally, we construct the training and testing data by applying the
feature extractor to each post (using post.get('class') to get
a post's dialogue act type), and create a new classifier:
>>> featuresets = [(dialogue_act_features(post.text), post.get('class'))
... for post in posts]
>>> size = int(len(featuresets) * 0.1)
>>> train_set, test_set = featuresets[size:], featuresets[:size]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print(nltk.classify.accuracy(classifier, test_set))
0.67