Hostname: page-component-745bb68f8f-5r2nc Total loading time: 0 Render date: 2025-02-11T09:53:15.100Z Has data issue: false hasContentIssue false

Tackling challenges of neural purchase stage identification from imbalanced twitter data

Published online by Cambridge University Press:  15 August 2019

Heike Adel*
Affiliation:
Institute for Natural Language Processing, University of Stuttgart, Pfaffenwaldring 5b, 70569Stuttgart, Germany
Francine Chen
Affiliation:
FX Palo Alto Laboratory, 3174 Porter Dr, Palo Alto, CA94304, USA
Yan-Ying Chen
Affiliation:
FX Palo Alto Laboratory, 3174 Porter Dr, Palo Alto, CA94304, USA
*
*Corresponding author. Email: heike.adel@ims.uni-stuttgart.de
Rights & Permissions [Opens in a new window]

Abstract

Twitter and other social media platforms are often used for sharing interest in products. The identification of purchase decision stages, such as in the AIDA model (Awareness, Interest, Desire, and Action), can enable more personalized e-commerce services and a finer-grained targeting of advertisements than predicting purchase intent only. In this paper, we propose and analyze neural models for identifying the purchase stage of single tweets in a user’s tweet sequence. In particular, we identify three challenges of purchase stage identification: imbalanced label distribution with a high number of non-purchase-stage instances, limited amount of training data, and domain adaptation with no or only little target domain data. Our experiments reveal that the imbalanced label distribution is the main challenge for our models. We address it with ranking loss and perform detailed investigations of the performance of our models on the different output classes. In order to improve the generalization of the models and augment the limited amount of training data, we examine the use of sentiment analysis as a complementary, secondary task in a multitask framework. For applying our models to tweets from another product domain, we consider two scenarios: for the first scenario without any labeled data in the target product domain, we show that learning domain-invariant representations with adversarial training is most promising, while for the second scenario with a small number of labeled target examples, fine-tuning the source model weights performs best. Finally, we conduct several analyses, including extracting attention weights and representative phrases for the different purchase stages. The results suggest that the model is learning features indicative of purchase stages and that the confusion errors are sensible.

Type
Article
Copyright
© Cambridge University Press 2019

1 Introduction

Social media platforms are increasingly becoming interwoven into people’s lives for communicating and sharing thoughts with others. These thoughts may include sharing their experience with products as well as their interest in new products (Morris, Teevan and Panovich Reference Morris, Teevan and Panovich2010). Analyzing social media posts to extract information, such as opinions, sentiment, or purchase interest, can be useful for e-commerce services, marketing, or customer relationship management.

Marketers and advertisers have employed the AIDA (Awareness/Attention, Interest, Desire, and Action) model (Lewis Reference Lewis1903; Dukesmith Reference Dukesmith1904; Russell Reference Russell1921) for decades to help move target customers from one purchase decision stage to the next, for example, from Interest to Desire. In this paper, we present our investigations in developing a method for automatically identifying purchase stages in tweets based on AIDA. For applying AIDA to tweets, we define “Action” as a completed buying action. Since all AIDA stages express a rather positive opinion, we extend them with a negative sentiment class: unhappiness of a user with a product. This can be an indication that a user needs help with a product or may soon purchase a new product. In contrast to the traditional use of AIDA for assessing an advertisement’s effect on users, we argue that knowledge of a user’s purchase stage can help to personalize the type of e-commerce service a user is offered. If a user is interested in a product, a manufacturer may offer the user information highlighting their product’s features. A store, on the other hand, may offer coupons for a product that a user has indicated a desire to buy. Manufacturers may also offer a list of stores which carry the product. Stores and manufacturers could send information about additional features to a user who has just bought their product. And manufacturers may wish to reach out with support to unhappy purchasers.

Traditionally, retailers have identified target customers by profiling based largely on demographic information and purchase records (Davenport, Dalle Mule and Lucker Reference Davenport, Dalle Mule and Lucker2011). For example, Lv et al. (Reference Lv, Yu, Tian and Wu2014) predicted which users of Sina Weibo (a microblogging platform) were interested in either real estate, healthy parenting, or sports; if so, they were considered target customers. Instead, we take advantage of the fact that social media platforms, such as Twitter, can provide a more direct indication of user interest than demographic information. In particular, we identify candidate target customers based on whether they tweet about a specific product category. For example, if the category is mobile device, a user who mentions a mobile device from a predefined lexicon of devices (e.g., “iPhone”) is considered a candidate target customer. Our model then identifies the purchase stages of a target customer (including an artificial class “no purchase stage”) based on their tweets. In this work, we focus on the tweets of a user in order to investigate how much information can be extracted from the textual content only.

As a result, our model is a pure natural language processing (NLP) model. Extracting information from unstructured tweet data has received a lot of attention in the NLP community recently. Examples include traditional NLP tasks, like syntactic processing (Owoputi et al. Reference Owoputi, O’Connor, Dyer, Gimpel, Schneider and Smith2013) or sentiment analysis on Twitter data (Nakov et al. Reference Nakov, Ritter, Rosenthal, Sebastiani and Stoyanov2016), and also business-related applications, such as stock market prediction (Bollen and Mao Reference Bollen and Mao2011) or user action prediction (Mahmud et al. Reference Mahmud, Fei, Xu, Pal and Zhou2016) from tweets. Processing tweets is inherently different from processing traditional text forms, such as news data. Tweets are short (with a character-based length restriction) and, thus, present information in a very dense way. Moreover, the variability of language is much higher than in other texts: abbreviations, misspellings, elongations, or emoticons are common and pose additional challenges to NLP models. In this paper, we address purchase stage prediction on Twitter data and show that it is challenging but possible to automatically identify purchase stages from tweets only. Note that the challenges we tackle in this paper have been identified for neural purchase stage prediction from Twitter data, but they are not specific only to this task. In fact, many NLP tasks such as relation extraction or domain- or language-independent language processing face the same challenges. Therefore, our findings are applicable to other NLP tasks as well.

In an earlier work (Adel, Chen and Chen Reference Adel, Chen and Chen2017), we defined the purchase stage identification task at the tweet level: for each tweet of the tweet sequence of a particular user, we automatically determined whether the user expresses interest, desire, etc. We proposed to model tweets and tweet sequences using a hierarchical neural network consisting of convolutional and recurrent neural networks (CNNs and RNNs). Recently, CNNs and RNNs have proven effective for text processing, cf., Cho et al. (Reference Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk and Bengio2014), Kalchbrenner, Grefenstette and Blunsom (Reference Kalchbrenner, Grefenstette and Blunsom2014), Kim (Reference Kim2014), Bahdanau, Cho and Bengio (Reference Bahdanau, Cho and Bengio2015), or Hermann et al. (Reference Hermann, Kociský, Grefenstette, Espeholt, Kay, Suleyman and Blunsom2015). We used CNNs for representing single tweets and modeled a tweet sequence with RNNs with gated recurrent units (GRUs) (see Figures 2 and 3). For comparison, we also experimented with other choices of neural and nonneural models.

The dataset we collected contains many more tweets expressing none of the purchase stages than tweets expressing one of the stages. An example tweet mentioning a product but not expressing a purchase stage is: “It is my mom’s birthday. Thank you, iPhone, for remembering.” Thus, one challenge our models have to cope with is class imbalance. In Adel et al. (Reference Adel and Schütze2017), we showed that a ranking layer approach outperforms traditional class weights in solving this problem. In this paper, we investigate the different choices of neural model layers in the context of our highly imbalanced dataset in more detail. In addition, we present a variety of analyses in order to better assess the performance of the models and understand their behavior.

For training robust tweet representation and tweet sequence models, a large amount of labeled data is necessary. However, obtaining this data is expensive, especially when it comes to modeling not only one but also several different (product) domains. Therefore, our models have to be robust in the context of little training data. To achieve this, we investigate multitask learning in this paper. In particular, we share the hidden layers of the neural networks across different tasks. As a secondary task, we use sentiment prediction and exploit Twitter sentiment analysis data from a shared task at the International Workshop on Semantic Evaluation (SemEval) (Nakov et al. Reference Nakov, Ritter, Rosenthal, Sebastiani and Stoyanov2016).

Moreover, we investigate domain adaptation to tweets about a new product with very little training data. We compare four approaches: (i) applying the trained model to new product tweets, (ii) fine-tuning the model with new product tweets, (iii) training a new model with the same architecture only on the data of the new product domain, and (iv) learning domain-invariant representations with adversarial training. Product domain adaptation is challenging because many of the phrases the model learns during training are product specific. Our results show that it is possible to adapt the model to new domains with no or only few samples from the new domain. Domain-invariant representations from adversarial training perform best when there is no labeled data from the target product domain available, while fine-tuning the model outperforms the other training methods when given a small number of labeled examples from the target domain.

Finally, we perform analyses to better understand what our models learn. In particular, we investigate the effect of n-gram size on our models, and we analyze which features the neural networks consider as important by extracting representative trigrams and attention weights for the different purchase stages.

To sum up, we investigate the performance of our neural models when facing three different challenges: high class imbalance, little training data, and domain adaptation. Our contributions are as follows:

  • Building on our previous work (Adel et al. Reference Adel, Chen and Chen2017) and the hierarchical neural model presented there, we investigate the performance of the model in the presence and absence of the dominant class (non-purchase stage N). Our results confirm that the model is able to differentiate the purchase stages reliably and that the class imbalance is the primary challenge.

  • We investigate the impact of multitask learning on the performance of our deep learning models in order to address the problem of little task-specific training data. We show that multitask learning improves the performance on six out of eight individual cases, and especially on the unhappiness class. The better overall results indicate that it helps the model to better generalize, most probably because the learned tweet representations are less likely to overfit to a particular dataset.

  • We show that our neural networks can be effectively adapted to a different product domain, even without labeled data in the target domain. We propose to use adversarial training for creating product domain-invariant tweet representations. Our results show that this approach is indeed most effective in both the absence and the presence of only a small amount of labeled target domain data (in our experiments 300 or fewer samples). If a medium or larger amount of labeled data is available (in our experiments 600 samples or more), fine-tuning an existing source model for a few epochs can be more effective.

  • Finally, we present detailed experimental results and perform analyses to better understand the results and the features learned by our model. In particular, we investigate the effect of n-gram sizes on our models, analyze attention weights on words, and extract representative phrases for the different purchase stages.

The remainder of this paper is organized as follows: Section 2 presents related work and Section 3 describes the task, dataset, and the most important challenges our models have to tackle. In Section 4, we explain the model with its different components, including the way we do multitask learning and product category adaptation. Sections 5 and 6 present our experiments and results. Finally, we show our analyses in Section 7. Section 8 concludes the paper.

2 Related work

Social media mining and purchase prediction. Social media mining, especially on Twitter, has become an active research area over the last several years. For example, there is work on predicting movie revenues (Asur and Huberman Reference Asur and Huberman2010) and stock prices (Bollen and Mao Reference Bollen and Mao2011; Kharratzadeh and Coates Reference Kharratzadeh and Coates2012) based on tweets. Other studies on social media classify purchase intentions (Ramanand, Bhavsar and Pedanekar Reference Ramanand, Bhavsar and Pedanekar2010; Hollerit, Kröll and Strohmaier Reference Hollerit, Kröll and Strohmaier2013; Gupta et al. Reference Gupta, Varshney, Jhamtani, Kedia and Karwa2014; Ding et al. Reference Ding, Liu, Duan and Nie2015; Vieira Reference Vieira2015; Lo, Frankowski and Leskovec Reference Lo, Frankowski and Leskovec2016) of users to target potential customers or to predict sales of products (Lassen, Madsen and Vatrapu Reference Lassen, Madsen and Vatrapu2014). Although Lassen et al. (Reference Lassen, Madsen and Vatrapu2014) had the AIDA model in mind when designing their features, they did not model purchase stages directly as we do in this paper. Gupta et al. (Reference Gupta, Varshney, Jhamtani, Kedia and Karwa2014) used social forum data from Quora and Yahoo! Answers to predict purchase intentions of products or services based on a large set of linguistic and statistical features. The study by Ramanand et al. (Reference Ramanand, Bhavsar and Pedanekar2010) detected “purchasing wishes” which would most probably be equivalent to our D (Desire) class. In contrast to all of these works, we predict a wider range of purchase stages in a more fine-grained, multiclass classification task.

Most previous works on purchase prediction use linguistic rules (Ramanand et al. Reference Ramanand, Bhavsar and Pedanekar2010) or traditional machine learning models with manually designed features, for example, regression models (Lassen et al. Reference Lassen, Madsen and Vatrapu2014; Lo et al. Reference Lo, Frankowski and Leskovec2016) or support vector machines (SVMs) (Hollerit et al. Reference Hollerit, Kröll and Strohmaier2013; Gupta et al. Reference Gupta, Varshney, Jhamtani, Kedia and Karwa2014; Mahmud et al. Reference Mahmud, Fei, Xu, Pal and Zhou2016). In contrast, we learn features automatically with neural networks and do not use any handcrafted rules or features based on time-consuming preprocessing, such as dependency parsing.

Lo et al. (Reference Lo, Frankowski and Leskovec2016) used features from e-commerce or content discovery platforms to predict buying intentions. In contrast, we use raw, unstructured text data as input; thus challenges like Twitter-specific language characteristics make the task more difficult. Our task is more similar to the work by Ding et al. (Reference Ding, Liu, Duan and Nie2015) who applied a CNN to the identification of consumption intention. Hollerit et al. (Reference Hollerit, Kröll and Strohmaier2013) trained different classifiers on the words and part-of-speech tags of tweets to detect buying and selling intent. Yan et al. (Reference Yan, Duan, Chen, Zhou, Zhou and Li2017) identified user intent by building a classifier trained on typical phrases for each intent collected via crowd-sourcing, and then used it in a dialogue system. Similar to our work, they used CNNs to represent phrases. However, they applied them to product category detection. The classes they distinguished are different from the AIDA-based purchase stages; instead, they are directly related to responses of the dialogue system. Furthermore, their data have different characteristics than Twitter data. We also investigate modeling tweet sequences rather than single phrases or tweets, but in contrast to all these works, we base our predictions on the context of a user’s earlier tweets, rather than performing predictions on each tweet in isolation.

Korpusik et al. (Reference Korpusik, Sakaki, Chen and Chen2016) also modeled tweet sequences using a long-short term memory (LSTM) network for predicting whether a user will purchase a product. Their work differs from ours in several important respects: we identify which of several AIDA-related purchase categories the user is in (they only considered the Bought-Action class and did not consider other categories, such as Interest or Desire); their tweet model was simply an average of word embeddings; predictions were per user, not per tweet; and their two classes (user will purchase or not) were relatively balanced.

Domain adaptation. The challenge of domain adaptation is well known and a large variety of methods exist to handle it. For neural networks, fine-tuning the model weights using data from the target domain has shown to be effective, cf., Kombrink et al. (Reference Kombrink, Mikolov, Karafiát and Burget2011) or Chen et al. (Reference Chen, Tan, Liu, Lanchantin, Wan, Gales and Woodland2015). Ding et al. (Reference Ding, Liu, Duan and Nie2015) also investigated domain adaptation of CNNs for detecting user consumption intention. However, they fixed the weights of the convolutional layer and only trained an adaptation layer on the target domain. In contrast, we fine-tune the whole network on the target domain to be able to account for different input words and phrases in the target domain. Recently, adversarial domain adaptation has become popular (Ganin et al. Reference Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky2016; Tzeng et al. Reference Tzeng, Hoffman, Darrell and Saenko2017). It uses a domain discriminator as an adversary to make the feature spaces of the different domains as similar as possible. This approach assumes that the target domain is known during training but does not need target domain data which is labeled for the specific task. This makes it more widely applicable than fine-tuning model weights. While most studies applied adversarial domain adaptation to classification tasks (e.g., to image classification (Ganin et al. Reference Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky2016; Tzeng et al. Reference Tzeng, Hoffman, Darrell and Saenko2017) or sentiment analysis (Ganin et al. Reference Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky2016; Chen et al. Reference Chen, Sun, Athiwaratkun, Weinberger and Cardie2018)), only a few works showed its effectiveness on sequence-tagging tasks as we do in this paper. An example is Gui et al. (Reference Gui, Zhang, Huang, Peng and Huang2017) who applied it for domain-independent part-of-speech tagging. In this paper, we discuss adversarial training in direct comparison with other domain adaptation possibilities, considering scenarios with and without labeled data from the target domain.

Multitask learning with sentiment analysis. Our approach for multitask learning is based on the techniques presented in Klerke, Goldberg and Søgaard (Reference Klerke, Goldberg and Søgaard2016). They shared hidden layers across several tasks, that is, the parameters were trained with data from different tasks. As an auxiliary task, we use sentiment analysis and leverage data from sentiment analysis on Twitter (Nakov et al. Reference Nakov, Ritter, Rosenthal, Sebastiani and Stoyanov2016). In contrast to purchase stage identification, the sentiment analysis task considers only classes on a positive–negative scale. Predicting purchase stages, on the other hand, aims at a more fine-grained distinction of opinions with a direct correlation to business aspects in mind. However, previous studies showed that sentiment can be a useful feature for purchase prediction (Asur and Huberman Reference Asur and Huberman2010; Gupta et al. Reference Gupta, Varshney, Jhamtani, Kedia and Karwa2014; Lassen et al. Reference Lassen, Madsen and Vatrapu2014; Mahmud et al. Reference Mahmud, Fei, Xu, Pal and Zhou2016). Therefore, we decided to use sentiment data for multitask learning in this work.

Probably most similar to our hierarchical model is the model proposed by Tang, Qin and Liu (Reference Tang, Qin and Liu2015) for sentiment prediction. Their model predicts a single sentiment from the sentences in a document. In contrast, we predict the purchase stage for each of a user’s tweets. While they used a recurrent model to combine sentences for a single prediction per document, we use a recurrent model to provide context for the prediction of the purchase stage of the current tweet.

3 Task and data

In this section, we describe the task of purchase stage identification, the dataset that we collected and annotated, and the challenges that make purchase stage classification from tweets difficult.

3.1 Purchase stage classification

We base our definition of purchase stages on the AIDA model (Lewis Reference Lewis1903; Dukesmith Reference Dukesmith1904; Russell Reference Russell1921) and model the following classes: Interest (I), Desire (D), and Action (A). We do not model awareness explicitly since we found that most examples which would have been labeled with awareness were advertisements rather than individual user comments; thus there were few true examples of tweets indicating awareness, perhaps a reflection that users are more likely to tweet about interest, desire, or a purchase than about awareness of a product.

Over the years, many variants of the AIDA model have been proposed to address different use cases, for example, salesmen and advertising. These variants include the AIDAS model which adds Satisfaction and the AISDALSLove model (Attention, Interest, Search, Desire, Action, Like/Dislike, Share, and Love/Hate), which adds several stages, including Like/Dislike (Wijaya Reference Wijaya2015). Motivated by this, we added a class with a negative sentiment: Unhappiness (U). This class includes any kind of unhappiness with products, either before or after buying it, and has similarities to the satisfaction class in the AIDAS model and the Like/Dislike class in the AISDALSLove model. Although it is possible that a user expresses unhappiness and an AIDA stage simultaneously, this occurred in only 15 tweets out of over 100k total. Note that we have not included additional classes with positive sentiment towards a bought product, such as Like or Love since the motivation for our work is the improvement of e-commerce services by detecting users who need additional information about products or support with products.

Our models identify purchase stages for individual tweets in a given tweet sequence. Table 1 shows a sample tweet for each of our extended set of purchase stages, which we refer to as IDA + U.

Table 1 Sample purchase decision stage tweets.

3.2 Dataset creation

Data Collection. To the best of our knowledge, a publicly available dataset of Twitter tweets for the purpose of predicting AIDA purchase stages was not available before our studies in Adel et al. (Reference Adel, Chen and Chen2017). The most similar work (Korpusik et al. Reference Korpusik, Sakaki, Chen and Chen2016) used a limited set of handcrafted regular expressions, for example, “want a X” or “my new X” to identify only bought and want tweets.

To create a more representative set, we collected 98 mobile device model names by scraping websites for mobile phones, tablets, and smart watches available in 2016. The full product names and relatively distinct model names (e.g, “iPad” but not “one” as in “HTC One”) formed queries to the Twitter search API. We could have searched for more general phrases such as “phone” or “tablet” instead, but we observed that this would have led to overwhelmingly more general tweets not related to any purchase stage. Similarly, we did not want to add purchase-related patterns to the queries, as in Sakaki et al. (Reference Sakaki, Chen, Korpusik and Chen2016) to avoid biasing the dataset towards specific phrases and purchase events. Instead, our assumption was that a user who is interested in a specific product or has bought a specific product will likely also mention its brand name.

The returned 156k tweets by 36,699 distinct Twitter users were filtered to remove spam using the URL features in Benevenuto et al. (Reference Benevenuto, Magno, Rodrigues and Almeida2010) and spam words, for example, “click here,” “win,” and “free.” After collecting and filtering the tweets, the timelines for each of the remaining 10,141 users were collected and the users filtered to remove spam users based on all of the tweets in their timeline. The timeline filtering allows for a richer set of user features as described by Benevenuto et al. (Reference Benevenuto, Magno, Rodrigues and Almeida2010), including fraction of tweets containing URLs, fraction of tweets with spam words, and fraction of retweets, that is, tweets starting with “RT.” In addition, users who posted the same identical tweet that was also posted by at least five other users were also marked as spam users. From the remaining 6378 Twitter users after filtering, we randomly selected 3000 users for annotation as a trade-off between annotation costs and reasonable dataset size. Note that 3000 users corresponded to more than 100,000 tweets for annotation (see Table 4).

For our adaptation experiments in this paper, we additionally scraped the models for each car brand on http://www.motortrend.com/cars for 2016 car models. Again, the full product names and relatively distinct model names were used as queries to the Twitter search API and filtering, and collection of tweets and their users was performed.

Annotation. Tweets containing at least one product mention were grouped by user and ordered chronologically to be presented to our annotators for labeling with the IDA + U purchase stages defined in Section 3.1. All tweets that do not express one of those stages were annotated with the artificial class N, so that our model is “open” to any type of class. Our annotators were two college students; they were given the definitions and shown examples of each of the IDA + UN categories. Prior to labeling the data used in our experiments, the annotators practiced the full labeling process for a day, performing both individual labeling and discussion of disagreements. During this time, since the application of AIDA to tweets was new, a number of “gray” areas in defining a label were identified and resolved. Afterwards, they first labeled the tweets individually. Then the annotators were asked to discuss the labels of the tweets they did not agree on and to determine a final label.

Annotation challenges. We examined the disagreements between the annotators. Examples of tweets that both annotators initially mislabeled, which we consider the most difficult cases, are shown in Table 2. Note that several of the examples are subtle and may be considered as the N class without careful consideration.

Table 2 Examples of tweets that were difficult for both annotators to label.

The confusion matrix in Table 3 compares the final “ground truth” label the annotators agreed on against the erroneous label one of the annotators initially assigned to the post. Note that, aside from confusions with N, the confusions were mostly between “nearby” labels. That is, D was sometimes confused with I and A, but I and A were rarely confused. In addition, U was rarely confused with I, D, or A. There was a sizable number of errors where an annotator initially mislabeled an N as another class. But the majority of the annotator errors were due to one of the annotators not identifying an IDA + U post and labeling it as N instead. An example N tweet on which the annotators initially disagreed is as follows:

Table 3 Confusion matrix of single annotator errors on mobile device data.

Using the iPad means it’s harder to backchannel with @USER at #wacug.

The tweet is expressing unhappiness but not with the product of interest. Another example is as follows:

Figured out it’s a Microsoft system, so rebooted a couple of times and it was good.

This tweet is indicating problems with a product but the user does not seem to be unhappy about it. Similarly, there are IDA + U tweets which were initially labeled with N by one of the annotators, such as the following I tweet:

In a year where cameras on phones, have gotten so good, I wonder if HTC can finally pack in a decent shooter?

These errors are consistent with comments by the annotators that finding non-N posts among the many N posts (more than 90% of all tweets, see Table 4) was tedious. This feedback from our annotators demonstrates the importance of labeling tweets automatically with purchase stages, the task we tackle in this paper. To help our annotators, we decided to reduce the number of N tweets they needed to examine. This makes the labeling process faster since less tweets need to be read and evaluated by the annotators. Furthermore, we argue that it also makes the annotation task easier and less error-prone, since by examining less tweets in total, the chances are reduced that an annotator misses a relevant tweet by accident. To achieve this, we trained a binary high-precision model to predict whether to label a tweet with a purchase stage or with N after our annotators had labeled about 70% of the mobile device tweets. We refer to this model as a “relevance classifier” in the remainder of the paper. To achieve high precision, the relevance classifier was biased towards predicting IDA + U and only predicted N if it was very confident. We used this classifier to prelabel the remaining unlabeled tweets and let the annotators focus on the potential IDA + U tweets only. In order to bias the model and reduce the number of false negative predictions (which would lead to falsely skipping a tweet expressing a purchase stage), we used a combination of a logistic regression (LR) classifier and a SVM which only output N if both classifiers agreed on it. Note that we did not use neural network classifiers in this step since it was not clear before our experiments whether they would be able to learn purchase stage- indicative features from a rather small amount of labeled data. We expected LR classifiers and SVMs to be more robust in this setting. The two individual models of the relevance classifier were also skewed towards high precision by using class weights which were tuned with cross-validation. We informally evaluated the high-precision relevance classifier by training on 80% of the labeled tweets and testing on the remaining 20%. Only 1% of the IDA + U tweets were incorrectly removed while reducing the number of tweets to label by 62%.

Table 4 Statistics of the annotated data.

Annotation quality. The initial Cohen’s kappa between the annotators was 0.30 for all tweets. For tweets that both annotators labeled with any of IDA + U, Cohen’s kappa was 0.77. Given this agreement on the IDA + U classes and the second annotation round in which the annotators agreed on a final label for all ambiguous tweets, we can assume to have a dataset with reasonably good annotation quality.

Tweet sequences. We temporally ordered the tweet sequence of each user. If the temporal distance between two tweets was more than 2 months, we split the sequence into two. We chose this distance heuristically based on a manual inspection of the time stamps and topics of tweets. Having a break of 2 months or more between two tweets indicated a change of topic in all inspected cases.

Statistics. Table 4 shows statistics of the annotated data, the number of users for whom we collected tweets, the number of collected tweets, the number of tweet sequences after splitting at 2-month boundaries (see above), and the label distribution (for both the mobile device product category used in our main experiments and the car product category used in our adaptation experiments).

3.3 Challenges

The collected dataset and the IDA + U identification task provide a variety of research challenges. One of them is the noise typical for Twitter data (abbreviations, acronyms, elongation of words, inconsistencies in spelling or tokenization, misspellings, and lack of grammatical structure). As described in Section 5.1, we use the publicly available scripts by Xu, Liang and Baldwin (Reference Xu, Liang and Baldwin2016) to apply Twitter-specific preprocessing steps to alleviate the noise. In addition, we experiment with word embeddings specifically trained on Twitter data using word2vec (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013) (see Section 4.1). In this subsection, we describe the three challenges we investigate in this paper.

Imbalanced data. The label distribution in Table 4 shows that the data are highly imbalanced. Many times users will talk about a product but are not necessarily interested in buying it. For instance, they might write about their experience with a product or just mention that someone else has bought a product. An example of this type of tweet is:

annoying cell phone talker on the bus is diagnosing her divorce. Loud enough to be heard over the ipod. I need louder music. Or a brick.

To cope with the imbalanced labels, we subsample the data and experiment with class weights and ranking loss (see Section 4.3). While subsampling and class weights are popular methods for dealing with imbalanced labels in the literature, cf., Chawla, Japkowicz and Kotcz (Reference Chawla, Japkowicz and Kotcz2004), we presented the ranking loss as an alternative way in our earlier work (Adel et al. Reference Adel, Chen and Chen2017). Our comparison of the model performance with and without the dominant (N) class shows that the data imbalance poses a very important challenge to our models.

Little training data. Table 5 in Section 5.1 shows that our models are trained on a rather small dataset. One reason for that is the supervised training which requires (manual) labels. To mitigate this problem, unsupervised or weakly supervised data can be added. In this work, we use word embeddings which have been pretrained on a large Twitter corpus and apply multitask training to learn more robust tweet representations from an additional dataset.

Table 5 Mobile device dataset statistics after preprocessing.

Product category dependence. Some of the terms and phrases in tweets expressing one of the IDA + U classes are independent of the product category, for example, “wanna buy.” However, others are dependent on the product category, for example, “fingerprint login” would apply to a mobile phone but not to a car. In order to identify purchase stages for a new product category, a new labeled dataset containing enough exemplars could be created for training. However, corpus creation of the size needed for training a deep learning system is time-consuming and expensive. We examine the alternative of creating a smaller labeled dataset for use in adapting a deep learning model to the new product category. Note that we follow the standard approach in domain adaptation with this: adapting a model learned on a larger source domain dataset to a smaller target domain dataset.

4 Model

In Adel et al. (Reference Adel, Chen and Chen2017), we proposed a hierarchical model for representing a sequence of tweets by a user and predicting whether a tweet expresses an IDA + U stage for a product. In this paper, we show additional experiments with this model and extend it to multitask learning and adaptation to a new product category. An overview of the hierarchical model is shown in Figure 1. Figure 2 shows how a representation for each tweet is created. Figure 3 depicts the purchase stage identification model which uses information from the current tweet as well as from earlier tweets to identify IDA + U stages.

Figure 1 Hierarchical model of tweets and a user’s tweet sequence.

Figure 2 (Color online) Tweet model: The words are represented by word embeddings and a CNN with multiple filters, possibly with different widths (blue: filter width = 1, red: filter width = 2, green: filter width = 3), is used to create a tweet representation.

Figure 3 Purchase stage model: A GRU models a user’s tweet sequence and a ranking layer determines the most probable outputs.

4.1 Creating tweet representations

To represent a tweet, we begin by representing each word by an embedding, skipping unknown words. We use publicly available embeddings trained with word2vec (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013) on Twitter data (Godin et al. Reference Godin, Vandersmissen, de Neve and van de Walle2015) and compare them to the public Google News embeddingsFootnote a and randomly initialized embeddings. To reduce the computation time, we limit the size of the vocabulary to the 100,000 most frequent words.

Convolutional neural network. To compute a tweet representation, we use CNNs on the sequence of word embeddings, as shown in Figure 2. Traditional methods like bag-of-words (BOWs) or averaging approaches have the disadvantage that they ignore order information of the word sequence. The two sentences “I bought a phone but no new case” and “I bought a new case but no phone,” for example, would get the same representations using these approaches. Deep learning methods like multilayer perceptrons can preserve sequence information but are very sensitive to word insertions. CNNs, on the other hand, are able to extract position-independent features. Since the CNN filters span several words in a sentence (depending on the filter width), they are also able to preserve order information within the filters. Moreover, CNNs have been used for many sentence classification tasks in the literature to create a sentence representation (Kalchbrenner et al. Reference Kalchbrenner, Grefenstette and Blunsom2014; Kim Reference Kim2014). Several CNN filters are slid over the embedded representation of words in a sentence, computing scores for each n-gram. Afterwards, pooling is applied to extract the most relevant scores and obtain a fixed-length sentence representation. In this work, we apply $k-\text{max}$ pooling with $k=3$ , following Kalchbrenner et al. (Reference Kalchbrenner, Grefenstette and Blunsom2014). In our main experiments, we use CNNs with a filter width of 3. However in our analyses (see Section 7), we also investigate CNN models with multiple filter widths, for example 1, 2, and 3 (as shown in Figure 2) to model unigrams, bigrams, and trigrams. After convolution and pooling, the results from the different filters are concatenated (Kim Reference Kim2014). In all settings, the height of the filters is the size of the word embeddings.

4.2 Modeling tweet sequences

To capture the context of prior tweets of a user, the inputs to our model are tweet sequences. The tweet representations described above are fed into a sequence model: an RNN. This enables the model to learn patterns across tweets, such as “a user expresses interest in a product before buying it but not vice versa.” In our dataset, about 27% of the users showed interest in a product and about 45% of the users expressed desire before mentioning that they bought the product.

Recurrent neural network. RNNs have been used in different NLP tasks, such as neural machine translation (Cho et al. Reference Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk and Bengio2014; Bahdanau et al. Reference Bahdanau, Cho and Bengio2015) or question answering (Hermann et al. Reference Hermann, Kociský, Grefenstette, Espeholt, Kay, Suleyman and Blunsom2015). They are especially suited for sequence modeling since they are able to process sequences of different lengths and store information in their hidden layer which is updated in a recurrent way using the same weight matrix at every time step. For training with backpropagation, they are unfolded (backpropagation through time (Werbos Reference Werbos1990)). This results in the challenge of vanishing and exploding gradients and makes them hard to train (Pascanu, Mikolov and Bengio Reference Pascanu, Mikolov and Bengio2013). LSTM networks (Hochreiter and Schmidhuber Reference Hochreiter and Schmidhuber1997) use gates to overcome the vanishing gradient problem. We chose to use GRUs (Cho et al. Reference Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk and Bengio2014) since they were shown to be as effective as LSTM networks but more efficient in training due to fewer parameters (Chung et al. Reference Chung, Gulcehre, Cho and Bengio2014). The GRUs generate a new hidden state $h^t$ at each time step t using the following update equations:

$$r = \sigma ({W_r}x + {U_r}{h^{t - 1}})$$
$$z = \sigma ({W_z}x + {U_z}{h^{t - 1}})$$
$${\tilde h^t} = \sigma (Wx + U(r \odot {h^{t - 1}}))$$
$$h^t = z \odot h^{t-1} + (1-z)\odot \tilde{h}^{t}\label{eqn4} $$

where x is the current input item, W and U are weight matrices which are learned during training, and r and z are the reset and update gates of the unit. To address exploding gradients we use gradient clipping (Mikolov Reference Mikolov2012; Pascanu et al. Reference Pascanu, Mikolov and Bengio2013)

Unidirectional versus Bidirectional GRU. Figure 3 depicts a unidirectional GRU as a tweet sequence model. To enrich the information available to the network, we also investigate bidirectional GRUs (Schuster, Paliwal and General Reference Schuster, Paliwal and General1997), which enable the network to use information from future tweets in addition to past tweets. While a bidirectional GRU might be more powerful than a unidirectional GRU, since it can take future information into account and has a larger number of parameters, its applications are more limited: it can only be applied if the full tweet sequence is given. For classifying tweets when they are posted, only unidirectional GRUs can be applied.

4.3 Output layer for imbalanced data

For each tweet, the model predicts a purchase stage. We propose to use a ranking layer approach for this and show that this is especially helpful for dealing with imbalanced data (see Section 3.3).

Given the output of the previous layer (e.g., the current hidden state of the tweet sequence model), we first apply a linear layer to map the output vector h to a vector s of the size of our output classes. Thus, vector s contains one score for each output class.

$$\begin{equation} s = V^Th \label{eqn5} \end{equation}$$

The matrix V which we use for the linear mapping is initialized randomly and learned during training. Afterwards, we calculate the ranking loss based on the score vector s during training or extract the class with the highest score during inference.

Ranking loss. We follow dos Santos, Xiang and Zhou (Reference Dos Santos, Xiang and Zhou2015) who introduced the following ranking loss function:

$$\begin{align} L =& \log\big(1+\exp(\gamma(m^+-s_{y^+}))\big) \nonumber\\[4pt] &+ \log\big(1+\exp(\gamma(m^-+s_{c^-}))\big)\label{eqn6}\end{align}$$

where $s_{y^+}$ is the score for the correct class $y^+$ and $s_{c^-}$ is the score for the best competitive class $c^-$ . Updating only two classes per training step is a common technique with ranking loss functions (Weston, Bengio and Usunier Reference Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa2011; Gao et al. Reference Gao, Pantel, Gamon, He and Deng2014; dos Santos et al. Reference Dos Santos, Xiang and Zhou2015) The intuition is to clearly separate the correct class from the other classes, represented by the best competitive one. For more details on this, we refer the reader to dos Santos et al. (Reference Dos Santos, Xiang and Zhou2015). The variables $m^+$ and $m^-$ are margins. By optimizing this function, the model learns to assign scores greater than $m^+$ for the correct class and scores smaller than $m^-$ for the incorrect classes. The scaling factor $\gamma$ can magnify the penalty for classification errors. Following the original implementation by Santos et al. (Reference Dos Santos, Xiang and Zhou2015), we set $m^+$ to 2.5 and $m^-$ to 0.5. However, we found it beneficial in our experiments to tune $\gamma$ on the development set. If $y^+=N$ , only the second summand is evaluated. Moreover, $c^-$ is chosen only from the purchase-stage classes (IDA + U). This has the advantage that the model does not need to learn a pattern for the N class which might not exist since that class conflates a lot of different patterns. During test, N is only predicted if the scores for all other classes are negative. This is why the function can be applied to handle artificial classes (like our N class) for which it might not be possible to learn a specific pattern. Instead, the model learns to focus on the non-artificial classes. Therefore, we examine this loss function in the context of data which is imbalanced between IDA + U and N. In our experiments in Section 6.1, we compare it against a standard softmax output layer and against a softmax output layer with class weights.

4.4 Multitask learning

Multitask learning provides the possibility to train the network parameters more robustly by using additional data. In this paper, we jointly train our model on purchase stage identification and sentiment analysis using the Twitter-based SemEval2016 sentiment analysis shared task dataset (Nakov et al. Reference Nakov, Ritter, Rosenthal, Sebastiani and Stoyanov2016). Note that in contrast to early work on multitask learning (Caruana Reference Caruana1993), we use a different dataset for the secondary task, as is now commonly done (Collobert et al. Reference Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa2011; Klerke et al. 2016; Ruder Reference Ruder2017). The sentiment analysis task assigns each tweet a sentiment (positive, negative, or neutral).

In particular, we share all parameters for computing tweet representations between the two tasks and then use task-individual output layers (softmax layer for sentiment analysis and tweet sequence layer plus ranking layer for purchase stage prediction). During training, we alternate one batch of sentiment analysis and two batches of purchase stage prediction. This training schedule has been chosen based on prior experiments in which we have compared several alternatives with respect to their performance on the development set, such as alternating more frequently (after each batch of a task) or less frequently (after two or more batches of the different tasks).

Figure 4 illustrates which parts of the network are shared for multitask learning. Due to the two task-specific output layers, it is possible to train the network by alternating examples from the two tasks without the need of having a dataset with labels for both tasks.

Figure 4 Model for multitask learning. The black part (tweet representation) is shared between both tasks, the blue part (top left) is evaluated for sentiment data, and the red part (top right) is evaluated for purchase stage prediction instances.

Since the purchase stage prediction classes express different intensities of excitement and desire (positive sentiments) and the unhappiness class expresses a negative sentiment, we expect that the tweet representation layers of the two individual tasks share a lot of commonalities which can be exploited by multitask learning. Moreover, the input data for both tasks comes from the Twitter domain. Thus, providing the network with more examples of tweets can help to make the tweet representation layer more robust.

4.5 Product category adaptation

Models which are trained in an end-to-end fashion may especially suffer from specialization to a specific domain since their features highly depend on a single objective and the specific dataset they have been trained on. Therefore, we investigate the adaptability of our hierarchical model to tweets about another product category (cars). Thus, our source domain is mobile devices, and our target domain is cars. We compare four different approaches: (i) training a model on one product category (source) and applying it to another (target); (ii) training a model only on the target category for which we have little training data; (iii) training a model on the source product category and fine-tuning it on the target category (see Section 4.5.1); (iv) learning domain-invariant representations via adversarial training (Section 4.5.2).

4.5.1 Fine-tuning

For fine-tuning, we take an existing model (trained on the source domain) and update its weights for a small number of epochs on the limited amount of training data from the target domain. Thus, the model learns to handle examples from the target domain. It is not trained from scratch but can make use of its prior knowledge about the task and the source domain.

4.5.2 Adversarial training

If there is no labeled data from the target domain available, it is neither possible to train a new model nor to fine-tune an existing model on the target domain. Instead, it might be beneficial to learn domain-invariant features. Adversarial domain adaptation is a mechanism to achieve this. As shown in Figure 5, we extend our model by a domain discriminator which attempts to classify the domain of the input given the same feature representation used for classifying the purchase stages. During training, the purchase stage classifier and domain discriminator are trained alternately in a multitask fashion. In contrast to the model in Figure 4, however, the feature extractor and the domain discriminator play a min–max game: The feature extractor f learns to fool the domain discriminator d while still providing meaningful representations to the purchase stage classifier y. This is shown in the following equations (more details are provided in Ganin et al. (Reference Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky2016)):

$$L_d^i({\theta _f},{\theta _d}) = - \log (p(D({x_i})|F({x_i})))$$
$$E({\theta _f},{\theta _y},{\theta _d}) = {1 \over Y}\sum\limits_{i = 1}^Y {L_y^i} ({\theta _f},{\theta _y}) - \lambda \left( {{1 \over Z}\sum\limits_{i = 1}^Z {L_d^i} ({\theta _f},{\theta _d})} \right)$$
$$({\hat \theta _f},{\hat \theta _y}) = {\rm{argmi}}{{\rm{n}}_{{\theta _f},{\theta _y}}}\;E({\theta _f},{\theta _y},{\hat \theta _d})$$
$${\hat \theta _d} = {\rm{argma}}{{\rm{x}}_{{\theta _d}}}\;E({\hat \theta _f},{\hat \theta _y},{\theta _d})$$

Figure 5 Model for adversarial training. The shared feature extractor is depicted in black, the domain classifier in blue, and the purchase stage classifier in red. The green backwards lines illustrate the gradient flow. Note that there is only one feature extractor (tweet model) which is applied to each example tweet. Thus, the updates from the domain classifier and the purchase stage model affect the same tweet model.

Let F denote the function of the feature extractor and D the function of the domain discriminator. F computes a feature representation of the input (with a CNN in our model) and D computes scores for the different domains (with a softmax layer). Equation 7 describes $L_d^i$ , the loss of the domain discriminator for data sample i. We use cross entropy loss, the standard loss function for classification problems. It is based on the representation $F(x_i)$ computed by the feature extractor for input $x_i$ . $L_y$ is the loss function of the purchase stage classifier. In our model, we use a ranking loss based on the representation F(x) of the feature extractor and our tweet sequence model (see Equation 6).

The two loss functions are combined to a single objective function E in Equation 8. The weights of the three components (purchase stage classifier, domain discriminator and the shared feature extractor) are denoted by $\theta_y$ , $\theta_d$ and $\theta_f$ , respectively. The hyperparameter $\lambda$ controls the impact of the loss of the domain discriminator. It is tuned on the development set. Y denotes the number of labeled training examples (in the source domain), while Z is the number of unlabeled tweets from the source and target domain. Note that the domain discriminator is trained on example tweets from both domains (which are labeled with their domain but not with purchase-stage classes), while the purchase stage classifier sees only examples from the source domain if no labeled target domain data are available.

5 Experimental setting

In this section, we describe the data preprocessing, model optimization, and evaluation method.

5.1 Data preprocessing

For preprocessing the tweets, we apply the publicly available scripts from the University of MelbourneFootnote b (Xu et al. Reference Mahmud, Fei, Xu, Pal and Zhou2016) to address some of the idiosyncrasies of tweets. They apply twokenize (Owoputi et al. Reference Owoputi, O’Connor, Dyer, Gimpel, Schneider and Smith2013) for tokenization and perform some basic cleaning steps, such as replacing URLs with a special token or normalizing elongated words (e.g., $\text{``looooong''} \rightarrow \text{``long''}$ ).

We split the cleaned data into training, development (dev), and test sets (80, 10, 10%Footnote c) and filtered them to reduce the effect of imbalanced labels. For this, we used the relevance classifier from the annotation process and skipped tweet sequences if every tweet of the sequence was classified as not relevant. Since the relevance classifier has been trained to produce as few false negative predictions as possible, the number of falsely skipped tweet sequences is negligibly small. In addition, we randomly subsampled the N tweets. Table 5 provides statistics for the resulting dataset. In Table 6, we show the distribution of the number of distinct purchase stages (i.e., IDA + U) per tweet sequence in the dataset. The high frequency of 0 purchase stages indicates that for more than half of the tweet sequences, all tweets are labeled with N. The higher frequency of non-purchase stages was expected. However, the analysis also shows that a considerable number of tweet sequences include two or more distinct purchase stages. To further analyze the tweet sequences of our dataset, we count the pairwise transitions between purchase stages. Table 7 shows that, as expected, most transitions happen either within the same purchase stage (diagonals of the heat maps) or towards the direction of more advanced purchase stages (e.g., from I to A but not from A to I). An exception is the stage U which can be source and target of a transition to an almost equal extent. Again, this was expected since it expresses a (negative) sentiment state which can either stem from recently buying a product or can result in the purchase of another product in the future. Those observations motivate using a tweet sequence layer on top of the tweet representation layer (see Section 4.2).

Table 6 Number of distinct purchase stages per tweet sequence (in percent).

Table 7 Number of transitions between purchase stages.

5.2 Model optimization

We implement the neural networks with Theano (Theano Development Team 2016) and the nonneural classifiers with scikit-learn (Pedregosa et al. Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot and Duchesnay2011). For training the neural networks, we use stochastic gradient descent and shuffle the training data at the beginning of each epoch. We apply AdaDelta as the learning rate schedule (Zeiler Reference Zeiler2012) and use a batch size of three for our experiments. We used a comparably small batch size out of two reasons: first, our training dataset is rather small and smaller batch sizes result in more updates during training. Second, we aggregate the loss of all tweets per tweet sequence, that is, when using batches of three tweet sequences, we, in fact, average not only three values (one per tweet sequence) but also many more values (one for each tweet in the three sequences). For regularization, we add the L2 regularization term to the cost function with weight $l=0.00001$ and perform early stopping on the dev set. To avoid exploding gradients, we clip the gradients at a threshold of $t = 1.0$ . The hyperparameters are optimized on dev, resulting in a CNN filter size of 3, a total number of 300 filters and 100 hidden units for the GRU sequence model. Note that we use only one filter width for the CNN in our main experiments; in the analysis in Section 7.1, we also compare to multiple filter widths as shown in Figure 2.

5.3 Evaluation measure

Since the number of IDA + U tweets is much lower than the number of N tweets, the task of identifying the IDA + U tweets is similar to a retrieval task, where the number of relevant documents is small compared to the number of nonrelevant documents. However, here we simultaneously differentiate among the four IDA + U classes at the same time. To evaluate performance, we use macro $F_1$ , the average of the $F_1$ scores equally weighted across the IDA + U categories. The $F_1$ score is the geometric mean of precision P and recall R: $F_1 = \frac{2\cdot P\cdot R}{P\,+\,R}$ .

6 Experimental results

We conduct three sets of experiments: models for purchase stage identification from tweets about mobile devices with a special focus on the impact of the dominant (N) class on the results, multitask experiments with sentiment analysis data, and adaptation to tweets about cars.

6.1 Purchase stage identification

In this section, we present an extended evaluation of our model from Adel et al. (Reference Adel, Chen and Chen2017) to show its general performance and validate our choices of the individual layers.

Tweet representation models. We first compare our convolutional layer to other ways of calculating tweet representations: n-gram BOW representations for nonneural models, averaging word embeddings, and GRUs with attention.

N-gram BOWs. For our nonneural models, that is, SVM and LR, we compute BOW vectors using the vocabulary given by the Twitter embeddings. In particular, we consider the combination of unigrams, bigrams, and trigrams of the input tweets. We only use features that can be directly derived from the input string of words for both our nonneural and neural models. The reason is that we want to compare and evaluate the power of different models without handcrafted feature engineering. Moreover, this has the advantage that we do not rely on expensive data preprocessing or any auxiliary models or tools.

Average over Word Embeddings. The most straightforward way of creating a tweet representation (i.e., a phrase embedding) is averaging the embeddings of the single words (cf., Mikolov et al. (Reference Mikolov, Chen, Corrado and Dean2013), Le and Mikolov (Reference Le and Mikolov2014), Lebret, Pinheiro and Collobert (Reference Lebret, Pinheiro and Collobert2015), or Korpusik et al. (Reference Korpusik, Sakaki, Chen and Chen2016)). The idea behind this is similar to the traditional BOW vectors which also accumulate all information from a phrase or a sentence.

GRUs with Attention. An alternative option to CNNs for processing the words of a sentence is RNNs. In particular, we apply bidirectional GRUs (Cho et al. Reference Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk and Bengio2014) to the word sequence of a tweet, following state-of-the-art work on sentence processing (e.g., Hermann et al. Reference Hermann, Kociský, Grefenstette, Espeholt, Kay, Suleyman and Blunsom2015; Yang et al. Reference Yang, Yang, Dyer, He, Smola and Hovy2016). In the bidirectional GRU, the final tweet representation is the concatenation of the last forward hidden state and the first backward hidden state, that is, the two hidden states that have seen all the words of the input tweet.

Attention allows the model to focus on the most relevant input words or representations. It has proven to be useful in many text processing tasks (e.g., Bahdanau et al. Reference Bahdanau, Cho and Bengio2015; Hermann et al. Reference Hermann, Kociský, Grefenstette, Espeholt, Kay, Suleyman and Blunsom2015; Adel and Schütze 2017) For each GRU hidden state $x_i$ , we calculate the attention weight $\alpha_i$ with a softmax layer:

$$\begin{equation} \alpha_i = \frac{exp(V^Tx_i)}{\sum_j exp(V^Tx_j)} \label{eqn11} \end{equation}$$

where V is a parameter of the layer which is initialized randomly and learned during training. The final tweet representation is the weighted sum of all hidden states (instead of using only the last hidden state as described above): $\sum_i \alpha_i \cdot x_i$

Results. The results of the different tweet representation models are given in Table 8 and show that our proposed convolutional layer performs the best. Note that $F_1$ for random guessing is 4.17 and 4.02 for dev and test, respectively, which indicates the difficulty of the task.

Table 8 Macro $F_1$ scores (in percent) for different tweet representation models; rep = representation, emb = embeddings, att = attention; in all neural cases: GRU as tweet sequence model trained with cross entropy loss without class weights.

For all neural models, using Twitter embeddings outperforms Google News embeddings. We suspect two reasons for this: domain mismatch of news data and Twitter data, and higher number of out-of-vocabulary words. Using Google News embeddings, 61.4% (29.7%) of all types (tokens) are unknown. This decreases to 48.8% (4.2%) with Twitter embeddings. For our baseline neural network model with averaging embeddings, we also compare the pretrained embeddings to randomly initialized embeddings (row t2a in Table 8). As expected with our limited amount of training data, pretrained embeddings perform better than randomly initialized embeddings. This is in line with the results of other studies on various NLP tasks (e.g., Kim Reference Kim2014).

Tweet representations computed by GRUs (rows t3a–t3c) or CNNs (t4a–t4b) clearly outperform simply averaging word embeddings (t2a–t2c) and using nonneural models with a BOW tweet representation (t1a, t1b). Averaging word embeddings means losing all word order information and having no possibilities to focus on words relevant to the task. Attention clearly improves the results of the GRUs (t3b and t3c vs. t3a) since it increases the ability of the model to focus on intermediary hidden states and, thus, relevant input words. We also experimented with attention for CNNs but did not observe improvements, probably because the pooling mechanism of CNNs is already a powerful selection mechanism.

Dealing with imbalanced data. Next, we summarize our results from Adel et al. (Reference Adel, Chen and Chen2017) on different ways of dealing with imbalanced data (see Table 9). We compare our proposed ranking loss to traditionally used class weights in combination with cross entropy loss (softmax output layer). When using class weights, the error of the model is weighted (i.e., multiplied) by a misclassification cost $w > 1$ if the reference is a non-artificial IDA+U class; for other misclassifications, the cost is not weighted, that is, multiplied by $w = 1$ . Thus, the model is penalized more for false negatives than for false positives. In combination with gradient descent, this means that the parameter updates after a false negative prediction are larger. The weight $w_i$ for class i is calculated based on the inverse class frequency $f_i$ : $w_i = \frac{n}{c \cdot f_i}$ with n being the total number of samples and c being the number of classes. The weights are normalized so that the weight for class N is 1.

Table 9 Macro $F_1$ (in percent) for various methods of dealing with class imbalance; CE = cross entropy, SH = squared hinge; GRU as tweet sequence model for neural models.

Results. Ranking loss outperforms cross entropy loss with and without class weights for the GRU and the CNN (i2c, i3c). For the nonneural models, adding class weights especially helps on the test set. Using CNNs for tweet representations (i3c) improves both the performance on dev and on test and leads to the best results. These results are in contrast to previous studies which have found that SVMs outperformed neural networks on imbalanced data (Chawla et al. Reference Chawla, Japkowicz and Kotcz2004) and confirm that ranking is a good approach to cope with this challenge.

Tweet sequence models. Finally, we investigate the impact of the tweet sequence model on the network which has performed best so far: the CNN-based tweet representations in combination with the ranking loss. We compare unidirectional and bidirectional GRUs as tweet sequence models against no tweet sequence model, that is, identifying the class of each tweet on its own by applying a fully connected hidden layer on each tweet representation. Omitting the tweet sequence model makes our architecture comparable to approaches from related work, such as Ding et al. (Reference Ding, Liu, Duan and Nie2015) or Yan et al. (Reference Yan, Duan, Chen, Zhou, Zhou and Li2017).

Results. Table 10 provides the results for different tweet sequence models: using a unidirectional GRU as a tweet sequence model (s2) outperforms treating each tweet individually with a neural network (s1). This indicates that the information from previous tweets helps when classifying the purchase stage of the current tweet.

Table 10 Macro F1 scores (in percent) for different tweet sequence models.

For the sequence models, a unidirectional GRU (s2), that is, classification without look-ahead, outperforms a bidirectional GRU (s3), that is, taking the future context into account. We assume that the reason is the limited amount of training data. Note that the decrease in performance from dev to test is larger for the bidirectional GRU than the unidirectional GRU. A unidirectional GRU has less parameters than a bidirectional GRU and might thus be more robust against overfitting in the context of few data.

6.2 Performance on imbalanced data

In order to investigate the impact of the imbalance of our dataset on the performance of the different models, we compare the results of the models on the whole dataset (IDA + UN, called “with N” in this section) with their results on only the purchase-stage classes (IDA + U, called “without N” in this section). In particular, we use the same models for both evaluation setups which have been trained on the whole training set including all the labels but do only consider IDA + U references and predictions for the “without N” scores. Table 11 shows the results. The large performance difference between the two evaluation setups shows that classifying a tweet as a purchase stage (any of IDA + U) or not (N) is much more challenging for the model than correctly identifying the actual purchase stage. The good results on the purchase-stage classes (“without N”) confirm that our models are reasonably good at identifying the correct purchase stage given a tweet expressing one of the purchase stages. Comparing the scores on the purchase-stage classes (“without N”) of the nonneural models with class weights to the scores of the neural model with ranking shows that the neural models are clearly better at distinguishing the different purchase stages. However, they suffer more from the class imbalance than the nonneural models. As a result, the score of the GRU + att model “with N” is slightly worse than the score of the SVM while it is better “without N.” Also the CNN model outperforms the other models more clearly on the purchase-stage classes (“without N”) than on all classes (“with N”).

Table 11 Macro F1 of models when testing on data with or without N class.

Table 12 shows the confusion matrix for the best neural model (CNN tweet representation with unidirectional GRU sequence model). In total, over 90% of the confusions involve the N class. Of the other confusions, most are neighboring labels, such as I and D or D and A. Similar to Table 11, this shows that the model has learned to distinguish the purchase stages and that class imbalance is the most important difficulty. This challenge remains even when applying the relevance classifier from the annotation process first since it only filters out those N tweets for which it is very confident (i.e., “easy” decisions) while our models most probably struggle with the harder cases.

Table 12 Confusion matrix on the test set for the best model (CNN + unidirectional GRU + ranking).

Similarly, our annotators had more difficulty with class imbalance, with a kappa (agreement) score of 0.30 over all tweets, but found the purchase stages to be relatively distinguishable, as evidenced by their higher kappa score of 0.77 when both annotators identified a tweet as being one of IDA + U (see Section 3.2). A possible reason for this is that the N class is artificially created and comprises a lot of different patterns. This high diversity within the class may lead to difficult decisions, especially for gray area cases. In future work, we consider it worth looking at the gray area cases in more detail. It might be interesting, for example, to cluster N tweets and identify subclasses of user-product relations.

6.3 Multitask learning

To address the challenge of a limited amount of training data, we investigate multitask learning by adding sentiment analysis as a second task during training. In addition to the macro $F_1$ score, we also report class-wise $F_1$ scores in Table 13 to show which classes are improved the most by multitask learning. Lines (m1a) and (m2a) show the results without multitask learning, and lines (m1b) and (m2b) provide the results with multitask learning. The results reveal that multitask learning improves the class wise results in six out of eight cases, especially the U class. This is reasonable since we use sentiment analysis data and U is the only class with a negative sentiment. For the model with CNN tweet representations, the macro $F_1$ score is also improved considerably. Since the tweet representations are trained on two different tasks in the multitask learning experiments, they may be more general and less likely to overfit to a particular task.

Table 13 Class-wise and macro $F_1$ scores (in percent) on test data for different tweet representation models and multitask learning; bidirectional GRU with ranking layer was used as sequence model for main task.

6.4 Product category adaptation

We compare our proposed CNN + GRU model against a baseline BOW + SVM model to assess the effectiveness of each when predicting user purchase stages for a new product category. Four adaptation methods are compared: (i) none: examines how well the mobile device model performs when tested on the new product category of cars, (ii) new model: a new model is trained on the available car training data, (iii) fine-tuning: the mobile device model is retrained on the available car training data (see Section 4.5.1),4 and (iv) adversarial: a domain discriminator forces the network to learn domain-invariant features (see Section 4.5.2).

Table 14 provides statistics for the car dataset after preprocessing as described in Section 5.1. Note that the number of car training sequences is less than 25% of the number of mobile device training sequences. We split the tweet sequences almost evenly into a training and test set to ensure that both datasets are as large as possible: if the training set was too small, our models would not be able to learn anything reasonable from it. If the test set was too small, we would not be able to draw any meaningful conclusions from the results. Nevertheless, the resulting training dataset is rather small. Therefore, we do not assign a fixed subset for development but tune the models with cross-validation. Only for the adversarial experiments and the analysis presented in Figure 6, we use a fixed subset of the training set for early stopping. The test set was held out for the final evaluation in all experiments.

Table 14 Car dataset statistics after preprocessing to reduce imbalanced labels.

Table 15 shows the results for the baseline BOW + SVM model and our neural model with CNN for representing tweets. We compare two settings: one in which labeled training data from the target domain is available and one without. In general, the neural models outperform the baseline SVM across all adaptation methods and both setups, indicating that neural domain adaptation is effective. The better performance of the bidirectional GRU compared to the unidirectional GRU could be explained by the larger model capacity (because of more parameters) of the bidirectional model. The poor performance with no adaptation (first column) confirms that the model performance is dependent on the product category and that domain adaptation is needed for handling differences between domains, such as vocabulary and word usage. The third through fifth column shows that it is beneficial to use data from the target domain if available: A new model trained on a limited amount of data from the new domain may perform better than a model trained for a different domain. Moreover, when target domain data are available, fine-tuning is the best adaptation method. The word embeddings and continuous feature representations of a neural network especially enable adaptation by fine-tuning with a limited amount of data from the new domain. If there is no labeled data from the target domain available, adversarial training is the best choice; it clearly outperforms applying the source model to the target domain.

Table 15 Macro $F_1$ comparing three methods for adapting a purchase stage prediction model to a new product category.

It is important to note that the CNN + GRU models have better performance than the baseline BOW + SVM model and further enable domain adaptation without any labeled training data from the target domain (with adversarial training). Thus, another advantage of the CNN + GRU model over the BOW + SVM model is more effective adaptation to new product categories.

Figure 6 analyzes the effect of adding labeled training data from the target domain on the performance of fine-tuning and adversarial training. As expected, the performance increases for both methods with more data from the target domain. The curve of the model with adversarial training looks more smooth, indicating that the model is more robust in general. Especially with no or little labeled target data, adversarial training leads to better performance. However, if more labeled target examples are available, fine-tuning a model to the new domain is a better choice than creating domain-invariant input representations.

Figure 6 Macro $F_1$ of domain adaptation w.r.t. the number of labeled target samples.

Since the label distribution in the car test set is different from the label distribution in the mobile test set (see Table 14 vs. Table 5), we show confusion matrices for the CNN + bidirectional GRU with and without adaptation in Table 16. Again, most confusions happen with the N class and the number of purchase-stage-class confusions stays small. However, domain adaptation leads to a slight increase of confusions with the D class due to the larger proportion of D in the car training data.

Table 16 Confusion matrices on the car test set for the CNN + bidir GRUmodel.

7 Analysis

In this section, we analyze the results of our model. First, we investigate the effect of different filter widths for the CNN as well as the choice of n-grams for the BOW features for the SVM. Then, complementary to Ribeiro, Singh and Guestrin (Reference Ribeiro, Singh and Guestrin2016) who describe how to “explain the prediction of any classifier,” we examine the type of words and phrases that our classifier considers important by: (i) extracting the most representative n-grams for each of our predicted classes and (ii) investigating the attention weights learned by the GRU + attention tweet representation model.

7.1 Effect of N-gram size

The input to the baseline SVM is n-grams. In this analysis, we compare the results of the SVM (with class weights) when using only unigrams, bigrams, or trigrams to combining all three n-gram sizes in the input BOW vector. Table 17 shows that the SVM performance decreases dramatically when increasing the n-gram size, probably due to data spareness. However, the combination of 1-gram, 2-grams, and 3-grams results in the best SVM model.

Table 17 Macro $F_1$ on test for different n-gram sizes for two models: (1) SVM + BOW and (2) CNN + unidirectional GRU with different filter sizes.

The filters of the CNN also span n-grams (n-gram size defined by the filter width). Thus, a similar analysis can be performed for the CNN tweet representation model. For this, we train the model with different filter sizes, tuning the number of filters and the number of hidden units for the GRU sequence model on dev. Table 17 shows that the CNN-based hierarchical model has better performance than the corresponding SVM model for all n-gram sizes except unigrams on test data. Moreover, the CNN-based model appears to be much more resistant against data sparseness with higher n-gram sizes than the SVM. This can be explained by the usage of word embeddings which enables generalization of word meanings so that the sparseness of higher-order n-grams is reduced and the model can effectively make use of the additional information from larger n-grams. The hierarchical CNN-based model with unigrams, bigrams, and trigrams (i.e., convolutional filters of widths 1, 2, 3) performs best on test data. Note that its performance is close to the performance of the CNNs using trigrams alone (with filters only of width 3), but the combination of n-gram sizes may provide some robustness.

7.2 Analysis of representative trigrams

To investigate the most representative features for our models, we extract the most frequent trigrams for each predicted IDA + U class from the mobile test data (see Table 18). Except for “an apple watch” and “an iphone NUM,” all the top trigrams are representative phrases for their specific purchase stage. For example, the trigrams used to predict Interest indicate questions or waiting for something. The trigrams for Desire show that the user “needs” or “cannot wait” to get something. The trigrams for predicting Action include the word “new” indicating that the user just bought something. Finally, Unhappiness is predicted based on trigrams indicating that a phone is broken. The low frequency of most of these trigrams indicates that there are many ways that tweets may signal a purchase stage. Informal examination of the top 50 trigrams of each class indicates a diverse set that could help in constructing queries for identifying additional tweets for each class.

Table 18 Most frequent trigrams and occurrence count in test data for each predicted IDA + U class.

7.3 Analysis of attention weights

Figure 7 plots attention weights calculated by the GRU + attention model as heat maps. The sentences were randomly selected from the test set. The plots show that the network assigns the highest weights to the parts most relevant to the particular label (Interest: “anyone selling,” Desire: “still looking for,” Action: “finally got,” and Unhappiness: “sad about”).

Figure 7 Heat maps: exemplary attention weights for randomly selected examples from the IDA + U classes.

8 Conclusions and future work

We presented and tackled three different challenges of purchase stage identification from Twitter data: highly imbalanced data, little training data, and adaptation of models to another product category.

Our experiments indicate that tweets do signal purchase stages. Our neural network model achieved the best performance compared to several other neural and nonneural models of tweets and tweet sequences. For coping with imbalanced data in neural networks, we showed that a ranking layer approach outperformed class weights. Our confusion matrix analysis indicates that the large imbalance in the N class is the primary challenge, and our results on actual purchase-stage instances confirmed that the model is able to distinguish the purchase stages well from each other.

In order to improve the purchase stage identification model given our limited amount of training data, we performed multitask learning. We trained the network simultaneously on purchase stage identification and sentiment analysis, and noted better performance, especially for the class expressing unhappiness with a product.

To make the model applicable to another product domain, we applied domain adaptation with adversarial training and fine-tuning of weights. We observed that our model is indeed adaptable to other product categories mentioned in tweets. If there is labeled data from the target domain available, the most effective method for adaptation is fine-tuning model weights. In scenarios without labeled data from the target domain, adversarial training is most promising.

Finally, our analysis of trigrams and attention weights suggested that the word and phrase features automatically identified by our model are indicative of the purchase decision stage classes.

For future work, we are interested in further investigating the N class and potential user-product relation subclasses within it. Moreover, it is possible to extend the set of purchase-stage classes, for example, to a more general Review stage after the buying event. In a manual inspection of the data, we found that many tweets describing the Bought event could also be classified as providing a brief review about the product. Therefore, it might be interesting to treat the task as a multilabel problem in the future. The multitask setup can be extended to more auxiliary tasks. While sentiment analysis mainly helped the classification of unhappiness (U class), other auxiliary tasks might be useful for other classes. A promising task might be the classification of the sentence type (statement, question, and exclamation) since a user who is interested in a product might ask for additional information by posing a question, while a user tweeting that they bought something would more likely use a statement or an exclamation. Regarding domain adaptation, possible future research directions are a quantitative investigation of the necessary number of data items in the source versus the target domain as well as an extension to multiple domains which could improve learning of domain-independent features.

Footnotes

c This ratio follows the suggestion at https://cs230-stanford.github.io/train-dev-test-split.html

d For “fine-tuning” the BOW + SVM model, the model was trained on a combination of the mobile device data and the car data.

References

Adel, H., Chen, F. and Chen, Y. (2017). Ranking convolutional recurrent neural networks for purchase stage identification on imbalanced Twitter data. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain: Association for Computational Linguistics, pp. 592598.Google Scholar
Adel, H. and Schütze, H. (2017). Exploring different dimensions of attention for uncertainty detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain: Association for Computational Linguistics, pp. 2234.Google Scholar
Asur, S. and Huberman, B.A. (2010). Predicting the future with social media. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2010, Main Conference Proceedings, Toronto, Canada: IEEE Computer Society, pp. 492499.CrossRefGoogle Scholar
Bahdanau, D., Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA.Google Scholar
Benevenuto, F., Magno, G., Rodrigues, T. and Almeida, V. (2010). Detecting spammers on Twitter. In CEAS 2010 - Seventh annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, Redmond, WA, USA: CEAS Conference.Google Scholar
Bollen, J. and Mao, H. (2011). Twitter mood as a stock market predictor. Computer 44(10), 9194.CrossRefGoogle Scholar
Caruana, R.A. (1993) Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the 10th International Conference on Machine Learning, ICML, Amherst, MA, USA, pp. 4148.CrossRefGoogle Scholar
Chawla, N.V., Japkowicz, N. and Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter 6(1), 16.CrossRefGoogle Scholar
Chen, X., Sun, Y., Athiwaratkun, B., Weinberger, K. and Cardie, C. (2018). Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics 6, 557570.CrossRefGoogle Scholar
Chen, X., Tan, T., Liu, X., Lanchantin, P., Wan, M., Gales, M.J.F. and Woodland, P.C. (2015). Recurrent neural network language model adaptation for multi-genre broadcast speech recognition. In INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, pp. 35113515.Google Scholar
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, pp. 17241734.CrossRefGoogle Scholar
Chung, J., Gulcehre, C., Cho, K. and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, Montreal, QC, Canada.Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 24932537.Google Scholar
Davenport, T.H., Dalle Mule, L. and Lucker, J. (2011). Know what your customers want before they do. Harvard Business Review 89, 8492.Google Scholar
Ding, X., Liu, T., Duan, J. and Nie, J. (2015). Mining user consumption intention from social media using domain adaptive convolutional neural network. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, pp. 23892395.Google Scholar
Dos Santos, C., Xiang, B. and Zhou, B. (2015). Classifying relations by ranking with convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China: Association for Computational Linguistics, pp. 626634.Google Scholar
Dukesmith, F.H. (1904). Three natural fields of salesmanship. Salesmanship 2(1), 14.Google Scholar
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M. and Lempitsky, V. (2016). Domain-adversarial training of neural networks. Journal of Machine Learning Research 17(1), 20962130.Google Scholar
Gao, J., Pantel, P., Gamon, M., He, X. and Deng, L. (2014). Modeling interestingness with deep neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, pp. 213.CrossRefGoogle Scholar
Godin, F., Vandersmissen, B., de Neve, W. and van de Walle, R. (2015). Multimedia lab @ ACL WNUT NER shared task: Named entity recognition for Twitter microposts using distributed word representations. In Proceedings of the Workshop on Noisy User-generated Text, Beijing, China: Association for Computational Linguistics, pp. 146153.CrossRefGoogle Scholar
Gui, T., Zhang, Q., Huang, H., Peng, M. and Huang, X. (2017). Part-of-speech tagging for Twitter with adversarial neural networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark: Association for Computational Linguistics, pp. 24112420.Google Scholar
Gupta, V., Varshney, D., Jhamtani, H., Kedia, D. and Karwa, S. 2014. Identifying purchase intent from social posts. In Proceedings of the Eighth International Conference on Weblogs and Social Media, ICWSM 2014, Ann Arbor, Michigan.Google Scholar
Hermann, K.M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M. and Blunsom, P. (2015). Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, pp. 1693–701.Google Scholar
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8), 17351780.CrossRefGoogle ScholarPubMed
Hollerit, B., Kröll, M. and Strohmaier, M. (2013). Towards linking buyers and sellers: Detecting commercial intent on Twitter. In 22nd International World Wide Web Conference, WWW 2013, Companion Volume, Rio de Janeiro, Brazil, pp. 629632.CrossRefGoogle Scholar
Kalchbrenner, N., Grefenstette, E. and Blunsom, P. (2014). A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, MD, USA: Association for Computational Linguistics, pp. 655665.CrossRefGoogle Scholar
Kharratzadeh, M. and Coates, M. (2012). Weblog analysis for predicting correlations in stock price evolutions. In Proceedings of the Sixth International Conference on Weblogs and Social Media, Dublin, Ireland.Google Scholar
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, pp. 17461751.CrossRefGoogle Scholar
Klerke, S., Goldberg, Y. and Søgaard, A. (2016). Improving sentence compression by learning to predict gaze. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA: Association for Computational Linguistics, pp. 15281533.Google Scholar
Kombrink, S., Mikolov, T., Karafiát, M. and Burget, L. (2011). Recurrent neural network based language modeling in meeting recognition. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, pp. 2877–2780.Google Scholar
Korpusik, M., Sakaki, S., Chen, F. and Chen, Y. (2016). Recurrent neural networks for customer purchase prediction on Twitter. In Proceedings of the 3rd Workshop on New Trends in Content-Based Recommender Systems co-located with ACM Conference on Recommender Systems (RecSys 2016), Boston, MA, USA, pp. 4750.Google Scholar
Lassen, N. B., Madsen, R. and Vatrapu, R. (2014). Predicting iphone sales from iphone tweets. In 18th IEEE International Enterprise Distributed Object Computing Conference, EDOC 2014, Ulm, Germany, pp. 8190.CrossRefGoogle Scholar
Le, Q.V. and Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31th International Conference on Machine Learning, ICML, Beijing, China, pp. 11881196.Google Scholar
Lebret, R., Pinheiro, P. and Collobert, R. (2015). Phrase-based image captioning. In Proceedings of the 32th International Conference on Machine Learning, ICML, Lille, France, pp. 20852094.Google Scholar
Lewis, E.S.E. (1903). Catch-line and argument. The Book-Keeper 15, 124128.Google Scholar
Lo, C., Frankowski, D. and Leskovec, J. (2016). Understanding behaviors that lead to purchasing: A case study of pinterest. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, pp. 531540.CrossRefGoogle Scholar
Lv, H., Yu, G., Tian, X. and Wu, G. (2014). Deep learning-based target customer position extraction on social network. In International Conference on Management Science & Engineering (ICMSE). IEEE, pp. 590595.Google Scholar
Mahmud, J., Fei, G., Xu, A., Pal, A. and Zhou, M.X. (2016). Predicting attitude and actions of Twitter users. In Proceedings of the 21st International Conference on Intelligent User Interfaces, IUI 2016, Sonoma, CA, USA, pp. 26.CrossRefGoogle Scholar
Mikolov, T. (2012). Statistical language models based on neural networks. PhD thesis. Brno University of Technology.Google Scholar
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of Workshop at 1st International Conference on Learning Representations (ICLR), Scottsdale, AZ, USA.Google Scholar
Morris, M.R., Teevan, J. and Panovich, K. (2010). What do people ask their social networks, and why? A survey study of status message Q&A behavior. In Proceedings of the 28th International Conference on Human Factors in Computing Systems, CHI 2010, Atlanta, GA, USA, pp. 17391748.Google Scholar
Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F. and Stoyanov, V. (2016). SemEval-2016 task 4: Sentiment analysis in Twitter. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA: Association for Computational Linguistics, pp. 118.Google Scholar
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N. and Smith, N.A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA: Association for Computational Linguistics, pp. 380390.Google Scholar
Pascanu, R., Mikolov, T. and Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, Atlanta, GA, USA: JMLR.org.Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 28252830.Google Scholar
Ramanand, J., Bhavsar, K. and Pedanekar, N. (2010). Wishful thinking - finding suggestions and “buy’ wishes from product reviews. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, Los Angeles, CA, USA: Association for Computational Linguistics, pp. 5461.Google Scholar
Ribeiro, M.T., Singh, S. and Guestrin, C. (2016). Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA: ACM, pp. 11351144.CrossRefGoogle Scholar
Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.Google Scholar
Russell, C. (1921). How to write a sales-making letter. Printers’ Ink 115, 4956.Google Scholar
Sakaki, S., Chen, F., Korpusik, M. and Chen, Y. (2016). Corpus for customer purchase behavior prediction in social media. In Language Resources and Evaluation Conference, Portorož, Slovenia, pp. 29762980.Google Scholar
Schuster, M., Paliwal, K. K. and General, A. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11), 26732681.CrossRefGoogle Scholar
Tang, D., Qin, B. and Liu, T. (2015). Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal: Association for Computational Linguistics, pp. 14221432.CrossRefGoogle Scholar
Theano Development Team (2016). Theano: A Python framework for fast computation of mathematical expressions. In arXiv:1605.02688.Google Scholar
Tzeng, E., Hoffman, J., Darrell, T. and Saenko, K. (2017). Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition, Honolulu, HI, USA.CrossRefGoogle Scholar
Vieira, A. (2015). Predicting online user behaviour using deep learning algorithms. In arXiv:1511.06247.Google Scholar
Werbos, P.J. (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE 78(10), 15501560.CrossRefGoogle Scholar
Weston, J., Bengio, S. and Usunier, N. (2011). WSABIE: Scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain: AAAI Press, pp. 27642770.Google Scholar
Wijaya, B.S. (2015). The development of hierarchy of effects model in advertising. International Research Journal of Business Studies 5(1).Google Scholar
Xu, S., Liang, H. and Baldwin, T. (2016). UNIMELB at SemEval-2016 tasks 4a and 4b: An ensemble of neural networks and a word2vec based model for sentiment classification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA: Association for Computational Linguistics, pp. 183189.Google Scholar
Yan, Z., Duan, N., Chen, P., Zhou, M., Zhou, J. and Li, Z. (2017). Building task-oriented dialogue systems for online shopping. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, pp. 46184626.Google Scholar
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. and Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA: Association for Computational Linguistics, pp. 14801489.Google Scholar
Zeiler, M.D. (2012). ADADELTA: An adaptive learning rate method. In arXiv:1212.5701.Google Scholar
Figure 0

Table 1 Sample purchase decision stage tweets.

Figure 1

Table 2 Examples of tweets that were difficult for both annotators to label.

Figure 2

Table 3 Confusion matrix of single annotator errors on mobile device data.

Figure 3

Table 4 Statistics of the annotated data.

Figure 4

Table 5 Mobile device dataset statistics after preprocessing.

Figure 5

Figure 1 Hierarchical model of tweets and a user’s tweet sequence.

Figure 6

Figure 2 (Color online) Tweet model: The words are represented by word embeddings and a CNN with multiple filters, possibly with different widths (blue: filter width = 1, red: filter width = 2, green: filter width = 3), is used to create a tweet representation.

Figure 7

Figure 3 Purchase stage model: A GRU models a user’s tweet sequence and a ranking layer determines the most probable outputs.

Figure 8

Figure 4 Model for multitask learning. The black part (tweet representation) is shared between both tasks, the blue part (top left) is evaluated for sentiment data, and the red part (top right) is evaluated for purchase stage prediction instances.

Figure 9

Figure 5 Model for adversarial training. The shared feature extractor is depicted in black, the domain classifier in blue, and the purchase stage classifier in red. The green backwards lines illustrate the gradient flow. Note that there is only one feature extractor (tweet model) which is applied to each example tweet. Thus, the updates from the domain classifier and the purchase stage model affect the same tweet model.

Figure 10

Table 6 Number of distinct purchase stages per tweet sequence (in percent).

Figure 11

Table 7 Number of transitions between purchase stages.

Figure 12

Table 8 Macro $F_1$ scores (in percent) for different tweet representation models; rep = representation, emb = embeddings, att = attention; in all neural cases: GRU as tweet sequence model trained with cross entropy loss without class weights.

Figure 13

Table 9 Macro $F_1$ (in percent) for various methods of dealing with class imbalance; CE = cross entropy, SH = squared hinge; GRU as tweet sequence model for neural models.

Figure 14

Table 10 Macro F1 scores (in percent) for different tweetsequence models.

Figure 15

Table 11 Macro F1 of models when testing on data with or without N class.

Figure 16

Table 12 Confusion matrix on the test set for the best model (CNN + unidirectional GRU + ranking).

Figure 17

Table 13 Class-wise and macro $F_1$ scores (in percent) on test data for different tweet representation models and multitask learning; bidirectional GRU with ranking layer was used as sequence model for main task.

Figure 18

Table 14 Car dataset statistics after preprocessing to reduce imbalanced labels.

Figure 19

Table 15 Macro $F_1$ comparing three methods for adapting a purchase stage prediction model to a new product category.

Figure 20

Figure 6 Macro $F_1$ of domain adaptation w.r.t. the number of labeled target samples.

Figure 21

Table 16 Confusion matrices on the car test set for the CNN + bidir GRUmodel.

Figure 22

Table 17 Macro $F_1$ on test for different n-gram sizes for two models: (1) SVM + BOW and (2) CNN + unidirectional GRU with different filter sizes.

Figure 23

Table 18 Most frequent trigrams and occurrence count in test data for each predicted IDA + U class.

Figure 24

Figure 7 Heat maps: exemplary attention weights for randomly selected examples from the IDA + U classes.