A Knowledge Regularized Hierarchical Approach for Emotion Cause Analysis

Emotion cause analysis, which aims to identify the reasons behind emotions, is a key topic in sentiment analysis. A variety of neural network models have been proposed recently, however, these previous models mostly focus on the learning architecture with local textual information, ignoring the discourse and prior knowledge, which play crucial roles in human text comprehension. In this paper, we propose a new method to extract emotion cause with a hierarchical neural model and knowledge-based regularizations, which aims to incorporate discourse context information and restrain the parameters by sentiment lexicon and common knowledge. The experimental results demonstrate that our proposed method achieves the state-of-the-art performance on two public datasets in different languages (Chinese and English), outperforming a number of competitive baselines by at least 2.08% in F-measure.


Introduction
Sentiment analysis has gained increasing popularity in recent years due to many useful applications (Pang and Lee, 2007). The goal of sentiment analysis is to classify the sentiment polarity of a given text as positive, negative, neutral, or more finegrained classes (Kim, 2014;Li et al., 2015;Yang et al., 2016;Qian et al., 2017;Li et al., 2018b;Yu et al., 2018). Most of these researches have assumed that emotion expressions are already observed and try to identify the emotion categories from text. However, in practice, such as in product comments or political reviews, we may care more about the reason why the customers or critics hold the emotion rather than a simple category label. Because they can improve the quality of the products or services * * Corresponding Author: xuruifeng@hit.edu.cn according to the emotion cause provided by users. Emotion cause analysis (ECA) aims to identify the reasons behind a certain emotion expression in an event text, for example: Ex.1 When the children saw the gifts I prepared carefully, (c −2 )|they cheered happily and hugged me. (c −1 )| I was full of happiness. (c 0 ) Here, Ex.1 shows a document with three clauses marked as (c −2 ), (c −1 ), and (c 0 ). The goal of ECA is to determine which clause contains emotion cause (e.g., (c −1 )) for an emotion word (e.g., happiness in (c 0 )).
Previous approaches for emotion cause analysis mostly depend on rule-based methods  and machine learning algorithms (Ghazi et al., 2015;Gui et al., 2016;Xu et al., 2019). Most of them rely heavily on complicated linguistic rules or feature engineering, which is time-consuming and labor-intensive. Recent studies have focused on solving the task using neural models (Gui et al., 2017;Li et al., 2018a;Chen et al., 2018;Li et al., 2019) with well designed attention mechanism based on local text. Despite the effectiveness of neural models, there are some defects in previous studies. First, they usually consider each clause individually, i.e., ignoring the discourse context information that can impact the semantic expression among different clauses of a document. Second, prior knowledge such as sentiment lexicon and relative position information that can provide crucial emotion cause cues has not been fully exploited in neural models.
To alleviate these limitations, we propose a regularized hierarchical neural network (RHNN) for emotion cause analysis, which combines the discourse context information and knowledge-based regularizations. Our model investigates the following intuitions. Firstly, documents exhibit discourse structure which may carry valuable information about the emotion cause cues. We em-ploy a hierarchical learning structure to capture the mutual impacts of semantic expression among the discourse context, to help produce better clause representation. Secondly, in emotion events, emotion causes usually express a certain sentiment polarity by some sentiment words. For example, in the emotion cause of Ex.1, the sentiment words cheered and happily express a positive polarity, also play a crucial role to provoke the emotion happiness. Therefore, capturing these sentiment words can enhance the causal connection between learned features and predictions. We approach this issue by designing a regularizer that incorporates linguistic knowledge (e.g., sentiment lexicon) to enlarge the margin of attention weights of sentiment words and non-sentiment words. Besides, it is often the case that humans usually write important points in different sections. For emotion events, emotion causes generally occur on positions very close to the emotion word and occur frequently. Ex.1 shows an anecdotal example illustrating this behavior that the emotion cause clause c −1 adjoins the emotion word happiness. To benefit from this phenomenon, we introduce a regularizer biased by relative position information to supervise the representation learning of text and further to revise the predictive position distribution of emotion causes relative to the emotion word (in brief, predictive distribution).
To sum up, our contribution includes: • We propose a novel discourse-aware learning structure with knowledge-based regularizations for emotion cause analysis.
• We empirically evaluate the proposed model on two public datasets in different languages (Chinese and English) and show statistically significant improvements compared to the state-of-the-art methods.
• To make the mechanism of our model clear, we also compare the performance of different combinations by ablation experiments. Extensive analysis on both datasets confirms the feasibility of incorporating discourse information and restraining the parameters by sentiment lexicon and common knowledge.

Our Framework
In this section, we first give the task definition. Then, our proposed regularized hierarchical neural The two auxiliary regularizers of RHNN will be introduced in the next section.

Task Definition
The formal definition of emotion cause analysis is given in (Gui et al., 2016). Formally, for a document d = {c 1 , c 2 , . . . , c n } consisting of n clauses, it contains an emotion word e and at least an emotion cause clause corresponding to this emotion word. Each clause c i = {w i1 , w i2 , . . . , w ik } consists of k words and is labeled with emotion causeoriented labels ∈ {0, 1}. We regard ECA as a binary classification task and aim to identify which clause contains emotion cause.

Hierarchical Attention Network
Documents exhibit discourse structure which can serve as useful information for clause representation generation. One simple but effective approach is to adopt a hierarchical attention network to simulate this structure. Our hierarchical attention network consists of several parts: a word encoder, a word attention layer, a clause attention layer and a clause encoder. The details of each component will be described in the following paragraphs.
Word Encoder Gated Recurrent Unit (GRU) has been widely adopted for text processing (Cho et al., 2014). In this work, we first map each word into a low dimensional embedding space by Word2Vec (Mikolov et al., 2013) and then feed the whole document into a GRU-based word encoder to extract word sequence features. To summarize information from both directions, we use bidirec-tional GRU to exploit two parallel passes: where x it is embedding vector for the word w it in clause c i at time step t and k is the length of clause c i . Then we concatenate hidden states of the two as the representation of each word.

Word Attention
We introduce an attention mechanism to extract such words that are important to the meaning of the clause and aggregate the representation of these informative words to construct the clause vector. Specifically, where w m is the parameter for computing attention signals and ⊕ is concatenate operation. The embedding of emotion word e is denoted by e E . α it is the emotion-specific attention signal showing the importance of word w it . o c i is the weighted sum of word representation based on weights.
Clause Attention Intuitively, different clauses of a document are different informative and should be labeled with different importance. Targeting this problem, we design a clause attention mechanism to indicate the importance of each clause. See it differently, the attention signals can be regarded as some "prior" information to bias the clause encoder toward some content that is more important to extracting the emotion cause.
As for details, we adopt a one-layer MLP to get the attention signals. Apparently, position information plays an important role in capturing the relative distance of the clause to emotion word. Thus, we concatenate the clause vector o c i and its position embedding as the feature to obtain the attention signal, this yields: where w v is a parameter vector, l i is the randomly initialized position embedding and keeps unchanged in the training stage, α i is the weight of clause c i . Then the clauses with different importance (e.g., o c i ) are fed into a clause encoder.
Clause Encoder Just as the meaning of a word is determined by its context, the semantic expression of a clause is usually impacted by its discourse context. Based on this observation, we introduce a clause encoder to model the latent semantic relations among different clauses. Analogously, we also append the relative position information to enhance the relations between the clause and its position. Formally, where n is the number of clauses in a document. Also, the two directional hidden state

Model Training
The emotion-specific representation o i with its position embedding l i as the final feature for emotion cause prediction and the model is trained by minimizing the cross entropy: where W m is a parameter matrix, y i and y i are target class distribution and predictive class distribution respectively.
• Position Regularizer (PR.): Relative position is a critical emotion cause indicator: in general the closer a clause is to the emotion word, the higher emotion cause probability it should be assigned. We approach this issue by introducing a proxy distribution and an auxiliary cross entropy function.
Formally, we disassemble the joint loss of emotion cause detection into an original cross entropy loss, a sentiment regularization loss, and a position regularization loss. The new training objective is revised as: where L ce is the original cross entropy loss ( §2.3). λ 1 and λ 2 are hyper-parameters. L sr and L pr are two auxiliary regularization losses ( §3.1 and §3.2). θ is the parameter set.

Sentiment Regularizer
The weak causal connection between learned features and predictions is a major issue in emotion cause analysis. Even though sentiment words are key to clause representation generation, most existing models do not focus on sentiment words or place less emphasis on them when producing clause representation. In other words, attention is distracted by irrelevant words with less sentiment polarity. To address this issue and sufficiently benefit from linguistic resources, we explicitly encourage the larger margin of attention weights between sentiment words and non-sentiment words using a sentiment lexicon. For sentiment words, the average attention weight is calculated by: where w it is the t-th word of clause c i , k is the length of c i and l s is the number of sentiment words in c i . α it is the calculated attention weight in §2.2 and S is a sentiment lexicon. Correspondingly, for non-sentiment words: Our training objective is to lead the model to pay more attention to sentiment words. Thus, the regularization term is formally expressed as: where m is a hyper-parameter for margin.

Position Regularizer
Empirically, emotion causes usually occur at the positions which are very close to the emotion word. However, another main issue is that the predictive distribution may locates the clauses that are distant from and irrelevant to the emotion word. The goal of this regularizer is to narrow the difference between the predictive distribution and the true position distribution of emotion causes relative to the emotion word (in brief, true distribution). Obviously we can not obtain the true distribution. Hence, we assume that it should satisfy the following conditions: (1) It should be a normed function within [0, 1]; (2) It should be a symmetric function of a certain value. Based on these conditions, we employ a function defined as follows: where r i is the relative distance of clause c i to emotion word, n is the number of clauses in a document, and b is the left and right boundary which limits the scope of emotion cause. Then we apply q = (q 1 , q 2 , . . . , q n ) as the proxy distribution to simulate the true distribution. Simultaneously, from section 2.3, we get the predictive class distribution y i of clause c i . Then, the probability for the emotion cause at position i can be calculated as: Similarity, we can obtain the predictive distribution of emotion causes relative to the emotion word by: The goal is to enforce the model to constrain the difference between the p and q, thus, we use cross entropy to measure the difference: Note that the two introduced regularizers work like L 1 and L 2 terms, which do not introduce any new parameters and only influence the training of the standard model parameters. The hyperparameters λ 1 and λ 2 guide the model to achieve the best trade-off among three types of losses.

Datasets
We select two public datasets from different languages to evaluate the proposed model: Chinese Dataset (Gui et al., 2016) collected from SINA city news 1 and English Dataset (Gao et al., 2017) collected from English novels. Each document of both datasets has only one emotion word and one or more emotion causes. It has been ensured that the emotion and the causes are relevant. The documents are segmented into several clauses manually for emotion cause analysis. The details about the two datasets are summarized in Table 1.

Implementation Details
For the Chinese dataset, there is no training/test split, we randomly divide the documents into a training/development/test set in a ratio of 8:1:1 and partition the clauses by Jieba 2 . For the English dataset, we randomly select 10% from the original training set as the development set and lowercase, lemmatize all the tokens by NLTK 3 . We evaluate our method 25 times with different splits and then perform one sample t-test on the experimental results by following (Gui et al., 2017). The precision (P ), recall (R) and F-measure (F ) are employed to measure the performance in this task. The sentiment lexicon adopted for the Chinese dataset consists of two parts. The first part is se-lected from HowNet (Dong et al., 2006) sentiment analysis lexicon set 4 and the second part comes from NTUSD (Ku et al., 2006). The combination of the two parts serves as the Chinese sentiment lexicon of this research. The English sentiment lexicon comes from MPQA (Wilson et al., 2005) and we only select the words with high sentiment polarity, because they are less sensitive to contextual information and usually express consistence sentiment polarities from their prior polarity. For both sentiment lexica, we filter out the words that are not in the datasets. Ultimately, 2022 and 1348 sentiment words are selected for the Chinese and English dataset respectively.
Online learning is performed with the Adam optimizer (Kingma and Ba, 2015) and initial learning rate 0.001 is adopted. The number of layers in Bi-GRU is set to 2 and dropout rate 0.5 is used to avoid overfitting. The word vectors are pre-trained by word2vec (Mikolov et al., 2013) and keep unchanged during training stage. We perform grid search over the hyper-parameters m ({0.10, 0.15, 0.20}), the boundary b ({2, 3, 4}), the dimensionality of the Bi-GRU ({64, 128}), λ 1 and λ 2 (both {0.25, 0.5, 0.75 }). For each corpus, the highest F-measure combination of these hyper-parameters is selected using development set.

Experiments
In this section, we will compare our RHNN model with the following groups of methods: • Rule-based and commonsense-based methods: Rule-based method (RB) is a traditional rule-based method proposed by . Commonsense-based methods (CB) is a knowledge-based method proposed by Russo et al. (2011). It uses Chinese Emotion Cognition Lexicon (Xu et al., 2013) as commonsense knowledge.
• Machine learning method: SVM is a SVM classifier trained on unigrams, bigrams and trigrams features .
Word2vec is a SVM classifier trained on word representations pre-trained by Word2vec (Mikolov et al., 2013). Multikernel represents a document by a syntactic structure and utilizes a modified convolution kernel method to determine which clause contains the emotion cause (Gui et al., 2016).   (Gui et al., 2017) and the rest are reprinted from the corresponding publications (p <0.001).
LambdaMART extracts emotion causes using learning to rank methods which based on the emotion-independent and emotiondependent features (Xu et al., 2019).
• Deep learning method: CNN is a convolutional neural network for sentence classification (Kim, 2014). ConvMS-Memnet considers emotion cause analysis as a reading comprehension task and designs a multiple-slot deep memory network to model context information (Gui et al., 2017). CANN uses a co-attention neural network to identify emotion causes (Li et al., 2018a). HCS is proposed by Yu et al. (2019) using a multiplelevel hierarchical network to detect the emotion causes. MANN is the current state-ofthe-art method employing a multi-attentionbased model for emotion cause extraction (Li et al., 2019). RHNN is our proposed model.

Main Results
The experimental results on both datasets are shown in Table 2 and Table 3, respectively. RB yields high precision but with low recall. CB has an opposite scenario from RB. A possible reason is that these linguistic-based methods depend on some cue words to identify the emotion cause, different rules or common sense may contain different cue words. For the machine learning methods, SVM and Word2vec have similar performance on the Chi-  Table 3: Experimental results on the English dataset, we follow the results that are implemented in (Li et al., 2019), the only available results on this dataset (p <0.001).
nese dataset, but SVM outperforms Word2vec on the English dataset. The main reason is that the polysemantic phenomenon is more obvious in English expressions. Multi-kernel has better performance by capturing context information through a syntactic tree. LambdaMART, which is based on ranking strategy and global emotion features, performs best among feature-based methods. However, both Multi-kernel and LambdaMART rely on expensive human-based features and lack of expandability on different dataset. Compared with CNN, ConvMS-Memnet models the context of each word and obtains better performance on both datasets. The co-attention based CANN captures the mutual relations between the emotion clause and each candidate clause, which has a comparable result with hierarchical-based HCS. MANN considers the interaction between the emotion clause and candidate clauses by designing a multi-attention mechanism and achieves the best performance among baselines. The proposed RHNN model further improves the performance on both datasets as shown in the tables. The improvement is significant with p-value less than 0.001 in one sample t-test. Specifically, RHNN manages to boost the performance by 3.06% in F-measure compared to Lamb-daMART, which exhibits that by restraining the parameters with knowledge-based regularizations, RHNN is better to identify the emotion cause cues than feature engineering. RHNN also outperforms the current best-performing method MANN by 2.08% on the Chinese dataset and 6.47% on the English dataset in F-measure respectively. Furthermore, for the English dataset, our proposed model has balance performance in precision and recall. The reason for this phenomenon is that RHNN can capture more emotion cue (e.g., sen-   Table 5: The F-measure on sub-dataset (Sub.) that only selects the clauses which contain sentiment words, and all-dataset (All.) that experiments on the whole dataset. timent words) information to optimize the model extracting emotion causes more exactly.

Detailed Analysis
Ablations of RHNN Model The proposed RHNN model consists of three components, including hierarchical structure (H.), sentiment regularizer (SR.) and position regularizer (PR.). We conduct ablation experiments to reveal the effect of each component. As illustrated in Table 4, all models with the proposed component consistently improve upon the Base model, verifying the effectiveness of the proposed approach. Compare with H. and PR. model, the SR. improve the performance most. The main reason is that there are 55.24% and 66.49% of emotion causes which contain sentiment words on the Chinese dataset and English dataset respectively, enforcing the model to pay more attention to sentiment words can enhance the causal connection between learned features and predictions.
On the Chinese dataset, the RHNN achieves the best performance with a 7.62% improvement on the F score compared with the baseline. However, on the English dataset, the HSR model performs better than the RHNN model. It may be caused by the overlapping between components. Besides, the performance on the English dataset always lower about 20% in F-measure than that on the Chinese dataset, one possible explanation for this phenomenon is that there are more clause structures in English expressions which is difficult for the model to capture this information without discourse tree.

Effect of Sentiment Regularizer
To gain more insights into our proposed model, we conduct further experiments to examine the effectiveness of sentiment regularizer on the subset of two datasets. The experiment results in Table 5 show that: 1) RHNN and HSR model achieve the best performance on the subset of two datasets respectively, similar observations can be found regarding on whole dataset; 2) With sentiment regularizer, the performance is boosted on both subsets compared to that on the whole dataset. This is consistent with our intuition because sentiment regularizer contributes much to pick up the words with sentiment polarity and these words are important causal indicators in clauses. Meanwhile, each clause contains sentiment words in the subsets, resulting in a better performance on the subsets. 3) The performance improvement on the English subset is remarkably higher than that on the Chinese subset, one possibility is that there are more explicit emotion terms in English expression than in Chinese.
Effect of Position Regularizer From Eq.(18), we see that the value of b limits the scope of emotion cause. In this section, we further investigate the effect of different limited scopes. For simplicity and efficiency, here we only apply the PR model on this experiment. The results are shown in Fig 2, the percentages denote the coverage of emotion cause in datasets. For instance, 72.77% represents that there are 72.77% emotion causes adjoining the emotion word in English dataset. From Fig 2, we can see that the performance trends for the two datasets are similar, and the performance improves with the expanding limited scope of emotion cause. However, when the limited scope of emotion cause is larger than a certain value (2 or 3), the performance decreases. This may be due to the reason that the larger of limited scope, the higher coverage of emotion cause in datasets. Nevertheless, when the limited scope is too large, the model is forced to allocate higher probability to these clauses which are distant from and irrelevant to the emotion word, and then leads to the performance degradation.

Case Study
Essentially, sentiment regularizer (SR.) aims to enlarge the margin between sentiment words and non-sentiment words. The question is, will the model focus on the words with sentiment polarity? We randomly choose one example (Ex.2) from the Chinese dataset to visualize its attention distribution and compare the difference between plusing sentiment regularizer or not.
Ex.2 A fellow came to store asking XiaoMei's situation, (c −2 )|DaXiong complained XiaoMei not good. (c −1 )| She heard it then felt both sad and angry. (c 0 ) In this example, the cause of emotion word sad is in (c 0 ). The visualization results are shown in Fig 3, we can observe that the model without sentiment regularizer most focus on non-sentiment words such as XiaoMei, she and heard it which are inessential to provoke the emotion. However, when we plus the SR. into model, we can see an obvious weights shift on attention distribution. Emotion cause events RHNN results 1 I was immediately *ashamed* of myself for my vanity, for having assumed that he wanted me to stay with him forever. I'm sorry, that was a little arrogant for having assumed that he wanted me to stay with him forever 2 He hopes of being *admitted* to a sight of the young ladies, of whose beauty he had heard much. But he saw only the father. of whose beauty he had heard much 3 I didn't know where in the hell you was, said Ennis, four years, I about *give up* on you. More clearly, the model captures the sentiment words complained and not good which are crucial to identify the emotion cause. This shows that our model with sentiment regularizer is more effective in extracting the most important keywords relating to the emotion cause. Also, better results are obtained using sentiment regularizer, this is consistent with what we observed in Table 4.

Null
Finally, we perform error analysis to understand what types of errors are introduced with the proposed model, focusing on three cases from the English dataset. The results are listed in Table 6, where the first column depicts the content of the emotion cause events and the second column depicts the emotion causes identified by RHNN. As shown in Table 6, the emotion causes appear in bold and emotion word is labeled between *.
From Table 6, we can find that there are two clauses contain the emotion cause in event 1. However, our model only detects one emotion cause clause. In event 2, our model has an error prediction. One possible reason is that our model is prone to treating clauses which contain sentiment words as emotion cause. RHNN extracts nothing from event 3, it may be due to the reason that the far distance between the emotion word and emotion cause clause, resulting in a difficult understanding of causal relations. Our proposed model is capable of getting rich emotion cause cues with knowledge-based regularizations. However, it also introduces some noisy into emotion cause analysis.

Related Work
Emotion classification is an important fundamental aspect of sentiment analysis. Going one step further, emotion cause analysis (ECA) which aims to discover the reason behind emotions, can be constructive to guide the direction of future work, i.e., improving the quality of products or services according to the emotion causes of comments provided by users. In this section, we describe the related work on emotion cause analysis.  first gave the formal definition of emotion cause analysis and manually constructed a dataset from the Academia Sinica Balanced Chinese Corpus. Based on this corpus,  designed two sets of features built on six groups of linguistic cues to detect emotion cause. Support vector machines (SVMs) and conditional random fields (CRFs) were investigated to detect cause or non-cause text with extended rule-based features in existing methods (Gui et al., 2014;Ghazi et al., 2015). Other than rule-based methods, Russo et al. (2011) proposed a crowdsourcing method to construct a commonsense knowledge base for emotion cause extraction in Italian newspaper articles. But it is challenging to extend the common-sense knowledge base automatically. Recently, Gui et al. (2016) proposed a multi-kernel based method to identify the emotion cause from a manually annotated emotion cause corpus. Xu et al. (2019) proposed a method based on learning to re-rank candidate emotion cause clauses with extracting a number of emotion-dependent and emotion-independent features. However, these methods are heavily dependent on the expensive human-based features and are too difficult in a real-world application.
Inspired by the success of neural network methods, deep neural models and attention mechanisms have been widely used in emotion cause analysis. Gui et al. (2017) proposed a novel deep neural network which regarded emotion cause analysis as a question-answering task. In this study, a convolution-based memory network was introduced to store the context information. Li et al. (2018a) considered the context around the emotion word as a query instead of only emotion word to model the mutual impacts between each candidate clause and the emotion clause. Cheng et al. (2017) constructed a corpus based on Chinese microblog and proposed to detect emotion cause using multiple-user structures. Besides, Yu et al. (2019) proposed a multiple-level hierarchical network-based clause selection strategy. Li et al. (2019) proposed a multi-attention-based neural model to capture the mutual influences between the emotion clause and each candidate clause, and then generate the representations for the above two clauses separately. This method achieves the current best performance. However, the existing approaches usually focus on the local textural information, ignoring the discourse structure (Zubiaga et al., 2018), and prior knowledge such as sentiment lexicon (Qian et al., 2017) and relative position information, which can provide important emotion cues for emotion cause analysis task.

Conclusion and Future Work
In this paper, we provide a regularized hierarchical neural network (RHNN) for emotion cause analysis. The proposed model aggregates discourse context information through a hierarchical learning structure and restrains the parameters with knowledge-based regularizations. We evaluate the proposed model on two public datasets in different languages. The experimental results demonstrate that our proposed method achieves the stateof-the-art performance on both datasets and extensive analysis confirms the feasibility of incorporating the discourse context and knowledge-based regularizations.
To preserve the simplicity of the proposed model, we do not consider document as a tree structure. In the future, we will exploit how to incorporate discourse parse tree or discourse relations into emotion cause analysis task to further improve the performance.