Multiple distributed representation construction
A sentence might contain more than one event, but most events will be contained in a single sentence. To decrease the complexity of the model, we divided corpus into sentences and built feature representations from sentence information. Inspired by the language model, we drew the basic picture of trigger candidates by their context and expanded details by their POS. Due to the deep relevance between triggers and entities (participate candidates), we utilized the distance between them to describe the possibility of the candidate to be a trigger. Moreover, we added sentence topic probability to discriminate sentences.
As mentioned before, we adopted the context words between candidate triggers to predict trigger type. Given a sentence S = …wi − 2wi − 1Tiwi + 1wi + 2…, where Tidenotes trigger to be predicted, and assuming the window size (dwin) is 2, therefore words wi − 2, wi − 1, wi + 1, wi + 2 are what we will adopt.
First, we found words around the trigger to be predicted from a dictionary of training set according to d
win, then we initialized these words’ vectors according to the dependency-based word embedding table. If the window is out of the edge of the sentence or can’t be found in the word embedding table, then we initialize the embedding in a random way. Finally, we concatenated them as follows:
$$ {\uppsi}_{\mathrm{con}}\left({\mathrm{T}}_{\mathrm{i}}\right)=\left[\right\langle \mathrm{W}\left\langle {}_{\left[{\mathrm{w}}_{\mathrm{i}-{\mathrm{d}}_{\mathrm{w}\mathrm{in}}}\right]}\cdot \cdots \cdot \right\langle \mathrm{W}\left\langle {}_{\left[{\mathrm{w}}_{\mathrm{i}-1}\right]}\cdot \right\langle \mathrm{W}\left\langle {}_{\left[{\mathrm{T}}_{\mathrm{i}}\right]}\cdot \right\langle \mathrm{W}\left\langle {}_{\left[{\mathrm{w}}_{\mathrm{i}+1}\right]}\cdots \cdot \right\langle \mathrm{W}\left\langle {}_{\left[{\mathrm{w}}_{\mathrm{i}+{\mathrm{d}}_{\mathrm{w}\mathrm{in}}}\right]}\right] $$
(2)
Where
\( \mathrm{W}\in {\mathcal{R}}^{\dim \times \left|\mathcal{D}\right|} \) represents the dependency-based word embedding table, dim is the dimensionality of word embedding we trained in the previous work, and
\( \left|\mathcal{D}\right| \) is the size of the dictionary.
Ambiguity is a problem that can’t be neglected in word classification tasks. A word can express various meanings according to the specific context. Therefore, disambiguation plays an important role in word classification tasks including trigger identification.
Here we added the topic feature to represent sentence topic information of each trigger in the sentence. We used a Latent Dirichlet Allocation (LDA) [
13] tool provided by Gensim, a python library, to acquire the topic distribution of every word in a sentence (the probability of a word belonging to one topic), and then multiply the topic probability of every word in one sentence to get the approximate sentence topic distribution [
14]. The sentence topic feature is computed by the following equation:
$$ {\uppsi}_{\mathrm{top}}\left(\mathrm{S}\right)={\prod}_{\mathrm{j}=\mathrm{i}-{\mathrm{d}}_{\mathrm{w}\mathrm{in}}}^{\mathrm{i}+{\mathrm{d}}_{\mathrm{w}\mathrm{in}}}{\left\langle {\mathrm{W}}_{\mathrm{top}}\right\rangle}_{\left[{\mathrm{w}}_{\mathrm{j}}\right]} $$
(3)
Where
\( \left\langle {\mathrm{W}}_{\mathrm{top}}\right\rangle \in {\mathcal{R}}^{{\mathrm{d}}_{\mathrm{top}}\times \mid \mathrm{D}\mid } \) is words topic distribution, d
top represents the total number of topics, and ∣ D∣ denotes the size of the dictionary.
The POS of triggers are mostly related to verbs which makes them a considerable role in trigger identification. The distributed representation of POS is shown as follow:
$$ {\uppsi}_{\mathrm{pos}}\left({\mathrm{T}}_{\mathrm{i}}\right)={\left\langle {\mathrm{W}}_{\mathrm{pos}}\right\rangle}_{\left[{\mathrm{pos}}_{{\mathrm{T}}_{\mathrm{i}}}\right]} $$
(4)
Where
\( \left\langle {\mathrm{W}}_{\mathrm{pos}}\right\rangle \in {\mathcal{R}}^{{\mathrm{d}}_{\mathrm{pos}}\times \left|{\mathcal{D}}_{\mathrm{pos}}\right|} \) represent POS vector table,
\( \left|{\mathcal{D}}_{\mathrm{pos}}\right| \) is the number of different POS types in the training set, and d
pos is the dimensionality of the POS vector. W
pos is initialized at random and updated while training.
Triggers have a deep relevance with the entities. The longer the distance between the word and the entities, the lower the possibility it can be a trigger. Based on our previous work [
15], we adopt the distance between the trigger and entities in the dependency tree, and the distance is shown as follows:
$$ {\uppsi}_{\mathrm{dis}}\left({\mathrm{T}}_{\mathrm{i}}{\mathrm{E}}_{\mathrm{j}}\right)={\left\langle {\mathrm{W}}_{\mathrm{dis}}\right\rangle}_{\left[\mathrm{dis}\left({\mathrm{T}}_{\mathrm{i}},{\mathrm{E}}_{\mathrm{j}}\right)\right]} $$
(5)
Where dis(Ti, Ej) represents the distance of the candidate trigger to the closest entities, and \( \left\langle {\mathrm{W}}_{\mathrm{d}\mathrm{is}}\right\rangle \in {\mathcal{R}}^{{\mathrm{d}}_{\mathrm{d}\mathrm{is}}\times \left|{\mathcal{D}}_{\mathrm{d}\mathrm{is}}\right|} \) is the distance table. \( \left|{\mathcal{D}}_{\mathrm{dis}}\right| \) is the number of distance possibilities and ddis is the dimensionality of the distance vector. Wdis is initialized at random and updated while training.
Finally, the concatenation of these four feature representations formed the distributed representation of candidate triggers to be the input layer fed to the deep learning model.
$$ \uppsi \left({\mathrm{T}}_{\mathrm{i}}\right)=\kern0.5em {\uppsi}_{\mathrm{con}}\left({\mathrm{T}}_{\mathrm{i}}\right)\bullet {\uppsi}_{\mathrm{top}}\left(\mathrm{S}\right)\bullet {\uppsi}_{\mathrm{pos}}\left({\mathrm{T}}_{\mathrm{i}}\right)\bullet {\uppsi}_{\mathrm{dis}}\left({\mathrm{T}}_{\mathrm{i}}{\mathrm{E}}_{\mathrm{j}}\right) $$
(6)
Where ∙ denotes concatenation.