Carlos Martin, David Corney, Ayse Göker
IDEAS Research Institute
Robert Gordon University, Aberdeen, UK
B. Named Entity Recognition
Increasingly, journalists and news organizations are using social media to discover news stories and supporting information through interactive information retrieval. The rate and diversity of posted messages present major challenges however. New topics emerge continually, making it even harder for the user to find what is newsworthy. We describe a novel technique to detect clusters of bursty phrases that appear in multiple messages so as to identify emerging topics in a useful fashion. We demonstrate our methods by analysing tweets drawn from the worlds of politics (US primaries and presidential elections) and sport (the FA Cup Final). To evaluate our methods, we created a user-centred 'ground truth' based on mainstream media accounts of the events. We show that our methods successfully detect a range of different topics for each event and can retrieve headlines that represent each topic for the user.
The growth of social networking sites, such as Twitter, Facebook and Reddit, is well documented. Every day, a huge variety of information on many different topics is shared by a huge variety of people. Given the real-time, global nature of these sites, they are used by many people as a primary source of news content . Increasingly, such sites are also used by journalists, partly to find breaking news but also to find eye-witnesses and user-generated content such as photos and videos, to enhance their stories. Our goal is to produce a practical tool to help journalists and news readers to find newsworthy topics from message streams. The fact of an election is not necessarily newsworthy; the name of the winner is.
We describe a system that identifies phrases that show a sudden increase in frequency (a 'burst') and then cluster them to get final topics. Such bursts are typically responses to real-world events. In this way, the news consumer can avoid being overwhelmed by redundant messages, even if the initial stream is formed of many messages from many sources. The emphasis is on the temporal nature of message streams as we bring to the surface groups of messages that contain suddenly-popular phrases. A first version of this approach was recently described , where it compared favourably to several alternatives and benchmarks. Here we update and expand that work, examining the effect of different types of n-grams and retrieving different numbers of topics.
As our ultimate goal is to produce tools to help journalists and other news consumers, we have evaluated our system with a clear focus on the user and an emphasis on finding 'whole stories' rather than just isolated keywords. The output of our system is therefore a series of linked phrases, accompanied by one or more representative tweets.
2 RELATED WORK
Many of the individual techniques that we use have been used before, but not together and not in a user-centred way. No other study that we are aware of focuses on the needs of journalist and news-reading users, and with an emphasis on recency to augment traditional collection statistics such as tf-idf. It is our view that the domain must be tackled with a combination of such methods. We build on the idea of the importance of time and the concept of 'sessions' now common in query-log analysis , but adapt it to the context of emerging news from Twitter.
Newman  discusses the central use of social media by news professionals, such as hosting live blogs of ongoing events. He also describes the growth of collaborative, networked journalism, where news professionals draw together a wide range of images, videos and text from social networks and provide a curation service. Broadcasters and newspapers can also use social media to increase brand loyalty across a fragmented media marketplace.
Petrovic et al.  focus on the task of first-story detection (FSD). They use a locality sensitive hashing technique on 160 million Twitter posts to place tweets into buckets in order to find the nearest neighbour and hence detect new events and track them. Their evaluation used newswire sources rather than tweets, with a Twitter-based evaluation limited to calculating the average precision of their system, with human annotators to label the output as being about an event or not. This contrasts with our goal which is to measure the topic-level recall, i.e. how many newsworthy stories the system retrieved. Also, while they process all available tweets, we assume that an initial filtering has taken place, such as when using hashtags or following certain accounts with the aim of filtering out a large number of non-newsworthy tweets at the outset. In this way we can meet the journalists' needs of rapidly finding information about breaking news stories.
Benhardus  uses standard collection statistics such as tf-idf, unigrams and bigrams to detect trending topics. Two data collections are used: one using the Twitter API and the other using the Edinburgh Twitter corpus, which contains 97 million tweets. This was used as a baseline with some natural language processing used (part of speech tagging). The research focused on general trending topics (typically finding personalities and for new hashtags) rather than focusing the needs of journalistic users and news readers.
The n-gram method has been used in a number of related studies. Lin et al.  train language models (LMs) on a set of pre-defined hashtags using smoothing techniques in conjunction with unigram and bigram LMs, using 2.66 billion tweets. The research focused on varying parameters in the model using a history retention technique. No natural language techniques were used.
Shamma et al.  focus on 'peaky topics' (topics that show highly localized, momentary interest) by using unigrams only. Their method obtains peak terms for a given time slot when compared to the whole corpus rather than over a given time-frame. The use of the whole corpus favours batch-mode processing and is less suitable for real-time and user-centred analysis. Becker et al.  considers temporal issues by focusing on the online detection of real world events and distinguishing them from non-events (e.g. conversations between posters). Clustering and classification algorithms were used but methods such as n-grams and NLP were not.
Word order is often essential to indicate meaning. For example, 'dog bites man' is not news, but 'man bites dog' is news. A bag-of-words approach cannot distinguish these cases. Partly for this reason, n-grams (a sequence of up to n words) are popular in natural language processing systems [10, 11].
Therefore, in addition to single keywords, we also use bigrams, trigrams and n-grams generally. We then count and index the frequency of n-grams rather than just single-word terms (i.e. unigrams). Throughout this work, by n-gram we refer to sequences of up to n consecutive terms. For example, when indexing 3-grams, we include 2-grams and 1-grams. We form n-grams after tokenization and stop-word removal by replacing all stop-words with a dummy token. We do not use stemming, part-of-speech tagging or other such processes in this work.
Proper nouns representing people, places and organizations are often semantically vital parts of a message. Therefore we used the 3-class Stanford Named Entity Recognizer  and increased the importance of these words by a constant 'boost' factor. We adopted the methods of Phuvipadawat and Murata , where the authors highlight the importance of such words for grouping results. As in their work, we use a factor boost=1.5 when the n-gram contains a named entity and boost=1 otherwise. In our experiments, important topics will often include the names of politicians, sports players, locations, TV networks and so on.
Term frequency-inverse document frequency, or tf-idf, has been one of the best known statistics for indexing documents for many years. We are not interested in indexing documents however, but in finding novel trends so we want to find terms (n-grams; see Section 3 A) that appear in one time period more than others. Also, we do not want to be distracted by insignificant chatter, such as celebrity gossip or jokes, as these are not newsworthy for most journalists (although we could easily focus on such topics by using a different initial filter). We wish to identify words and phrases (and the messages that contain them) that are both new and significant. Conceptually, we therefore define newsworthiness as the combination of novelty and significance. Rather than attempting a formal definition of significance, by significant we mean that a story is the kind of story that would typically be included in a newspaper, news website or broadcast news bulletin. This is distinct from `popular' messages that would also include jokes, celebrity messages etc. We can help to maximise significance by filtering tweets either by keywords (as in this work) or by following a carefully chosen list of accounts. We can then maximise novelty by finding bursts of high-frequency words and phrases (using df-idf t, defined below).
We find emerging topics by finding terms that become suddenly more common. We use this to rank clusters of terms and then select the most 'bursty' terms and assume that they represent emerging topics and breaking news. High frequency n-grams are likely to represent coherent semantic units and phrases. The use of n-grams is appropriate for Twitter since many tweets are quote other messages.
We measure burstiness using the 'temporal document frequency-inverse document frequency', or df-idf t. For each term at each point in time we compare the term frequencies from the current time slot with those of the preceding time slot, where each slot lasts x minutes. For each term, we calculate the document frequency for a period of time using df ti, defined as the number of messages in a period of time i that contain the term t.
In addition, those terms containing any entity are boosted (Section 3 B). This produces a list of terms which can be ranked by their df-idf t scores. Note that we add one to term counts to avoid problems with dividing by zero or taking the log of zero.
Having found bursts of potentially newsworthy n-grams, we then group together n-grams that tend to appear in the same tweets, because a single term is often not very informative but a group can define a story. We use a hierarchical clustering approach based on the distance between n-grams or clusters of them. We initially assign every term (n-gram) to its own singleton cluster, then follow a standard 'group average' hierarchical clustering algorithm  to iteratively find and merge the closest pair of clusters. We repeat this until no two clusters share more than half their terms, at which point we assume that each cluster represents a distinct topic. We define the similarity between two terms as the fraction of messages in the same time slot that contain both of them, so it is likely that the term clusters whose similarities are high represent the same topic. Further details about this algorithm were recently published .
Finally, the clusters are ranked according to the highest df-idf t score of the n-grams contained by the cluster. This ranking criterion is based on the assumption that each cluster score should be associated with the most representative n-gram in the cluster. Each of these clusters defines a topic as a list of n-grams, which we also illustrate with a representative tweet in the user interface.
4 EVALUATION METHODS
To make the evaluation useful, we must define a realistic test problem. We used three Twitter data sets focused on popular real-world events, which were generated by filtering Twitter using relevant keywords and hashtags. We compare the topics that our algorithm finds with an independently-defined ground truth. To establish this ground truth for our evaluations, we used mainstream media (MSM) reports of the events: the 'Super Tuesday' primaries, part of the presidential nomination race of the US Republican Party; the 2012 FA Cup final, the climax to the English football season; and the 2012 US presidential election, an event of global significance. In each case, we reviewed the published MSM accounts of the events and chose a set of stories that were significant, time-specific, and represented on Twitter. For example, we ignored general reviews of the state of US politics (not time-specific), and quotes from members of the public (not significant events). This use of MSM sources helps to ensure that our ground truth topics are newsworthy (by definition) and that the evaluation is goal-focussed (i.e. will help journalists produce such stories). Further details of the data sets have recently been published , and the data is available (http://www.socialsensor.eu/results/datasets/72-twitter-tdt-dataset).
Fig. 1. Twitter activity during events (tweets per minute). For the FA Cup, the peaks correspond to start and end of the match and the goals. For the two political collections, the peaks correspond to the main result announcements.
We ran the topic detection algorithm on each time slot for which at least one topic is contained in the ground truth. In total, we had 8 one-hour slots for Super Tuesday, 13 one-minute slots for the FA Cup final and 26 ten-minute slots for the US elections. We used only the data from that slot as input to the methods due to the real-time scenario we are addressing: the topic detection system has to behave as a monitor by providing the user with the current trending topics.
The topics detected by the system were compared to the ground truth (comprising 22 topics for Super Tuesday; 13 for the FA Cup final; and 64 for the US elections) using three metrics:
We previously compared our method with other state-of-art techniques, such as Latent Dirichlet Allocation (LDA) . Our approach performed well on all the datasets, obtaining the best results in most of them.
Figure  compares the performance of our approach using different types of n-grams across the three data sets, showing topic recall as a function of the number of topics returned (as users may want to see different numbers of top topics in their displays). To compare the methods independently of the number of topics, we calculate the normalised area under the curves which we define as:
where tr i is the topic recall at top i and m=10 in our experiments. Table 1 shows the nAUC scores and their average, weighted by number of ground truth topics per data set. According to these results, the use of n-grams (phrases) produce better results than keywords (unigrams).
There is a meaningful difference among the results using keywords (unigrams) and n-grams. The 3- and 4-gram cases get similarly good results for all the datasets. For efficiency, we use 3-grams for our other experiments here and in the previously mentioned work ; adding 4-grams produces at best a marginal improvement at the cost of a substantially larger index.
Table 2 shows the keyword precision and recall scores for the best n-grams of each of data set, when the top m=5 results are returned.
Comparing the results between the three collections reveals one striking difference: the topic recall is far higher for football (over 90%) than for politics (around 50%). This is likely to reflect the different nature of conversations about the events. Topics within a live sports event tend to be transient: fans care (or at least tweet) little about what happened five minutes ago; what matters is what is happening 'now'. In politics, conversations and comments tend to spread over hours (or even days) rather than minutes. This means that sports-related topics tend to occur over a much narrower window, with less overlapping chatter. In politics, several different topics are likely to be discussed at the same time, making this type of trend detection much harder. Looking back at the distribution of the tweets over time (Figure 1), we can see clear spikes in the FA Cup graph, each corresponding to a major event (kick-off, goals, half-time, full-time etc.). No such clarity is in the politics graphs, which instead is best viewed as many overlapping trends.
This difference is reflected in the way that major news stories often emerge: an initial single, focussed story emerges but is later replaced with several potentially overlapping sub-stories covering different aspects of the story. Our results suggest that a dynamic approach may be required for newsworthy topic detection, finding an initial clear burst and subsequently seeking more subtle and overlapping topics.
Lastly, n-grams seem to work properly in our topic detection approach as they improve the quality of the topics detected compared to using single keywords.