ABSTRACT moved to the World Wide Web,

ABSTRACT

Mass
media sources, specifically the news media, have traditionally informed us of
daily events. In modern times, social media services such as Twitter provide an
enormous amount of user-generated data, which have great potential to contain
informative news-related content. For these resources to be useful, we must
find a way to filter noise and only capture the content that, based on its
similarity to the news media, is considered valuable. However, even after noise
is removed, information overload may still exist in the remaining data—hence, it
is convenient to prioritize it for consumption. To achieve prioritization,
information must be ranked in order of estimated importance considering three
factors. First, the temporal prevalence of a particular topic in the news media
is a factor of importance, and can be considered the media focus (MF) of a
topic. Second, the temporal prevalence of the topic in social media indicates
its user attention (UA). Last, the interaction between the social media users
who mention this topic indicates the strength of the community discussing it,
and can be regarded as the user interaction (UI) toward the topic. We propose
an unsupervised framework which identifies news topics prevalent in both social
media and the news media, and then ranks them by relevance using their degrees
of MF, UA, and UI.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

INTRODUCTION

The
mining of valuable information from online sources has become a prominent
research area in information technology in recent years. Historically,
knowledge that apprises the general public of daily events has been provided by
mass media sources, specifically the news media. Many of these news media
sources have either abandoned their hardcopy publications and moved to the
World Wide Web, or now produce both hard-copy and Internet versions
simultaneously. These news media sources are considered reliable because they
are published by professional journalists, who are held accountable for their
content. On the other hand, the Internet, being a free and open forum for
information exchange, has recently seen a fascinating phenomenon known as
social media. In social media, regular, nonjournalist users are able to publish
unverified content and express their interest in certain events. Microblogs
have become one of the most popular social media outlets. One microblogging
service in particular, Twitter, is used by millions of people around the world,
providing enormous amounts of user-generated data.

For
social media data to be of any use for topic identification, we must find a way
to filter uninformative information and capture only information which, based
on its content similarity to the news media, may be considered useful or
valuable. Unfortunately, even after the removal of unimportant content, there
is still information overload in the remaining news-related data, which must be
prioritized for consumption. To assist in the prioritization of news
information, news must be ranked in order of estimated importance. The temporal
prevalence of a particular topic in the news media indicates that it is widely
covered by news media sources, making it an important factor when estimating
topical relevance. This factor may be referred to as the MF of the topic. The
temporal prevalence of the topic in social media, specifically in Twitter,
indicates that users are interested in the topic and can provide a basis for
the estimation of its popularity. This factor is regarded as the UA of the
topic. Likewise, the number of users discussing a topic and the interaction
between them also gives insight into topical importance, referred to as the UI.
By combining these three factors, we gain insight into topical importance and
are then able to rank the news topics accordingly.

Additionally,
news topics that perhaps were not perceived as popular by the mass media could
be uncovered from social media and given more coverage and priority. For
instance, a particular story that has been discontinued by news providers could
be given resurgence and continued if it is still a popular topic among social
networks. This information, in turn, can be filtered to discover how particular
topics are discussed in different geographic locations, which serve as feedback
for businesses and governments.

A
straightforward approach for identifying topics from different social and news
media sources is the application of topic modeling. This approach, however,
misses out in the temporal component of prevalent topic detection, that is, it
does not take into account how topics change with time. Furthermore, topic
modeling and other topic detection techniques do not rank topics according to
their popularity by taking into account their prevalence in both news media and
social media.

The
effectiveness of our system is validated by extensive controlled and
uncontrolled experiments. To achieve its goal, it uses keywords from news media
sources (for a specified period of time) to identify the overlap with social
media from that same period. We then build a graph whose nodes represent these
keywords and whose edges depict their co-occurrences in social media. The graph
is then clustered to clearly identify distinct topics. After obtaining
well-separated topic clusters (TCs), the factors that signify their importance
are calculated: MF, UA, and UI. Finally, the topics are ranked by an overall
measure that combines these three factors.

 

LITERATURE SURVEY

1.   
Title:
Topic Detection by Clustering Keywords

Authors: C.
Wartena and R. Brussee

Abstract:

We consider topic detection without any prior knowledge of category
structure or possible categories. Keywords are extracted and clustered based on
different similarity measures using the induced k-bisecting clustering
algorithm. Evaluation on Wikipedia articles shows that clusters of keywords
correlate strongly with the Wikipedia categories of the articles. In addition,
we find that a distance measure based on the Jensen-Shannon divergence of
probability distributions outperforms the cosine similarity. In particular, a
newly proposed term distribution taking co-occurrence of terms into account
gives best results.

 

2.   
Title:
Emerging topic detection on Twitter based on temporal and social terms
evaluation

Authors: M. Cataldi, L.
Di Caro, and C. Schifanella

Abstract:

Twitter is a user-generated content system that allows its
users to share short text messages, called tweets, for a variety of purposes,
including daily conversations, URLs sharing and information news. Considering
its world-wide distributed network of users of any age and social condition, it
represents a low level news flashes portal that, in its impressive short
response time, has the principal advantage.

In this paper we recognize this primary role of Twitter and
we propose a novel topic detection technique that permits to retrieve in
real-time the most emergent topics expressed by the community. First, we
extract the contents (set of terms) of the tweets and model the term life cycle
according to a novel aging theory intended to mine the emerging ones. A term
can be defined as emerging if it frequently occurs in the specified time
interval and it was relatively rare in the past. Moreover, considering that the
importance of a content also depends on its source, we analyze the social
relationships in the network with the well-known Page Rank algorithm in order
to determine the authority of the users. Finally, we leverage a navigable topic
graph which connects the emerging terms with other semantically related
keywords, allowing the detection of the emerging topics, under user-specified
time constraints. We provide different case studies which show the validity of
the proposed approach.

 

3.   
Title:
Comparing Twitter and traditional media using topic models in Advances in
Information Retrieval.

Authors: W. X. Zhao

Abstract:

Twitter as a new form of social
media can potentially contain much useful information, but content analysis on
Twitter has not been well studied. In particular, it is not clear whether as an
information source Twitter can be simply regarded as a faster news feed that
covers mostly the same information as traditional news media. In This paper we
empirically compare the content of Twitter with a traditional news medium, New
York Times, using unsupervised topic modeling. We use a Twitter-LDA model to
discover topics from a representative sample of the entire Twitter. We then use
text mining techniques to compare these Twitter topics with topics from New
York Times, taking into consideration topic categories and types. We also study
the relation between the proportions of opinionated tweets and retweets and
topic categories and types. Our comparisons show interesting and useful
findings for downstream IR or DM applications.

 

 

EXISTING
SYSTEM

Media
Focus Estimation: To estimate the MF of a TC, the news articles that are
related to TC are first selected. This presents a problem similar to the
selection of tweets when calculating UA. The weighted nodes of TC are used to
accurately select the articles that are genuinely related to its inherent
topic. The only difference now is that instead of comparing node combinations
with tweet content, they are compared to the top k keywords selected from each
article.Hashtags are of great interest to us because of their potential to hold
the topical focus of a tweet. However, hashtags usually contain several words
joined together, which must be segmented in order to be useful. This problem,
occured in our existing work. The segmented terms are then tagged as
“hashtag.”To eliminate terms that are not relevant, only terms tagged as
hashtag, noun, adjective or verb are selected. The terms are then lemmatized
and added to set T, which represents all unique terms that appear in tweets
from dates d1 to d2.

DISADVANTAGES

Ø Hard
to find a way to filter news from noisy

Ø High
computational demand prioritize.