The Clear Cloud - Home
Sentiment analysis of social media data using Big Data Processing Techniques
NOV 24, 2016 22:56 PM
A+ A A-


With the extensive growth in the usage of online social media, the ransom amount of data is available as users’ preference regarding any product, services provided by various organizations or with respect to any political issues. Micro blogs, forums are also available wherein internet users, can express their opinions. Since mobile devices can access network easily from anywhere, social media is becoming more and more popular. The number of people using the social media is increasing day by day as they can share their personal feeling every day and reviews are created in large-scale. Every minute opinion, reviews are being expressed online and a potential user rely on these reviews, opinions, feedback given by various other users to make decisions with respect to purchasing an item or developing a software when it comes to an organization that provides services. Analyzing these reviews, opinions or feedback in this scenario is of utmost importance. It seems evaluating these reviews, opinions are not as easy as it appears to be, and it requires performing sentiment analysis. Sentiment analysis greatly helps us in knowing the customer behavior. The biggest challenge is to process the social data which are in unstructured or semi-structured form. The former technologies fail to process the data in this form in an effective way. So, there is a need for highly optimized, scalable and efficient technology to process the abundant data that are being produced at a high rate. The social media data produced will be either unstructured or semi-structured. Hadoop framework effectively analyzes the unstructured and semi-structured form data. With the increase in the utilization of Hadoop for processing the huge sets of data in various fields the need for maintaining the overall performance of Hadoop becomes inevitable which is made possible by developing various open source tools such as Spark, Hive, Flume, Oozie, Zookeeper, Sqoop which are supported by Hadoop which makes it even more powerful.


Sentiment is defined as an expression or opinion by an author about any object or any aspect. Analyzing, investigating, extracting users’ opinion, sentiment and preferences from the subjective text is known as sentiment analysis. The main focus of sentiment analysis is parsing the text. In simple terms, sentiment analysis can be defined as detecting the polarity of the text. Polarity can be positive, negative or neutral.  It is also referred to as opinion mining as it derives opinion of the user. Opinions vary from user to user and sentiment analysis greatly helps in understanding users’ perspective. Sentiment can be,

Direct opinion: As the name suggests the opinion about an object is given directly and the opinion may be either positive or negative. For example, “The video clarity of the cellphone is poor” expresses a direct opinion.

Comparison opinion: It is a comparative statement which consists of comparison between two identical objects. The statement, “The picture quality of camera-x is better than that of camera-y” is one possible example for expressing a comparative opinion.

Sentiment analysis is performed at three different levels:

  • Sentiment analysis at sentence level identifies whether the given sentence is subjective or objective. Analysis at sentence level assumes that the sentence contains only one opinion.
  • Sentiment analysis at document level classifies the opinion about the particular entity. Entire document contains opinion about the single object and from the single opinion holder.
  • Sentiment analysis at feature level extracts the feature of a particular object from the reviews and determines whether the stated opinion is positive or negative. The extracted features are then grouped and their summarized report is produced.

Architecture components of Big Data Ecosystem

With the explosive growth of data on the Internet and the improvement of corpus, sentiment analysis system needs the big data processing techniques to complete tasks. The term Big Data is represented by three V’s (Volume, Variety, and Velocity). Volume represents amount of data used for summarization. Variety represents different type of data like structured, semi-structured and unstructured which is extracted from various sources. Velocity represents the speed of data generation on internet. For processing the large sets of data in parallel across cluster of nodes Apache came up with an open source framework known as Hadoop.

The major components of Hadoop are Hadoop distributed file system (HDFS) and the MapReduce programming model. Hadoop is accessible, because it runs on cloud computing services or commodity machine across clusters of nodes. It is able to handle failures in an efficient manner even though it is intended to run on commodity hardware which makes it robust. Any number of nodes can be added to the Hadoop cluster in order to deal with huge data in parallel. Hadoop is simple in that a user can write a simple parallel code. To each and every node data is distributed and hence operation is performed in parallel in Hadoop cluster. Hadoop overcomes the hardware failure by keeping multiple copies of data. Modules of Hadoop Ecosystems are as follows:

1. Hadoop common utilities

Hadoop modules require operating system level and file system level abstractions which are provided by the java libraries and utilities. Execution of Hadoop is carried out by the java files and scripts facilitated by Hadoop common utilities.

2. Hadoop Distributed file system (HDFS)

Hadoop provides its own filesystem known as Hadoop distributed file system for storing huge set of data based on Google File Server (GFS) which is highly fault-tolerant. The architecture of HDFS depicts the master/slave architecture. Master node manages the file system and the storage of actual data is taken care by the slave node.A file in a HDFS namespace is divided into several segments and these segments are stored in DataNodes. The plotting of these segments to the DataNodes is identified by the NameNode. The data node performs read and write operations.

3.  MapReduce

The MapReduce is a Distributed Data Processing Framework of Apache Hadoop enables the writing of applications in an effective manner and also enables parallel processing of huge sets of data .The MapReduce paradigm has two different tasks:

  • The Map Task: The Map task captures the input and this input data are divided into pair of data. This data is further divided into tuples to form a key/value pair.
  • The Reduce Task:  The input to the Reduce task is the output from the Map task. All the divided   tuples in the Map task is combined to form smaller set of tuples. Map Task is followed by Reduce Task.

The MapReduce component of the Hadoop framework schedules monitors the tasks and also re-executes the failed task. The MapReduce paradigm has a single JobTracker and one TaskTracker that acts as master and slave respectively. The master JobTracker directs the slave TaskTracker to execute the task and also it manages the resource, tracks the resource distribution, consumption and availability. On the other hand the TaskTracker provides the status information to the JobTracker.

4.       Hadoop Yarn framework

It provides computational resources required for application execution.YARN is enabler for dynamic resource utilization on Hadoop framework as users can run various Hadoop applications without having to bother about increasing workloads. Yarn has Resource Manager and Node manager for Scheduling of jobs and managing the resources to the clusters. The master is Resource Manager. It will do resource scheduling by knowing where the slaves are located and how many resources they have. The slave of the infrastructure is Node Manager. When it starts, it announces himself to the Resource Manager and periodically sends a heartbeat to the Resource Manager. 

Hadoop proves to be a reliable framework and also it processes the huge set of data in a fault-tolerant manner which makes it efficient. To make Hadoop function methodically various open source technologies such as Spark, Flume, Hive, Mahout, Sqoop, Oozie, Zookeeper etc. are developed which are built on top of Hadoop, collectively called as Hadoop eco-system proves to improve the overall performance of Hadoop. 

Figure: Hadoop Ecosystem

Data Access Components of Hadoop Ecosystem

Data access components of Hadoop are Apache Pig and Hive. They are used for analyzing large data sets without the low level work with Map reduce. Apache Pig is a platform for analysing large data sets . Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs and Pig's language layer currently consists of a textual language called Pig Latin. Hive is a data warehouse system for Hadoop for querying, Summarizing and analysis of large data sets stored in HDFS. It provides SQL like interface. The data stored in HDFS is queried by Hive with the help of HiveQL.

Data Integration Components of Hadoop Ecosystem

Data Integration components of Hadoop Ecosystem are flume and sqoop. Flume is used for populating data with Hadoop. Collection, aggregation and movement of data is the responsibility of Flume. Sqoop/REST/ODBC is a connectivity tool for moving data from non-Hadoop data stores like relational databases and data warehouses into Hadoop. The users can specify the target location inside of Hadoop and instruct Sqoop to move data from any relational databases to the target.

Data Storage Component of Hadoop Ecosystem

HBase is the data storage of Hadoop ecosystem used to handle huge data sets with billions of rows and columns. It easily combines data sources of different structures and schemas .It is also called non-relational database. Hbase provides transactional capabilities to Hadoop and allows users to conduct insertion, updating and deletion operation.


Monitoring, Management and Orchestration Components of Hadoop Ecosystem

Work-flow processng and managing jobs are carried out by Oozie. It lets users define a series of jobs written in multiple languages such as Map Reduce, Pig and Hive and links them to one another. Coordination and synchronization of distributed systems is a service provided by Zookeeper.. ZooKeeper allows developers to focus on core application logic without worrying about the distributed nature of the application. It is also called Hadoop admin tool for managing large set of hosts.

Hadoop ecosystem components include Mahout and Ambari for big data analytics in order to meet business requirement.


A Hadoop component, Ambari is a API which provides web user interface for Hadoop management. It provides step-by-step wizard for installing Hadoop ecosystem services. It is equipped with central management to start, stop and re-configure Hadoop services and it facilitates the metrics collection, alert framework, which can monitor the status of the Hadoop cluster.


Mahout is a Hadoop component for machine learning. It provides implementation of various machine learning algorithms like classification and clustering. Mahout helps in considering user behavior in providing suggestions, categorizing the items to its respective group, classifying items based on the categorization and supporting in implementation group mining.


Many researchers illustrated the usage of open source technologies for sentiment analysis. Social media data can be analyzed by Hadoop Ecosystem. Apache Flume tool is used to collect the data from the social media. The responsibility of moving the data to HDFS and also aggregating the collected data is effectively handled by Flume. The combination of Spark RDD and Hadoop provides significant computational capabilities in a fault tolerant cluster setup with low price commodity hardware. This provides an information analytics layer on top of Hadoop that embraces the MapReduce paradigm and the resilience of Spark RDD’s along with advanced statistical analysis layer through R with design time and run time optimizations of the open source stack. While designing they faced a challenge in constructing the polarity lexicon as it was the main component in analyzing, classifying sentiments. Few researchers used Hive tool for storing and HiveQL for the analysis of twitter data.

Authors worked on trend analysis of E-commerce data using Hadoop eco-system. Hadoop parallel processing framework is used to process large data sets using Map Reduce programming and Apache Hive is a data warehouse infrastructure which is built on top of Hadoop for providing data summarization, querying and analysis. The authors executed a job on the server to fetch the latest log files every hour and store a copy on MySQLdb for processing by Hadoop used Sqoop to load the data into hive.

Researchers also made a comparative study of Big Data technologies such as Apache Pig, Hive, Sqoop, HBase, Zookeeper, Flume integrated with Hadoop to increase the efficiency and performance of Hadoop. The experiments done by this paper shows that, when the number of nodes is increased the MapReduce CPU time spent also increases but there is a decrease in Hive query execution time which is an advantage. So, they concluded saying that all the technologies implemented on top of Hadoop improves the performance of basic Hadoop MapReduce Framework.


With the increasing dependence on social media data, the information obtained from the web in the form of feedbacks, comments have gained much attention in the field of sentiment analysis. The major challenge with this extensive growth in the usage of social media is processing and analyzing the huge sets of data produced as a result. With the implementation of MapReduce paradigm, Hadoop framework proves to be a reliable framework as it processes the huge sets of data in a fault- tolerant manner. We can implement the technologies such as Apache Pig, Hive, Sqoop, HBase, Zookeeper, and Flume on top of Hadoop in-order to improve the efficiency and performance of Hadoop. The combination of Spark RDD and Hadoop also improves the in-memory computation capabilities. 


[1] Devendra K Tayal and Sumit Kumar Yadav, “Fast retrieval approach of sentiment  analysis using bloom filter hadoop.” Computational Techniques in Information and Communication Technologies(ICCTICT)(2016):14-18.

[2] Aditya Bhardwaj, Vanraj, Ankit Kumar, Yogendra Narayan, Pawan Kumar, “Big Data Emerging Technologies: A CaseStudy with Analyzing Twitter Data using Apache Hive.”Recent Advances in Engineering and Computational Sciences(RAECS)(2015):1-6.

[3]  Fatima Zohra ENNAJI, Abdelaziz EL FAZZIKl ,Mohamed SADGAL, DjamalBENSLIMANE, “Social Intelligence Framework: Extracting and Analyzing Opinions for Social CRM.” Computer Systems and Applications(AICCSA)(2015):1-7.


[5]  Rama Satish K. V., N. P. Kavya, “ Trend Analysis of E-Commerce Data using Hadoop Ecosystem”, International Journal of Computer Applications (0975 – 8887) Volume 147 – No.6, August 2016.

[6] Sangeeta, “Twitter Data Analysis Using FLUME & HIVE on Hadoop FrameWork”, International Journal of Recent Advances in Engineering & Technology (IJRAET) V-4 I-2.

[7] Aditya Bhardwaj, Vanraj, Ankit Kumar, Yogendra Narayan, Pawan Kumar, “Big Data Emerging Technologies: A CaseStudy with Analyzing Twitter Data using Apache Hive.”Recent Advances in Engineering and Computational Sciences(RAECS)(2015):1-6.

[8]  Fatima Zohra ENNAJI, Abdelaziz EL FAZZIKl ,Mohamed SADGAL, DjamalBENSLIMANE, “Social Intelligence Framework: Extracting and Analyzing Opinions for Social CRM.” Computer Systems and Applications(AICCSA)(2015):1-7.

Author details:

1. Anisha P Rodrigues, Assistant Professor, Department of Computer Science and Engineering, NMAMIT, Nitte – 574110

  • Areas of Interest: Sentimental analysis, Big data analytics, Internet of Things, Image processing
  • email ID

2. Dr. Niranjan N Chiplunkar, Principal and Professor,Department of Computer Science and Engineering, NMAMIT, Nitte – 574110

Dr.Niranjan has written one Text book on “CAD for VLSI” published by NMAMIT Publications in 2007 and one more on “VLSI CAD” published by PHI Learning, New Delhi in 2011. Prof. Chiplunkar has been awarded with Bharatiya Vidya Bhavan National Award for the “Best Engineering College Principal – 2014” by Indian Society for Technical Education. He has also been awarded with “Excellent Achievement Award” from Centre for International Cooperation on Computerization, Govt. of Japan in March 2002.

  • Areas of Interest: CAD for VLSI Embedded Systems, Computer Networks & Security.
  • email ID:

[%= name %]
[%= createDate %]
[%= comment %]
Share this:

Computing Now Blogs
Business Intelligence
by Keith Peterson
Cloud Computing
A Cloud Blog: by Irena Bojanova
The Clear Cloud: by STC Cloud Computing
Computing Careers: by Lori Cameron
Display Technologies
Enterprise Solutions
Enterprise Thinking: by Josh Greenbaum
Healthcare Technologies
The Doctor Is In: Dr. Keith W. Vrbicky
Heterogeneous Systems
Hot Topics
NealNotes: by Neal Leavitt
Industry Trends
The Robotics Report: by Jeff Debrosse
Internet Of Things
Sensing IoT: by Irena Bojanova