Abstract – A huge
amount of data (Data in the unit of Exabyte or Zettabyte) is called Big Data.To
quantify such a large amount of data and store electronically is not easy. Hadoop
system is used to handle these large data sets. To collect big data according to the request, Map Reduce
program is used. In order to achieve greater performance, big data requires
proper scheduling. To minimize starvation and maximize the utilization of resource,
scheduling technique are used to assign the jobs to available resources. The Performance
can be increasedby implementing deadline constraints on jobs. The goal of the
research is to study and analyze various scheduling algorithm for better
Index Terms – Big Data,
Map Reduce, Hadoop, Job Scheduling Algorithms.
the term big data 1 has become very trendy in Information Technology segment.
Big data refers to broad range of datasets which are hard to be managed by
previous conventional applications. Big data can be applied in finance and
business, banking, online and onsite purchasing, healthcare, astronomy,
oceanography, engineering, and many other fields. These datasets are very
difficult and are rising exponentially day by day in very large amount. As data is increasing in volume, in variety and
with high velocity, it leads to complexities in processing it. To correlate,
link, match and transform such big data is a complex process. Big data being a
developing field has a lot of research problems and challenges to address. The
major research problems in big data are following: 1) Handling data volume, 2)
Analysis of big data, 3) Privacy of data, 4) Storage of huge amount of data, 5)
Data visualization, 6) Job scheduling in big data, 7) Fault tolerance. 1)
Handling data volume
1 2: The large amount of data coming from different fields of science such
as biology, astronomy, meteorology, etc makes its processing very
difficult to the scientists. 2) Analysis
of big data: it is difficult to analyze big data due to heterogeneity and
incompleteness of data.
data can be in different formats, variety and structure 3. 3) Privacy of data
in the context of big data 3: There is public fear regarding the
inappropriate use of personal data, particularly through linking of data from
multiple sources. Managing privacy is both a technical and a Sociological
problem. 4) Storage of huge amount of data 1 3: it represents the problem
of how to recognize and store important information, extracted from
unstructured data, efficiently. 5) Data visualization 1: Data processing
techniques should be efficient enough to enable real time visualization. 6) Job
scheduling in big data 4: This problem focuses on efficient scheduling of
jobs in a distributed environment. 7) Fault tolerance 5: is another issue in
Hadoop framework in big data. In Hadoop, NameNode is a single point of failure.
Replication of block is one of the fault tolerance technique used by Hadoop.
Fault tolerance techniques must be efficient enough to handle failure in
distributed environment. MapReduce 6 provides an ideal framework for
processing of such large datasets by using parallel and distributed programming
operations depend on double function such as Map and Reduce function. Boththe functions
are written based on the needs of the user. The Map functiontakes an input pair
andgenerates a set of intermediate or middle key or the value pairs. Then the MapReduce
library will collects all the middle values that are associated with the same
middle key andtransfer them into the Reduce function for additional operations.The
Reduce function obtains an intermediate or middle key with an integrated set of
values. And it associates thosevalues to make it as a smaller set of values.
The Figure 1 shows all process of MapReduce.
Overall MapReduce Word Count Process.
Scheduling decisions which are
taken by the master node are called as Job Tracker and by the worker nodes are
called as Task Tracker which executes the tasks.
cluster includes a single master node and multiple slave nodes. Figure 2 shows
Hadoop Architecture. Asingle master node reside a Job tracker, Task tracker,
Name node and Data node.
function of the job tracker is to manage the task trackers and tracking
resource availability. The Job tracker is a node which controls the job
execution process. Job tracker performs mapreduce tasks to a particular node in
the cluster. Client submits jobs to the Job tracker. When the work is
completed, the Job tracker updates its status. Client applications can ask the
Job tracker for information.
It follows the
orders of the job tracker and updates the job tracker with its status
periodically. Task tracker run tasks and send the reports to Job tracker, which
keeps a complete record of each job. Every Task tracker is configured with a
set of slots which indicates the number of tasks that it can accept.
The name node
plots toblock locations. Whenever a data node undergoes a disk corruption of a
particular block, the first table gets updated and whenever a data node is
detected to be dead due to network failure or a node failure, both the tables
get updated. The updating of the tables is based on only failure of the nodes.
It does not depend on any neighbor blocks or any block locations to identify
its destination. Each block is separated with its job nodes and respective
The node which
stores the data in hadoop system is known to be as data node. All data nodes
send a heartbeat message to the name node for every three seconds to say that
they are alive. If the name node does not receive a heartbeat from a particular
data node for ten minutes, then the name node consider that data node is dead
or out of service. It initiates some other data node for the process. The data
nodes updates the name node with the block information periodically.
JOB SCHEDULING IN BIGDATA
default Scheduling algorithm is supported on FIFO where jobs were executed in
the magnitude of their humility. Later on the cognition to set the priority of
a Job was added. Facebook and Character contributed meaningful apply in
processing schedulers i.e. Legible Scheduler 8 and Capacity Scheduler 9
respectively which after free to Hadoop Dominion. This section describes
various Job Scheduling algorithms in big data.
A. Default FIFO Scheduling
default Hadoop scheduler operates using a FIFO queue. After a job is divided
into independent tasks, they are ended into the queue and allotted to free
slots as they get acquirable on Task Tracker nodes. Although there is keep for
decision of priorities to jobs, this is not revolved on by default. Typically
apiece job would use the complete assemble, so jobs had to inactivity for their
release. Regularize though a distributed constellate offers zealous latent for
offering larger resources to numerous users, the job of intercourse resources
evenhandedly between users requires a turn scheduler. Production jobs bet in a
B. Fair Scheduling
Fair Scheduler 8 was developed at Facebook to manage access to their Hadoop
cluster and subsequently released to the Hadoop community. The Fair Scheduler plans
to provide each user a fair share of the cluster capacity in excess of time.
Users may allocate jobs to pools, with every pool owed a guaranteed smallest
number of Map and Reduce slots. Free slots in unsuccessful pools may be owed to
new pools; piece immoderateness ability within a pool is joint among jobs. The
Fair Scheduler maintains preemption, so if a pool has not received its fair
contract for a destined period of measure, then the scheduler module will
denial tasks in pools flowing over capacity in dictate to afford the slots to
the pool functional under capacity. In addition, administrators may enforce
priority settings on doomed pools. Tasks are therefore scheduled in an
interleaved fashion, supported on their priority within their pool, and the
constellate capacity and activity of their pool. As jobs contain their tasks
assigned to Task Tracker slots for calculation, the scheduler follows the
shortfall between the become of calculate really old and the saint fair
percentage for that job. Eventually, this has the result of ensuring that jobs
obtain roughly equal amounts of resources. Shorter jobs are assigned enough
resources to terminate fast. Simultaneously, longer jobs are assured to not be
ravenous of resources.
C. Capacity Scheduling
Scheduler 10 initially developed at Yahoo addresses a usage circumstances
where the number of users is huge, and there is a require to make sure a fair
assign of calculation resources between users.
The Capacity Scheduler
allocates jobs supported on the submitting user to queues with configurable
drawing of Map and Minify slots. Queues that hold jobs are bestowed their
organized capacity; patch a trip capacity in a queue is shared among opposite
queues. Within a queue, planning operates on a modified priority queue
groundwork with specialized person limits, with priorities orientated supported
on the quantify a job was submitted, and the priority scene allocated to that
human and accumulation of job. When a Task Tracker receptacle becomes unfixed,
the queue with the lowest laden is elite, from which the oldest remaining job
is chosen. A task is then scheduled from that job. This has the validity of
enforcing meet capacity distribution among users, rather than among jobs, as
was the case in the Fair Scheduler.
D. Dynamic Proportional Scheduling
claimed by Sandholm and Lai 12, Dynamic Proportional scheduling gives a lot
of job sharing and prioritization that end in increasing share of cluster
resources and a lot of differentiation in service levels of various jobs. This
algorithm improves response time for multi-user Hadoop environments.
Adaptive Scheduling (RAS)
increaseutilization of resource among machines even as monitoring the completion
time of process, RAS proposed by Polo et al. 13 for the Map Reduce with
Zhao et al. 14
provides task scheduling algorithm based on the resource attribute selection
(RAS) to work out its resource assigned by sending a group of test tasks to an
execution node before a task is scheduled and so choose optimal node to execute
a task consistent with resource needs and appropriateness between the resource
node and therefore the task, which uses history task information if prevail.
F. MapReduce task scheduling with
deadline constraints (MTSD) algorithm
to Tang et al. 15, scheduling algorithmic rule sets two deadlines:
map-deadline and reduce-deadline. Reduce-deadline is simply the users’ job
deadline. Pop et al. 16 presents a classical approach for a periodic task
scheduling by considering a scheduling system with totally different queues for
periodic and aperiodic function and deadline, because the main constraint
develops a method to guess the quantity of resources required to schedule a
group of an interrupted tasks or function, by considering along implementation
and data transfers costs. Based on a numerical model, and by using dissimilar
simulation situations, MTSD proved the
following statements: (1) varied sources of independent an episodic tasks will
be measured approximating to a single one; (2) when the quantity of evaluated
resources transcend a data center capability, the tasks migration between
totally different regional centers is that the appropriate resolution with
relevance the global deadline; and (3) during a heterogeneous data center, we
want higher variety of resources for an equivalent request with relevance the
deadline constraints. In MapReduce, Wang and Li 17 detailed the task
scheduling, for disseminated data centers on heterogeneous networks through
adaptative heartbeats, job deadlines and data locality. Job deadlines are
dividing alongside the foremost data quantity of tasks. With the thought of
limitation, the task scheduling is twisted as an assignment downside in each
heartbeat, during which adaptive heartbeats are supposed by the process times
of tasks and jobs are sequencing in terms of the separated deadlines and tasks
are planned by the Hungarian algorithmic program. On the idea of data transfer
and process times, the most appropriate data center for all mapped jobs are
determined within the reduce part.
G. Delay Scheduling
objective is to deal with the dispute between locality and fairness. once a
node requests for a task or function, if the head-of-line job cannot project a
local task, scheduler omit that task and appears at later jobs. If a job has
been omitted for long, we tend to permit it to project non-native tasks, to
scheduling provisionally relaxes fairness to induce higher locality through
allowing jobs to attend for scheduling on a node among native data. Song et al.
18 offer a game assumption based technique to solve scheduling problems by
separating a Hadoop scheduling issue into 2 levels—job level and task level.
the job level scheduling, use a bid model to have assurance
to the fairness and reduce the common waiting time. For tasks level, change
scheduling drawback into assignment problem and use Hungarian methodology to
optimize the problem. Wan et al. 19 provides multi-job scheduling algorithm
in MapReduce supported game assumption that deals with the competition for
resources between many jobs.
H. Multi Objective Scheduling
et al. 20 explain about scheduling algorithm named MOMTH by considering
objective functions associated to resources and users within the similar time
with constraints similar to deadline and budget.
enact model takes into account as allMapReduce jobs are independent. As there’s
no nodes failure before/during scheduling computation, scheduling decision is
taken solely based on the present data. Bian et al. presents scheduling
strategy.Consistent with this scheduling strategy, the cluster finds the speed
of the present nodes and creates some backups of the intermediate MapReduce
data which results to a high performance cache server. The data created by that
node could get it wrong shortly. Hence the cluster could resume the execution
to the previous level rapidly if there are many nodes going wrong, then the cut
back nodes will scan the Map output from the cache server or from both the
cache and also from the node, and keeps its
Multistage Heuristic Scheduling (HMHS)
et al. 21 elaborates heuristic scheduling algorithm named HMHS that makes an
attempt to clarify the scheduling trouble by rending it into 2 sub problems:
sequencing and dispatching. For sequencing, they use heuristic supported Pri
(the modified Johnson’s algorithm). For dispatching, they recommend two
heuristics Min-Min and Dynamic Min-Min.
TABLE I: COMPARISON OF
VARIOUS JOB SCHEDULING ALGORITHMS IN BIGDATA
Default FIFO Scheduling 22
Schedule jobs based on their
priorities in first-in first-out
1. Cost of entire cluster scheduling
process is less.
2. Simple to implement and efficient.
1. Designed only for single type of
2. Low performance when run multiple
types of jobs.
3. Poor response times for short jobs
compared to large jobs.
Fair Scheduling 8
Do an equal distribution of compute
resources among the users/jobs in the system.
1. Less complex
2. Works well when both small and
3. It can provide fast response times
for small jobs mixed with larger jobs.
1. Does not consider the job weight of
Maximization the resource utilization
and throughput in multi-tenant cluster environment.
1. Ensure guaranteed access with the
potential to reuse unused capacity and prioritize jobs within queues over
1. The most complex among three schedulers.
Dynamic Proportional Scheduling12
Planned for data intensive workloads
and tries to maintain data locality during job execution
1. It is a fast and flexible
2. It improves response time for
multi-user Hadoop environments.
If the system eventually crashes then
all unfinished low priority processes gets lost.
Resource-Aware Adaptive Scheduling
Dynamic Free Slot Advertisement. Free
It improves the Job performance.
Only takes action on appropriate slow
MapReduce task scheduling with
deadline constraints (MTSD)15
Achieve nearly full overlap via the
novel idea of including reduce in the overlap.
1. It Reduce computation time.
2. Improve performance for the
important class of shuffle-heavy Map Reductions.
Better work with small clusters only.
To address the conflict between
locality and fairness.
1. Simplicity of scheduling
Multi Objective Scheduling20
The execution type consider as all the
MapReduce jobs are independent, there is no nodes failure before or during
the scheduling computation and the scheduling decision is taken only based on
It keeps performance is high.
Execution Time is too large.
Hybrid Multistage Heuristic Scheduling
Johnson’s algorithm & Min-Min and
Dynamic-MinMin algorithm used
Achieves not only high data locality
rate but also high cluster utilization.
It does not ensure reliability.
paper provides the classification of Hadoop schedulers based on different
parameters such as time, priority, resources etc. It discuss about how various
task scheduling algorithms helps in achieving better result in Hadoop cluster. Furthermore
this paper also discusses about advantages and disadvantages of various task
scheduling algorithms. This comparison results shows, each scheduling algorithm
has some advantages and disadvantages. So, all algorithms are important in job
paper gives an overall idea about different job scheduling algorithm in the big
data. And it compares most of the properties of various task scheduling
algorithms. Individual scheduling techniques which areused to upgrade the data
locality, efficiency,make span,fairness and performance are elaborated and
discussed. However, the scheduling technique is an open area for researchers to