SOLUTION: Performance analysis of clustering algorithm

Journal of High Speed Networks 23 (2017) 49–57 49
DOI 10.3233/JHS-170556
IOS Press
Performance analysis of clustering algorithm
under two kinds of big data architecture
Beibei Li a, Bo Liu a, Weiwei Lin b,∗ and Ying Zhang a
a School of Computer, South China Normal University, Guangzhou, China
E-mails: 1538339980@qq.com, liugubin@126.com, 2283240387@qq.com
b School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
E-mail: linww@scut.edu.cn
Abstract. To compare the performance of the clustering algorithm on two data processing architectures, the implementations of k-means
clustering algorithm on two big data architectures are given at first in this paper. Then we focus on the differences of theoretical performance of
k-means algorithm on two architectures from the mathematical point of view. The theoretical analysis shows that Spark architecture is superior
to the Hadoop in aspects of the average execution time and I/O time. Finally, a text data set of social networking site of users’ behaviors is
employed to conduct algorithm experiments. The results show that Spark is significantly less than MapReduce in aspects of the execution time
and I/O time based on k-means algorithm. The theoretical analysis and the implementation technology of the big data algorithm proposed in
this paper are a good reference for the application of big data technology.
Keywords: Hadoop, MapReduce, Spark, clustering algorithm, big data, k-means
1. Introduction
With the coming of the era of Internet+, massive data has been produced in all aspects of social life. How to dig
out its hidden enormous value has become the focus of the community, and also has risen to the national strategic
level. In March 2012, the Obama administration announced that they would plan to invest $200 million to start
“big data research and a development program”, which followed another major technological development-the
“information superhighway” plan [15] announced in 1993. A series of data from the Big Data Report in 2012
McKinsey showed that big data industry had brought $300 billion revenues for the US health care system annually,
e250 billion revenues for the European public administration annually, 60% pure profit for the retail industry, and
had reduced 50% product development costs for the manufacturing industry. However, Canner thought by 2015
more than 85% of Fortune 500 companies would lose their strengths in the big data competition [11]. The market
research firm IDC predicted that big data technology and services market would rise from $3.2 billion in 2010 to
$16.9 billion in 2015 and achieved 40% growth rate annually [9]. From the statistics above, it is easy to find that
big data is widely applied and is of great value. In terms of the concept and research status of big data, the core
force that promote the big data development is the big data processing technology. Whether we can dig out the
enormous scientific and economic value hidden in massive data depends on its processing technology. Therefore,
the big data technology has become the hot spot and research focus [13]. The limit of traditional data processing
model in memory and processing capabilities is unable to meet the actual demands. With the development of
science and technology, parallel processing mechanisms e.g. MPI, PVM and MapReduce have been widely used in
the past years. However, with the deep research on machine learning, there are a large number of applications that
requires iterative algorithm processing. The result of this application processed by the traditional data processing
*Corresponding author. E-mail: linww@scut.edu.cn.
0926-6801/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved
50 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture
architecture is not satisfactory. An open source universal parallel cloud computing platform-Spark developed by
UC Berkeley AMP Lab meets the needs [24]. The Spark is the latest parallel distributed computing framework
mainly based on memory computing on big data technology chain. And some issues related to memory computing
have got support from the National Natural Science Foundation and related research has started. They are also
supported by a lot of companies in the market, such as Alibaba, Baidu, NetEase and so on. Researchers are more
concerned about the performance of data processing platform. At home and abroad, most researches have been
focused on the differences between MapReduce [14] and Spark; integration of both memory computing and data
mining algorithm on Spark [20]; improvement on clustering algorithm [19] binding Spark platform and so on. The
decision tree research on two architectures mentioned in the literature [24] shows that the Spark is more suitable
for the iterative algorithm, and there is no deep research on the performance differences of the two architectures.
Meanwhile, the decision tree also points out the performance differences of the two architectures combining with
k-means algorithm [21]. The most latest researches on two architectures performance differences just analyze the
results by experiments. But the researches from the mathematical point of theoretical analysis are rare.
In this paper we firstly give the implementations of k-means clustering algorithm on MapReduce and Spark.
Then, we focus on the theoretical performance differences of the two architectures from the mathematical point of
view. Finally we use experiments to verify the validity of the theoretical analysis of big data algorithm.
2. Two implementations of k-means algorithm
2.1. Overview of k-means algorithm
K-means is a clustering algorithm based on distance and unsupervised learning. It has been used widely on
science, industry, business and so on [17]. Its cluster similarity criterion is the distance between data objects. The
data of same cluster is similar, and the data of different clusters is different. Clustering function is deviation sum
of square criterion function, which is defined as:
Gc =
c j=1
ni
k=1

x
(j)
k – mj

2
For each data object xi, the function is to compute which class xi belongs to:
ci = argminj |xi – mj|2
Where mj is clustering center.
The function of computing new center of clustering j is:
mnew_j =
n i=1 wijxi
n i=1 wij
Where xi is a data object and wij is the identification whether xi belongs to class j. If it is true, wij = 1; Or
wij = 0.
K-means algorithm implementation [30] is as follows:
Input: data set D, the numbers of cluster k;
Output: the k sets of clustering;
Select data objects as initial center in data set D;
B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 51
Repeat
For each data object xi from data set D
computing distance from xi to center point
the data object is divided to the nearest cluster
End For
Calculating the data object average value of each clustering center used as a new clustering center until clustering
center points no longer change [17,30].
From the algorithm implementation point of view, we have seen that algorithm would be inefficient when algorithm required multiple iterations in dealing with massive data. Then algorithm can’t meet the needs of practical
applications. K-means parallel implementation solves the problem [17]. The following parts mainly introduced the
k-means parallel implementations on MapReduce and Spark.
2.2. Parallel implementation of k-means based on MapReduce
As is discussed in Section 2.1, the key to parallel implementation of algorithm is to independently assign different samples to the nearest cluster. The Map and Reduce operations are the same in each iteration of parallel
implementation k-means algorithm [2]. Firstly, we select k samples as the center randomly, and store them in the
HDFS files as a global variable. Then the iteration includes three parts:
Map Function [25]: is inputted by default. The ‘key’ is an offset that is the current sample relative
to the starting point of the input file. The ‘value’ is a string that consists of each dimension coordinate values of
the current sample. Firstly, we analyze each dimension coordinate values of current sample from the value, and
calculate the distance from data object to k clustering center. We can obtain the clustered index of nearest distance,
and output , where the key1 is the clustered index of nearest distance, and the value1 is a string
that consists of each dimension coordinate values of the current sample.
Combine Function: is inputted. The ‘key’ is the clustered index. The ‘V’ is the string linked list that
consists of each dimension coordinate values whose clustered index is key. Firstly, we obtain the coordinate values
of each sample from the string linked list. Secondly, we add each value corresponding, and record the total number
of samples in the list. Outputting , where the key1 is the clustered index, the value1 is a string
which consists of the sample sum and each dimension coordinate values.
Reduce Function: is inputted. Firstly we can obtain the intermediate results. Secondly, we can
get new clustering center through related operation and update the HDFS files. Then the next iteration continues
until results converge. The implementation process is shown in Fig. 1.
2.3. Parallel implementation of k-means based on Spark
The implementation of k-means algorithm based on Spark includes two parts [26]: dividing the data clustering
point, computing clustering center through multiple iterations until the results converge. The implementation is
mainly achieved by the Driver, Mapper, Combiner and Reducer classes [29].
Driver: It’s a underlying driver class of initial program, and it deals with data set through the related functions.
Fig. 1. K-means algorithm implementation based on MapReduce.
52 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture
Fig. 2. K-means algorithm implementation based on Spark.
Mapper: It’s a class that determines the initial clustering center, and divides initial data set. It calculates the
distance from the data object RDD to the initial clustering center, and selects the class of the nearest distance to
merge. At last it reelects the new clustering center. The intermediate results generated by iteration are transformed
into the new data object RDD [4].
Combiner: It’s a class to achieve the combination process of the RDD intermediate data set. Because the Map
process has produced a large number of RDD intermediate results, the combination can reduce the traffic, and
avoid congestion for network communication on the Spark platform.
Reducer: It’s a class that makes local results through Combiner doing Reducer, and gets the global results. It can
judge the convergence of clustering center according to the clustering center threshold [6]. The implementation
process is shown in Fig. 2.
3. Theoretical analysis of algorithm performance on two architectures
As is discussed in Section 2, k-means algorithm implementations on two architectures were based on the Map
and Reduce. The main reason for the performance differences between the two architectures is that Spark [16] is
based on memory RDD [4] calculation which doesn’t need to interact with the disk, while Hadoop is based on
external memory which need to interact with the disk. Then, we will analyze the theoretical performance of two
architectures by execution time which is one of the standard to measure platform performance merits.
Algorithm execution time consists of computing time, communication time and system execution time. The
complexity of computing time of the two architectures is similar. Communication time includes communication
volume and communication mechanism. Hadoop and Spark are based on RPC mechanism, so the time difference
can be ignored. In terms of communication volume, Hadoop can’t reuse the data set in iterative process, while
Spark can support data set cache policy. Whether the data set reuses or not directly affects the number of iteration.
We can merge this difference into the execution time. Execution time includes Map, Reduce and I/O operation
time. Therefore, the difference of time consumption between the two architectures is mainly the system execution
time. Specific time analysis is as follows:
The first iteration of the two architectures is to read data from HDFS. The start and end heartbeat mechanism
of Hadoop are negligible relatively to total time. The second and subsequent iterative processing ideology are
the same. The mainly difference is I/O time consumption. To analyze the performance differences between the
two architectures conveniently, we assume that the cluster is homogeneous, and job is evenly distributed to all
nodes, and no node fail during the implementation process. We need the following definitions auxiliary instructions [10]:
B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 53
Definition 1. We assume that k-means data processing requires (k + 1) times iteration on two architectures. We
make Hadoop, Spark architectures perform a full iteration. The I/O time required in a complete MapReduce process
is respectively Th, Ts. The Map time on the Hadoop and Spark is respectively t1, T1. The Reduce time on the
Hadoop and Spark is respectively t2, T2.
Definition 2. We define the mainly parameters on performing MapReduce process: input data set S, intermediate
output data set S1, the final output data set S2. when the data size is x, each Map running time on Hadoop is f (x),
and each Reduce running time on Hadoop is g(x). Each Map running time on Spark is F(x), and each Reduce
running time on Spark is G(x). They are all directly proportional with x, and ratios coefficient respectively are α,
β, γ, μ.
Definition 3. The available maximum numbers of Map and Reduce are respectively M, R in MapReduce computing system. During execution process the Map number divided is X, the Reduce number of system starting is Y.
The rate of data that is read from HDFS is vi. The rate of data that is written back to the disk is vo. The rate of data
that is read from memory is Vi. The rate of data that will be written back to memory is Vo. Network transmission
rate is vn. The Map initialization overhead is C1 on Hadoop, and Reduce initialization overhead is C2 on Hadoop.
The overheads respectively are C3, C4 on Spark. The node number in a cluster is N, The number of CPU cores for
each node is p. So we can conclude that R = Np, M = 2Np.
Hadoop Map Time: The process includes reading data from HDFS, executing Map calculation and writing the
Map intermediate results back to disk. Each Map input data is XS . The time consumption in this process is:
t1 =
S
Xvi
+ f XS + Xv S1o + C1 (1)
Hadoop Reduce Time: It inputs intermediate results output by Map sorting and executing Reduce calculation,
and outputs the results. Each Reduce input data is SY1 . The time consumption in this process is:
t2 =
S1
Yvi
+ gSY1 + Yv S2o + C2 (2)
So we can conclude, the I/O time of finishing a full MapReduce process is:
Th =
S
Xvi
+
S1
Xvo
+
S1
Yvi
+
S2
Yvo
(3)
Similarly, we can conclude the time consumption of each stage on Spark is:
T1 =
S
XVi
+ FXS + XV S1o + C3 (4)
T2 =
S1
YVi
+ GSY1 + YV S2o + C4 (5)
Ts =
S
XVi
+
S1
XVo
+
S1
YVi
+
S2
YVo
(6)
About Hadoop and Spark, we assume that the data size of each map transmission of each reduce is XY S1 . So an
iterative calculation of network transmission time is:

tn
=
S1
XY vn

(7)

54 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture
The derivation of the formula above is irrespective of task scheduling. Schedule is inevitable, because the practical
application data set is too large. Scheduling times of Map, Reduce implementation are:
λm =
X M
(8)
λr =
Y R
(9)
When t1 Mtn, there is no need to wait for reduce execution. In the practical application, the time of completing
an iterative calculation of Hadoop and Spark is respectively t, t:

t = λmt1 + Mtn + (λr – 1)Xtn + λrt2 t = λmT1 + Mtn + (λr – 1)Xtn + λrT2 The time required for implementation of (k + 1) times iteration under the two architectures is respectively: Thadoop = kt = kλmt1 + kMtn + k(λr – 1)Xtn + kλrt2 Tspark = kt = kλmT1 + kMtn + k(λr – 1)Xtn + kλrT2 I/O time of (k + 1) times iteration is respectively:	(10) (11)
(12) (13)
T h = kTh;	T

s = kTs (14)
Therefore, the performance differences between the two architectures can be illustrated by execution time and I/O
consumption. In order to give a more intuitive description of the problem, the parameters will be specific values.
According to the experience, set Y = 1.75∗N ∗p, we can get λr = 2. In order to calculate conveniently and reduce
workload imbalance, we assume that S = S1 = S2, vi = vo = 100 Mb/s, Vi = Vo = 10 Gb/s, vn = 1 Gb/s,so we
can get TThs = 100.
We assume that vi, vo, Vi, Vo, vn are same, so the other parameters values are respectively, M = 12, R = 6,
N = 3, p = 2, C1 = C3 = 0.3 s, C2 = C4 = 0.2 s, α = 0.8 s/M, β = 0.9 s/M, γ = 0.1 s/M, μ = 0.2 s/M. We
can get TThadoop spark = 18. If C1 = 3 s, C2 = 2 s, C3 = 0.2 s, C4 = 0.1 s, α = 1.8 s/M, β = 2 s/M, γ = 0.04 s/M,
μ = 0.05 s/M, we can get TThadoop spark = 41.
The results have shown that the overhead and execution rate of each stage of the implementation process have
a great influence on the architecture performance. In practical application, the data volume maybe reach to T level
and even bigger. The difference of execution time on two architectures becomes much more obvious. Bandwidth
may also become the bottleneck of the two architectures. Finally, the results have shown that Hadoop is longer than
Spark in I/O consumption or total time. Using the execution time to measure the performance of two architectures,
we can conclude that Spark is superior to the Hadoop.
4. Experiments and results
There are many clustering algorithms now. The algorithm implementation steps are different owing to different algorithm ideas, so algorithm result-clustering effect is different. The effect of clustering algorithms varies in
practical applications. To illustrate the performance differences between the two architectures, this paper analyzes
clustering algorithm implementations based on two architectures from the mathematical point of view. The experiments use text data set used to test clustering algorithm to compare the performance differences between the
two architectures by changing the number of iteration. Therefore, the performance differences between the two
architectures can’t be illustrated by clustering effect.
B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 55
4.1. Experimental environment
In the experiment we used one server and three virtual hosts produced on the Workstation VMware. We used
CDH5 as Hadoop and Spark platform, and used CentOS6.5 x64 as node operating system. We use Hadoop 2.5.0-
cdh5.3.2 benchmark for Hadoop and the Spark 1.2.0 benchmark for Spark [22] and JDK 1.7 benchmark for Java.
4.2. Experimental data
Experimental data used text data set of social networking site of users’ behaviors [1]. All data is displayed in.csv
file format. And they are packaged separately in multiple tar.gz file.
User information format: [user id]t[user text], for example: 369319 zzzop. User relationship network format:
[user id]t[crawled page count]t[friend count]t[friend id list]t[fans count]t[fans list], for example: 1.2.3..htm
1 14215 6 hamas jkaneko caol_ila manwomanfilm public_design_center Kaminogoya 4 hamas lawmn shamroy
tkwshnsk.
4.3. Results and analysis
The experiment used the standard pure text data set which is used to test k-means algorithm. By changing
the number of iteration and comparing average execution time and I/O time of the two architectures, we can
illustrate the performance differences between the two architectures. Fig. 3 is the average execution time of kmeans algorithm of the two architectures. As we can see from Fig. 3, the processing time of MapReduce increases
with the number of iteration, and the processing time of Spark architecture is relatively stable. When the number of
iterations is same, the processing time of MapReduce is longer than the processing time of Spark, and the average
execution time of MapReduce is 50 times of Spark. The conclusion is consistent with the theoretical analysis. The
experimental environment and different parameter values of the theoretical analysis are the causes of deviation.
Figure 4 is the I/O time of the k-means algorithm of the two architectures. As we can see from Fig. 4, the I/O
time ratio of MapReduce and Spark increases with the number of iteration. When the number of iteration is same,
the I/O time of MapReduce is longer than the I/O time of Spark. So the iterative processing time is mainly I/O time,
and I/O time of MapReduce is 60 times of Spark. The conclusion is consistent with the theoretical analysis. The
experimental environment and different parameter values of the theoretical analysis and the smaller experimental
data set are the causes of deviation.
In a word, as we can see from the experimental results, the execution time and I/O time of Spark are significantly
less than MapReduce. So Spark performance is superior to MapReduce in terms of time consumption. Moreover,
the experimental results are consistent with the theoretical analysis results of Section 3. Then, we can verify the
validity of the theoretical analysis results.
Fig. 3. The average execution time comparison under two kinds of architecture.
56 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture
Fig. 4. The I/O time comparison under two kinds of architecture.
5. Conclusions and future work
In this paper we have introduced the implementation steps of k-means algorithm and its implementation on
MapReduce and Spark. Then, this paper focuses on theoretical performance differences of two architectures using
clustering algorithm from the mathematical point of view. Finally, with the increase of the number of iteration,
MapReduce will increase significantly and Spark will change small on the execution time. That is to say, the
performance of the Spark is superior to the MapReduce.
In future work, we plan to further analyze the performance differences of two architectures on scalability.
MapReduce is based on external memory calculation [7,12] and Spark is based on memory calculation, so the
memory consumption on processing [18,23] data can also affect the architecture performance [3,27,28]. Thus
memory optimization is one of the most important directions in future research [5,8].
Acknowledgements
Our sincere appreciation to the anonymous reviewers for their helpful comments and suggestions. What’s more,
this work is partially supported by the National Natural Science Foundation of China (Grant No. 61402183),
Guangdong Natural Science Foundation (Grant No. S2012030006242), Guangdong Provincial Scientific and
Technological Projects (Grant Nos. 2016A010101007, 2016B090918021, 2014B010117001, 2014A010103022,
2014A010103008, 2013B010202001 and 2013B010401021), Guangzhou Civic Science and Technology Project
(Grant Nos. 201607010048 and 201604010040) and Fundamental Research Funds for the Central Universities,
SCUT (No. 2015ZZ0098).
References
[1] http://www.datatang.com/data/list.
[2] Apache Hadoop, available at: http://hadoop.apache.org/.
[3] Apache Spark documentation, 2014, available at: https://spark.apache.org/documentation.html.
[4] Apache Spark Research, 2014, available at: https://spark.apache.org/research.html.
[5] H. Byun, A reliable data delivery scheme with delay constraints for wireless sensor networks, Journal of High Speed Networks 21(3)
(2015), 195–203. doi:10.3233/JHS-150520.
[6] G. Feng and Y. Ma, A distributed frequent itemset mining algorithm based on Spark, in: Computer Supported Cooperative Wok in Design
(CSCWD), 2015 IEEE 19th International Conference on, 6–8 May 2015, 2015, pp. 271–275.
[7] L. Feng, Research and implementation of memory optimization in cluster computer engine Spark, Tsinghua University, 2013.
B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 57
[8] U. Fiore, F. Palmieri, A. Castiglione and A. De Santis, A cluster-based data-centric model for network-aware task scheduling in distributed
systems, International Journal of Parallel Programming 42(5) (2014), 755–775. doi:10.1007/s10766-013-0289-y.
[9] J. Gantz and D. Reinsel, 2011 digital universe study: Extracting value from chaos, available at: http://www.b-eye-network.com/blogs/
devlin/archives/2011/071.
[10] Y. Gao, W. Zhou and J. Han, An evaluation model on key technologies of large-scale graph data processing, Journal of Computer Research
and Development 51(1) (2014), 1–16. doi:10.2190/EC.51.1.a.
[11] F. Gu, X. Liu and C. Zuo, Study on carriers’ mobile Internet development strategy in the context of big data, Designing Techniques of
Posts and Telecommunications 8 (2012), 21–24.
[12] C. Guo, B. Liu and W. Lin, Research on performance of big data computing and query processing based on Impala, Application Research
of Computer 32(5) (2015), 1331–1334.
[13] Y. Huang, A study on the analysis of the research hotspots and development trends of big data overseas, Journal of Intelligence 33(6)
(2014), 99–104.
[14] Y. Hui and S. Wu, Sequence-growth: A scalable and effective frequent itemset mining algorithm for big data based on MapReduce
framework, in: Big Data (Big Data Congress) 2015 IEEE International Congress on, June 27 2015–July 2 2015, IEEE, 2015, pp. 393–
400.
[15] G. Li, Scientific value of big data research, Communications of the China Computer Federation 8(9) (2012), 8–15.
[16] W. Li, Research on spark for big data processing, Modern Computer 3 (2015), 55–60.
[17] Y. Liang, Research on parallelization of data mining algorithm based on distribute platforms Spark and YARN, Sun Yat-sen University,
2014.
[18] W. Lin, An improved data placement strategy for hadoop, Journal of South China University of Technology (Natural Science Edition)
40(1) (2012), 153–158.
[19] R. Qiu, The parallel design and application of the CURE algorithm based on Spark platform, South China University of Technology,
2014.
[20] S. Rathee, M. Kaul and A. Kashyap, R-apriori: An efficient apriori based algorithm on Spark, in: PIKM ’15 Proceedings of the 8th
Workshop on Ph.D. Workshop in Information and Knowledge Management, ACM, New York, NY, USA, 2015, pp. 27–34.
[21] G. Satish and A. Rohan, Comparing Apache Spark and Map Reduce with performance analysis using K-means, International Journal of
Computer Applications 113(1) (2015), 8–11.
[22] Scala, available at, http://www.scala-lang.org.
[23] X. Tu, B. Liu and W. Lin, Survey of big data, Application Research of Computers 31(6) (2014), 1613–1623.
[24] H. Wang, B. Wu, S. Yang and B. Yang, Research of decision tree on YARN using MapReduce and Spark, in: World Congress in Computer
Science, Computer Engineering, and Applied Computing, 2014, available at: http://www.world-academy-of-science.org/.
[25] X. Wang, Clustering in the cloud: Clustering algorithms to Hadoop Map/Reduce framework, Department of Computer Science, Texas
State Univerdity, 2010.
[26] Z. Yang, The research of recommendation system based on Spark platform, University of Science and Technology of China, 2015.
[27] J. Zhang, T. Yang and C. Zhao, Load balancing and data aggregation tree routing algorithm in wireless sensor networks, Journal of High
Speed Networks 21(2) (2015), 121–129. doi:10.3233/JHS-150515.
[28] C. Zhao, C. Xia and C. Jia, Research and analysis on spatial adaptive strategy of End-hopping system, Journal of High Speed Networks
21(2) (2015), 95–106. doi:10.3233/JHS-150514.
[29] W. Zhao, H. Ma and Y. Fu, Research on parallel k-means algorithm design based on Hadoop platform, Computer Science 38(10) (2011),
166–168.
[30] T. Zhou, J. Zhang and C. Luo, Realization of K-means clustering algorithm based on Hadoop, Computer Technology and Development
23(7) (2013), 17–21.
Journal of High Speed Networks 23 (2017) 59–66 59
DOI 10.3233/JHS-170557
IOS Press
A simulation system based on ONE and
SUMO simulators: Performance evaluation
of different vehicular DTN routing protocols
Miralda Cuka a, Ilir Shinko a, Evjola Spaho a, Tetsuya Oda b, Makoto Ikeda b and Leonard Barolli b,∗
a Faculty of Information Technologies, Polytechnic University of Tirana, Bul. “Dëshmorët e Kombit”,
“Mother Theresa” Square, Nr. 4, Tirana, Albania
b Department of Information and Communication Engineering, Fukuoka Institute of Technology (FIT), Japan
Abstract. The advances in next generation network and IoT have provided a promising opportunity to resolve the challenges caused by the
increasing transportation issues. In this paper, we investigate the performance of different routing protocols in a Vehicular Delay Tolerant
Network (VDTN) crossroad scenario. The mobility patterns of vehicles are generated by means of SUMO (Simulation of Urban MObility)
and as communication protocol simulator is used ONE (Opportunistic Network Environment). We use Packet Delivery Ratio (PDR), Relay
Delivery Ratio (RDR), hop count and delay as evaluation metrics. We compared the performance of six protocols and the simulation results
show that for PDR the Epidemic protocol has better performance than other protocols. Considering RDR and hop count, the performance of
Direct Delivery protocol is better than other protocols. While for delay, the performance of Epidemic protocol is better than other protocols.
Keywords: Vehicular Networks, VDTN, ONE, SUMO, Opportunistic Network, IoT
1. Introduction
Modern vehicles are increasingly equipped with a large amount of sensors, actuators, and communication devices (mobile devices, GPS devices, and embedded computers) [14]. In particular, numerous vehicles have possessed powerful sensing, networking, communication, and data processing capabilities, and can communicate with
other vehicles or exchange information with the external environments over various protocols, including HTTP,
TCP/IP, SMTP, WAP, and Next Generation Telematics Protocol (NGTP) [17]. As a result, many innovative telematics services [33], such as remote security for disabling engine and remote diagnosis, have been developed to
enhance drivers’ safety, convenience, and enjoyment.
The advances in cloud computing, Internet of Things (IoT) [1] and Opportunistic Networks [15] have provided
a promising opportunity to further address the increasing transportation issues, such as heavy traffic, congestion,
and vehicle safety. In the past few years, researchers have proposed a few models that use cloud computing for
implementing intelligent transportation systems (ITSs). For example, a new vehicular cloud architecture called
ITS-Cloud was proposed to improve vehicle-to-vehicle communication and road safety [4].
Vehicular Delay Tolerant Networks (VDTNs) are a special type of DTNs. They can been utilized to guarantee
road safety, to avoid potential accidents and make new forms of inter-vehicle communications so they will be an
important part of the future Intelligent Transport Systems (ITS).
Due to the high cost of deploying and implementing VDTN systems in a real environment, most of research is
concentrated on simulations.
*Corresponding author: Leonard Barolli, Department of Information and Communication Engineering, Fukuoka Institute of Technology
(FIT), 3-30-1 Wajiro-Higashi, Higashi-Ku, Fukuoka 811-0295, Japan. E-mail: barolli@fit.ac.jp.
0926-6801/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved
Copyright of Journal of High Speed Networks is the property of IOS Press and its content
may not be copied or emailed to multiple sites or posted to a listserv without the copyright
holder’s express written permission. However, users may print, download, or email articles for
individual use.