Journal of High Speed Networks 23 (2017) 49–57 49

DOI 10.3233/JHS-170556

IOS Press

Performance analysis of clustering algorithm

under two kinds of big data architecture

Beibei Li a, Bo Liu a, Weiwei Lin b,∗ and Ying Zhang a

a School of Computer, South China Normal University, Guangzhou, China

E-mails: 1538339980@qq.com, liugubin@126.com, 2283240387@qq.com

b School of Computer Science and Engineering, South China University of Technology, Guangzhou, China

E-mail: linww@scut.edu.cn

Abstract. To compare the performance of the clustering algorithm on two data processing architectures, the implementations of k-means

clustering algorithm on two big data architectures are given at first in this paper. Then we focus on the differences of theoretical performance of

k-means algorithm on two architectures from the mathematical point of view. The theoretical analysis shows that Spark architecture is superior

to the Hadoop in aspects of the average execution time and I/O time. Finally, a text data set of social networking site of users’ behaviors is

employed to conduct algorithm experiments. The results show that Spark is significantly less than MapReduce in aspects of the execution time

and I/O time based on k-means algorithm. The theoretical analysis and the implementation technology of the big data algorithm proposed in

this paper are a good reference for the application of big data technology.

Keywords: Hadoop, MapReduce, Spark, clustering algorithm, big data, k-means

1. Introduction

With the coming of the era of Internet+, massive data has been produced in all aspects of social life. How to dig

out its hidden enormous value has become the focus of the community, and also has risen to the national strategic

level. In March 2012, the Obama administration announced that they would plan to invest $200 million to start

“big data research and a development program”, which followed another major technological development-the

“information superhighway” plan [15] announced in 1993. A series of data from the Big Data Report in 2012

McKinsey showed that big data industry had brought $300 billion revenues for the US health care system annually,

e250 billion revenues for the European public administration annually, 60% pure profit for the retail industry, and

had reduced 50% product development costs for the manufacturing industry. However, Canner thought by 2015

more than 85% of Fortune 500 companies would lose their strengths in the big data competition [11]. The market

research firm IDC predicted that big data technology and services market would rise from $3.2 billion in 2010 to

$16.9 billion in 2015 and achieved 40% growth rate annually [9]. From the statistics above, it is easy to find that

big data is widely applied and is of great value. In terms of the concept and research status of big data, the core

force that promote the big data development is the big data processing technology. Whether we can dig out the

enormous scientific and economic value hidden in massive data depends on its processing technology. Therefore,

the big data technology has become the hot spot and research focus [13]. The limit of traditional data processing

model in memory and processing capabilities is unable to meet the actual demands. With the development of

science and technology, parallel processing mechanisms e.g. MPI, PVM and MapReduce have been widely used in

the past years. However, with the deep research on machine learning, there are a large number of applications that

requires iterative algorithm processing. The result of this application processed by the traditional data processing

*Corresponding author. E-mail: linww@scut.edu.cn.

0926-6801/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved

50 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture

architecture is not satisfactory. An open source universal parallel cloud computing platform-Spark developed by

UC Berkeley AMP Lab meets the needs [24]. The Spark is the latest parallel distributed computing framework

mainly based on memory computing on big data technology chain. And some issues related to memory computing

have got support from the National Natural Science Foundation and related research has started. They are also

supported by a lot of companies in the market, such as Alibaba, Baidu, NetEase and so on. Researchers are more

concerned about the performance of data processing platform. At home and abroad, most researches have been

focused on the differences between MapReduce [14] and Spark; integration of both memory computing and data

mining algorithm on Spark [20]; improvement on clustering algorithm [19] binding Spark platform and so on. The

decision tree research on two architectures mentioned in the literature [24] shows that the Spark is more suitable

for the iterative algorithm, and there is no deep research on the performance differences of the two architectures.

Meanwhile, the decision tree also points out the performance differences of the two architectures combining with

k-means algorithm [21]. The most latest researches on two architectures performance differences just analyze the

results by experiments. But the researches from the mathematical point of theoretical analysis are rare.

In this paper we firstly give the implementations of k-means clustering algorithm on MapReduce and Spark.

Then, we focus on the theoretical performance differences of the two architectures from the mathematical point of

view. Finally we use experiments to verify the validity of the theoretical analysis of big data algorithm.

2. Two implementations of k-means algorithm

2.1. Overview of k-means algorithm

K-means is a clustering algorithm based on distance and unsupervised learning. It has been used widely on

science, industry, business and so on [17]. Its cluster similarity criterion is the distance between data objects. The

data of same cluster is similar, and the data of different clusters is different. Clustering function is deviation sum

of square criterion function, which is defined as:

Gc =

c j=1

ni

k=1

x

(j)

k – mj

2

For each data object xi, the function is to compute which class xi belongs to:

ci = argminj |xi – mj|2

Where mj is clustering center.

The function of computing new center of clustering j is:

mnew_j =

n i=1 wijxi

n i=1 wij

Where xi is a data object and wij is the identification whether xi belongs to class j. If it is true, wij = 1; Or

wij = 0.

K-means algorithm implementation [30] is as follows:

Input: data set D, the numbers of cluster k;

Output: the k sets of clustering;

Select data objects as initial center in data set D;

B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 51

Repeat

For each data object xi from data set D

computing distance from xi to center point

the data object is divided to the nearest cluster

End For

Calculating the data object average value of each clustering center used as a new clustering center until clustering

center points no longer change [17,30].

From the algorithm implementation point of view, we have seen that algorithm would be inefficient when algorithm required multiple iterations in dealing with massive data. Then algorithm can’t meet the needs of practical

applications. K-means parallel implementation solves the problem [17]. The following parts mainly introduced the

k-means parallel implementations on MapReduce and Spark.

2.2. Parallel implementation of k-means based on MapReduce

As is discussed in Section 2.1, the key to parallel implementation of algorithm is to independently assign different samples to the nearest cluster. The Map and Reduce operations are the same in each iteration of parallel

implementation k-means algorithm [2]. Firstly, we select k samples as the center randomly, and store them in the

HDFS files as a global variable. Then the iteration includes three parts:

Map Function [25]: is inputted by default. The ‘key’ is an offset that is the current sample relative

to the starting point of the input file. The ‘value’ is a string that consists of each dimension coordinate values of

the current sample. Firstly, we analyze each dimension coordinate values of current sample from the value, and

calculate the distance from data object to k clustering center. We can obtain the clustered index of nearest distance,

and output , where the key1 is the clustered index of nearest distance, and the value1 is a string

that consists of each dimension coordinate values of the current sample.

Combine Function: is inputted. The ‘key’ is the clustered index. The ‘V’ is the string linked list that

consists of each dimension coordinate values whose clustered index is key. Firstly, we obtain the coordinate values

of each sample from the string linked list. Secondly, we add each value corresponding, and record the total number

of samples in the list. Outputting , where the key1 is the clustered index, the value1 is a string

which consists of the sample sum and each dimension coordinate values.

Reduce Function: is inputted. Firstly we can obtain the intermediate results. Secondly, we can

get new clustering center through related operation and update the HDFS files. Then the next iteration continues

until results converge. The implementation process is shown in Fig. 1.

2.3. Parallel implementation of k-means based on Spark

The implementation of k-means algorithm based on Spark includes two parts [26]: dividing the data clustering

point, computing clustering center through multiple iterations until the results converge. The implementation is

mainly achieved by the Driver, Mapper, Combiner and Reducer classes [29].

Driver: It’s a underlying driver class of initial program, and it deals with data set through the related functions.

Fig. 1. K-means algorithm implementation based on MapReduce.

52 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture

Fig. 2. K-means algorithm implementation based on Spark.

Mapper: It’s a class that determines the initial clustering center, and divides initial data set. It calculates the

distance from the data object RDD to the initial clustering center, and selects the class of the nearest distance to

merge. At last it reelects the new clustering center. The intermediate results generated by iteration are transformed

into the new data object RDD [4].

Combiner: It’s a class to achieve the combination process of the RDD intermediate data set. Because the Map

process has produced a large number of RDD intermediate results, the combination can reduce the traffic, and

avoid congestion for network communication on the Spark platform.

Reducer: It’s a class that makes local results through Combiner doing Reducer, and gets the global results. It can

judge the convergence of clustering center according to the clustering center threshold [6]. The implementation

process is shown in Fig. 2.

3. Theoretical analysis of algorithm performance on two architectures

As is discussed in Section 2, k-means algorithm implementations on two architectures were based on the Map

and Reduce. The main reason for the performance differences between the two architectures is that Spark [16] is

based on memory RDD [4] calculation which doesn’t need to interact with the disk, while Hadoop is based on

external memory which need to interact with the disk. Then, we will analyze the theoretical performance of two

architectures by execution time which is one of the standard to measure platform performance merits.

Algorithm execution time consists of computing time, communication time and system execution time. The

complexity of computing time of the two architectures is similar. Communication time includes communication

volume and communication mechanism. Hadoop and Spark are based on RPC mechanism, so the time difference

can be ignored. In terms of communication volume, Hadoop can’t reuse the data set in iterative process, while

Spark can support data set cache policy. Whether the data set reuses or not directly affects the number of iteration.

We can merge this difference into the execution time. Execution time includes Map, Reduce and I/O operation

time. Therefore, the difference of time consumption between the two architectures is mainly the system execution

time. Specific time analysis is as follows:

The first iteration of the two architectures is to read data from HDFS. The start and end heartbeat mechanism

of Hadoop are negligible relatively to total time. The second and subsequent iterative processing ideology are

the same. The mainly difference is I/O time consumption. To analyze the performance differences between the

two architectures conveniently, we assume that the cluster is homogeneous, and job is evenly distributed to all

nodes, and no node fail during the implementation process. We need the following definitions auxiliary instructions [10]:

B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 53

Definition 1. We assume that k-means data processing requires (k + 1) times iteration on two architectures. We

make Hadoop, Spark architectures perform a full iteration. The I/O time required in a complete MapReduce process

is respectively Th, Ts. The Map time on the Hadoop and Spark is respectively t1, T1. The Reduce time on the

Hadoop and Spark is respectively t2, T2.

Definition 2. We define the mainly parameters on performing MapReduce process: input data set S, intermediate

output data set S1, the final output data set S2. when the data size is x, each Map running time on Hadoop is f (x),

and each Reduce running time on Hadoop is g(x). Each Map running time on Spark is F(x), and each Reduce

running time on Spark is G(x). They are all directly proportional with x, and ratios coefficient respectively are α,

β, γ, μ.

Definition 3. The available maximum numbers of Map and Reduce are respectively M, R in MapReduce computing system. During execution process the Map number divided is X, the Reduce number of system starting is Y.

The rate of data that is read from HDFS is vi. The rate of data that is written back to the disk is vo. The rate of data

that is read from memory is Vi. The rate of data that will be written back to memory is Vo. Network transmission

rate is vn. The Map initialization overhead is C1 on Hadoop, and Reduce initialization overhead is C2 on Hadoop.

The overheads respectively are C3, C4 on Spark. The node number in a cluster is N, The number of CPU cores for

each node is p. So we can conclude that R = Np, M = 2Np.

Hadoop Map Time: The process includes reading data from HDFS, executing Map calculation and writing the

Map intermediate results back to disk. Each Map input data is XS . The time consumption in this process is:

t1 =

S

Xvi

+ f XS + Xv S1o + C1 (1)

Hadoop Reduce Time: It inputs intermediate results output by Map sorting and executing Reduce calculation,

and outputs the results. Each Reduce input data is SY1 . The time consumption in this process is:

t2 =

S1

Yvi

+ gSY1 + Yv S2o + C2 (2)

So we can conclude, the I/O time of finishing a full MapReduce process is:

Th =

S

Xvi

+

S1

Xvo

+

S1

Yvi

+

S2

Yvo

(3)

Similarly, we can conclude the time consumption of each stage on Spark is:

T1 =

S

XVi

+ FXS + XV S1o + C3 (4)

T2 =

S1

YVi

+ GSY1 + YV S2o + C4 (5)

Ts =

S

XVi

+

S1

XVo

+

S1

YVi

+

S2

YVo

(6)

About Hadoop and Spark, we assume that the data size of each map transmission of each reduce is XY S1 . So an

iterative calculation of network transmission time is:

tn = S1 XY vn | (7) |

54 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture

The derivation of the formula above is irrespective of task scheduling. Schedule is inevitable, because the practical

application data set is too large. Scheduling times of Map, Reduce implementation are:

λm =

X M

(8)

λr =

Y R

(9)

When t1 Mtn, there is no need to wait for reduce execution. In the practical application, the time of completing

an iterative calculation of Hadoop and Spark is respectively t, t:

t = λmt1 + Mtn + (λr – 1)Xtn + λrt2 t = λmT1 + Mtn + (λr – 1)Xtn + λrT2 The time required for implementation of (k + 1) times iteration under the two architectures is respectively: Thadoop = kt = kλmt1 + kMtn + k(λr – 1)Xtn + kλrt2 Tspark = kt = kλmT1 + kMtn + k(λr – 1)Xtn + kλrT2 I/O time of (k + 1) times iteration is respectively: | (10) (11) |

(12) (13) | |

T h = kTh; | T |

s = kTs (14)

Therefore, the performance differences between the two architectures can be illustrated by execution time and I/O

consumption. In order to give a more intuitive description of the problem, the parameters will be specific values.

According to the experience, set Y = 1.75∗N ∗p, we can get λr = 2. In order to calculate conveniently and reduce

workload imbalance, we assume that S = S1 = S2, vi = vo = 100 Mb/s, Vi = Vo = 10 Gb/s, vn = 1 Gb/s,so we

can get TThs = 100.

We assume that vi, vo, Vi, Vo, vn are same, so the other parameters values are respectively, M = 12, R = 6,

N = 3, p = 2, C1 = C3 = 0.3 s, C2 = C4 = 0.2 s, α = 0.8 s/M, β = 0.9 s/M, γ = 0.1 s/M, μ = 0.2 s/M. We

can get TThadoop spark = 18. If C1 = 3 s, C2 = 2 s, C3 = 0.2 s, C4 = 0.1 s, α = 1.8 s/M, β = 2 s/M, γ = 0.04 s/M,

μ = 0.05 s/M, we can get TThadoop spark = 41.

The results have shown that the overhead and execution rate of each stage of the implementation process have

a great influence on the architecture performance. In practical application, the data volume maybe reach to T level

and even bigger. The difference of execution time on two architectures becomes much more obvious. Bandwidth

may also become the bottleneck of the two architectures. Finally, the results have shown that Hadoop is longer than

Spark in I/O consumption or total time. Using the execution time to measure the performance of two architectures,

we can conclude that Spark is superior to the Hadoop.

4. Experiments and results

There are many clustering algorithms now. The algorithm implementation steps are different owing to different algorithm ideas, so algorithm result-clustering effect is different. The effect of clustering algorithms varies in

practical applications. To illustrate the performance differences between the two architectures, this paper analyzes

clustering algorithm implementations based on two architectures from the mathematical point of view. The experiments use text data set used to test clustering algorithm to compare the performance differences between the

two architectures by changing the number of iteration. Therefore, the performance differences between the two

architectures can’t be illustrated by clustering effect.

B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 55

4.1. Experimental environment

In the experiment we used one server and three virtual hosts produced on the Workstation VMware. We used

CDH5 as Hadoop and Spark platform, and used CentOS6.5 x64 as node operating system. We use Hadoop 2.5.0-

cdh5.3.2 benchmark for Hadoop and the Spark 1.2.0 benchmark for Spark [22] and JDK 1.7 benchmark for Java.

4.2. Experimental data

Experimental data used text data set of social networking site of users’ behaviors [1]. All data is displayed in.csv

file format. And they are packaged separately in multiple tar.gz file.

User information format: [user id]t[user text], for example: 369319 zzzop. User relationship network format:

[user id]t[crawled page count]t[friend count]t[friend id list]t[fans count]t[fans list], for example: 1.2.3..htm

1 14215 6 hamas jkaneko caol_ila manwomanfilm public_design_center Kaminogoya 4 hamas lawmn shamroy

tkwshnsk.

4.3. Results and analysis

The experiment used the standard pure text data set which is used to test k-means algorithm. By changing

the number of iteration and comparing average execution time and I/O time of the two architectures, we can

illustrate the performance differences between the two architectures. Fig. 3 is the average execution time of kmeans algorithm of the two architectures. As we can see from Fig. 3, the processing time of MapReduce increases

with the number of iteration, and the processing time of Spark architecture is relatively stable. When the number of

iterations is same, the processing time of MapReduce is longer than the processing time of Spark, and the average

execution time of MapReduce is 50 times of Spark. The conclusion is consistent with the theoretical analysis. The

experimental environment and different parameter values of the theoretical analysis are the causes of deviation.

Figure 4 is the I/O time of the k-means algorithm of the two architectures. As we can see from Fig. 4, the I/O

time ratio of MapReduce and Spark increases with the number of iteration. When the number of iteration is same,

the I/O time of MapReduce is longer than the I/O time of Spark. So the iterative processing time is mainly I/O time,

and I/O time of MapReduce is 60 times of Spark. The conclusion is consistent with the theoretical analysis. The

experimental environment and different parameter values of the theoretical analysis and the smaller experimental

data set are the causes of deviation.

In a word, as we can see from the experimental results, the execution time and I/O time of Spark are significantly

less than MapReduce. So Spark performance is superior to MapReduce in terms of time consumption. Moreover,

the experimental results are consistent with the theoretical analysis results of Section 3. Then, we can verify the

validity of the theoretical analysis results.

Fig. 3. The average execution time comparison under two kinds of architecture.

56 B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture

Fig. 4. The I/O time comparison under two kinds of architecture.

5. Conclusions and future work

In this paper we have introduced the implementation steps of k-means algorithm and its implementation on

MapReduce and Spark. Then, this paper focuses on theoretical performance differences of two architectures using

clustering algorithm from the mathematical point of view. Finally, with the increase of the number of iteration,

MapReduce will increase significantly and Spark will change small on the execution time. That is to say, the

performance of the Spark is superior to the MapReduce.

In future work, we plan to further analyze the performance differences of two architectures on scalability.

MapReduce is based on external memory calculation [7,12] and Spark is based on memory calculation, so the

memory consumption on processing [18,23] data can also affect the architecture performance [3,27,28]. Thus

memory optimization is one of the most important directions in future research [5,8].

Acknowledgements

Our sincere appreciation to the anonymous reviewers for their helpful comments and suggestions. What’s more,

this work is partially supported by the National Natural Science Foundation of China (Grant No. 61402183),

Guangdong Natural Science Foundation (Grant No. S2012030006242), Guangdong Provincial Scientific and

Technological Projects (Grant Nos. 2016A010101007, 2016B090918021, 2014B010117001, 2014A010103022,

2014A010103008, 2013B010202001 and 2013B010401021), Guangzhou Civic Science and Technology Project

(Grant Nos. 201607010048 and 201604010040) and Fundamental Research Funds for the Central Universities,

SCUT (No. 2015ZZ0098).

References

[1] http://www.datatang.com/data/list.

[2] Apache Hadoop, available at: http://hadoop.apache.org/.

[3] Apache Spark documentation, 2014, available at: https://spark.apache.org/documentation.html.

[4] Apache Spark Research, 2014, available at: https://spark.apache.org/research.html.

[5] H. Byun, A reliable data delivery scheme with delay constraints for wireless sensor networks, Journal of High Speed Networks 21(3)

(2015), 195–203. doi:10.3233/JHS-150520.

[6] G. Feng and Y. Ma, A distributed frequent itemset mining algorithm based on Spark, in: Computer Supported Cooperative Wok in Design

(CSCWD), 2015 IEEE 19th International Conference on, 6–8 May 2015, 2015, pp. 271–275.

[7] L. Feng, Research and implementation of memory optimization in cluster computer engine Spark, Tsinghua University, 2013.

B. Li et al. / Performance analysis of clustering algorithm under two kinds of big data architecture 57

[8] U. Fiore, F. Palmieri, A. Castiglione and A. De Santis, A cluster-based data-centric model for network-aware task scheduling in distributed

systems, International Journal of Parallel Programming 42(5) (2014), 755–775. doi:10.1007/s10766-013-0289-y.

[9] J. Gantz and D. Reinsel, 2011 digital universe study: Extracting value from chaos, available at: http://www.b-eye-network.com/blogs/

devlin/archives/2011/071.

[10] Y. Gao, W. Zhou and J. Han, An evaluation model on key technologies of large-scale graph data processing, Journal of Computer Research

and Development 51(1) (2014), 1–16. doi:10.2190/EC.51.1.a.

[11] F. Gu, X. Liu and C. Zuo, Study on carriers’ mobile Internet development strategy in the context of big data, Designing Techniques of

Posts and Telecommunications 8 (2012), 21–24.

[12] C. Guo, B. Liu and W. Lin, Research on performance of big data computing and query processing based on Impala, Application Research

of Computer 32(5) (2015), 1331–1334.

[13] Y. Huang, A study on the analysis of the research hotspots and development trends of big data overseas, Journal of Intelligence 33(6)

(2014), 99–104.

[14] Y. Hui and S. Wu, Sequence-growth: A scalable and effective frequent itemset mining algorithm for big data based on MapReduce

framework, in: Big Data (Big Data Congress) 2015 IEEE International Congress on, June 27 2015–July 2 2015, IEEE, 2015, pp. 393–

400.

[15] G. Li, Scientific value of big data research, Communications of the China Computer Federation 8(9) (2012), 8–15.

[16] W. Li, Research on spark for big data processing, Modern Computer 3 (2015), 55–60.

[17] Y. Liang, Research on parallelization of data mining algorithm based on distribute platforms Spark and YARN, Sun Yat-sen University,

2014.

[18] W. Lin, An improved data placement strategy for hadoop, Journal of South China University of Technology (Natural Science Edition)

40(1) (2012), 153–158.

[19] R. Qiu, The parallel design and application of the CURE algorithm based on Spark platform, South China University of Technology,

2014.

[20] S. Rathee, M. Kaul and A. Kashyap, R-apriori: An efficient apriori based algorithm on Spark, in: PIKM ’15 Proceedings of the 8th

Workshop on Ph.D. Workshop in Information and Knowledge Management, ACM, New York, NY, USA, 2015, pp. 27–34.

[21] G. Satish and A. Rohan, Comparing Apache Spark and Map Reduce with performance analysis using K-means, International Journal of

Computer Applications 113(1) (2015), 8–11.

[22] Scala, available at, http://www.scala-lang.org.

[23] X. Tu, B. Liu and W. Lin, Survey of big data, Application Research of Computers 31(6) (2014), 1613–1623.

[24] H. Wang, B. Wu, S. Yang and B. Yang, Research of decision tree on YARN using MapReduce and Spark, in: World Congress in Computer

Science, Computer Engineering, and Applied Computing, 2014, available at: http://www.world-academy-of-science.org/.

[25] X. Wang, Clustering in the cloud: Clustering algorithms to Hadoop Map/Reduce framework, Department of Computer Science, Texas

State Univerdity, 2010.

[26] Z. Yang, The research of recommendation system based on Spark platform, University of Science and Technology of China, 2015.

[27] J. Zhang, T. Yang and C. Zhao, Load balancing and data aggregation tree routing algorithm in wireless sensor networks, Journal of High

Speed Networks 21(2) (2015), 121–129. doi:10.3233/JHS-150515.

[28] C. Zhao, C. Xia and C. Jia, Research and analysis on spatial adaptive strategy of End-hopping system, Journal of High Speed Networks

21(2) (2015), 95–106. doi:10.3233/JHS-150514.

[29] W. Zhao, H. Ma and Y. Fu, Research on parallel k-means algorithm design based on Hadoop platform, Computer Science 38(10) (2011),

166–168.

[30] T. Zhou, J. Zhang and C. Luo, Realization of K-means clustering algorithm based on Hadoop, Computer Technology and Development

23(7) (2013), 17–21.

Journal of High Speed Networks 23 (2017) 59–66 59

DOI 10.3233/JHS-170557

IOS Press

A simulation system based on ONE and

SUMO simulators: Performance evaluation

of different vehicular DTN routing protocols

Miralda Cuka a, Ilir Shinko a, Evjola Spaho a, Tetsuya Oda b, Makoto Ikeda b and Leonard Barolli b,∗

a Faculty of Information Technologies, Polytechnic University of Tirana, Bul. “Dëshmorët e Kombit”,

“Mother Theresa” Square, Nr. 4, Tirana, Albania

b Department of Information and Communication Engineering, Fukuoka Institute of Technology (FIT), Japan

Abstract. The advances in next generation network and IoT have provided a promising opportunity to resolve the challenges caused by the

increasing transportation issues. In this paper, we investigate the performance of different routing protocols in a Vehicular Delay Tolerant

Network (VDTN) crossroad scenario. The mobility patterns of vehicles are generated by means of SUMO (Simulation of Urban MObility)

and as communication protocol simulator is used ONE (Opportunistic Network Environment). We use Packet Delivery Ratio (PDR), Relay

Delivery Ratio (RDR), hop count and delay as evaluation metrics. We compared the performance of six protocols and the simulation results

show that for PDR the Epidemic protocol has better performance than other protocols. Considering RDR and hop count, the performance of

Direct Delivery protocol is better than other protocols. While for delay, the performance of Epidemic protocol is better than other protocols.

Keywords: Vehicular Networks, VDTN, ONE, SUMO, Opportunistic Network, IoT

1. Introduction

Modern vehicles are increasingly equipped with a large amount of sensors, actuators, and communication devices (mobile devices, GPS devices, and embedded computers) [14]. In particular, numerous vehicles have possessed powerful sensing, networking, communication, and data processing capabilities, and can communicate with

other vehicles or exchange information with the external environments over various protocols, including HTTP,

TCP/IP, SMTP, WAP, and Next Generation Telematics Protocol (NGTP) [17]. As a result, many innovative telematics services [33], such as remote security for disabling engine and remote diagnosis, have been developed to

enhance drivers’ safety, convenience, and enjoyment.

The advances in cloud computing, Internet of Things (IoT) [1] and Opportunistic Networks [15] have provided

a promising opportunity to further address the increasing transportation issues, such as heavy traffic, congestion,

and vehicle safety. In the past few years, researchers have proposed a few models that use cloud computing for

implementing intelligent transportation systems (ITSs). For example, a new vehicular cloud architecture called

ITS-Cloud was proposed to improve vehicle-to-vehicle communication and road safety [4].

Vehicular Delay Tolerant Networks (VDTNs) are a special type of DTNs. They can been utilized to guarantee

road safety, to avoid potential accidents and make new forms of inter-vehicle communications so they will be an

important part of the future Intelligent Transport Systems (ITS).

Due to the high cost of deploying and implementing VDTN systems in a real environment, most of research is

concentrated on simulations.

*Corresponding author: Leonard Barolli, Department of Information and Communication Engineering, Fukuoka Institute of Technology

(FIT), 3-30-1 Wajiro-Higashi, Higashi-Ku, Fukuoka 811-0295, Japan. E-mail: barolli@fit.ac.jp.

0926-6801/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved

Copyright of Journal of High Speed Networks is the property of IOS Press and its content

may not be copied or emailed to multiple sites or posted to a listserv without the copyright

holder’s express written permission. However, users may print, download, or email articles for

individual use.