Page 1 of 6UEL-CN-7031Summative assessment Final Project 100%Submission instructions• Cover sheet to be attached to the front of the assignment whensubmitted• Question paper to be attached to assignment when submitted• All pages to be numbered sequentially Module codeUEL-CN-7031Module titleBig Data AnalyticsAssignment titleBig Data Analytics: CourseworkAssignment number1Weighting100%Submission dateWeek 12Additional information Page 2 of 6UEL-CN-7031 – Big Data AnalyticsThis coursework (CRWK) must be attempted as an individual work. This coursework isdivided into two sections: (1) Big Data analytics on a real case study and (2) presentation.Overall mark for CRWK comes from two main activities as follows:1- Big Data Analytics report (around 5,000 words, with a tolerance of ± 10%) (60%)2- Presentation (40%)Marking Scheme TopicTotalRemarksmark(breakdown of marks for each sub-task)Big Data(10)Providing big data queries using HIVE.Analytics using30(10)Using Built-in (Date, Math, Conditional, and String)HIVEFunctions in HIVE.(10)Visualizing the results of queries into the graphicalrepresentations and be able to interpret them.(15)Analyzing the dataset through statistical analysis methods.Big Data50(35)Designing single- and multi-class classifiers and evaluateAnalytics usingand visualize the accuracy/performance.SparkIndividual10(10)(1) Find alternative solutions for high level languages andassessmentanalytics approaches (use references), and Expressfindings from big data analytics with the relevant theories.Documentation10(10)Write down a scientific report.Total:100 Good Luck!Page 3 of 6Big Data Analytics using Hadoop and SparkUEL-CN-7031 – Big Data AnalyticsTasks:(1) Understanding Dataset: UNSW-NB15The raw network packets of the UNSW-NB151 dataset was created by the IXIA PerfectStormtool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) for generatinga hybrid of real modern normal activities and synthetic contemporary attack behaviours.Tcpdump tool used to capture 100 GB of the raw traffic (e.g., Pcap files). This data set has ninetypes of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic,Reconnaissance, Shellcode and Worms. The Argus and Bro-IDS tools are used and twelvealgorithms are developed to generate totally 49 features with the class label.a) The features are described here.b) The number of attacks and their sub-categories is described here.c) In this coursework, we use the total number of 10-million records that was stored inthe CSV file (download). The total size is about 600MB, which is big enough toemploy big data methodologies for analytics. As a big data specialist, firstly, we wouldlike to read and understand its features, then apply modeling techniques. If you wantto see a few records of this dataset, you can import it into Hadoop HDFS, then makea Hive query for printing the first 5-10 records for your understanding.(2) Big Data Query & Analysis by Apache Hive [30 marks]This task is using Apache Hive for converting big raw data into useful information for the endusers. To do so, firstly understand the dataset carefully. Then, make at least 4 Hive queries(refer to the marking scheme). Apply appropriate visualization tools to present yourfindings numerically and graphically. Interpret shortly your findings.Finally, take screenshot of your outcomes (e.g., tables and plots) together with thescripts/queries into the report.Tip: The mark for this section depends on the level of your HIVE queries’ complexities, forinstance using the simple select query is not supposed for full mark.1source: https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/Page 4 of 6(3) Advanced Analytics using PySpark [50 marks]In this section, you will conduct advanced analytics using PySpark.3.1. Analyze and Interpret Big Data (15 marks)We need to learn and understand the data through at least 4 analytical methods(descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You need topresent your work numerically and graphically. Apply tooltip text, legend, title, X-Y labels etc.accordingly to help end-users for getting insights.3.2. Design and Build a Classifier (35 marks)a) Design and build a binary classifier over the dataset. Explain your algorithm and itsconfiguration. Explain your findings into both numerical and graphicalrepresentations. Evaluate the performance of the model and verify the accuracy andthe effectiveness of your model. [15 marks]b) Apply a multi-class classifier to classify data into ten classes (categories): one normaland nine attacks (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic,Reconnaissance, Shellcode and Worms). Briefly explain your model with supportivestatements on its parameters, accuracy and effectiveness. [20 marks]Tip: you can use this link (https://spark.apache.org/docs/2.2.0/mlclassification-regression.html) for more information on modelling.(4) Individual Assessment [10 marks]Discuss (1) what other alternative technologies are available for tasks 2 and 3 and how theyare differ (use academic references), and (2) what was surprisingly new thinking evokedand/or neglected at your end?Tip: add individual assessment of each member in a same report.(5) Documentation [10 marks]Document all your work. Your final report must follow 5 sections detailed in the “format offinal submission” section (refer to the next page). Your work must demonstrate appropriateunderstanding of academic writing and integrity.Page 5 of 6FORMAT OF FINAL SUBMISSIONYou need to prepare one single file in PDF format as your coursework within thefollowing sections:1. Use ONLY one Cover Page2. Table of Contents3. Report of the tasks (it needs sub-sections for few tasks, accordingly)4. References (if any)SUBMISSIONsingle PDF into Turnitin in Moodle, by the end of Week 12PLAGIARISMThe University defines an assessment offence as any action(s) or behaviour likely to conferan unfair advantage in assessment, whether by advantaging the alleged offender ordisadvantaging (deliberately or unconsciously) another or others. A number of examples areset out in the Regulations and these include:“D.5.7.1 (e) the submission of material (written, visual or oral), originally produced by anotherperson or persons, without due acknowledgement, so that the work could be assumed thestudent’s own. For the purposes of these Regulations, this includes incorporation ofsignificant extracts or elements taken from the work of (an) other(s), withoutacknowledgement or reference, and the submission of work produced in collaboration foran assignment based on the assessment of individual work. (Such offences are typicallydescribed as plagiarism and collusion.)”. The University’s Assessment Offences Regulationscan be found on our web site. Also, information about plagiarism can be found on theprogramme’s handbook.FEEDBACK TO STUDENTSFeedback is central to learning and is provided to students to develop their knowledge,understanding, skills and to help promote learning and facilitate improvement.• Feedback will be provided as soon as possible after the student hascompleted the assessment task.• Feedback will be in relation to the learning outcomes and assessment criteria.As the feedback (including marks) is provided before Award & Field Board, marks are:• Provisional• available for External Examiner scrutiny• subject to change and approval by the Assessment BoardPage 6 of 6Assessment Criteria: CriteriaGivenMarkDemonstrate/interpret the HIVE analysis/queries10Understand Hadoop and Spark engines5Demonstrate/interpret the PySpark analysis/coding15Ability to answer questions10Overall mark40