1Unit Information: SIT772 Database and Information RetrievalTrimester: 2021 T3Assessment 2: Information Retrieval Problem Solving TaskThis document supplies the detailed information on assessment tasks for this unit.Key information• Due: Wednesday, 06 Oct 2021, 23:59 (AEST)• Weighting: 30%• Submit: Through CloudDeakinLearning OutcomesThis assessment assesses the following Unit Learning Outcomes (ULO) and related GraduateLearning Outcomes (GLO): Unit Learning Outcome (ULO)Graduate Learning Outcome (GLO)ULO 5: Demonstrate data retrieval skills inthe context of a data processing system.GLO 1: Discipline-specific knowledge andcapabilities PurposeThis task evaluates the student’s technical skills in the management of unstructured data, withpotential usage in real applications. This assessment supports student understandings of thetechniques related to unstructured data management and data processingInstructions and Submission GuideThis is an individual assessment task. Students are required to submit ONE written report.• Read these instructions and the following questions.• ONE written report with the name as using student ID_givenname_A2.pdf, e.g.,123456_Kevin_A2.pdf )• The report must be submitted via CloudDeakin assessment portal. The wrongsubmission venue or the wrong submitted file may lead to the penalty.2Question 1: (6- 4+2)Try and find a Query of the form [Query-term-1, Query-term-2] (without quotes) that, when runon Google, produces at least one result that contains only one of three terms. That is, try to findan example where Google does not interpret a the-term query as a conjunction. (If you havedifficulty with finding an appropriate query, try one that produces very few hits, say, fewer than20.) (i)Take screenshot of the first page of Google results (or more if you want to) and markeach result with 2 (both terms occur on the page), 1 (one term occurs on the page) or0 (neither term occurs on the page)Based on this evidence, does Google interpret all queries as a Boolean conjunction?(ii) Explain.Question 2: (16: 8+4+4)Recall and Precision are two important evaluation metrics that we use to analyze a set ofunranked results. Precision and Recall metrics consider the differences between set ofdocuments retrieved for given query and the set of documents that are relevant to the user’sneed.A) Compute Recall, Precision and [email protected] for the following retrieval against QueriesQ1, Q2 and Q3 Relevant documentRetrieved DocumentQ11,14,17,23, 24, 33,54, 55, 59,74,101,1032,5,7,23, 33,50, 55, 59, 77,98, 99,101, 103, 110,120Q214,19, 25, 27,30,39, 42, 63, 769,790,156314, 21, 25, 26,27, 38, 42, 63, 569,769, 790, 1565, 1589Q38, 11,32,54,67,69,78, 79,91,99,111,12211, 13, 17, 19, 21, 32,77,79,99,102,111,122Q44,26, 38, 63, 569, 769, 790, 1565, 158914, 21, 25, 26,27, 38,63, 88, 769,790, B) Recall and Precision are often discussed together as their focus is on complementaryinformation. If precision is important, the we don’t not want to see any non-relevantdocuments. That is, whatever is retrieved, should be relevant. If recall is important, wewant to see all the relevant documents, even if it requires sifting through some nonrelevant ones. Provide and Justify two information-seeking tasks where precision may beconsiderably more important than recall. Similarly, Provide and Justify two informationseeking tasks where recall may be more important than precision. [Don’t forget to justifyyour choices: Justification will be graded, not the particular choices].C) The trade-off between Recall and Precision may be user-specific i.e. some users may beinterested in precision than recall and vice versa. How the search engine try to guesswithout asking, whether user cares more about precision than recall, or vice versa?Think of different ways, users interact with a search engine and be creative!3Question 3: (6: 3×3)(a) Consider, we have three collections C1, C2, and C3 that have 500, 15,000 and 300,000documents respectively. We have added All documents in C1, to C2 and C3. Whichcollection is likely to have more new terms added to its vocabulary (C1, C2 or C3) andwhy? [Heaps’ Law](b) Calculate the tf-idf for below documents.a. D1: Sweets Potatoes are Sweetb. D2: Sweet Oranges are sour and Sweetc. D3: I have sweet Apple, Sweet Orange, Sweet PotatoesQuestion 4: (10-5×2) Doc-idhouseforsaleinGeelongMelbourne1391132222242191931516213192013219412201411313 ➔ (houses OR for OR sale OR in OR Geelong OR Melbourne)➔ (houses AND for AND sale AND in AND Geelong OR Melbourne)Suppose these are issued to a search engine that uses the ranked Boolean retrieval model.Assume, for simplicity, only four documents in the collection (with document ids 1-4).Answer the following questions. The above table gives the number of times each query-termoccurs in each document. (i)Compute the document scores and the ranking associated with the query (houses ORfor OR sale OR in OR Geelong OR Melbourne).How is the ranking produced probably sub-optimal and why does this happen?(ii) (iii) Compute the document scores and the ranking associated with the query (houses ANDfor AND sale AND in AND Geelong OR Melbourne).(iv) How is the ranking produced probably sub-optimal and why does this happen?(v) How would you extend the Boolean retrieval model to handle AND NOT constraints(e.g., houses AND NOT Geelong)? Your proposed solution should give a higher scoreto documents that contain fewer occurrences of the term to the right of the AND NOT(e.g., Geelong). Please be as mathematical as possible. In other words, saying: “I wouldreduce the score for documents that contain the word to the right of AND NOT.” is toovague.(vi) Using the index, what would be the Boolean retrieval model scores given to documents1-4 by your proposed scoring method for the query “houses AND NOT Geelong”?Question 5: (12-4×3)Doc1: A book is considered a good book that makes the reader feels better.Doc2: I love reading good books to feel better.Doc3: One can feel better after reading Tom’s recent book.4Query-1: I love books that are goodQuery -2: reading good books make you feel betterStop Word Dictionary=[is, can, after, a, to, I, the, about, that] i.ii.iii.Explain the similarity scores of both Query -1 and Query -2 using TF-IDF.How would the result change if TF-IDF is used instead of TF as Query?What do prefer using TF or TF-IDF as Query (Support your claim using F-score). Assessment feedbackGeneral feedback to the class will be provided via CloudDeakin-Discussion Forum. The formalassessment feedback will be released with the marks in CloudDeakin altogether.Extension requestsRequests for extensions should be made to Unit/Campus Chairs 3 days early before theassessment due date.Special considerationYou may be eligible for special consideration if circumstances beyond your control prevent youfrom undertaking or completing an assessment task at the scheduled time.See the following link for advice on the application process:http://www.deakin.edu.au/students/studying/assessment-and-results/special-considerationAssessment feedbackDetailed written feedback and results will be provided within two weeks of submission.ReferencingYou must correctly use Harvard referencing in this assessment. See the Deakin referencingguide.Academic integrity, plagiarism and collusionPlagiarism and collusion constitute extremely serious breaches of academic integrity. They areforms of cheating, and severe penalties are associated with them, including cancellation of marksfor a specific assignment, for a specific unit or even exclusion from the course. If you are ever indoubt about how to properly use and cite a source of information refer to the referencing siteabove.Plagiarism occurs when a student passes off as the student’s own work, or copies withoutacknowledgement as to its authorship, the work of any other person or resubmits their ownwork from a previous assessment task.5Collusion occurs when a student obtains the agreement of another person for a fraudulentpurpose, with the intent of obtaining an advantage in submitting an assignment or other work.Work submitted may be reproduced and/or communicated by the university for the purpose ofassuring academic integrity of submissions: https://www.deakin.edu.au/students/studysupport/referencing/academic-integrity