Write My Paper Button

WhatsApp Widget

Data Mining Project 1: Text Mining | My Assignment Tutor

Data Mining Project 1: Text MiningThe primary purpose of this assignment is to get familiar with text preprocessing, feature extraction, feature selection, classification, and clustering. You can work individually or in a two-person team. Setup:• You will use Python for this project and need the NLTK toolkit and the scikit-learn library. Python 3 is recommended. To install them, you should install pip3 first. Then, run the following command: II pip3 install -U sklearn nitk• Download the mini 20 newsgroups dataset. The documents are distributed into 20 directories, corresponding to the 20 news groups. Each file in the directory corresponds to one document. • For plotting curves, you may consider matplotlib, as shown in this example. More examples can be found on the sklearn or other websites. IFdp3 install -U matplotlibPart 1: Text preprocessing and feature extractionIn this task, you will develop the program feature-extract.py, which has the following usage.IIpython feature-extract.py directory_of_newsgroups_data feature_definition_file class_definitiowhere• input: directory_of_newsgroups_data is the directory of the unzipped newsgroups data, • output: feature_definition_file contains (term, feature_id) pairs, • output: class_definition_file contains (class_name, class_id) pairs, • output: training_data_file has a specific format, which will be described later.feature-extract.py contains the following two components 1.1 and 1.21.1 document preprocessing. For each document, the preprocessing steps include• split a document to a list of tokens and lowercase the tokens. • remove the stopwords, using this list of stopwords. • stemming. Use a stemmer in NLTK.When you parse the documents, you may only look at the subject and body. The subject line is indicated by the keyword “Subject:”, while the number of lines in the body is indicated by “Lines: xx”, which are the last xx lines of the document. Some files may miss the “Lines: xx” or have other exceptions. Please manually add the “Lines: xx” line for these few files and appropriately handle the exceptions. If you believe other fields of the documents might also be useful, you are free to include them.1/4

Don`t copy text!
WeCreativez WhatsApp Support
Our customer support team is here to answer your questions. Ask us anything!
???? Hi, how can I help?