Topic - Data Mining Report
Course:- Other Subject
Length: word count:5000
Reference No.:- EM132384028

Assignment Help
Expertsmind Rated 4.9 / 5 based on 47215 reviews.
Review Site
Assignment Help >> Other Subject

Assignment Topic - Data Mining Report

1. Problem Description -

In this assignment, you will perform predictive analytics. You are given a CSV data file (data2019.student.csv) which contains a total of 1100 samples. The first 1000 samples have already been categorised into two classes. You are asked to predict the class labels of the last 100 samples associated with IDs from 1001 to 1100. You are given the following information:

The attribute Class indicates the class label. For each of the first 1000 samples, the class label is either 0 or 1. For each of the last 100 samples, the class label is missing. You are asked to predict these missing class labels.

There are exactly 50 samples from each class in the last 100 samples to be predicted.

Attributes are either categorical or numeric. Note that some attributes may appear numeric. You will need to decide whether to treat them as numeric or categorical and justify your action.

The data is known to contain imperfections:

  • There are missing/corrupted entries in the data set.
  • There are duplicates, both instances and attributes.
  • There are irrelevant attributes that do not contain any useful information useful for the classification task.
  • The labelled data is imbalanced: there is a considerable difference between the number of samples from each class.

Note that the attribute names and their values have been obfuscated. Any pre-processing and analytical steps to the data need to be based entirely on the values of the attributes.

Attempt the following:

  • Data Preparation: In this phase, you will need to study the data and address the issues present in the data. At the end of this phase, you will need to obtain a processed version of the original data ready for classification, and suitably divide the data into two subsets: a training set and a test set.
  • Data Classification: In this phase, you will perform analytical processing of the training data, build suitable predictive models, test and validate the models, select the models that you believe the most suitable for the given data, and then predict the missing labels.
  • Report: You will need to write a complete report documenting the steps taken, from data preparation to classification. In addition, you should also give comments or explain your choice/decision at every step. For example, if an attribute has missing entries, you have to describe what strategy taken to address them, and why you employ that particular strategy based on the observation of the data. Importantly, the report must also include your prediction of the missing labels.

You may choose either of the following approaches to complete the:

  • Programming Approach: If you choose the programming approach, it is expected that you will use the data mining software and the programming environment provided in this unit for completing the assignment. Your developed Python/R programs will be tested using the virtual machines provided. If you plan to use any extra tools/packages, you must obtain a written approval from the Unit Coordinator. This is to ensure fairness among students.
  • Non-Programming Approach: If you choose the non-programming approach, i.e. using only the Weka GUI, it is expected that you will need to submit a separate document myweka.pdf detailing how you use Weka to accomplish the tasks. See Subsection 2.4 for further detail.

2. The Tasks -

2.1 Data Preparation

In this first task, you will examine the data attributes and identify issues present in the data. For each of the issues that you have identified, decide and perform necessary action to address it. Finally, you will need to suitably split the data into two sets: one for training and one for testing, the latter contains 100 samples with missing class labels. The two sets must also be submitted electronically with your report. They must be presented in Weka ARFF format. Your marks for this task will depend on how well you identify the issues and address them. Use the following list as a general guide for this task:

Irrelevant attributes: this data set is known to have irrelevant attributes.

  • Describe what you think irrelevant attributes are.
  • For each attribute, carefully examine it and decide whether it is irrelevant. If so, give a brief explanation and remove the attribute.

Missing entries

  • Which attributes/instances have missing entries?
  • For those attributes/instances, how many missing entries are present?
  • For each attribute/instance with missing entries, make a suitable decision, justify it, and proceed.


  • Detect if there are any duplicates (instances/attributes) in the original data?
  • For each attribute/instance with duplicates, make a suitable decision, justify it, and proceed.

Data type:

  • For each attribute, carefully examine the default data type (e.g. Numeric, Nominal, Binary, String, etc.) that has been decided when Weka loads the original CSV file.
  • If the data type of an attribute is not suitable, give a brief explanation and convert the attribute to a more suitable data type. Provide detailed information of the conversion.

Scaling and standardisation:

  • For each numeric attribute, decide if any pre-processing (e.g. scaling, standardisation) is required. Give a brief explanation why it is needed (this should be discussed in relation to the subsequent classification task).
  • Feature engineering: you may also come up with attributes derived from existing attributes. If this is the case, give an explanation of the new attributes that you have created.

Feature/Attribute selection: if applicable, clearly indicate which attributes you decide to remove in addition to those (obviously) irrelevant attributes that you have identified above and give a brief explanation why.

Data instances: if you decide to make changes to the data instances with class labels (this may include selecting only a subset of the data, removing instances, randomizing or reordering instances, or synthetically injecting new data instances to the training data, etc.), provide an explanation.

Data imbalance: the data set is known to have more samples from one class than the other. If you employ any strategy to address the data imbalance issue, describe it thoroughly.

Others: describe other data-preparation steps not mentioned above.

Training, Validation, and Test Sets: suitably divide the prepared data into training, validation and test sets. These sets must be in ARFF format and submitted together with the electronic version of your report. See the Submission section for further information.

2.2 Data Classification

For this task, you will demonstrate convincingly how you select a suitable classification scheme to learn the predictive model from training data and use that model to predict the missing labels. You will also need to estimate the prediction accuracy on the actual test data. Finally, you will need to provide your prediction as a table in the report and a CSV file to be submitted electronically. You will need to demonstrate the following:

Classifier selection: you will need to select at least three (3) classifiers: k-NN, Naive Bayes, and Decision Trees (J48). Other classifiers, including meta classifiers, are also encouraged. Every classifier typically has parameters to tune. If you change the default parameters to achieve higher cross-validation performance, clearly indicate what the parameters mean, and what values you have selected.

Cross validation: you will need to address the following

  • How to evaluate the effectiveness of a classifier on the given data?
  • How to address the issue of class imbalance in the training data?
  • What is your choice of validation/cross-validation?
  • For each classifier that you've selected, what is the validation/cross-validation performance? Give an interpretation of the confusion matrix.
  • For each classifier that you've selected, what is the estimated classification accuracy on the actual test data?

Classifier comparison:

  • Compare the classification performance between difference classifiers. You need to select at least two (2) evaluation metrics, for example F-measure and classification accuracy, when comparing them. Your comparison must take into account the variation between different runs due to cross-validation.
  • Based on the comparison, select the best two (2) classification schemes for final prediction. Note that the two classification schemes can be one type of classifier, but with two different parameters. Clearly indicate the final choice of parameters if they are not the default values.


  • Use the best two classification schemes that you have identified in the previous step to predict the missing class labels of the last 100 samples in the original data set.
  • Provide your prediction in the report by creating a table, the first column is the sample ID, the second and third columns are the predicted class labels respectively.
  • Produce a CSV file with the name predict.csv that contain your prediction in a similar format: the first column is the sample ID, the second and third columns are the predicted class labels. This file must be submitted electronically with the electronic copy of the report.

IMPORTANT: Please ensure that your prediction is correctly formatted as required. You must also indicate clearly in the report your estimated prediction accuracy. This should be based on the validation study.

2.3 Report

You will also need to submit a written report. It should serve the following objectives:

  • It demonstrates your understanding of the problem and the necessary steps you have attempted to solve the tasks.
  • It contains information necessary for marking your work.
  • Page limit: your report must not exceed 20 pages. Pages beyond 20 will be ignored!

What you should include in the report: Structure of the report:

Cover page

Summary: briefly list the major findings (data preparation and classification) and the lessons you've learned.

Methodology: address the requirements described above for

  • Data preparation
  • Data classification

Prediction: produce a table that describes the best two prediction results.

References: list any relevant work that you refer to.

Appendices: important things not mentioned above.

Visual illustration to support your analysis which may include: tables, figures, plots, diagrams, and screenshots.

Attachment:- Assignment Files - Data Mining Report.rar

Put your comment

Ask Question & Get Answers from Experts
Browse some more (Other Subject) Materials
Insects do not have lungs as we do, nor do they breathe through their mouths. Instead, they have a system of tiny tubes, called tracheae, through which oxygen diffuses into th
Discuss the advantages and disadvantages of the Categorization and Interest-Based Bargaining (IBB) Methods and which method you would be more likely to use in a negotiating
Most mental disorders lie on a continuum with "normal" behavior at one end. For example, nearly everyone has a fear of something, but it does not rise to the level of a phob
How does the existentialism found in sartre, de Beauvoir, and Camus manifest the heroic trait of taking total responsiblity for ones own personal identiy as the only authent
Research and describe the effect of alcohol on your selected special population. Identify unique problems or considerations that apply to your group. Compare and contrast your
Excluding the Title and Reference pages, create an outline of the procedures or methods of your research design proposal. Include a step-by-step description of research design
Share what your typical diet is like. Explain what is good about your diet, and what's not so good about it. Identify one change in your current diet that you believe might ma
About four million women a year are victims of domenstic violence initiated by their spouses. What prevailing attitudes have impaired the progress of dealing with this issue?