Prepare heritage data for classification learning

Assignment Help Basic Statistics
Reference no: EM131231115

Assignment 1:

1. Using heritage data (release 1) in SQL

a. Find support for all single itemsets

b. List all itemsets with 2 elements and support of at least 0.2

c. List all itemsets with 3 elements and support at least 0.2

2. In Weka

a. Load heritage data (release 1)

b. Apply at least two association rule generation algorithms and compare results

c. Apply FPTree algorithm with at least two measures of rule metrics

Assignment 2:

1. In SQL/Weka:

a. Prepare heritage data for classification learning

b. Load heritage data release 3 (preprocessed to binary representation, including demographics and output attribute(s))

c. Perform exploratory analysis

d. Create at least three classification models for predicting hospitalization based on Year 1 data.

e. Which model performs the best on year 2 data?

f. Create regression model for predicting hospitalization days.

g. What is the difference between regression and classification models?

h. Present your results in a form of short report that includes screenshots, tables, an d needed description.

Assignment 3:

Classification Part 2

1. Using heritage release 3 data prepared last assignment

a. Include drug information into data

b. Include laboratory information into data

c. Import newly created data into Weka and run classification algorithms

d. Does inclusion of the information improve predictions?

There are many ways to complete question 4, so you need to make different decisions.

Try not to overcomplicate the problem.

2. In Weka using heritage 3 dataset

a. Apply kmeans algorithm for k=2, 3, 5, 10

b. Apply EM algorithm. What is the optimal number of clusters obtained by EM?

c. Compare the created clusters to classification based on hospitalization in year 2.

Assignment 4:

3.Using the data table shown below.

a.Calculate distance between all points in 1
-norm, 2
-norm and infinity
-norm. Show dissimilarity matrix.

b.Is there any need to preprocess the data to be more suitable for clustering? If so, describe the operations and show the resulting data table.

c.Apply k
-means clustering algorithm with k=2.

Using the data table shown below.

a. Calculate distance between all points in 1-norm, 2-norm and infinity-norm. Show dissimilarity matrix.

b. Is there any need to preprocess the data to be more suitable for clustering? If so, describe the operations and show the resulting data table.

c. Apply k-means clustering algorithm with k=2.

ID

Age

BMI

Gender

Total Cholesterol

1

30

24

M

180

2

70

19

M

190

3

65

26

M

220

4

40

32

F

260

Assignment 5:

-Text Mining

1.Write regular expression to:

a.detect zip codes in text

b.Find last names of all patients whose first name is John (note that regular expressions may have some false positives/false negatives).

2.List challenges in automatically retrieving ICD-9 codes from clinical notes. Search literature for to find relevant published work. Also, include own observations and comments.

3. Using the SMS data

a. Split data into training (80%) and testing (20%) sets

b. Build naïve Bayes classifier for detecting spam based on bag of words

i. List all words in the documents

ii. Count occurrences in spam and ham

iii. Assign likelihoods P(word|spam) and P(word|ham) for all words

iv. Convert test data into list of words. For each message you need, 2 columns: message id and word

v. Classify test data. This can be done by a series of joins with the data prepared in (iii).

vi. Calculate accuracy of your model (accuracy, precision, recall)

Reference no: EM131231115

Questions Cloud

How can u.s. companies protect their digital assets overseas : Prepare a 3 to 5 paragraph briefing statement that can be used to answer the above question. Your audience will be attendees at a conference for small business owners who are interested in expanding their footprint overseas (sales, offices, produc..
Calculate the total amount of co2 released to the atmosphere : Calculate the CO2 emissions in g CO2/MJ (LHV) with gasoline as fuel.
What is privacy in an information security context : What is another name for the Kennedy-Kassebaum Act (1996), and why is it impor- tant to organizations that are not in the health care industry? ?If you work for a financial service organization such as a bank or credit union, which 1999 law affect..
What is the amount of the companys total assets : The liabilities of the Smith Company are $120,000 and its owner's equity is $232,000. What is the amount of the company's total assets?
Prepare heritage data for classification learning : Perform exploratory analysis and create at least three classification models for predicting hospitalization based on Year 1 data.
Excellence in orthopedic care for large geriatric population : Dynamic Health System is a 3-hospital, 500-bed system in the Midwest United States. This system employs 100 physicians, both primary care and specialists, in 12 physician practices. Dynamic also runs a center of excellence in orthopedic care for the ..
Examine the five steps to the evidentiary process : Review the U.S. Department of Justice document explaining the Fourth Amendment protections in context of preparing electronic evidence. What are some noteworthy issues, recommendations, observations, or comments you have regarding these exceptions..
Estimate the maximum permissible cost of the condenser : If the sea power plant described in Problem 11.5 is to deliver power at $8/106 Btu, estimate the maximum permissible cost of the condenser and evaporator heat-exchanger surface in dollars per square foot, assuming a 20-year life, 10% discount rate..
What does the calculation of each ratio represent : What does the calculation of each ratio represent? How does year one compare with year two, and what trend can be seen when you compare the two years? Is the trend from year one to year two positive or negative?

Reviews

len1231115

10/5/2016 1:25:05 AM

I have the data for the first 3 assignment for now which i needed to be done by this coming Saturday and the rest I can wait for them till i got the data-set.Apply at least two association rule generation algorithms and compare results

Write a Review

Basic Statistics Questions & Answers

  Computing the probability values using normal

computing the probability values using normal distribution.information from the department of motor vehicles indicates

  Compute the probability that the sum of the two rolls

1. A fair die is rolled 12 times. Given that there are exactly two ones, what is the probability that there are exactly two sixes? 2. A fair die is rolled twice. Compute the probability that the sum of the two rolls is 3, 5, 7, 9, 11, respectively,..

  Identify the empirical formula of the new oxide

A 12.37 g sample of Mo2O3(s) is converted completely to another molybdenum oxide by adding oxygen. The new oxide has a mass of 13.197 g. Add subscripts below to correctly identify the empirical formula of the new oxide.

  How large sample take estimate rue proportion of widow women

How large a sample must one take to be 90% confident that the estimate is within 0.05 of the true proportion of women over 55 who are widows?

  Determining appropriate procedure

To investigate whether female executives are experiencing pay-discrimination, a researcher samples the incomes of 300 male and 300 female executives. Which procedure would be appropriate?

  Test for single population mean

The sample data are summarized by the statistics n = 27, x = 4.6 years, s- 1.9 years.

  Determining if z follows standard normal distribution

Determine following probabilities where Z follows standard Normal distribution?

  Academic approach to statistics problem

In a certain city, 50 percent of the people consider themselves conservative (C), 25 percent consider themselves liberal (L), and 25 percent consider themselves to be independent (I).

  In a preliminary random sample of 30 small businesses it

for this problem carry at least four digits after the decimal in your calculations. answers may vary slightly due to

  If the company drills 12 wells what is the probability that

norfisk oil is drilling some exploratory wells on the mainland of norway. the results are described as either a

  A study found that the average time it took a person to

a study found that the average time it took a person to find a new job was 3.5 months. if a sample of 30 job seekers

  Determine z-score for each score in the population

Population of following N=5 scores: 0,6,4,3, and 12. Calculate μ and σ for the population. Determine the z-score for each score in the population.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd