Detecting Spam Email

Assignment Help Basic Computer Science
Reference no: EM133300734

Detecting Spam Email (from the UCI Machine Learning Repository):

A team at Hewlett Packard collected data on a large amount of email-messages from their postmaster and personal email for the purpose of finding a classifier that can separate email-messages that are spam vs. non-spam (AKA "ham"). The spam concept is diverse: it includes advertisements for products or websites, "make money fast" schemes, chain letters, pornography, etc. and so on. The definition used here is "unsolicited commercial e-mail". The file Spambase.xls contains information on 4601 email-messages, among which 1813 are tagged "spam". The predictors include 57 attributes, most of them are the average number of times a certain word (e.g., mail, George) or symbol (e.g., #, !) appears in the email. A few predictors are related to the number and length of capitalized words.

1. To reduce the number of predictors to a manageable size, examine how each predictor differs between the spam and non-spam emails by comparing the spam-class average and non-spam-class average. Which are the 11 predictors that appear to vary the most between spam and non-spam emails? From these 11, which words/signs occur more often in spam?

2. Partition the data into training and validation sets, then perform a discriminant analysis on the training data using only the 11 predictors.

3. If we are interested mainly in detecting spam messages, is this model useful? Use the confusion matrix, lift chart, and docile chart for the validation set for the evaluation.

4. In the sample, almost 40% of the email-messages were tagged as spam. However, suppose that the actual proportion of spam messages in these email accounts is 10%. Compute the constants of the classification functions to account for this information.

5. A spam filter that is based on your model is used, so that only messages that are classified as non-spam are delivered, while messages that are classified as spam are quarantined. In this case mis-classifying a non-spam email (as spam) has much heftier results. Suppose that the cost of quarantining a non-spam email is 20 times that of not detecting a spam message. Compute the constants of the classification functions to account for these costs (assume that the proportion of spam is reflected correctly by the sample proportion).

Reference no: EM133300734

Questions Cloud

What is included in a personal budget : What values or issues are important enough to get you to participate in a boycott? What are the main costs associated with higher education?
Implementing a shorter workweek for increased productivity : Report about an article about the problems facing organizations and managers on the topic "Implementing a Shorter Workweek for Increased Productivity".
Analysis of social networks within organizations : Conduct a cost benefit analysis of social networks within organizations and provide your overall view point on social networks.
Define mandatory spending and discretionary spending : Define mandatory spending.State which category within mandatory spending is most important and why.Define discretionary spending
Detecting Spam Email : A team at Hewlett Packard collected data on a large amount of email-messages from their postmaster and personal email for the purpose of finding a classifier
Why people behave certain way in organizational environment : Why do people behave a certain way in an organizational environment? What factors affect job performance, employee interaction, job commitment.
Consider alternate policies to minimize carbon emissions : Consider alternate policies to minimize carbon emissions, such as a levy on methane emissions, tax credits for buying electric cars, and support for the clean
Advantages of placing functionality in device controller : What are three advantages of placing functionality in a device controller, rather than in the kernel?
Explain increased national income that gets spent on health : explain the increased national income that gets spent on health care.Regarding the Handbook of Health Economics by Mark V. Pauly, Thomas G. McGuire

Reviews

Write a Review

Basic Computer Science Questions & Answers

  Identifies the cost of computer

identifies the cost of computer components to configure a computer system (including all peripheral devices where needed) for use in one of the following four situations:

  Input devices

Compare how the gestures data is generated and represented for interpretation in each of the following input devices. In your comparison, consider the data formats (radio waves, electrical signal, sound, etc.), device drivers, operating systems suppo..

  Cores on computer systems

Assignment : Cores on Computer Systems:  Differentiate between multiprocessor systems and many-core systems in terms of power efficiency, cost benefit analysis, instructions processing efficiency, and packaging form factors.

  Prepare an annual budget in an excel spreadsheet

Prepare working solutions in Excel that will manage the annual budget

  Write a research paper in relation to a software design

Research paper in relation to a Software Design related topic

  Describe the forest, domain, ou, and trust configuration

Describe the forest, domain, OU, and trust configuration for Bluesky. Include a chart or diagram of the current configuration. Currently Bluesky has a single domain and default OU structure.

  Construct a truth table for the boolean expression

Construct a truth table for the Boolean expressions ABC + A'B'C' ABC + AB'C' + A'B'C' A(BC' + B'C)

  Evaluate the cost of materials

Evaluate the cost of materials

  The marie simulator

Depending on how comfortable you are with using the MARIE simulator after reading

  What is the main advantage of using master pages

What is the main advantage of using master pages. Explain the purpose and advantage of using styles.

  Describe the three fundamental models of distributed systems

Explain the two approaches to packet delivery by the network layer in Distributed Systems. Describe the three fundamental models of Distributed Systems

  Distinguish between caching and buffering

Distinguish between caching and buffering The failure model defines the ways in which failure may occur in order to provide an understanding of the effects of failure. Give one type of failure with a brief description of the failure

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd