Explanation of your algorithm and pseudocode

Assignment Help Basic Statistics
Reference no: EM131142323

I need help with my project for big data

It's up to us to define the specific design and limit or expand the scope but should involve substantial design, analysis, programming, and validation. I have listed the topic and technology that should be used for the project. You can make changes and consider other attributes for the system. Let me know the price and process to move further.

Topic: Recommendation or clustering system for IMDB movie dataset.

Technology: Spark will be used for solving the problem.

Data: https://www.imdb.com/interfaces

The movie database will be considering following attributes for the movie recommendation system.

1) Title

2) Genre

3) Actors

4) Actresses

5) Directors

6) Rating

Hypothesis: We wish to build a movie recommendation system, which will suggest movies to user that he/she might be interested in based on the tastes, interests, and people connections. also need a report

Parts for the report-

I. Design:

Design document should contain your proposed design of the solution.

  • Summary of problem definition - Focus on explaining what you want to do in the project, any assumptions, and limitations.
  • Description of input data - In the design document, you have to provide a summary of your data - for example, data format, attributes, and metadata. Please do not include data inside the report.
  • Explanation of your algorithm and pseudocode
  • Explanation of your Big Data strategy - Which Big Data strategy did you use and how does it make sense in the overall picture. If you used multiple technologies, list the project phases where you used each.
  • Create a data flow diagram for your application. - An example of DFD for MapReduce is here: https://creately.com/diagram/example/h21wfdxq2/MapReduce Similarly, you can create a data flow diagram for your machine learning/data analysis strategy.
  • Details of how your application handles bad or missing data and is your strategy robust i.e. can it recover from errors. Similarly, if you use machine learning, how do you handle over fitting?

II. Analysis of Results:

 In this section, you will present your final results and analyze them. Following are certain key points:

 • Summarize your results well. This could be in the form of tables, graphs, plots, or other visualization tools.

 • Validate your results.

For example, you can compute the accuracy of your model on the test dataset, or use cross-validation on the training dataset, or you can show that there is a correlation between positive review and star rating, or that there is a correlation between positive sentiment and stock price. Try to come up with numerical results.

 • If you results are below expectation, explain probable reasons.

III. Conclusion

Following are key points:

 • Explain how using Big Data helped you with this project? Explain how using Big Data helped you arrive at a better/faster/more efficient solution.

  • Describe what you learned in this project.
  • Describe how your technique/strategy can be improved

PROJECT SOURCE CODE - Source Code and Sample data files:

Some of the coding requirements are as follows:

  • Please include a README file indicating which language and technology you used and how to compile your code.
  • You need to use HDFS for at least some part of the project.

- This could mean that you use HDFS for data extraction, pre-processing, or actual classification or clustering. The key is you have to use HDFS somewhere.

  • Your code should be well documented.
  • Ideally, you should create a UNIX script such that the entire workflow - data extraction, parsing, pre-processing, analysis, MapReduce, machine learning task - can be run using that script. The script can accept parameters from the command line.

Please attach a sample of your data. This should not be the entire dataset, but just the top few lines, so the TA can run your code. About 1000 lines/records should be fine.

Reference no: EM131142323

Questions Cloud

What is the average delay per vehicle : Each booth processes trucks at a uniform rate of 2 per minute. What is the average delay per vehicle, the maximum queue length, and the average queue length?
Present recommendation to sincere college board of directors : You have been hired as an HR staffing consultant by the administration of Sincere College.  You are to prepare a comprehensive research paper that presents your recommendations to Sincere College's board of directors.
Why a free-rider problem might arise in this situation : Group projects are often assigned in classes, with everyone in the group receiving the same grade for the project. Explain why a free-rider problem might arise in this situation.
Which protist causes a sexually transmitted disease : What advantage does sound communication have over visual communication? Which Protist causes a sexually transmitted disease? Which of the following is considered to be most closely related to the plants?
Explanation of your algorithm and pseudocode : Explanation of your algorithm and pseudocode, Explanation of your Big Data strategy - Which Big Data strategy did you use and how does it make sense in the overall picture. If you used multiple technologies, list the project phases where you used eac..
What two characteristics define a public good : What two characteristics define a public good? Give an example. Why will private markets not supply the efficient level of public goods?
Why are goods with negative externalities often overproduced : Why are goods with negative externalities often overproduced? Why are goods with positive externalities often underproduced? Give an example for each.
What kind of negotiations could help engage indian employees : What kind of negotiations could help engage Indian employees and overcome some of the cultural problems encountered? How might culture play a role in the approach the Indian employees take in their negotiation with the financial firm?
What is the cloud and internet of things : Material visibility is always a topic in logistics and supply chain management- What is the cloud and Internet of things and how is it changing supply chain management?

Reviews

Write a Review

Basic Statistics Questions & Answers

  Find the standard error of the estimate

Determine the coefficient of determination, r2, and interpret its meaning. Find the standard error of the estimate. How useful do you think this regression model is for labor hours?

  Compute the exponential smoothing forecast

Construct a time series plot. What type of pattern exists in the data and show the four-quarter and centered moving average values for this time series.

  Are mean waiting times same for emergency patients

The results are shown below. Research question: Are the mean waiting times the same for emergency patients in these four hospitals? Use 0.05 for the significance level in making your decision.

  Suppose that someone has hired you to determine whether

suppose that someone has hired you to determine whether they should refinance their home. this customer has a 375200

  Find out the relationship between the variables

a. Generate a scatter diagram that includes the best-fit linear equation for these data. b. Does there appear to be any relationship between the variables? If so, is the relationship direct or inverse? c. Interpret the slope of the equation generated..

  Standard scores-measures of variability-norms

Describe the purpose of transforming raw scores into standard scores. Describe what "normal distribution" means.

  First order linear differential equations

To begin with, I have an example problem that asks to find a complete solution for: y'+ 2xy = x. The text says to multiply both sides by e^int a(x) dx

  Calculating mean-median and interquartile range

For each set of data, calculate mean, median, mode, range, interquartile range, standard deviation, and coefficient of variation.

  What is the probability to find in both neighborhoods

A real estate agent has 2 tours. The first tour through N1and has 10 homes for sale, whereas the second tour through N2and has 5 homes for sale.

  Force of attraction between mars and the satellite

1) What is the force of attraction between Mars and the satellite? 2) What speed should the satellite have to be in a perfectly circular orbit? 3) How much time does it take the satellite to complete one revolution?

  Of a random sample of 90 firms with employee stock

of a random sample of 90 firms with employee stock ownership plans 40 indicated that the primary reason for setting up

  Number of students that have taken the exam

In a class, the grade of a student on an exam is an RV with expectation 60 and variance 16. If the class average is between 55 and 65, with a probability greater than or equal to 0.9, determine the number of students that have taken the exam.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd