Explanation of your algorithm and pseudocode

Assignment Help Basic Statistics
Reference no: EM131142323

I need help with my project for big data

It's up to us to define the specific design and limit or expand the scope but should involve substantial design, analysis, programming, and validation. I have listed the topic and technology that should be used for the project. You can make changes and consider other attributes for the system. Let me know the price and process to move further.

Topic: Recommendation or clustering system for IMDB movie dataset.

Technology: Spark will be used for solving the problem.

Data: https://www.imdb.com/interfaces

The movie database will be considering following attributes for the movie recommendation system.

1) Title

2) Genre

3) Actors

4) Actresses

5) Directors

6) Rating

Hypothesis: We wish to build a movie recommendation system, which will suggest movies to user that he/she might be interested in based on the tastes, interests, and people connections. also need a report

Parts for the report-

I. Design:

Design document should contain your proposed design of the solution.

  • Summary of problem definition - Focus on explaining what you want to do in the project, any assumptions, and limitations.
  • Description of input data - In the design document, you have to provide a summary of your data - for example, data format, attributes, and metadata. Please do not include data inside the report.
  • Explanation of your algorithm and pseudocode
  • Explanation of your Big Data strategy - Which Big Data strategy did you use and how does it make sense in the overall picture. If you used multiple technologies, list the project phases where you used each.
  • Create a data flow diagram for your application. - An example of DFD for MapReduce is here: https://creately.com/diagram/example/h21wfdxq2/MapReduce Similarly, you can create a data flow diagram for your machine learning/data analysis strategy.
  • Details of how your application handles bad or missing data and is your strategy robust i.e. can it recover from errors. Similarly, if you use machine learning, how do you handle over fitting?

II. Analysis of Results:

 In this section, you will present your final results and analyze them. Following are certain key points:

 • Summarize your results well. This could be in the form of tables, graphs, plots, or other visualization tools.

 • Validate your results.

For example, you can compute the accuracy of your model on the test dataset, or use cross-validation on the training dataset, or you can show that there is a correlation between positive review and star rating, or that there is a correlation between positive sentiment and stock price. Try to come up with numerical results.

 • If you results are below expectation, explain probable reasons.

III. Conclusion

Following are key points:

 • Explain how using Big Data helped you with this project? Explain how using Big Data helped you arrive at a better/faster/more efficient solution.

  • Describe what you learned in this project.
  • Describe how your technique/strategy can be improved

PROJECT SOURCE CODE - Source Code and Sample data files:

Some of the coding requirements are as follows:

  • Please include a README file indicating which language and technology you used and how to compile your code.
  • You need to use HDFS for at least some part of the project.

- This could mean that you use HDFS for data extraction, pre-processing, or actual classification or clustering. The key is you have to use HDFS somewhere.

  • Your code should be well documented.
  • Ideally, you should create a UNIX script such that the entire workflow - data extraction, parsing, pre-processing, analysis, MapReduce, machine learning task - can be run using that script. The script can accept parameters from the command line.

Please attach a sample of your data. This should not be the entire dataset, but just the top few lines, so the TA can run your code. About 1000 lines/records should be fine.

Reference no: EM131142323

Questions Cloud

What is the average delay per vehicle : Each booth processes trucks at a uniform rate of 2 per minute. What is the average delay per vehicle, the maximum queue length, and the average queue length?
Present recommendation to sincere college board of directors : You have been hired as an HR staffing consultant by the administration of Sincere College.  You are to prepare a comprehensive research paper that presents your recommendations to Sincere College's board of directors.
Why a free-rider problem might arise in this situation : Group projects are often assigned in classes, with everyone in the group receiving the same grade for the project. Explain why a free-rider problem might arise in this situation.
Which protist causes a sexually transmitted disease : What advantage does sound communication have over visual communication? Which Protist causes a sexually transmitted disease? Which of the following is considered to be most closely related to the plants?
Explanation of your algorithm and pseudocode : Explanation of your algorithm and pseudocode, Explanation of your Big Data strategy - Which Big Data strategy did you use and how does it make sense in the overall picture. If you used multiple technologies, list the project phases where you used eac..
What two characteristics define a public good : What two characteristics define a public good? Give an example. Why will private markets not supply the efficient level of public goods?
Why are goods with negative externalities often overproduced : Why are goods with negative externalities often overproduced? Why are goods with positive externalities often underproduced? Give an example for each.
What kind of negotiations could help engage indian employees : What kind of negotiations could help engage Indian employees and overcome some of the cultural problems encountered? How might culture play a role in the approach the Indian employees take in their negotiation with the financial firm?
What is the cloud and internet of things : Material visibility is always a topic in logistics and supply chain management- What is the cloud and Internet of things and how is it changing supply chain management?

Reviews

Write a Review

Basic Statistics Questions & Answers

  Findout the percentage of data among standard deviation

For a standard normal distribution, findout the percentage of data that are among 3 standard deviation below the mean also1 standard deviation above the mean.

  An experimenter wants to conduct a hypothesis test on a

an experimenter wants to conduct a hypothesis test on a claim that there was no difference between two population

  Draw scatterplot for in-person deadline for absentee ballot

Write up your responses and submit a PDF of the file to Canvas. There is no min or max page limit, but the burden is on you to convey your understanding.

  Evidence to reject the directors claim

A job placement director claims that the average starting salary for nurses is 24,000.A sample of 10 nurses' salaries has a mean of 23, 450 and a standard deviation of 400. Is there enough evidence to reject the director's claim at 0.05 level of s..

  This measure of reproductive rate is normally distributed

a a conservationist is interested in knowing whether polar bears in the wild have lower rates of reproduction than

  Random variable-uniform distribution

A random variable Y is uniformly distributed over a range of 0 to 2*pi. Another random variable X is related to Y by X=cos(Y).

  Test the claim of the brewery about the mean volume

Test the claim of the brewery about the mean volume equal to 32 oz. at the 0.05 significance level. It's your choice whether to use z or t.

  Based on empirical data determine probabilities

Based on the empirical data determine the following probabilities:

  Discuss whether this problem is describing a binomial

1 a top nhl hockey player scores on 93 of his shots in a shooting competition. what is the probability that the player

  Discuss about two true and false statements

These questions related to Statistics and discuss about two true and false statements. The first question is about F being negative. The second question is about whether or not HSD would indicate pair contributed toward significant F

  What would be the appropriate control limits for the

question control charts for xolineand r are in use with the following parameters. the sample size is n 9. both

  Explain the differences between an observation and an

explain the differences between an observation and an inference. relative to what probability means how does this

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd