Reference no: EM133933475
Case: Predicting Parole Violators
Details
In this assignment, you will need to make use of the following data
Solve the Predicting Parole Violators problem on pages 402-405 of the textbook. Please turn in a PDF file and .ipynb file of your answers.
Here are scans of the book (for those who don't have the book) --
For b parts' last question, please report accuracy, recall, F1 score, AUC respectively. The threshold for calculation is 0.5
Part I: Basic Data Systems Concepts
For this part:
Please justify your answer and provide enough details to show your understanding on the concepts.
The writing needs to be coherent and readable. Please use fewer than 200 words; otherwise, your answer will not be considered (Canvas has a word counter at the bottom right of your answer box).
"If I had more time, I would have written a shorter letter." -- Blaise Pascal, Mathematician and philosopher
Question 1
Is Snowflake Data Platform a data lake house or a data warehouse?
Question 2
You are working for a plant shop and you want to suggest to the owner that it is time to move to a Cloud Data Platform like Snowflake from spreadsheets. What would you say?
Question 3
Your company is having trouble maintaining the ETL pipeline. What would have gone wrong? Under what scenarios would you recommend your company to try ELT instead?
Question 4
What are the differences among relational model, document data model, and vector data model?
Instructions on query problems
You need to write out the queries that output desired results to the asked questions. You do not need to paste or report any results. Queries will be enough.
Part II: SQL on restaurant reviews
Question 5
Calculate the average rating given by each customer and also show the individual rating that each customer gives to each restaurant.
Output customer id, customer name, restaurant id, restaurant name, individual rating, average rating renamed as avg_stars_by_customer.
Question 6
Calculate the highest rating given to each restaurant and also show the individual rating each customer gives to each restaurant.
Output customer id, customer name, restaurant id, restaurant name, individual rating, biggest rating renamed as highest_rating_by_restaurant. Get top-notch online assignment help.
Order by restaurant id.
Use window function.
Part III: SQL on TPC-H
Question 7
What is the total number of distinct parts offered by each supplier? The query should return the name of the supplier and the total number of parts.
Question 8
What is the cost of the most expensive part by any supplier? The query should return only the price of that most expensive part. No need to return the name.
Question 9
What is the cost of the most expensive part for each supplier? The query should return the name of the supplier and the cost of the most expensive part but you do not need to return the name of that part.
Question 10
What is the cost of the TOP TWO expensive parts for each supplier? The query should return the name of the supplier and the cost of the top two expensive parts but you do not need to return the name of those parts.
Question 11
What is the total number of customers per nation? The query should return the name of the nation and the number of unique customers.
Question 12
What is number of parts shipped between 10 October, 1996 and 10 November, 1996 for each supplier? The query should return the name of the supplier and the number of parts.
-- hint: use 'between' for date function
Part IV: Nested Semi-Structured Data
You will be querying information from a nested JSON data file about books.
Question 13
Use database LIBRARY_CARD_CATALOG you created in Day 3 class.
Day 3 file (semi-structure data).docx in attachments or zip folder ..use for below questions as well
Question 14
Create a table for your JSON data:
Name the table with NESTED_INGEST_JSON, under the LIBRARY_CARD_CATALOG database and PUBLIC schema; Name the VARIANT data type with RAW_NESTED_BOOK.
Question 15
What does the VARIANT data type do here?
Question 16
You created a stage named like_a_window_into_an_s3_bucket in the Day 3 class.
Now you will write command to copy the data from like_a_window_into_an_s3_bucket/json_book_author_nested.json to the table you created in Problem 14.
Please specify the file_format as JSON_FILE_FORMAT, which you created in the Day 3 class ( Day 3 file (semi-structure data).docx in attachments or zip folder) .
Question 17
Write a query to look at the entire JSON file you loaded into NESTED_INGEST_JSON. This is the books entity.
Question 18
Write a query to get all the values for the following key: year_published. Get top-notch online assignment help.
Question 19
Write a query to get the authors entity, which is nested in the books entity.
Question 20
Write a query to return the first name of the first author in each book.
Question 21
Flatten the authors entity with LATERAL+FLATTEN.
Question 22
Flatten the authors entity with TABLE + FLATTEN and only output each author's first name.
Question 23
Output each author's first name without the double quotes.
Question 24
Create a view named AUTHORS_NORMALIZED that has first_name as the first column and last_name as the second column. Rename the first column as FIRST_NM and rename the second column as SECOND_NM. Make sure that the values do not have double quotes around them.
Question 25
Extra 1 Point Problem: work on zero to snowflake tutorial on semi-structured data:
Click on Canvas->Modules->Day 1->Zero to Snowflake; link:
Follow the instructions in Modules 10-12 from Simple Data Pipeline to semi-structured data ;
Question 1: After running the two queries in "Step 1 - Querying VARIANT Data", what do you see in the result panel the ingredients for
sugar cone? (it should be the second records in your Results view)
Question 2: Explain in your own words what the example query in "Step 2 - Parsing Arrays with FLATTEN" does. Decipher the syntax including but limited to "::", ":", "LATERAL FLATTEN" etc.