Reference no: EM133996910
BIG DATA ANALYTICS FOR BUSINESS
Question 1
A global logistics company, "RapidRoute Logistics," relies heavily on Big Data to optimise delivery routes and manage real-time inventory. It collects enormous volumes of data from vehicle sensors (IoT), warehouse management systems, and customer feedback. However, its predictive analytics often prove inaccurate, leading to expensive delivery delays and erroneous stock forecasts. The analytics team suspects that poor Data Quality is the primary cause.
Identify and critically analyse four specific dimensions of Data Quality that are most likely compromised by the Big Data context of RapidRoute Logistics. For each dimension, propose an integrated solution and demonstrate how addressing it will directly improve the accuracy of predictive models and operational efficiency.
Question 2
A leading global supermarket chain, "OmniFresh Grocers," is struggling to optimise its customer experience across its complete omnichannel presence (physical stores, mobile application, and website). It possesses a wealth of customer data, including structured transaction history, semi-structured app clickstreams, in-store sensor data from smart trolleys (IoT), and unstructured customer reviews on social media. Despite this data abundance, it faces difficulty in two key areas:
Enhancing Product Recommendations to be more relevant and real-time across different channels.
Understanding the complex Customer Purchase Journeys that transition between online browsing and in-store purchase, and vice versa.
Describe in detail two (2) distinct applications of advanced data mining/analytical techniques (beyond simple regression or basic forecasting) that can be applied to OmniFresh Grocers' multifaceted dataset. For each application, clearly explain: (a) the specific technique(s) used, and (b) how the technique directly addresses one of the two business challenges listed above. No AI shortcuts — Only authentic assignment help from real expert tutors.
Question 3
Hierarchical clustering is a crucial unsupervised technique that reveals underlying structures in data. While both agglomerative (bottom-up) and divisive (top-down) methods achieve a nested structure, the choices made regarding distance, linkage, and splitting criteria profoundly influence the resulting hierarchy and its business interpretation.
Imagine a Big Data scenario where a FinTech company is clustering its global customer base based on two features: Average Monthly Transaction Volume and Geographical Latitude.
Explain the fundamental process (step-by-step) of an agglomerative hierarchical clustering algorithm.
Discuss how the choice between 'Single Linkage' and 'Complete Linkage' in the agglomerative method would yield fundamentally different cluster formations and dendrogram interpretations in this specific FinTech scenario.
Question 4
A leading video streaming platform, "StreamVerse," is collecting massive amounts of user interaction data, including viewing history sequences (which shows were watched in what order), genres, and user demographics. It wishes to move beyond simple 'Users who watched X also watched Y' rules to discover more comprehensive and actionable associations using association rule mining.
Propose a detailed methodology for applying the Apriori algorithm to discover association rules from StreamVerse's data. Specify the necessary preprocessing step required to adapt sequential viewing data (time-series) for use with a non-sequential algorithm like Apriori.
Explain the purpose and importance of the Lift metric in evaluating the discovered association rules. Imagine you find a rule X?Y with a high Support but a Lift value of 0.9. Critically interpret this result for StreamVerse.
Discuss two primary challenges or limitations StreamVerse would face when applying basic association rule mining to its sequential viewing data, and suggest a specific advanced technique to address each challenge.
Question 5
A major multinational bank, "GlobalTrust Financial," is launching a Big Data initiative to build a real-time Fraud Detection System using a variety of data sources, including structured transaction logs, semi-structured credit application forms, and unstructured customer interaction notes. Given the critical nature of fraud detection and strict regulatory requirements (compliance), Data Governance is paramount.
Select and describe three key dimensions of Data Quality (DQ) that would be most critical for the success of this specific Fraud Detection System. Justify your choices by explaining the potential negative impact of poor quality in each chosen dimension on the system's ability to detect fraud.
Explain how a robust Data Governance framework directly addresses the challenge of Data Variety (combining the different data types) faced by GlobalTrust Financial and discuss why this framework is essential for meeting financial regulatory compliance standards.
PART 2
Case Studies
Part 2 contains hyperlinks to files which you must download in order to answer the questions. To do this:
Login into the student platform
Hold ctrl and click the hyperlinks in the assessment paper to download the files.
Open the downloaded files on your device. OR
Download the datasets from the Past Paper section (Section 8.2) in the H11BD online learning platform.
Question 1
A school administration aims to use data analytics to identify the core drivers of student academic success and develop focused intervention strategies
Required:
Given that the three subject scores (math_score, science_score, english_score) are highly likely to be strongly correlated with each other and the target variable (overall_score), discuss the potential problem this poses for a Multiple Linear Regression (MLR) model. Propose and justify two alternative regression-based modelling techniques that are specifically designed to handle or mitigate the effects of multicollinearity (e.g., in terms of coefficient stability and interpretability).
Develop an alternative regression model using R or Python to predict the overall_score. Your solution must include:
Code for data loading, preprocessing (handling categorical variables), and splitting the data.
The application of standardisation (scaling) to the predictor variables.
The process for tuning the regularisation parameter (λ or α) using a
technique like cross-validation.
The final model evaluation using at least two appropriate metrics.
Present the code and the evaluation results (copy/paste or screenshot).
Based on the model developed in part (b), interpret the final non-zero coefficients (or the relative magnitude of the standardised coefficients).
Critically compare the coefficients produced by this regularised model with what you would typically expect from a standard MLR model in the presence of high multicollinearity. Discuss how the regularised model's coefficients offer a more stable basis for recommending interventions to the school.
Identify one Confounding Variable and one Mediating Variable that are likely unobserved (missing) in the dataset but are relevant to the students' performance.
Explain the conceptual difference between a confounding and a mediating variable, and discuss how the inclusion of the Mediating Variable (if available) could alter the interpretation of the direct effect of study_time on the overall_score.
Question 2
An insurance company, "PolicyWise," is highly successful in generating leads, but only a small percentage (historically less than 10%) of potential customers actually purchase a policy. It aims to use the 'insurance_data.csv' dataset to build a predictive model for 'Conversion_Status' that specifically focuses on identifying these rare positive cases.
The goal is not just to be accurate overall, but to maximise the chances of correctly identifying a genuinely converting customer while keeping false alarms manageable.
Required:
Explain the specific challenge that the historically low conversion rate poses when training a standard binary classification model (like Logistic Regression) on this dataset. Justify why Accuracy alone is an inadequate evaluation metric for this specific business problem, and propose two alternative, more appropriate classification metrics that PolicyWise should prioritise. Briefly explain what each proposed metric measures and why it is more relevant for identifying rare conversions.
Develop a model using R or Python to predict the Conversion_Status. Your solution must include:
Code for data loading, preprocessing (handling categorical variables), and splitting the data.
The implementation of one specific data or algorithmic technique to mitigate the issue of imbalanced data (e.g., Oversampling the minority class, using class weights).
Evaluation of the final model using the two metrics you proposed in part (a), along with the ROC AUC score.
Present the code and the evaluation results (copy/paste or screenshot).
Using the model developed in part (b), determine the top five most influential features (predictors) for conversion using the model's Feature Importance score. Based on the top three features, provide three distinct, actionable recommendations for PolicyWise's sales team to optimise their lead interaction strategy.
Critically discuss the importance of the ROC AUC score for PolicyWise's decision-making process. Explain what the AUC represents in practical terms, and why relying on the Confusion Matrix (which requires setting a fixed decision threshold) is riskier than using the AUC when deploying a system whose main goal is to prioritise scarce sales resources toward the best leads.