Reference no: EM133918144 , Length: word count:1000
Artificial Intelligence and Machine Learning
Simulation and problem solving using advanced machine learning models
Assessment - Modelling and Simulation
Task
Using Orange Data Mining or Python, design and implement Neural Network-based machine learning models to solve real-world problems. Answer to the accompanying questions with thoughtful analysis.
Assessment Description
Neural networks, especially their applications in deep learning, have gained significant popularity in recent years. Business Analytics professionals now have access to cloud-based hardware, enabling them to run deep learning models effortlessly via browsers and no-code platforms.
In this assessment, you will design, implement, and evaluate neural network-based models, such as image and text classification, using Orange Data Mining or Python. You will be provided with datasets to apply these models to real-world scenarios, assessing their strengths, limitations, and practical impact. Get expert-level assignment help in any subject.
Assessment Instructions
Use Orange Data Mining or Python to implement neural network-based models to solve real-world problems. The datasets, parameters, and instructions are provided in the assessment sheet.
Based on software outputs, answer the accompanying questions in the assessment sheet.
Write a 1000-word (maximum) report that summarises your work and includes the answers to the questions from the assessment sheet. The report must be written using a Google docs template (shared by your lecturer).
1. Machine Learning Model Comparison
You are a data scientist tasked with solving a real-world healthcare challenge: building a machine learning system to predict breast cancer diagnoses (malignant or benign). Early detection is critical for improving patient outcomes, and your goal is to identify the most accurate predictive model using a provided dataset.
You will:
Explore the dataset.
Implement multiple machine learning algorithms.
Evaluate and compare model performance.
Select the best-performing model for deployment.
The dataset for this activity is available here: Breast Cancer Data. It contains diagnostic data for breast cancer cases, including various features extracted from cell nuclei in digitized images. There are 30 features.
Load the dataset, inspect it, and perform some descriptive analytics (summary statistics, correlation analysis, distributions analysis).
1.1 What is the percentage of benign and malignant diagnosis in the dataset? Provide a visual to justify your answer.
1.2 What are the two most correlated features (respectively the two least correlated features) in the dataset? Provide a visual to justify your answer.
1.3 Provide a visualization of the variables `radius_worst) (feature) and 'diagnosis' (target variable) together? What can you say about the ability of the feature 'radius worst' to predict the target variable 'diagnosis'? Justify your answer.
Perform Principal Component Analysis (PCA) on all the features.
1.4 What is the percentage of explained variance provided by the first 10 principal components? Provide a visual to justify your answer.
From now on, you will only use the first 10 principal components of PCA as predictor variables to build different machine learning models. Perform stratified, replicable train-test splitting on the new dataset using the split ratio 80% training and 20% testing. Train the following machine learning classification models:
Logistic Regression: train a logistic regression model with no regularization.
Random Forest: train a random forest model with 500 trees, number of features considered at each split equals to 5, limit depth of individual trees equals to 3, do not split subsets smaller than 5, and replicable training.
Neural Networks: train a neural network model, with 2 hidden layers, 5 neurons per hidden layers, 'tank' activation function, Adam' solver, no regularization, replicable training, and maximum number of iterations equals to 500.
1.5 Provide the predictive performance metrics table of the 3 machine learning models (logistic regression, random forest, neural networks) on the testing data for each category of the target variable and overall.
1.6 What is the best predictive model according to the Fl score and why? Provide its confusion matrix.
1.7 Suppose that the best predictive model predicts a breast cancer case in testing dataset as a malignant diagnosis. What is the probability that the diagnosis is benign? Be sure to show all your workings, each step you take to reach your answer should be clearly presented.
2. Image Ana lytics
You are working as a machine learning engineer for a company specializing in automated plant identification systems. The company is developing a mobile application that helps users identify flowers in real time using image recognition. Your task is to build a predictive model that can accurately classify whether an image contains a daisy flower or not.
Accurate classification is essential for improving user experience and ensuring the reliability of the app's recommendations. The model you develop will serve as a prototype for future deployment in the app's backend system.
You will build a complete image classification workflow. The goal is to distinguish between daisy and non-daisy images using deep learning-based feature extraction and a custom neural network classifier.
The image dataset for this activity is available here Flowers Data. It contains labeled examples of daisy and non-daisy images. Ensure the dataset is properly organized with clear labels for each class before importing into Orange.
Use the dataset above to build an Orange predictive workflow to classify whether an image is daisy flower or not (daisy vs non-daisy). Use the train-test split ratio of 75:25. Use replicable sampling and stratify the sample. Use SqueezeNet for feature extraction and a neural network for classification, with 2 hidden layers of 10 neurons each, ReLU activation, Adam optimizer, regularization strength of 0.0005 and train for 250 steps maximum with replicable training.
2.1 What is the percentage of daisy and non- daisy flowers in the dataset? Provide a visual to justify your answer.
2.2 Provide the predictive performance metrics table on the testing data for each category of the target variable and overall.
2.3 In the testing set, how many daisy flowers have been wrongly classified as non-daisy? Provide a visual to justify your answer.
2.4 Provide a visual of the daisy flower images in the testing set that have been wrongly classified as non-daisy.
2.5 Classify the image available here Image To Classify using your predictive analytics orange workflow. What are the predicted probabilities to be daisy and non-daisy? Provide a visual to justify your answer.
2.6 Did the predictive system classify well the previous image (in 2.5)7 If Yes or No, what could be the reason?
Suppose we use three principal components (PC1, PC2, PC3) of the extracted features as input features (xl, x2, x3) and perform daisy flower classification using a neural network that has no hidden layers (shown above).
3. Text Analytics
You are a data analyst working for a social media monitoring firm that provides sentiment insights into brands, public figures, and organizations. Your current task is to analyze tweets to determine how people feel about specific entities-such as companies, politicians, or celebrities.
Understanding sentiment at the entity level is critical for reputation management, targeted marketing, and strategic decision-making. Your goal is to build a model that can accurately classify the sentiment expressed in a tweet about a given entity.
Using a labeled Twitter dataset, you will perform entity-level sentiment analysis. For each tweet and its associated entity, your task is to classify the sentiment as: Positive, Negative, and Neutral. The dataset is available here: Twitter Data. The dataset has been already split into training and testing sets. Use 'twitter..training' as the training set and 'twitter_testing' as the testing set. Each record in each dataset contains: tweet id, entity, sentiment label, and tweet content.
Carried out the following steps to the English-language text data:
Preprocessing: Transformation (remove URLs, remove accents, lowercase, parse html).
Preprocessing: Tokenization using the Regexp algorithm, and normalization with Lemmagen Lemmatizer.
Preprocessing: Filtering stopwords and numbers.
Data Exploration: Visualize the target variable, Generate Word Cloud.
Embedding: Convert the cleaned text into numerical features using the fastText algorithm with mean aggregator.
Training: Train a neural network classification model using both the extracted features and the entity variable as predictor variables, with 2 hidden layers of 10 neurons each, tank activation function, Adam optimizer, no regularization and train for 500 steps maximum with replicable training.
Evaluate the predictive performance of the trained model on testing dataset.
3.1 What is the percentage of each category of the target variable in the training data. Provide a visual to justify your answer.
3.2 What is the most frequent word in the dataset? Provide a visual to justify your answer.
3.3 Report the predictive performance metrics on the testing dataset for each category of the target variable and overall.
3.4 In the testing dataset, how many positive tweets have been wrongly classified as Negative? Provide a visual to justify your answer.
3.5 Provide a visual of positive tweets in the testing set that have been wrongly classified as Neutral.