Congratulations! You made it to week 4 of the AI Developer Challenge! This week we will continue where we left of last week and train our regression model using the dataset we uploaded in week 3 and the Data Attribute Recommendation service (DAR).
Before we start, let me give you a brief intro to machine learning so that you know what kind of model we are going to train this week. Machine Learning is a subclass of AI that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data. Traditionally, there are three main types of algorithms: supervised, unsupervised, and reinforcement learning. Furthermore, there are also different types of machine learning models like regression, classification, clustering, and association rules. These models serve various purposes and are used in different types of machine learning problems. Regression models are models that are used to predict continuous values, such as predicting a person’s salary based on their years of experience and education or predicting avocado prices based on region, type, and the other values in our dataset. For the regression model we will train this week, we are using a supervised approach, as we provide labeled training data – that is a dataset that includes the avocado prices in the training dataset from past purchases to learn from for future predictions.
There are different supervised regression algorithms that you can use such as Linear Regression, Support Vector Regression, Decision Tree Regression, Gradient Boosting Regression and many more. The Data Attribute Recommendation services uses a neural network for regression that seeks to minimize the mean squared error (MSE). If you want to learn more about the regression capabilities of DAR check out this blog post.
- By using the GET List Executables request, you can get all the Executable IDs. We will use the Executable ID for the regression model: “0e06e0eb-b0b2-41c6-b204-49ebc091d55c”
- Assign the Executable ID (“0e06e0eb-b0b2-41c6-b204-49ebc091d55c”) to the trainingExecutableIdRegression variable in your environment.
- Use the POST Create Training Configuration Regression request to create a training configuration for DAR. This way DAR knows which algorithm and which dataset and dataset schema to use during training. The datasetSchemaArtifactIdRegression and the datasetArtifactIdRegression should have been assigned from last week. If not you can use the GET List Artifacts request to get the missing values.
- Now you need to create a training execution using the configuration from step 3. Therefore, we will use the POST Create Training Execution Regression request. If the variable trainingConfigurationIdRegression was not assigned automatically, you can assign it in the Environment yourself. It is the ID that was returned in step 3.
- The current status of your training execution is “UNKNOWN” and the target status is “COMPLETED”. The status might change to “PENDING” and should then change to “RUNNING” as soon as the necessary resources have been allocated and the training started. You can check the status of your training by using the GET Get Execution Regression request.
- The output of the training will be a model artifact and the training status “COMPLETED”.
- Use the GET Get Metric Details request to check your models performance! You need to assign the training execution ID to the variable trainingExecutionId in your environment to get the results.
- MSE (Mean Squared Error) is a common metric for measuring the accuracy of a machine learning model in regression problems. It calculates the average squared difference between the predicted and actual values, which tells us about the average magnitude of error in the model. A smaller MSE is more favorable as it suggests lower error magnitudes.
- MAE (Mean Absolute Error) is another metric used for regression models. It calculates the average absolute difference between the predicted and the actual values, giving us another perspective on the error magnitude of the model. Unlike MSE, MAE isn’t sensitive to outliers as it doesn’t square the errors in the calculation. A smaller MAE is again more favorable, indicating less difference between predicted and actual values.
- MAPE (Mean Absolute Percentage Error) measures the average absolute percent difference between the predicted and actual values. It is usually used when you want to understand the error in terms of the relative size of the predictions. It is expressed as a percentage, and smaller MAPE values indicate a better fit of the model to the data.
- A feature contribution score is a measure that assigns the importance of each variable in making a prediction and provides insights into which features are most influential in the model. If a feature has a high positive feature contribution score, it means that the feature significantly contributes to pushing the model’s output higher for a given prediction. Conversely, if a feature has a high negative score, it plays a significant role in pushing the model’s output lower. Knowing which features are most important can help improve your model by focusing on those variables during feature engineering.
- SUBMIT a screenshot of your model’s metrics like this:
Stay tuned for next week to deploy your model and run inferences to see your model in action!