Data Discovery using ChatGPT in SAP Analytics Cloud

Part-1 – Bringing ChatGPT to SAP Analytics Cloud using a Custom Widget

Part-2 – Data Discovery using ChatGPT in SAP Analytics Cloud

In my last blog post titled “Bringing ChatGPT to SAP Analytics Cloud using a Custom Widget,” we discussed a step-by-step process of integrating the ChatGPT using OpenAI API with SAP Analytics Cloud. Now, let’s take this integration to the next level. Our focus in this blog is extending the integration to unleash the power of large language models like ChatGPT to perform data discovery, using model data from SAP Analytics Cloud.

I have aimed to keep things simple and easy to follow through, making it accessible even to beginners. We’ll achieve our goal in just 7 easy steps – So let’s dive right in.

Step 1: Prerequisites:

The first step is to complete all the steps from the part-1 of the blog series. if you have already completed the steps from it then we are good to go.

Step 2: Creating a data model

The data used in this demonstration comes from Kaggle, being publicly available, makes it easier for you to follow along the steps in this blog.

Data URL ->  80 Cereals | Kaggle

Download the data from the URL above and create an analytic model in the SAP Analytics cloud. Please ensure to have the same dimension and measure as shown in Figure 1 and Figure 2 below.

Note: Please take a moment to review the “About Dataset” section to enhance your understanding of the data and its dimensions. This will provide you with a clearer context of the data, enabling you to formulate complex queries to test the power of LLM at the end of the development process.

Figure 1: The figure shows card view of the model.

 

Figure 2: The Figure shows the model structure.

 

Step 3: Adding table and script variable:

As you would notice, the layout is almost the same as the layout of part-1 blog with two new addition, i.e., a table and a script variable, as shown in below figure 3.

 

Figure 3: The figure shows required widgets to create UI.

 

The table is added to utilize the Table API in the SAC Analytic Designer for accessing the underlying data. The table should be structured similarly to the one in Figure 4 below.

The script variable is named “contextData” and is of type string, as illustrated in Figure 5 below.

Figure 4: The figure shows builder panel of the table.

Figure 5: The figure shows structure of the script variable.

 

Step 4: Deciding right format for data input

In this demonstration, I have opted to format the data in CSV style for data input, taking into account the following considerations:

Data type: Simple

Use case: Data Discovery

Estimation of token usage (Input): CSV appears more cost-effective than JSON, requiring fewer tokens per record for data input, as shown in Figure 6 below. The appropriate format may vary depending on the specific case. For example: If the usecase requires to use data aggregates as data input rather than entire data set then we can use JSON as it is suitable for complex structure. You can also play around and test with the tokenizer to see right format in your use case by clicking here.

Figure 6: The figure shows the comparison of token usage per record between CSV and JSON style respectively.

OpenAI charges for both input and output token usage. If you want to limit the output token, you can consider the following methods: 1) Using prompts 2) Parameter tuning. In this blog, we will only use prompts in Step 6 to limit output tokens usage.

 

Step 5: Add Script to “Application – onInitialization”:

In the previous step, we determined the appropriate format for data input. In this step, we will write a script using the APIs provided in the Analytic Designer to structure the model data into CSV format for data input. If you are following through the blog, copy and paste the code below to Application – initialization.

 

//getting members of name dimension
var nameDim =  Table_1.getDataSource().getMembers("name");

//getting result set consisting of measures in table
var accData = Table_1.getDataSource().getResultSet();

//storing column names as first line of the string variable
var stringCSV = "Name_of_cereal,Manufacturer_of_cereal,manufacturer_type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,weight,cups,consumer_rating" + "n";

//Algorithm to create SAC data in CSV style format
for(var i=1;i<nameDim.length; i++)	
{
	var name = nameDim[i].id;
	
	 stringCSV = stringCSV + name + ",";
	
	var flag = true;
	
	for( var j=0; j<accData.length; j++)
	{
		if(accData[j]["name"].id === name)
		{
			if(flag === true)
			{
				 stringCSV = stringCSV  + accData[j]["mfr"].id + ",";
				 stringCSV = stringCSV  + accData[j]["type"].id + ",";
				 flag = false;
			}
			if(accData[j][Alias.MeasureDimension].description === "calories")
			{
			   stringCSV = stringCSV + accData[j][Alias.MeasureDimension].rawValue + "," ;
			}
			if(accData[j][Alias.MeasureDimension].description === "protein")
			{
				 stringCSV = stringCSV + accData[j][Alias.MeasureDimension].rawValue + ",";
			}
			if(accData[j][Alias.MeasureDimension].description === "fat")
			{
				 stringCSV = stringCSV + accData[j][Alias.MeasureDimension].rawValue + ",";
			}
			if(accData[j][Alias.MeasureDimension].description === "sodium")
			{
				 stringCSV = stringCSV  + accData[j][Alias.MeasureDimension].rawValue + ",";
			}
			if(accData[j][Alias.MeasureDimension].description === "fiber")
			{
				 stringCSV = stringCSV  + accData[j][Alias.MeasureDimension].rawValue + ",";
			}
			if(accData[j][Alias.MeasureDimension].description === "carbo")
			{
				 stringCSV = stringCSV  + accData[j][Alias.MeasureDimension].rawValue + ",";
			}
			if(accData[j][Alias.MeasureDimension].description === "sugars")
			{
				 stringCSV = stringCSV  + accData[j][Alias.MeasureDimension].rawValue + ",";
			}
			if(accData[j][Alias.MeasureDimension].description === "potass")
			{
				 stringCSV = stringCSV  + accData[j][Alias.MeasureDimension].rawValue + ",";
			}
			if(accData[j][Alias.MeasureDimension].description === "vitamins")
			{
				 stringCSV = stringCSV  + accData[j][Alias.MeasureDimension].rawValue + ",";
			}
			if(accData[j][Alias.MeasureDimension].description === "weight")
			{
			     stringCSV = stringCSV  + accData[j][Alias.MeasureDimension].rawValue + ",";
			}
			if(accData[j][Alias.MeasureDimension].description === "cups")
			{
				 stringCSV = stringCSV  + accData[j][Alias.MeasureDimension].rawValue + ",";
			}
			if(accData[j][Alias.MeasureDimension].description === "rating")
			{
				 stringCSV = stringCSV  + accData[j][Alias.MeasureDimension].rawValue;
			};
		};
	};
	stringCSV = stringCSV + "n";
};

//passing CSV style data from string variable to contextData
contextData = stringCSV;
//console.log(contextData);

 

Note: Shelf dimension has been left out from the data input because it doesn’t add any valuable information.

 

Step 6: Add Script to “Button_submit – onClick”:

Please remove script from part-1 blog and add the script below to the “button_submit ” onClick event. Replace the SECRET_API_KEY with your secret API key.

 

var endpoint = "completions";
var APIKey = "SECRET_API_KEY";
var userPrompt = InputField_prompt.getValue();
var prompt = "context data:" + contextData + "," + "Please answer the queries using the context data in less than 30 words based on prompt followed after this line:" + userPrompt;
var response = chatGPTInSAC_1.post(APIKey,endpoint,prompt);
Text_response.applyText(response);
//console.log(["output", response]);

 

 

Figure 7: The figure shows the script for “onClick” event in button.

The prompt written in Figure 7 is rather straightforward, aiming to limit token usage in order to stay within the free usage limits. This approach allows both free and paid users to test it, as outlined in this blog. If you are a paid user, you have the option to enhance the prompt by including additional details, such as descriptions for each column. This additional details can provide a better context to the ChatGPT.

Please note: Embedding models like “text-embedding-ada-002” are the right choice for sharing large amounts of data, as they offer significant cost savings compared to “text-davinci-002”. However, for the sake of simplicity and to maintain continuity with part-1 blog, I’ve opted to continue using the same model. You can learn more about OpenAI models by clicking here.

 

Step 7: Data Discovery:

In this step, we would prompt ChatGPT to find out if it could answer complex queries using the input data used in this demonstration.

The query I used is “Name one cereal that is best for someone who wants to gain muscle and reduce fat. Additionally, take rating in consideration. Explain why you suggested it?”

Please notice that we have left it up to ChatGPT to figure out several things on its own, as shown in Figure 8 below, such as:

  1. Interpreting the query.
  2. Which nutrient is required to gain muscle?
  3. Which nutrients would have to be less to reduce fat?
  4. We have vaguely mentioned “take rating in consideration,” and it figured out that rating has to be high.
  5. Finding cereal that fulfills all three criteria, i.e., High protein, low fat, and good rating.

It did good job with finding such cereal as shown in Figure 8.

 

 

Figure 8: The figure shows the output based on the user prompt.

 

Let’s take one step further and make a small change to the variable “var prompt” in the “button_submit” onClick event from step 6. Add “in the same language” to the prompt, as shown in Figure 9 below. Now, write the same query in German, i.e., “Nennen Sie ein Müsli, das am besten für jemanden geeignet ist, der Muskeln aufbauen und Fett abbauen möchte. Berücksichtigen Sie dabei auch die Bewertung. Erklären Sie, warum Sie es vorgeschlagen haben?”, as shown in Figure 10 below.

Figure 9: The figure shows the change made in prompt.

 

 

Figure 10: The figure shows the output in german based on the user prompt.

As shown in Figure 10, our analytics application can now handle queries in German and more than 50 other languages. This minor prompt modification has significantly increased the scalability of the analytic application.

This simple demonstration illustrates the powerful capabilities of a language model like ChatGPT and how it can be used for data discovery.

Note: The output is inconsistent at times. The next section briefly discusses some mitigation techniques to improve the performance of the model.

Limitations:

Before implementing such a solution in a business scenario, it’s crucial to consider various limitations regarding the accuracy, often termed hallucination, alongside the consistency and randomness of outputs among many. Nevertheless, these limitations can be significantly mitigated by selecting appropriate models, fine-tuning hyperparameters, employing prompt engineering techniques, and fine-tuning the model by training it with business-specific data.

Conclusion:

In this blog post, we have leveraged the integration covered in the part-1 of the blog series and created an analytic application that empowers users to harness the capabilities of ChatGPT for data discovery. Moreover, the current application can also be extended and customized to suit specific use cases such as search, recommendations, anamoly detection and classification.

If you find this blog useful then, please like this blog post and follow me for contents related to SAP Analytics Cloud. If you have any questions or feedback, please leave a comment below.

 

References:

Bringing ChatGPT3 to SAP Analytics Cloud using a Custom Widget

 

 

 

 

 

Scroll to Top