Details of the dataflow between SAP BW HANA to Databricks can be read here
Databricks to BW flow
How DI picks the data from Databricks:
On the return journey Data can be retracted using 2 techniques,
- Databricks informs DI to pull the data once ready : This is simpler method where SAP DI provided API’s (/v1/runtime/graphs) are utilised to invoke DI graph to pull the data from the Databricks delta table. This becomes challenging if SAP DI is behind the firewall which means Databricks can not reach it directly.
- SAP DI pulls the data in equal intervals from Databricks : SAP DI polls the Databricks to check the availability of the new data. Preferred when SAP DI is behind the firewall.
What if need to load from Databricks in parallel:
If the requirement is to extract the data from Databricks in lightning fashion, in parallel, it is important to maintain Status tables . This can be considered as a custom Checkpoints setup to track the status of the ongoing parallel pipelines.
Parallel loads status tracking is required in following scenarios,
- Single SAP DI extraction process pulls data from Databricks in parallel
- Examples : 5 graphs loading data for 5 periods/months in parallel). Status needs to be tracked to make sure same period has not been not pulled again
- Previous pull from SAP DI overlaps with new SAP DI pull request.
- Here it is important that multiple pulls don’t extract same data multiple times. Example, last pull brought data for 1-10 periods and the next one should only pull 11 -12 periods. Status table is needed to track such scenarios.
Extraction from Databricks is performed in 2 steps,
- Databricks ODBC driver to read the data : This was used to extract the data from Databricks in highly efficient manner. It also supports auto indexing so as to extract the data in packages and only once.
- Update the status tables : As this needs an update in the Delta table, we deployed Notebooks via workflow jobs that updates the tables with the execution status of the ongoing data loads (e.g which period load is in progress or completed). Job is triggered using API (/api/2.1/jobs/run-now)
Updating the tables is also possible via SQL warehouse API(/api/2.0/sql/statements/) in the new version. Worth trying it as you can directly send SQL’s from SAP DI. Here you get quick response if SQL statement was successful or not and take accordingly the action. With Job API’s, API provides only status if Job was triggered however fails to give status of the whole job execution .
How data is sent to SAP BW:
This data further is sent to SAP HANA Tables using standard SAP HANA operator. It is advisable to keep operator in INSERT ONLY mode as it gives the best in class write performance. This setup doesn’t offer idempotent setup on HANA side which is a trade off against the speed.
REMEMBER : SAP HANA does not support multiple Updates in the same HANA table at the same time. Although it worked properly in our case, it is not a supported feature.
SAP DI efficiently can trigger the Process chains to load the data further from SAP HANA tables (flow : Table-> calculation view-> Datasource) to BW providers.
We noticed that standard feature of SAP DI operator waits for PC execution response for the allocated wait time, however it was not useful feature in our case. We did not want to occupy DI cluster resources till PC completes but release them and use for the next parallel loads.
To address this problem, we developed a BadI (definition : BADI_LTAPE_ENGINE_OPERATOR) that is triggered by SAP DI using DI operator. BadI holds the code to trigger SAP PC. This way SAP DI would just trigger the PC via events and release the resources.
Lot of work went into setting up such a tightly integrated environment. Teams from SAP DI, SAP BW and Databricks worked hard to make this a reality for us.
Hope it helps and do let provide feedback as I strongly feel there is never only one solution to the problem.