I have spent a descent time in Data Warehousing. The biggest challenge we faced numerous times was that the Business requested some new Data or KPIs and when they finally arrived, the requested KPI was superseded or data interpretation not in the accuracy expected by the business. It would be obvious to introduce DevOps as the Developing Methodology for DWH. Why is it not done, or can it even be done?
Where I come from
DWH sources the Data from the Source System through few stages such as Datahub, Integration-, Staging- and Warehouse Layer up to the Final Star or Cube Model where the business runs the report out of it. The journey is long and from start to end it takes several weeks or sprints until a newly requested KPI reaches his report. It seems natural to reduce the steps and elapse time whenever, wherever possible and there should be a huge opportunity to do so.
As the opportunity is huge and DevOps not known or born, some years back, we felt the urgency and started our journey to optimize time to market for our DWH. In numerous attempts and steps, we reduced the risk of such a long delivery process and the related fails in the delivery, by including the business in testing and defining the data processing rules. One of the biggest issues to solve was, to analyze the data of a source system in its entirety. As Excel blew up regularly, we mastered to setup some data hubs (the data from the source system 1:1 in a SQL database) and made them accessible to the Business Analyst. This enabled the business to get access to the data and exploring the complexity of even “simple” business rules, as they had to cope with the entire data.
Faster but still not enough
With this step enabled, we delivered more accurate and on the expectation of the business. Also, the first phase of a waterfall approach almost disappeared. As the Business Analyst was able to analyze the source data, the definitions where more accurate and the selection criteria set. However, as the process of code integration, testing and loading all the data through the layers still took a huge elapse time, the business soon managed to report directly from the data hub trough their excels. Something which was not anticipated, but the appetite increased in such a way the business almost accepted any risk. So, we had to move on and speed up the development cycle. But with the system in place, we didn’t achieve to reduce load times from multiple hours significantly. Also, the idea to develop multiple layers in parallel didn’t bring the desired outcome. As a conclusion we accepted that we can’t speed up and a such now called DevOps mode is not possible.
As scrum and other methods came more popular the IT was challenged to introduce a more rapid way to bring code to life. As numerous processes where optimized and discussion made, ideas for new approaches integration patterns remained und heard.
Defeated by a number of no’s and a number of talks and attempt, I think it is time to challenge the DWH Worlds with the Setups if they can deliver in a continuous integrated manner and if this can happen without fancy cloud technology in a legacy environment.
It works
Numerous blogs and articles exist, where DWH is introduce on cloud platform with automated testing and integration processes on a daily basis. [Microsoft Azure, teradata]. But can it be done on a legacy architecture? There are still some additional tools and development required, such as code repository and test automation tools. But only a few new tools don’t make the difference itself, there are a number of test and comparison logic which must be implemented. [DevOPS in DWH]
Be careful
Also, a change to autonomous integration must be well thought trough and the cleanup of the legacy must be done. Only to do the same inefficient data processing in a faster, more repetitive way doesn’t’ serve the purpose [Be careful when implementing automation]. This would support my initial drawback, when we didn’t manage to speed up the existing deployment process, as loads and clones tuck hours of processing. However even long lagging load process can be optimized and improved [Analytics Today].
Conclusion
As a final conclusion, a DWH following the DevOps principles and pace can be introduced. Maybe not on a daily implementation cadence, a bit less seams more practical. But from 6 releases a year to a weekly release I consider a huge step forward. The effort involved to get to such a solution is quite large. There are numerous hurdles to overcome in a classical DWH. Dependent on the legacy architecture and technical depts, it is almost easier to start from scratch and consider a move to cloud. With such a starting point it sometimes seems more practical, to live with the slow pace and the few release cycles and to have the business run a few more excel queries and V-lookups.