Last month I wrote about some of the ways in which analytics has become significantly more complex. A major factor has been the proliferation of software-as-a-service (SaaS) applications with many companies now having to contend with company data splintered across a wide variety of applications.
This article examines some of the ways in which analytics solutions have evolved to meet this increased complexity, beginning with the area of integration.
Analytics solutions comprise a few key components, namely:
Just as companies have turned to the cloud for many of their ongoing business operations, they are also using it for analytics solutions. Many are now using solutions for each of the main stages of analytics, with those solutions often supplied by different vendors. An ecosystem of products has emerged replacing the all-encompassing one-stop shop.
The Modern Data Stack: “Choose One Solution from Each Component Area”
Some of the market leading solutions for each of the key components are listed below:
|Component Area||Leading Solutions|
|Integrate||Fivetran, Stitch Data, Matillion, Hevo Data, Segment, Rudderstack|
|Store||Snowflake, Amazon Redshift, Microsoft Azure, Google BigQuery, SQL Server|
|Analyse & Report||Looker, Mode, Tableau, Chartio, Power BI, Sisense, Qlik|
|Programming Languages||SQL, R|
For many companies their analytics thus comprises a combination of solutions – at least one for each component area. The rest of this article focuses on the integration stage or, as we’ll see shortly, on the integration stages.
So What Is Data Integration?
Data integration is essentially about moving data. Frequently data needs to be moved, or copied, on a repeated basis. This explains the common reference to data pipelines. This process is often referred to as Extract, Transform, Load (ETL). Indeed, the terms “data integration” and “ETL” are frequently used interchangeably. Extract, Transform and Load are the main sub-components of any solution with each required to complete a specific key task as outlined below.
|Extract||Extraction of data from sources||Transform||Preparation of data for loading based on source||Load||Load or output of the transformed data to targets|
This is a model that works well for many use cases. Many utilities, for example, use solutions to integrate usage records data through to operational systems such as billing and fraud management. Generally, data is transformed to meet the format required by the downstream system. Many ‘traditional’ integration solutions are also used as a key part of analytics solutions. McDonald’s, for example, selected Talend to integrate data from more than 1,000 sources across its global operations and supply chain.
Recently, however, many vendors have developed products specifically to address analytics requirements and, in the process, exploit some of the features offered by the new cloud storage options.
Data Sources and Targets
The main kinds of data source are shown below.
As reported in my previous article, even medium-sized companies are now using an average of 185 SaaS applications. Integration solutions need to connect to both ends of each source-to-target combination and carry out relevant extract, transform and load tasks. Solutions need to be constantly updated to reflect any changes to source data formats. When it comes to SaaS applications this can prove to be a constant game of ‘Whac-a-mole’ that has frequently tilted many companies towards the buy side of any build versus buy decision.
Integration Solution Vendors
There are now several vendors focussing on the analytics niche. Several have secured significant funding including Fivetran and Matillion who were featured in the sector’s leading market research survey, the 2021 Gartner Magic Quadrant for Data Integration alongside established generalist integration vendors such as Informatica and IBM.
Most of the larger vendors offer support for several target data stores (such as Snowflake, BigQuery etc.) and all of the major source types (Files, Events etc.) including from 50 to 300 of the most popular SaaS Applications. Vendors are constantly adding sources and targets in response to market demand.
Specialist Vendors Are Emerging
Many vendors have acquired expertise in handling particular sources such as SaaS Apps (Fivetran, Stitch Data, Hevo Data), databases (HVR*), events (Segment, Rudderstack) or even file format data (Sonra).
Even within the particular realm of SaaS applications some specialists are emerging: Adverity, Funnel, and Improvado all have a focus on general marketing data. Likewise, Supermetrics have acquired a particular prominence in advertising data, integrating some 10% of the paid advertising data created on the major platforms.
Many of the generalist vendors have expanded their product set to include analytics/cloud-focused integration. Informatica, CloverDx and Talend have done so – the latter via the acquisition of Stitch Data.
Key Differences between Traditional Integration and Integration for Analytics Solutions
When it comes to integrating data for analytics there are generally no requirements to filter out or transform data. In contrast to many non-analytics solutions, the general ethos is to integrate all of the source data – every record, every field – and to carry out only the bare minimum of transformation needed to enable the data to be loaded to the data store. The reason behind this is that it makes logistical sense to defer any significant transformation until after the load stage, which is effectively a second data integration stage, commonly referred to as Data Transformation, as shown in the diagram below.
With the transform part of analytics integration solutions generally being delayed until after the load part the approach is often described as ELT. The initial phase is largely an Extract and Load one, sometimes referred to as “Data Replication”. One of the key reasons for deferring the transform stage is the computational power and scalability that the modern cloud data store offers. The transformation can also be carried out using SQL, a language widely used by data analytics professionals.
The primary aim of the transformation stage is, in essence, to create analytics-ready data based on the already landed raw data. This can range from relatively basic format transformations such as format changes or data filtering through to transformations that require the complex modelling of several data sources.
The rise of the ELT approach has spurred considerable innovation in the area of SQL editors. These have brought many useful software engineering practices such as version management. It has also promoted various productivity features such as query correction, autocompletion and query parameterization. The latter enables the same script to be run against a production or test environment. However, it was not until the arrival of tools specifically focused on data transformation that more structured ways of handling transformations could be put in place.
Data Transformation Vendors
The first transformation solution to gain significant traction was dbt. An acronym for “data base tool”, it was created in 2016 and released as open source by dbt Labs, which was then primarily an analytics consultancy. The dbt tool enables analysts to use their preferred editor to write scripts in a combination of SQL and the templating language Jinja. Scripts are then compiled to SQL and run against the data store creating views or tables.
Like the new breed of SQL editors, the tool supports version management and parameterization. It can be used to create complex data models that can be used across the data team. Check out this advice regarding Recurring Revenue transformations, for a flavour of the sort of models and techniques that can be used.
The dbt tool is widely considered to have set a new standard for data transformation. A substantial user community has coalesced around it. This was helped in no small part by the company blog and Slack channel that promote collaboration and the sharing of scripts and techniques both within a data team and within the wider community. At the time of writing, more than 5,000 companies are using dbt in production with many using dbt cloud, a paid service, that includes an integrated development environment. You can read more on dbt here.
Other transformation tools include Dataform, Datameer, and Trifacta, all of which, like dbt, provide data teams with a means of creating transformation scripts using SQL. Dataform was recently acquired by Google and is currently closed to new sign-ups as the primary focus is integrating it into the broader Google Cloud Platform offering. Some of the “first stage” integration vendors including Matillion and Rudderstack also offer transform stage SQL-based options. You can read more about data transformation tools here.
Whilst transformation tools allowed analysts to create and run new models without help from data engineers, a downside is that this has often led to fairly chaotic data stores. In the next article I’ll look in further detail at some of the ways companies have begun to bring more sophistication to their analytics efforts.