Serverless Data Engineering in Azure - Part 1

Serverless computing is taking over when it comes to developing new micro-services - the wealth of tooling to develop impressive distributed systems with the associated elastic scaling properties is phenomenal - but where does this leave Data Engineers?

Learn more »

Serverless Data Engineering in Azure - Part 1


28 Sep 2018 - Ed


We recently had a chance to explore the Azure offerings for a customer we a requirement for a low volume pipeline to track history of HR records for their staff. Utilising serverless data engineering tools allowed us to build an experience like a traditional data warehouse but without the cost of running a SQL database 24/7.

In this blog series we’ll focusing on tracking a slowly changing dimension in Azure Data Factory - We’ll cover how we ingest incoming data from an external source, schematise the data on read and build a slowly changing dimension from the data set. All the resources used are usage billed - so there’s no $5 an hour data warehouse to worry about if you want to follow along - in fact during development of this project our azure bill was less than $3 total!

The first blog in this series covers moving data sets between cloud providers, automatically detecting schemas and converting into efficient columnar formats.

Data Ingestion

In this blog post we’ll use a set of GitHub users we collected from their public API and written in raw JSON format to s3. We’ll write another blog post detailing this in the near future, but for now we can tell you that we populate an AWS S3 bucket with date partitioned files that contain a set of objects like shown below.

{
  "login": "mazdak",
  "id": 691,
  "node_id": "MDQ6VXNlcjY5MQ==",
  "avatar_url": "https://avatars2.githubusercontent.com/u/691?v=4",
  "gravatar_id": "",
  "url": "https://api.github.com/users/mazdak",
  "html_url": "https://github.com/mazdak",
  "followers_url": "https://api.github.com/users/mazdak/followers",
  "following_url": "https://api.github.com/users/mazdak/following{/other_user}",
  "gists_url": "https://api.github.com/users/mazdak/gists{/gist_id}",
  "starred_url": "https://api.github.com/users/mazdak/starred{/owner}{/repo}",
  "subscriptions_url": "https://api.github.com/users/mazdak/subscriptions",
  "organizations_url": "https://api.github.com/users/mazdak/orgs",
  "repos_url": "https://api.github.com/users/mazdak/repos",
  "events_url": "https://api.github.com/users/mazdak/events{/privacy}",
  "received_events_url": "https://api.github.com/users/mazdak/received_events",
  "type": "User",
  "site_admin": false,
  "name": null,
  "company": null,
  "blog": "http://mazdakrezvani.com",
  "location": "Toronto, Ontario, Canada",
  "email": null,
  "hireable": null,
  "bio": null,
  "public_repos": 2,
  "public_gists": 0,
  "followers": 4,
  "following": 0,
  "created_at": "2008-02-23T19:12:21Z",
  "updated_at": "2018-09-06T20:47:47Z",
  "extract_year": 2018,
  "extract_month": 10,
  "extract_day": 4,
  "extract_hour": 13,
  "extract_minute": 43,
  "extract_second": 44,
  "extract_timestamp": 1538660624
}

Microsoft are clearly targeting a wide audience with their Data factory offering - they support a huge range of data sources both from the Microsoft ecosystem (cloud and on-prem) and from 3rd parties. This is great for us as the AWS s3 connector gives a conduit to shuffle data between two disparate cloud providers. We’ll have to create an IAM user on the AWS side to provide the required security credentials to Data Factory (you can find documentation on this (here)[LINK]).

Create a DataSet in Data Factory

Data sources and destinations in Data Factory are represented by Dataset objects - these encapsulate the details about the storage location, file types and potentially schemas. Adding a new dataset first asks you which type of data source you want to use - we select Amazon A3 as shown below

create dataset

Moving over to the connection detail provide we provide the bucket name and prefix - This then allows the console to explore the data set and provide you a preview to confirm you’re at the right place. It can also detect the format of the files in the bucket - however in my experience this isn’t entirely reliable - We can just select JSON and Set of Objects as shown below

select-connection

Now moving over the the Schema tab we can hit the Import Schema button to automatically detect and populate our schema.

import-schema

Create a DataSink

We follow the same steps as above - but this time choosing Azure Data Lake Gen 1 as the data source. There’s no need to provide a schema this time as it will be inferred but we should choose a well-known location for the output file so we can process it later in the pipeline. Usefully Data Factory supports Dynamic Data in many places so we can access the ID of a particular pipeline run in the configuration @CONCAT(pipeline().RunId, '.paraquet') . This is shown in the image below

create-datasink

Create a Copy Action

Now we’re ready to produce the first part of our pipeline to import the data over to Azure Data Lake converting it into parquet format on route. Create a new pipeline and add a Copy action from the side bar. All we have to do now is select our s3_data dataset as the data source specifying the copy files recursively option.

create-copy-action-source

Add the staging dataset as the data sink and ask Data Factory to merge to files into one for ease.

create-copy-action-sink

Conclusions

This blog post has shown how powerful Azure Data Factory is for simple Data Engineering tasks. We’ve copied data between two cloud systems,automatically inferred schemas, and converted into a columnar format. Whats more we achieved all of this without a single server and the cost shows it!