This is a follow up blog post based on the Intro to Data Factory session I gave on the Training on the T’s with Pragmatic Works. Find more free training from the past and upcoming here. I did my session on January 13, 2015.
In this session, I gave a simple introduction to new Azure Data Factory using a CopyActivity pipeline between Azure Blob Storage and Azure SQL Database. Below is a diagram illustrating the factory that is created in the demo.
I have published my presentation materials here. This includes the sample JSON files, the Movies.csv, and PowerShell scripts.
Q & A
Here are a few questions that were answered during the session.
1. Does Availability refer to when data that has been transferred will be available? Or when the data source is actually available for query?
Availability refers to when the datasets will make a slice available. This is the when the dataset can be consumed as an input or be targeted as an output. This means you can consume data hourly but choose to push it to its final destination on a different cadence to prevent issues on the receiving end.
2. What pre-requisites are must haves?…e.g.(Azure account, HDInsight, Blob Storage Accounts, etc.)
- An Azure Account is the only real must have. You could use two on premise SQL Server instances.
- HDInsight if you want to use the HDInsight activitities
- An Azure Storage account to use blob or table storage
3. How do you decide to use a Factory or Warehouse?
The factory is more of a data movement tool. A warehouse could be a source or target of a factory pipeline.
4. Is this similar to SSIS in SQL Server?
Yes and no. SSIS is definitely more mature and has more tooling available such as data sources and transformations. SSIS also have a good workflow constructor. The focus of the Data Factory initially was to load HDInsight tables from a variety of sources with more flexibility. The other note here is that Data Factory is being built from the ground up to support the scale of the cloud or Azure.
5. Can this be used for Big Data?
Absolutely. I would say that it is one of the primary reasons for the tool. In reference to the previous question, it will likely be the tool of choice for big data operations because it will be able to scale with Azure.
Links to Additional Resources on Data Factory or tools that were used in the presentation:
Thanks for joining me for this presentation. We look forward to seeing you at the next Free Training on the T’s.