<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=612681139262614&amp;ev=PageView&amp;noscript=1">
Skip to content

Need help? Talk to an expert: phone(904) 638-5743

Incrementally Loading Data from Salesforce

Incrementally Loading Data from Salesforce

As a business owner and someone who is responsible for sales, reporting in SalesForce.com is challenging to say the least. The reporting is tightly restricted to a few objects and the interface can't create the beautiful reports that those in the data industry are used to seeing. Most importantly, I can't merge the data with my existing data sources like data from accounting, HR or our other internal systems. This is why most people pull data from SalesForce.com into a central reporting database to do more advanced reporting. This blog shows you my method for pulling data into a single repository for further reporting.

At Pragmatic Works, we wanted a single reporting repository that contained the raw data from all of our data sources like SalesForce.com. SalesForceAPI access can be expensive and if you exceed your API limit, external systems might shut down so if everyone creates their own smaller databases so creating this central database allows us to unify our work.

The Incremental Load Pattern Control Table

The first thing you'll need for any incremental load in SSIS is create a table to hold operational data called a control table. This control table in my case uses the below script to manage the ETL.

Column Name Description
SourceObject Object name in SalesForce.com
LastLoadDate Represents the last time the object was localized (pulled from)
RowsInserted Number of rows that have come over in the last retreival of data.

 

CREATE TABLE dbo.SalesForceControlTable
 (
 SourceObject varchar(50) NOT NULL,
 LastLoadDate datetime NOT NULL,
 RowsInserted int NOT NULL
 )  ON [PRIMARY]
GO
ALTER TABLE dbo.SalesForceControlTable ADD CONSTRAINT
 DF_SalesForceControlTable_LastLoadDate DEFAULT 1/1/1900 FOR LastLoadDate
GO
ALTER TABLE dbo.SalesForceControlTable ADD CONSTRAINT
 DF_SalesForceControlTable_RowsInserted DEFAULT 0 FOR RowsInserted
GO
ALTER TABLE dbo.SalesForceControlTable ADD CONSTRAINT
 PK_SalesForceControlTable PRIMARY KEY CLUSTERED
 (
 SourceObject
 ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
 

Here's an example of what my table looks like after the Case object has been loaded.

 

ControlTable.jpg

The SSIS ETL

With the plumbing now complete, it's time to get our hands dirty programming theSSIS packages. The package will be broken into three steps:
  1. Determining when the last load of the package ran (reading the control table)
  2. Reading the data out of SalesForce
  3. Setting the last load date in the control table

Steps 1 and 3 will be Execute SQL tasks and the second step will be a standard Data Flow task. For Step 1, use an Execute SQL task that queries your control table as shown in the below query. Save the results of the query by changing the Result Set option to Single Row. Lastly, go to the Result Set tab and save the result to a variable called dtLastLoadDate.

SELECT LastLoadDate
FROM SalesForceControlTable
WHERE SourceObject = 'Case'

Next, add a Data Flow task. You can use the Task Factory SalesForce.com Source for SSIS to retreive data out of SalesForce.com. If you're in Visual Studio, Task Factory is free to use. In the below screenshot, you can see that I selected the Case entity from SalesForce. The query below the entity then automatically populates. I also wanted to retrieve only an incremental amount of data (never retrieve the same rows twice unless they've changed). So, I add the following WHERE clause to the SOQL query to only retreive rows that have changed in SalesForce.com since the latest time I ran this package. I personally prefer to do a date range so I'm protected from not missing rows that might be changed while the package is running. The StartTime variable you see below does just that. This WHERE clause is asking, "Which cases have been changed since the last time I ran the package until when I started this package?"

WHERE LastModifiedDate > <@User::dtLastLoadDate>
and LastModifiedDate < <@System::StartTime>

SalesForce.comSource.jpg

In my case, I sent the data to a Task Factory Upsert Destination. Doing so allows me to conditionally update or insert the data.

TaskFactoryUpsert.jpg

 

Now that the hard part of the package is complete, add a final Execute SQL task to set the date on the control table back to the curent run of the pacakge.  I use the below query to set that date to the start date of the package. The question mark is substituted with the StartTime variable in the Variable tab.

UPDATE SalesForceControlTable
SET LastLoadDate = ?
WHERE SourceObject = 'Lead'

Once run, the data flow only pulls all the rows over the first time. After that, subsequent runs of the package only pull over changed rows.

IncrementalLoadDF.jpg

 

The complete control flow tab looks like the below. This same pattern can be used to extract data incrementally for nearly any data source.

 

Incremental Load Complete SalesForce

In the below video Shawn Harrison walks through the complete example end-to-end.

 

Task Factory Free Trial Download

 

Sign-up now and get instant access

Leave a comment

Free Trial

On-demand learning

Most Recent

private training

Hackathons, enterprise training, virtual monitoring