date and time diffs: Dataflow with Workflow

I know the idea was to seperate workflow and dataflow, but I have come across a scenario where it would be useful for a branch of a dataflow to wait until another branch has finished.

I have some transactional data which records events for the start and end of a session. I want to build a list of unique sessions with the start and end date. I currently have the list of events sorted by time, followed by a conditional split for the start and end events. I can then insert all of the start events and would like to wait until all of the starts are inserted before updating them with their relevant end times.

Is this achievable?

Does anyone else think it would be a good idea to be able to set precendence across multiple branches of a data flow?

Does anyone have a better solution?

I know this is the wrong forum, but is there a way to model this against the transactional data in SSAS, I will move this question to the SSAS forum if anyone can think this would work!

Philip Coupar

Unfortunately, setting execution precendences in between branches in the data flow is not supported today. Until we redesign the engine to support this feature, I would consider other possibilities to solve the problem, such as using a combination of task flow and data flow.

|||A simple strategy is to use raw files. Stage the updates in a raw file and use a second Data Flow Task to load them and run the updates. Raw Files are very efficient.|||

Philip Coupar wrote:

I know the idea was to seperate workflow and dataflow, but I have come across a scenario where it would be useful for a branch of a dataflow to wait until another branch has finished.
I have some transactional data which records events for the start and end of a session. I want to build a list of unique sessions with the start and end date. I currently have the list of events sorted by time, followed by a conditional split for the start and end events. I can then insert all of the start events and would like to wait until all of the starts are inserted before updating them with their relevant end times.
Is this achievable?
Does anyone else think it would be a good idea to be able to set precendence across multiple branches of a data flow?
Does anyone have a better solution?
I know this is the wrong forum, but is there a way to model this against the transactional data in SSAS, I will move this question to the SSAS forum if anyone can think this would work!
Philip Coupar

Phil,

As Darren said, raw files are the solution here.

On a similar vein though I have requested some functionality that would allow you to prioritise destinations:

Intrinsic Flow Priority
(https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=178058)

Would this help you?

-Jamie

|||

Raw files may well be very efficient, and space may be cheap but I already have a stage table with 4 million transaction rows which breaks out to 2 million start events and 2 million end events. I don't really want to read 4 million rows from a set of text files into a SQL Server table to read back out all the rows while applying some transformation logic to it, to split on one of the transformed columns to put half the data back out to a file, so that at some later point I can pick up that file again and update my 2 million start events with their end time.

That is just the incremental data, to rebuild the entire model would involve reprocessing 250 Million transaction rows, the pipeline uses buffers to manage the number of rows in memory, the raw file in this case would require 125 Million rows to be written out. I know I could batch up the rebuild into 4 Million rows sets, but I quite like the "Read Once, Transform Many, Write Once" mentality.

|||

Jamie Thomson wrote:

Phil,
As Darren said, raw files are the solution here.
On a similar vein though I have requested some functionality that would allow you to prioritise destinations:
Intrinsic Flow Priority
(https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=178058)
Would this help you?
-Jamie

That looks great, I hope anyone else following this tread will have a look and vote. My only additional request to your proposal would be to include the OLEDB Command Component as well as the destinations so that precendence could be specified over inserts and updates.

As well as the scenario above we have a number of data feeds which have data columns split across multiple source files and not all sets of file contain distinct data sets, sometime a file can contain data items that relate to data from a previous set of files. This introduces exactly the same scenario where I would like to insert all new rows and after that use other data in the data flow to update these or existing rows.

|||

Philip Coupar wrote:

Jamie Thomson wrote:
Phil,
As Darren said, raw files are the solution here.
On a similar vein though I have requested some functionality that would allow you to prioritise destinations:
Intrinsic Flow Priority
(https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=178058)
Would this help you?
-Jamie
That looks great, I hope anyone else following this tread will have a look and vote. My only additional request to your proposal would be to include the OLEDB Command Component as well as the destinations so that precendence could be specified over inserts and updates.

Agree with that totally.

Note to anyone else. Don't just vote - that is useless. Add a use case for why you want this functionality (as Phil has done).

Philip Coupar wrote:

As well as the scenario above we have a number of data feeds which have data columns split across multiple source files and not all sets of file contain distinct data sets, sometime a file can contain data items that relate to data from a previous set of files. This introduces exactly the same scenario where I would like to insert all new rows and after that use other data in the data flow to update these or existing rows.

Hmmm...the scenario I have in mind is that you either insert a row or update it. But I guess there is no reason why this "Intrinsic Flow Priority" feature couldn't be used to both insert a record and then update it. Interesting....

One more thing Phil, can you make sure you add all the details from your use case on this thread to the connect posting? They'll take more notice of it if you give them more reasons to.

Jamie

Sunday, March 11, 2012

Dataflow with Workflow

date and time diffs

Blog Archive

About Me