date and time diffs: DataReader Source and Column Types

Is there a way to control the types for output columns of a DataReader Source? It appears that any System.String will always come out as DT_WSTR. As I have my own managed provider, and I know what went in, I can say that really it should be DT_STR. The GetSchemaTable call from my provider will always say System.String as it does not have much choice, but GetSchemaTable does contain a ProviderType which is different for my DT_STR vs DT_WSTR, or rather when I want each. I think something like MappingFiles as used by the Wizard would work, but can I do anything today?

Darren,

I am afraid there is no way to influence this mapping. The Data Reader Source adapter uses only the CLR type (the DataType column from the table's schema) to determine which DT_... type to choose. The ProviderType field could not be used as it has different meaning for different providers.

The mapping files would definitely help here, but that infrastructure is not used by this component.

I do not have any good advice, but explicit data conversion to the DT_STR type or building your custom ADO .NET adapter are options I see available at this moment.

Thanks.

|||

Bob,

Thanks for confirming what I already suspected, but I had to ask. I'm thinking a MSDN feedback request for the ability to supply mapping files would be coming your way.

Conversion works, but I'm concerned about the impact of doubling the buffer size each time I do this. Most columns I am working with are DT_WTR, but need to be DT_STR, it is 2 x buffer every time. Is this really twice the size or is there some fancy pointer type work going on? I will probably test when I have time as I had some other ideas about custom components for such conversion, but it depends on what the impact really is.

A custom provider had been considered but currently rejected due to time. It took me long enough to get the managed provider written :)

Thanks

|||

Hi Darren,

I believe you are right about the buffer size. It might impact your package performance, but it is not sure how significant that could be. It may depend on many factors. If you get a chance to measure the impact in your configuration, please share results with us.

If the "power" stays with us, we should be able to provide much better story with managed providers in the next version.

Thanks.

|||

I have done some playing around with this.

For information my theory goes like this. If I use a Data Conversion transform I am increasing the number of columns in my buffer so I get less rows per buffer. This seems inefficient. On the other side we know that copying data between buffers has a cost. So lets test which is more efficient, the larger row size versus the cost of moving between buffers, and keeping a small row size.

I wrote a simple asynchronous component that allowed you to select columns from the input buffer which are then copied directly to the output buffer. The one feature is that any DT_WSTR column is reproduced as DT_STR. So the buffer sizes/structure are the same for input and output except for the change in type, and any associated overheads of each type. One would think that unicode types require twice the space of non-unicode, so this should make the asynchronous component test even faster as this allows even more rows to fit into the output buffer of my component.

For a baseline I used a Script Component -> Union All. The script component generated a variable number of rows, as determined by a package variable. The columns produced are 1 integer column (row count), and 9 x 50 character DT_WSTR columns fully populated.

For testing I used the same script component and two methods of converting the columns -

Script Component -> Data Conversion -> Union All

Script Component -> DeUnicodeAsynchTestComponent -> Union All

Tests showed that the data conversion was 1.5-2.5 times slower than the baseline. The asynchronous component was then 2-2.5 times slower than the data conversion. Times were averaged across 6 executions. The range in times are for different row counts, 100,000 to 1,000,000.

N.B. These ratios are for my local machine, and I would fully expect results to vary on different hardware and with different resource constraints. These are for illustration only. If you want to know how this equates to your environment, test it for yourself, and use real hardware, not a test system.

So, whilst it may not look pretty leaving the buffer alone is the way to go. Trying to remove columns or change columns is a non-starter as this means creating a new buffer, the cost of which far outweighs the benefit of the smaller row size in the buffer. When you do need to work on columns, use a synchronous component such as the Data Conversion or Derived Column transformations, and don’t worry if you end up with more columns that you will use at the end. (Obviously don’t create columns for the sake of it!)

Monday, March 19, 2012

DataReader Source and Column Types

date and time diffs

Blog Archive

About Me