Connect to CSV with CData

Author: Thomas Lind Last updated: November 22, 2024 13:42
txt Text File ODX FTP Multiple Textfile Data source CData CSV Business Unit SFTP

Note: There is an updated version of this article, which applies to the next generation of TimeXtender.

Before you attempt this you need to add the CSV provider to your list of available sources.

In the business unit, we already have two data source providers that can connect to CSV/TXT files, and for the most part, I would suggest using those. There is however a few times where this can be used. Especially because this can connect to a folder with multiple files with different columns and fields in each of them.

Update

It is now possible to aggregate multiple files with different names, but the same structure into a single file, with the file name as a field.

Go through this guide before you start.

Add a CData data source

CSV Information

Here is a link to the file I use in this example

http://mysafeinfo.com/api/data?list=moviesbyearnings2005&format=csv

This one is set up with elements like this.

Year,Rank,Title,Gross $,Month Opened,Theatres
2005,1,Star Wars: Episode III - Revenge of the Sith,380270577.000000,2015-05,3663
2005,2,The Chronicles of Narnia: The Lion the Witch and the Wardrobe,291710957.000000,2015-12,3853
2005,3,Harry Potter and the Goblet of Fire,290013036.000000,2015-11,3858
2005,4,War of the Worlds,234280354.000000,2015-06,3910

CData Setup

Add a CData data source and choose the CSV provider.

The first step in setting up a CData connector should always be to click the ? in the top right corner.

CSV guide from CData

Connecting to one file

You need to save the link in the information area as something csv and save it in a folder your user has rights to.

I saved mine in C:/Flat Files/Movies2005.csv. Then you just add a link to it in the Data Source field.

You can then set it up like this.

You just need to add the location of the file in the URI or Data Source field and then it should work.
Another important thing to change is the following Row Scan Depth field.

By default, this is set to 100 as you can see, but I would change it to 0 which is equal to indefinite and then it will read the whole file before determining the field data types.

Connecting to more than one file

If you have multiple files, you can connect to a folder. Like this

You can also point to files across many folders. You just need to be sure the following features are turned on.

So you need to set Include Subdirectories to True and if you do not want it to go all the way to the end of the path, you can change the Directory Retrieval Depth to another number. -1 means that it has no limit.

Connect to CSV files on a FTP/SFTP server

You also have the option of connecting to a FTP server that contains CSV files. The connection string is as follows.

URI=ftps://localhost:990/folder1; User=user1; Password=password1;

I have a FTP server with a folder that contains some CSV files and here is how I set it up.

So I added my password to the Password field, my user account to the User field, and in the URI field I added the location of the server. The default ftp port is 21.

If the files are stored on a SFTP server it works similarly, but you need to change the string slightly.

URI=sftp://127.0.0.1:22/folder1; User=user1; Password=password1;

So the URI needs to start with sftp:// and the default port is 22 instead.

Connect to files on an Azure Storage

There are two methods to connect to Blob store containers and one to connect to File Shares:

For all of them, you need to add the following.

Find the storage account name and add that to the Azure Storage Account field. You also need an Azure Access Key. This can be found here.

ABFSS:// Data Lake Gen 2

Important: specify abfss:// in URI in lowercase letters

AzureBlob://

For both methods what comes after the // is the same. First is the container name, then the folders, and finally the file. However, you can choose to point at only the folder if you want.

If your file is stored in a FileShare instead of a container you can use the azurefile:// option

AzureFile://

The file is not using the default delimiter settings

As long as it is comma-separated, this is all you need. If it is not comma-separated there are two options.

1st option

There are two fields you can use to change the default settings.

The first field is the FMT field. Seemingly it has three options.

That is actually not the case. As explained in the following Link it can actually be filled with whatever sign you want to use as a delimiter.

Here I overwrote the std options with a 'pipe' | sign.

This when synchronized will give the same as if it was a , sign like default.

The other field is the Row Delimiter field.

So if your file doesn't use CR LF as the row delimiter, you can add it here.

2nd option

http://mysafeinfo.com/api/data?list=moviesbyearnings2005&format=csv&delimiter=|

It still needs to be saved in that folder in C:/Flat Files/Movies2005.csv. Additionally, you need to create a file called schema.ini. It has to be named exactly that, or it won't work.

You can read about the various parameters by googling csv and schema.ini, but for this, we need it set up like this.

[Movies2005.csv]
Format=Delimited(|)

First off you need to have the full file name as the first line in square brackets. Then you need to add each change to the file that is not default, mainly delimiters and such.

Then when you synchronize it looks at this file and finds the correct delimiter. This also works for multiple files, they just need to be added below with the same setup.

Like this for a folder.

[Company.csv]
Format=Delimited(;)
[Customer.csv]
Format=Delimited(;)
[Date.csv]
Format=Delimited(;)
[Item.csv]
Format=Delimited(;)
[Sales Invoice Line.csv]
Format=Delimited(;)

Are you connecting to files on an FTP, they also need the ini file added to its folder.

Getting the data

After setting up the provider, all needed is to synchronize/read the data source.

After that, you can set up incremental load and make data type overwrites.

If your data doesn't come out as you expected, go through the CData guide and see if you can turn on a setting that will give you the correct look.

Aggregate multiple files into one

A new feature of CSV is the option to aggregate files in a folder into one.

I have this folder with some files that contain data from a specified month.

Other than that they are exactly the same. This is important for it to work, they have the exact same field names in all of the files. If not there are some options which is present in the issues segment.

I connect to the files and use the following settings.

URI: points to the folder where the files are stored. As described before, this could also be a folder on a FTP/SFTP server.

Aggregate Files: Needs to be set to True for the CSV provider to attempt to merge the files.

Specifically for this I also change the default Row Scan Depth to 10000. This is because I had no decimal points in a field for at least 100 rows, but there where one in a row above that number.

When I synchronize the data source it will generate a table called AggregatedFiles.

It contains data from all 10 files with the URI field showing the file name.

Using an asterisk to point at specific files

It can be that you have more than one file type in the folder you do aggregates at, for example, some files with a date part of their names, but also files that do not contain the same fields and data types, like so.

You only want to merge the Orders files, not the others. To do this you change the Include Files field to contain the name and add an asterisk for the part you want to ignore. Like so.

Now it will only aggregate the two order files and not the others. You can also mix this with the Include Subdirectories feature to do it across many folders. It also works for all the special connections like FTP, Azure, and similar.

Issues

If you cannot control that the field names always is the same, you can set it to ignore the field names and skip the 1st row in each file. Such as below.

Another option is to point to a specific Metadata file in the Metadata Discovery URI field and pull the info from there.

The above settings will give this. Where all field names is called Col1 and so forth.

Troubleshooting data types

CData provider attempts to detect data types, but sometimes it may run into errors. There are a few workarounds for quick data extraction:
1. Row Scan Depth
The number of rows to scan when dynamically determining columns for the table.
Set it to 0 if the source file is not too big. Setting this value to 0 (zero) will parse the entire CSV document.
increasing RowScanDepth ( >100 ) may help in some cases.

2. Data Type Override
Manually set up rules to override specific data types.

3. Set Type Detection Scheme = “None” in the provider setup dialog options

This will bring all data as string type (varchar). You may convert it later to other types as needed.

This technique may work in any CData provider that exposes Type Detection Scheme property.

Dates are shown in the wrong timezone

Sometimes your datetime fields will have added or subtracted hours. This is due to it being converted to your time zone. The CData provider expects all dates will come in UTC and add hours according to the settings on your computer. It even knows about summertime.

If you do not want this to happen, you need to add this to the other field

ConvertDateTimeToGMT

Convert the character set

UPDATE - The following are not necessary there now is a Character field you can set.

If your CData release is a legacy version, you can set it like the below still

If you have a CSV file in a specific encoding, you can convert it by adding this to the Other field.

Charset=windows-1252

You can also use Charset=CP1252, which gives the same result and it could be other encodings as well.

The files were made with different culture settings

If your file is created with some specific culture settings you can use the two fields Culture and Client Culture to align it to a specific one.

The list of available options is here.

So if my files are set up with Dutch culture settings and I wanted to convert them to settings for the USA, I would add nl-NL in the Client Culture field and en-US in the Culture field. Then it should align the values after this when you synchronize.

9 Comments

Date Votes

2

Hansen, Mikkel Hecht DK - NOH March 26, 2019 12:26

Hi can you create a connection to a file folder and only include some but not all of the files in the folder, using some kind of wildcard, Regular expression or other? like File_Actuals_*.CSV or like?

Comment actions Permalink
5

Nils Östborg October 11, 2019 20:35

Good question, Mikkel! And also, can you get the file name as part of the data? Other things like file size and OS timestamp? Thanks

Comment actions Permalink
0

Lennaert van den Brink July 26, 2021 07:35

Better late than never: yes, you can give a mask under the option "include files". For example, you can do [include files] = file_*.csv and it will load all CSV files that start with "file_"

Comment actions Permalink
0

Rik Miltenburg March 24, 2022 09:02

Hi,

When i'm reading my csv files with Cdata source it gives me the "SqlDateTime overflow. Must be between 1/1/1753 12:00:00 AM and 12/31/9999 11:59:59 PM." error. Is there a way in Cdata source settings similar to the oracle data soruce settings to convert out of range dates?

Comment actions Permalink
0
Thomas Lind March 24, 2022 09:09
Hi Rik

No, there is no option like that.

You can try two things.
1. Convert the field to a datetime2 using the data type override feature. A datetime2 is allowed to be outside the normal range set by a normal datetime.
2. Or be sure that the field is read as a string and then convert it to an allowed date when it is added to the DSA.
Comment actions Permalink
0

Bart Peek June 22, 2022 12:20

Is it possible to have the Data Source delete the files after the processing has been completed?

Comment actions Permalink
0
Thomas Lind June 22, 2022 12:57
Hi Bart

No, there are no options for post-transfer changes.

My suggestion is to use a bat file that removes or moves the files from the folder you look in.
```
cd "C:\FlatFiles\GL\"
del *.csv
```
You can run a job with a query in a script action. The job can run the batch file. You can then set it to run the script as a post-transfer job.
Edited by Thomas Lind June 22, 2022 13:01

Comment actions Permalink
0

Sierd August 17, 2022 15:20

Hi,

Is it possible to use ConvertDateTimeToGMT inside an RSD file for JSON connections?

Comment actions Permalink
1
Permanently deleted user August 22, 2022 19:57
Hi Sierd,

yes, you can set the ConvertDateTimeToGMT option within the RSD file like this:
```
<api:set attr="Other" value="ConvertDateTimeToGMT=False;" />
```
or
```
<api:set attr="Other" value="ConvertDateTimeToGMT=True;" />
```
respectively.

/Frank
Comment actions Permalink

Please sign in to leave a comment.