reading tab delimited file in C

Question

srao1 0 Newbie Poster

10 Years Ago

I am trying to read in a large tab-delimited file which has 16 columns and about five lac rows. The first column is a date string. TThe second contains integers and so on. However I am not able to get the arrays correctly. Also how can one ignore three header lines of the file?

FILE *ptr_file;
char buf[1000000];
ptr_file =fopen(filename,"r");

fscanf(ptr_file,"%s %f %s %f %s %f %s %f %s %f %s %f %s %f %s",Date1,&Load1,QCLoad,&Tamb1,QCTamb,&TOT1,QCTOT,&WindA1,QCWindA1,&WindB1,QCWindB1,&WindC1,QCWindC1,&Tamb2,QCTamb2);

c

5 Contributors
15 Replies
6K Views
1 Week Discussion Span
Latest Post 10 Years Ago Latest Post by Ancient Dragon

mathijs 0 Light Poster

10 Years Ago

To skip the header lines, just read them in but don't do anything with them. Or look into fseek() to move the file pointer past them.

To read in the columns you seem to be on the right track, if they are seperated by tabs you should probably use \t to represent these. You don't know what string %s will catch, but you could use \t as a delimeter. (\t is a tab) .

Ancient Dragon 5,243 Achieved Level 70

10 Years Ago

You can easily skip the first line by calling fgets() to read it, then just don't use what was read.

Do any of the strings contain spaces? fscanf with %s can't read the spaces. If there are no spaces with the strings then the fscanf line you posted will probably work because tabs are considered white space. %s stops reading at the first white space character (space, tab, newline, and backspace).

Edited 10 Years Ago by Ancient Dragon

Ancient Dragon 5,243 Achieved Level 70

10 Years Ago

fgets() is incorrect. See the parameters here. If you are not sure about parameters you need to look them up -- just google "c++ fgets" and you will find the link I posted.

What are A, B and C?? fgets() reads an entire line, so unless you want to skip 3 lines it should only appear once.

Edited 10 Years Ago by Ancient Dragon

deceptikon 1,790 Code Sniper

10 Years Ago

I wouldn't recommend using %s unless you can guarantee that the fields won't contain embedded whitespace. Further, you should always use a field width for string specifiers in scanf to avoid buffer overflow.

Ignoring the possibility of quoted fields that contain embedded tabs, it's straightforward to read a line with fgets and then parse it with sscanf:

#include <stdio.h>

int main(void)
{
    FILE *in = fopen("input.txt", "r");

    if (in != NULL)
    {
        char line[1024];

        while (fgets(line, sizeof line, in) != NULL)
        {
            char field[1024];
            int offset = 0;

            // Break down the line based on tabs ('\n' included because fgets retains a newline if it's present)
            while (sscanf(line + offset, "%1023[^\t\n]", field) == 1)
            {
                puts(field);

                offset += strlen(field);

                // Safety check to avoid stepping off the end of the array
                if (line[offset] != '\0')
                {
                    ++offset;
                }
            }
        }

        fclose(in);
    }

    return 0;
}

Note the use of a scanset in scanf to look explicitly for tabs.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

srao1 0 Newbie Poster · Answer 1 · 2014-01-12T01:07:52+00:00

Thank you AncientDragon and mathijs. I did what you suggested to skip the header lines. Here is the code

            FILE *ptr_file;
            char buf[1000000];

            ptr_file =fopen(filename,"r");
            fgets(A, ptr_file);
            fgets(B, ptr_file);
            fgets(C, ptr_file);
fscanf(ptr_file,"%s %f %s %f %s %f %s %f %s %f %s %f %s %f %s",Date1,&Load1,QCLoad,&Tamb1,QCTamb,&TOT1,QCTOT,&WindA1,QCWindA1,&WindB1,QCWindB1,&WindC1,QCWindC1,&Tamb2,QCTamb2);

I do not understand what is it that I am missing. Is there a mistake in my declaration? Do you suggest that I replace the %s ? The problem is that the number of characters in the field are not fixed. In fact the columns like QCLoad and QCTamb may even be empty.

srao1 0 Newbie Poster · Answer 2 · 2014-01-12T02:41:48+00:00

Yes there are three header lines that are of no use to me and hence I used it thrice. Also I had refered to the link while writing this.
char * fgets ( char * str, int num, FILE * stream );
Since I wanted it to read till new line character, I did not use num and I am storing it in A,B,C which I do not intend to use.

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 3 · 2014-01-12T04:08:33+00:00

You can never skip parameters in C language -- they are always required. In this case num is the maximum number of characters that can be put into the buffer. If the newline is encountered in the file before that then the newline will appear in the buffer. If the newline does not appear in the file then the buffer will not contain a newline.

How are A, B, and C declared? You really don't even need them, just reuse buffer

Are those lines really 1,000,000 characters??? What's a whapping big line. Reduce the size of buf to a more reasonable number.

            FILE *ptr_file;
            char buf[256];

            ptr_file =fopen(filename,"r");
            fgets(buf, sizeof(buf), ptr_file);
            fgets(buf, sizeof(buf), ptr_file);
            fgets(buf, sizeof(buf), ptr_file);
fscanf(ptr_file,"%s %f %s %f %s %f %s %f %s %f %s %f %s %f %s",Date1,&Load1,QCLoad,&Tamb1,QCTamb,&TOT1,QCTOT,&WindA1,QCWindA1,&WindB1,QCWindB1,&WindC1,QCWindC1,&Tamb2,QCTamb2);

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 4 · 2014-01-12T15:52:49+00:00

I tested your code, and as it turns out "%1023[^\t\n]" is no different than "%1023s" because with %s scanf() stops converting when it encounters the first white space (space, tab, newline, and backspace).

In either event it still doesn't work right because offset is only being incremented by 1 which produces undesireable results. offset has to be incremented to the beginning of the next field.

void foo(const char* format)
{
    int offset = 0;
    char field[1024];
    char line [] = "now\tis the\t0123\t   time\n";
    while (sscanf(line + offset, format, field) == 1)
    {
        puts(field);
        if (line[offset] != '\0')
        {
            offset += strlen(field);
            while (isspace(line[offset]))
                ++offset;
        }
    }


}

int main()
{

    foo("%1023[^\t\n]");
    printf("\n\n\n");
    foo("%1023s");

}

and the results are

now
is the
0123
time



now
is
the
0123
time
Press any key to continue . . .

deceptikon 1,790 Code Sniper Team Colleague Featured Poster · Answer 5 · 2014-01-12T19:01:27+00:00

I tested your code, and as it turns out "%1023[^\t\n]" is no different than "%1023s" because with %s scanf() stops converting when it encounters the first white space (space, tab, newline, and backspace).

Not quite, as your test results clearly show. ;) %s stops on all whitespace, which includes more than just tab and newline. Granted that those two will be the most common by far, but in general it's not safe to assume the file won't contain other whitespace sequences.

In either event it still doesn't work right because offset is only being incremented by 1 which produces undesireable results. offset has to be incremented to the beginning of the next field.

If the format is tab delimited, trimming non-tab whitespace from a field is the wrong behavior. It's another matter entirely if you need to support empty fields (ie. multiple adjacent delimiters). In that case a more intelligent parsing method would be required; scanf isn't well suited to complex delimited formats.

srao1 0 Newbie Poster · Answer 6 · 2014-01-21T16:05:11+00:00

Hi,

I decided to write a code to avoid stopping on white spaces. Here is the C code. Can anyone help me converting it to a .mex file ? I am facing difficulty in declaring the structure as an output.

#include <stdio.h>
#include<string.h>
#include<stdlib.h>
#define NAME "Cooley3_dynxfmr_20100501000000_20100930235930.txt"
#define INPUT 1

int main(void)
//int mexFunction(int INPUT, char *NAME)
{
//Structure
typedef struct _col{
    char date[20];
    float mva;
    char qc_load[1];
    float air;
    char qc_air[1];
    float oil;
    char qc_oil[1];
    float wind_a;
    char qc_wind_a[1];
    float wind_b;
    char qc_wind_b[1];
    float wind_c;
    char qc_wind_c[1];  
#if INPUT   
    float tamb1;
    char qc_tamb1[1];
#endif  


} col;  




    char buf[1024];
    char temp[20];
    int count = 0;
    int i, j, k, l;

//open file to count elements   
    int ptr_file =fopen(NAME,"r");
    if (ptr_file != NULL)
    {
//skip first 3 lines    
    fgets(buf, sizeof(buf), ptr_file);
    fgets(buf, sizeof(buf), ptr_file);
    fgets(buf, sizeof(buf), ptr_file);
//start counting no. of elements    
    while(fgets(buf, sizeof(buf), ptr_file) != NULL)
        count++;
    fclose(ptr_file);
    }
    col *elem = malloc(count*sizeof(col));
//open file again for storing elements columnwise
    ptr_file =fopen(NAME,"r");
    if (ptr_file != NULL)
    {
//skip first 3 lines    
    fgets(buf, sizeof(buf), ptr_file);
    fgets(buf, sizeof(buf), ptr_file);
    fgets(buf, sizeof(buf), ptr_file);  
//start collecting data 
    for(i=0;i<count;i++){    //increment line
        //get line
        fgets(buf, sizeof(buf), ptr_file);
        j=0;
        k=0;
        //extract first word
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        strcpy(elem[i].date, temp);

        //extract second word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        elem[i].mva = atof(temp);   

        //extract third word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        strcpy(elem[i].qc_load, temp);  

        //extract fourth word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        elem[i].air = atof(temp);


        //extract fifth word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        strcpy(elem[i].qc_air, temp);

        //extract sixth word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        elem[i].oil = atof(temp);

        //extract seventh word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        strcpy(elem[i].qc_oil, temp);

        //extract 8 word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        elem[i].wind_a = atof(temp);

        //extract 9 word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        strcpy(elem[i].qc_wind_a, temp);

        //extract fourth word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        elem[i].wind_b =atof(temp);

        //extract 9 word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        strcpy(elem[i].qc_wind_b, temp);

        //extract fourth word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        elem[i].wind_c = atof(temp);

        //extract 9 word
        k=0;
#if !INPUT      
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\n'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        strcpy(elem[i].qc_wind_c, temp);
#endif      
#if INPUT   

//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        strcpy(elem[i].qc_wind_c, temp);

        //extract 9 word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\t'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        elem[i].tamb1 = atof(temp); 

        //extract 9 word
        k=0;
//      strcpy(temp, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0");
        while(buf[j] != '\n'){
            temp[k] = buf[j];
            j++;
            k++;
        }
        temp[k] = '\0';
        j++;
        strcpy(elem[i].qc_tamb1, temp);
#endif
    }

    fclose(ptr_file);
    }

printf("%s\t%f\t%s\t%f\t%f\t%s\n", elem[count-1].date, elem[count-1].mva, elem[count-1].qc_load, elem[count-1].air, elem[count-1].wind_a, elem[count-1].qc_wind_c); 
free(elem);
return 0;
}

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 7 · 2014-01-21T16:36:35+00:00

Can anyone help me converting it to a .mex file ?

What is a .mex file???

iamthwee · Answer 8 · 2014-01-21T22:49:42+00:00

^^

http://www.mathworks.co.uk/help/matlab/matlab_external/table-of-mex-examples.html

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 9 · 2014-01-21T23:11:46+00:00

Ancient Dragon 5,243 Achieved Level 70

10 Years Ago

That doesn't explain what a .mex file is.

iamthwee · Answer 10 · 2014-01-21T23:20:08+00:00

MEX stands for MATLAB Executable. A MEX-file provides an interface between MATLAB and subroutines written in C, C++ or Fortran. When compiled, MEX files are dynamically loaded and allow non-MATLAB ...

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 11 · 2014-01-22T00:17:02+00:00

Oh, normally something that contains the dot is a file extension, such as .txt, .dat, .mex, etc. I thought you were walking about a file format. I never studied MATLAB so I can't help you.