Not clear with strtok

Question

smith32 0 Newbie Poster

12 Years Ago

Hi, Here is the File content. I want to use strtok to separate each.
but the problem is if i use strtok(string, "\",");, the zip code (19428) of the detail is miss place and take State (PA) for the first record Because of "," in State. Although the second one is ok. Do u have any idea to fix it? I use strtok(string, "'\",\"'\""); for " "," " and " " " but it is also not working.

FirstName,LastName,Company,Address,City,County,State,ZIP,Phone,Fax,Email,Web  
"Brandon","Ortolano","Steele, Kendall D Esq","100 Front St", "Conshohocken", "Montgomery", "PA","19428", "610-834-2651", "610-834-9404","brandon@ortolano.com","http://www.brandonortolano.com"
"Jonathon","Todeschi","Central Motive Power Inc","210 W Pennsylvania Ave #-650", "Towson", "Baltimore", "MD","21204", "410-828-0316","410-828-3043","jonathon@todeschi.com","http://www.jonathontodeschi.com"

c

Edited 12 Years Ago by smith32

4 Contributors
13 Replies
1K Views
4 Days Discussion Span
Latest Post 12 Years Ago Latest Post by smith32

deceptikon 1,790 Code Sniper

12 Years Ago

strtok() doesn't respect CSV quoting rules, and as such it's the wrong solution to your problem. You need to actually parse the CSV format and break it down properly, which means either using a library or writing your own parser (not especially difficult, but not trivial either).

rubberman 1,355 Nearly a Posting Virtuoso

12 Years Ago

I seem to be agreeing with deceptikon a lot lately! :-) That said, strtok() is not the be-all and end-all of string parsing. I tried it a long time ago when writing an object serializer/de-serializer and found out its limitations, ending up writing my own tokenizer that could handle such things as nested quotes, etc.

Edited 12 Years Ago by rubberman

rubberman 1,355 Nearly a Posting Virtuoso

12 Years Ago

Dec - per your signature - water, water, everywhere! :-)

deceptikon 1,790 Code Sniper

12 Years Ago

that code is a bit more advanced

CSV is a bit more complicated, so the code to handle it must also be more complicated.

what if its just a regular csv file, no quotes.

Let's make one thing clear, when you say CSV you imply a format that includes the following restrictions (based on RFC 4180):

The first line may optionally be a header.
Each record is placed on a separate line.
Each record contains one or more fields separated by a single comma.
Empty fields are represented by two adjacent commas.
Each field may optionally be surrounded by double quotes.
If a field contains a newline, comma, or double quote, then the field must be surrounded by double quotes.
Double quotes inside of a field must be escaped by doubling them (eg. "allowed ""'s inside").

So when you say "regular" CSV, the above is what everyone will assume. In my experience the most common CSV feature that's not included is support for embedded newlines.

You can say "simple" CSV, or CSV "without quoting rules", or even "comma delimited" to make it clear that you only support comma delimiters and none of the special rules. However, this also introduces the restriction that you cannot have embedded commas in a field. In my experience, this is often prohibitive to the point where I prefer pipe characters ('|') for simple delimited files because pipes are less likely to be in a field.

what if its just a regular csv file, no quotes. i have 5 cols, by 10 rows(with only names in each field)? wouldn't using the comma as the delimeter be enough?

In that case, maybe. I say maybe because it depends on how you want to handle empty fields. strtok() ignores empty fields, while most of the time in a delimited file you want to extract and process them as blank strings. So once again you can't use strtok() if you can expect empty fields.

smith32 commented: that help me understand the CSV file and inspirit to learn CSV file format. Thank you a lot. +0

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

that help me understand the CSV file and inspirit to learn CSV file format. Thank you a lot.

deceptikon 1,790 Code Sniper Team Colleague Featured Poster · Answer 1 · 2013-03-27T20:02:41+00:00

I seem to be agreeing with deceptikon a lot lately! :-)

I'm very agreeable. ;)

saukembears 0 Newbie Poster · Answer 2 · 2013-03-28T03:41:29+00:00

hello. Im rewriting my post because I read the reply wrong. I guess strtok can read a csv file but im having a hard time understanding why I only get the first column of my csv file with 5 cols, 10 rows with this piece of code. i've read this website plenty of times and its helped me out, but now I decided to make a profile and ask something for myself. i've tokenized other files before, but not a csv file. i thought it would be fairly easy to look for a comma, but i was wrong. just like everything else in C turns out to be for me.

char *token ;
char * del = "," ;

while( fgets(buffer, sizeof(buffer), fp) != NULL ) {    
    token = strtok(buffer, del) ;   
        printf("%s ", token) ;
        token = strtok(NULL, ",") ;
}

deceptikon 1,790 Code Sniper Team Colleague Featured Poster · Answer 3 · 2013-03-28T13:14:26+00:00

I guess strtok can read a csv file

Only if there are no embedded delimiters in a field. strtok() isn't smart enough to determine that a field is quoted and ignore the a delimiter if it's inside a quoted field, nor is it smart enough to recognize escaped quotes.

As I mentioned before, you need to take more care in parsing the format, because it's not as simple as splitting on commas. For example:

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>

char **split_csv(const char *line, char delim)
{
    char **buf = (char**)malloc(2 * sizeof *buf);
    size_t pos, n = 0, len = 0;
    int inquote = 0;

    if (!buf) {
        return NULL;
    }

    buf[n] = NULL;

    /* Trim leading whitespace on the first field */
    for (pos = 0; line[pos] && line[pos] == ' ' || line[pos] == '\t'; pos++) {
        ++pos;
    }

    for (;; pos++) {
        char ch = line[pos];

        if (!inquote && len == 0 && ch == '"') {
            /* Starting a quoted field */
            inquote = 1;
        }
        else if (inquote && ch == '"') {
            if (line[pos + 1] != '"') {
                /* Terminating a quoted field */
                inquote = 0;

                /* Trim trailing whitespace on the field */
                while (line[pos + 1] && line[pos + 1] == ' ' || line[pos + 1] == '\t') {
                    ++pos;
                }
            }
            else {
                /* The quote was escapted so it doesn't count */
                buf[n] = (char*)realloc(buf[n], (++len + 1) * sizeof *buf[n]);
                buf[n][len - 1] = ch;
            }
        }
        else if (!inquote && (ch == delim || (!ch || ch == '\r' || ch == '\n'))) {
            /* Finalize the current field and prepare for the next field */
            buf = (char**)realloc(buf, (++n + 1) * sizeof *buf);
            buf[n - 1][len] = '\0';
            buf[n] = NULL;
            len = 0;

            if (ch == delim) {
                /* Trim leading whitespace on the next field */
                while (line[pos + 1] && line[pos + 1] == ' ' || line[pos + 1] == '\t') {
                    ++pos;
                }
            }
            else {
                /* Unquoted line break means we're done */
                break;
            }
        }
        else {
            buf[n] = (char*)realloc(buf[n], (++len + 1) * sizeof *buf[n]);
            buf[n][len - 1] = ch;
        }
    }

    buf[n] = NULL;

    return buf;
}

int main(void)
{
    FILE *in = fopen("test.txt", "r");
    int skip_header = 1;

    if (in) {
        char line[BUFSIZ];
        char **fields;
        int rows = 0;
        int i;

        while (fgets(line, sizeof line, in)) {
            if (rows++ == 0 && skip_header) {
                continue;
            }

            fields = split_csv(line, ',');

            if (fields) {
                printf("Line #%d\n", rows - skip_header);

                for (i = 0; fields[i]; i++) {
                    printf("\t'%s'\n", fields[i]);
                    free(fields[i]);
                }

                free(fields);
            }
        }
    }

    return 0;
}

Note that this was a quick and dirty implementation, so I won't guarantee correctness. And I'm reasonably sure it's slow, given how memory is being allocated. Making that faster would be an amusing exercise, but would drastically complicate the code and hide the underlying CSV algorithm.

I'd recommend finding a good CSV parsing library and using that rather than ad hoc parsing, because it's easy to get wrong.

saukembears 0 Newbie Poster · Answer 4 · 2013-03-28T16:53:37+00:00

thank you very much for the response. that code is a bit more advanced, but i will study it for a while. what if its just a regular csv file, no quotes. i have 5 cols, by 10 rows(with only names in each field)? wouldn't using the comma as the delimeter be enough?

my final goal is to read the file, and then store the fields into an array, or a data structure. since they are all the same value(names), i was planning on using an array. then i'll need to sort each column in ascending order. i have a plan of using malloc to clear the memory after i sort each column, but i can't do any of that without reading the file. so far i can only read the first column. im not sure what you mean by finding a good csv parsing library, but im about to do some google searching. thanks again.

saukembears 0 Newbie Poster · Answer 5 · 2013-03-28T19:58:20+00:00

that is alot of complicated information. very important. im gonna bookmark this thread.
here is the data information for my csv file. i should have posted this in my original post. in previous assignments we've been reading regular text files. i was able to tokenize those assignmnets. again thank you for you taking time to help me. i've been all over the net trying to look up info to complete the assignment. i've attatched what i've done so far. i apologize for the code not showing up in the proper format. im trying to figure that out now. im not trying to be a hassle. if it looks like im in the right direction then ill do what i can these next two days to figure it out. i really do appreciate the help.

each file consists of a comma-delimited list of names (there are no other visible commas, quotes etc)
number of lines in the file will be exactly 10
number of names per line will be exatly 5
no name will have more than 9 letters

pseudocode :
1. determine the number of fields in each line
2. create a structure or array to hold each field
3. create a file object that will read in the data.
4. create a character buffer large enough to hold one line of the file at one time.
5. open the file, read the file
6. parse the file using the strtok function
7. for subsequent reads from the same line, pass strtok a NULL parameter instead of the buffer string that was read in before.

/*this code kind of gives me the output i need. is there a better way to do this? after this i need to store whats in the output into an array or structure */ 
int main(void) {

    FILE* fp ;
    char filename[] = "homeworkFile.csv" ;
    char buffer[51] ;
    // char [10][5] ; // array to hold the names
    char *token ;
    char *del = "," ;

    if( (fp = fopen(filename, "r")) == NULL ) {
        printf("unable to open %s for reading\n", filename) ;
        exit(1);
    }
    while( fgets(buffer, sizeof(buffer), fp) != NULL ) {
             // printf("read ==> %s", buffer) ;  // this line prints the line with commas

             // these actually print each field without the commas. i think this should be enough
             // to populate an array or stucture
             token = strtok(buffer, del) ;
                 printf("%-10s", token) ;
             token = strtok(NULL, del) ;
                 printf("%-10s", token) ;
             token = strtok(NULL, del) ;
                 printf("%-10s", token) ;
             token = strtok(NULL, del) ;
                 printf("%-10s", token) ;
             token = strtok(NULL, del) ;
                 printf("%-10s\n", token) ;
    }

    fclose( fp ) ;

    return 0 ;
}

deceptikon 1,790 Code Sniper Team Colleague Featured Poster · Answer 6 · 2013-03-28T20:00:27+00:00

this code kind of gives me the output i need. is there a better way to do this?

Always start with "working", then work on "better". But aside from tossing those several calls to strtok() into a loop, I'd say you're on the right track. Do you have any concerns about storing the strings in a record?

saukembears 0 Newbie Poster · Answer 7 · 2013-03-28T22:59:30+00:00

im glad im on the right track. im not sure on storing the strings into a record. C is definitely a challenge.
i need to use malloc to dynamically store the names and im doing some reading on that now. Do you think i should use an array or a stucture to store the records? im thinkig im gonna have to use strncopy in order to get the strings inside of a record.

/*what i've written down so far in regards to setting up an array.*/
#include <stdio.h>

#define rows 10 
#define cols 5

int main(void) {
    int i , k ;
    char **data ;

    data = malloc(rows * sizeof(char*)) ;

    for(k = 0; k < cols; k++) {
        data[i] = malloc(rows * sizeof(char) ) ;

        for(i = 0; i < rows; i++) {
            // im not sure what the heck im doing right now to be honest
        }


    }

}

`

deceptikon 1,790 Code Sniper Team Colleague Featured Poster · Answer 8 · 2013-03-29T12:19:38+00:00

By "record" I was thinking more of a structure-based solution. You'd read a line from the flat file, break it down into fields, then populate a structure with validation rules to determine if the line was legit or not. For example, using this test flat file:

12345,Joe,Blow,19,3.1
55326,Jane,Doe,20,4.0
47200,Stu,Bernstein,20,2.5
78309,Foo,McDiddles,19,1.0

You might parse and display it like this:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct student {
    int    id;
    char   first[10];
    char   last[50];
    int    age;
    double gpa;
} student;

int deserialize(const char *s, student *record, char *delim)
{
    char *temp = (char*)malloc(strlen(s) + 1);
    int rc = 0;

    if (temp) {
        char *tok;

        /* make a working copy because strtok modifies the string */
        strcpy(temp, s);

        rc = ((tok = strtok(temp, delim)) != NULL && sscanf(tok, "%d", &record->id) == 1) &&
             ((tok = strtok(NULL, delim)) != NULL && strcpy(record->first, tok)) &&
             ((tok = strtok(NULL, delim)) != NULL && strcpy(record->last, tok)) &&
             ((tok = strtok(NULL, delim)) != NULL && sscanf(tok, "%d", &record->age) == 1) &&
             ((tok = strtok(NULL, delim)) != NULL && sscanf(tok, "%lf", &record->gpa) == 1);

        free(temp);
    }

    return rc;
}

int main(void)
{
    FILE *in = fopen("test.txt", "r");

    if (in) {
        student records[10];
        char line[BUFSIZ];
        size_t i, n;

        for (n = 0; n < 10 && fgets(line, sizeof line, in); n++) {
            if (!deserialize(line, &records[n], ",")) {
                break;
            }
        }

        for (i = 0; i < n; i++) {
            printf("Student #%d\n", records[i].id);
            printf("\t%s, %s (%d)\n", records[i].last, records[i].first, records[i].age);
            printf("\tGPA: %1.1f\n", records[i].gpa);
        }
    }

    return 0;
}

smith32 0 Newbie Poster · Answer 9 · 2013-04-01T03:06:22+00:00

Thank you all. I've solved the problem. Sorry to saukembears. I have to mark this problem as solved. I am not sure if still can continue discussing or not. If can, feel free to continue. it can be good for others, too.