My project want to write tokenizer which can able to tokenize input C code (already written C file input to the system) and identify what are the keywords, variables, comments, conditions etc. (all syntax provide by C). any one can help me to do this.

What have you done so far? We're not going to help you by doing it for you, but we'll be glad to help with any issues when you prove that you've make some kind of honest attempt.

What have you done so far? We're not going to help you by doing it for you, but we'll be glad to help with any issues when you prove that you've make some kind of honest attempt.

okey thanks Narue, I'm staring to write my translator (I'm still in initial step). I already finish scanning file line by line. I write function to split sting using space. then pass all parts in to the my function called FindToken in this method identify what is the tokens include in the part of the string and print it but I want to extend this to cover all grammes in C but when I try to extend it it becomes more complex I try to reduce that complexity I need help to do this.

I already finish scanning file line by line.

This is your first problem.

I write function to split sting using space.

This is your second problem.

You need to take the stream as a whole and recognize tokens as they come character by character. Why? Because the following is perfectly legal C:

int(a)=12345;a*\
=10;

If you only read line by line, you'll have to somehow recognize tokens split across lines. If you rely on tokens being separated by whitespace, that removes a huge amount of common code.

A simple tokenization method is to read from the stream character by character while peeking at the next character, and build a token string. Compare the token string to valid operators, keywords, and literals as you build it, using the peek character to determine if you've completed the longest possible token (C tokenization employs a maximal munch strategy). If the token string isn't an operator, keyword, or literal, treat it as an identifier.

I'd start by incrementally recognizing more and more. So start by recognizing just keywords, then add support for operators, then literals, then identifiers. That way you don't get overwhelmed by all of the different cases.

This is your first problem.


This is your second problem.

You need to take the stream as a whole and recognize tokens as they come character by character. Why? Because the following is perfectly legal C:

int(a)=12345;a*\
=10;

If you only read line by line, you'll have to somehow recognize tokens split across lines. If you rely on tokens being separated by whitespace, that removes a huge amount of common code.

A simple tokenization method is to read from the stream character by character while peeking at the next character, and build a token string. Compare the token string to valid operators, keywords, and literals as you build it, using the peek character to determine if you've completed the longest possible token (C tokenization employs a maximal munch strategy). If the token string isn't an operator, keyword, or literal, treat it as an identifier.

I'd start by incrementally recognizing more and more. So start by recognizing just keywords, then add support for operators, then literals, then identifiers. That way you don't get overwhelmed by all of the different cases.

Thanks Narue your comment very valuable to me. I got that point I will change my already implemented methods and scan characters one by one but I have one question after get one character I must search that character or character set in all grammar rules e.g if I found "in" I must search these two characters using all grammar rules.
how I cut-down that search cost and space are there any heuristics method to do that.

[sorry for my broken English]

after get one character I must search that character or character set in all grammar rules

You don't have to exhaustively search against all rules, just the ones that are valid for the current token string. For example, if you read "b", then there's no reason to check the operator or literal rules because you clearly have either an identifier or a keyword.

You don't have to exhaustively search against all rules, just the ones that are valid for the current token string. For example, if you read "b", then there's no reason to check the operator or literal rules because you clearly have either an identifier or a keyword.

Thanks Due I will tryout this If there are any question I will ask in this thread ok. please help me to do this, thanks Narue.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.