Lexer- Tokenizer problem

Question

n.aggel 13 Posting Whiz in Training

17 Years Ago

hi, i use the flex tool {http://www.gnu.org/software/flex/manual/} to generate a tokenizer ,but i have the following problem {it has to do with the way that flex tokenizes the input::

FILE : flex.l

%{
		#define WEB 0
		#define SPACE 1
		#define STRING 2	
%}

string_component [0-9a-zA-Z \t\.!#$%^&()*@_]

%%

"daniweb"		              {return WEB;}
[ \t\n]			{return SPACE;}
{string_component}+	{return STRING;}

%%

#include <iostream>
			
using namespace std;
		
int main()
{	
	cout<<yylex()<<endl;
	cout<<yylex()<<endl;

	return 0;
}

int yywrap(void){return 1;}

Example file:

test_string daniweb

What i want is to have the above string tokenized as
STRING SPACE WEB
instead flex recognizes it as STRING, because it tries to match the longest input....

How can i fix this problem?
all ideas are welcomed....

PS:: to compile:

flex flex.l
g++ lex.yy.c
./a.out <example

c++

5 Contributors
11 Replies
191 Views
2 Days Discussion Span
Latest Post 17 Years Ago Latest Post by n.aggel

iamthwee

17 Years Ago

What is this Flex? some kinda regular expression library or something. Do you even need it or can your problem be simplified?

iamthwee

17 Years Ago

Um ok, please explain this:

string_component [0-9a-zA-Z \t\.!#$%^&()*@_]

and what you think it does?

iamthwee

17 Years Ago

First you gotta know what your regular expressions are doing.

To me string_component [0-9a-zA-Z \t\.!#$%^&()*@_] and the example you have given is contradictory, like salem mentioned.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Salem 5,265 Posting Sage · Answer 1 · 2007-08-28T22:13:13+00:00

Your string component matches spaces, and now you're complaining that you don't want to match spaces.

You can't have it both ways.

n.aggel 13 Posting Whiz in Training · Answer 2 · 2007-08-29T15:49:57+00:00

Your string component matches spaces, and now you're complaining that you don't want to match spaces.
You can't have it both ways.

Thank you for answering {apparently, few people have read the post...}

Yes you are rigth, it seems that i can't have it both ways... but from where i stand i want to use flex in order to do the following:::

Recognize some specif keywords {in the simplified example i provided the keyword was "daniweb"} and recognize everything else as a string...any ideas on how can i do that?

PS: maybe start conditions could help me solve the problem?{ i havven't understand them so well...}
PS2:in the beggining i thought it wouldn't be that difficult, but i was wrong...

n.aggel 13 Posting Whiz in Training · Answer 3 · 2007-08-30T00:04:28+00:00

What is this Flex? some kinda regular expression library or something. Do you even need it or can your problem be simplified?

Flex

Flex (The Fast Lexical Analyzer)
Flex is a fast lexical analyser generator. It is a tool for generating programs that perform pattern-matching on text. Flex is a non-GNU free implementation of the well known Lex program.

http://www.gnu.org/software/flex/
http://flex.sourceforge.net/

nedrocks 0 Newbie Poster · Answer 4 · 2007-08-30T02:40:07+00:00

There's a way to set precedence of regex's in flex. I don't remember the exact syntax, but you should put it before your catchall regex that you have defined there.

n.aggel 13 Posting Whiz in Training · Answer 5 · 2007-08-30T21:57:50+00:00

There's a way to set precedence of regex's in flex. I don't remember the exact syntax, but you should put it before your catchall regex that you have defined there.

i haven't seen what you mention in the manual...

unfortunately i haven't found the solution...i worked around my problem by changing the grammar {i.e. bison file}, and finally i gave the project... Now when i find the time i will try to find a solution using starting conditions

vijayan121 1,152 Posting Virtuoso · Answer 6 · 2007-08-30T23:53:19+00:00

using boost.spirit may be much easier: http://www.boost.org/libs/spirit/doc/quick_start.html

#include <boost/spirit/core.hpp>
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <boost/assign.hpp>
using namespace std ;
using namespace boost ;
using namespace boost::spirit ;
using namespace boost::assign ;

struct parse_it
{ 
  void operator() ( const string& str ) const
  {
    vector<string> tokens ;
    const char* cstr = str.c_str() ;
    size_t n = 0 ;
    while( n < str.size() )
      n += parse( cstr + n,
                  (+space_p) [  push_back_a( tokens, "SPACE" ) ] |
                  str_p("daniweb") [ push_back_a( tokens, "WEB" ) ] |
                  str_p("lexer") [ push_back_a( tokens, "LEX" ) ] |
                  str_p("tokenizer") [ push_back_a( tokens, "TOK" ) ] |
                  (+~space_p) [ push_back_a( tokens, "STRING" ) ]
                ).length ;
    cout << '\n' << "parsed: " << str << "\ntokens: " ;      
    copy( tokens.begin(), tokens.end(), 
               ostream_iterator<string>(cout," ") ) ;
    cout << '\n' ;      
  }
};
int main()
{
  vector<string> test_cases = list_of
                ( "test daniweb lexer xyz tokenizer lexer" )
                ( "daniweblexer tokenizerlexer abcd lexerlexer" )
                ( "daniwebtest lexerdaniweblexertest tokenizerxxx" ) ;
  for_each( test_cases.begin(), test_cases.end(), parse_it() ) ;
}
/**
>g++ -Wall -std=c++98 -I/usr/local/include keyword.cpp && ./a.out

parsed: test daniweb lexer xyz tokenizer lexer
tokens: STRING SPACE WEB SPACE LEX SPACE STRING SPACE TOK SPACE LEX

parsed: daniweblexer tokenizerlexer abcd lexerlexer
tokens: WEB LEX SPACE TOK LEX SPACE STRING SPACE LEX LEX

parsed: daniwebtest lexerdaniweblexertest tokenizerxxx
tokens: WEB STRING SPACE LEX WEB LEX STRING SPACE TOK STRING
*/

n.aggel 13 Posting Whiz in Training · Answer 7 · 2007-08-31T00:18:57+00:00

man, i did not know that boost had a parsing tool... unfortunately i was obligated to use bison and flex from the project guide lines!

n.aggel 13 Posting Whiz in Training · Answer 8 · 2007-08-31T00:21:46+00:00

First you gotta know what your regular expressions are doing.
To me string_component [0-9a-zA-Z \t\.!#$%^&()*@_] and the example you have given is contradictory, like salem mentioned.

ok, maybe it is contradictory, but how can you express in flex the concept i wrote before? eg recognize some tokens and consider everything else a string...