hi, i use the flex tool {http://www.gnu.org/software/flex/manual/} to generate a tokenizer ,but i have the following problem {it has to do with the way that flex tokenizes the input::

FILE : flex.l

%{
		#define WEB 0
		#define SPACE 1
		#define STRING 2	
%}

string_component [0-9a-zA-Z \t\.!#$%^&()*@_]

%%

"daniweb"		              {return WEB;}
[ \t\n]			{return SPACE;}
{string_component}+	{return STRING;}

%%

#include <iostream>
			
using namespace std;
		
int main()
{	
	cout<<yylex()<<endl;
	cout<<yylex()<<endl;

	return 0;
}

int yywrap(void){return 1;}

Example file:

test_string daniweb

What i want is to have the above string tokenized as
STRING SPACE WEB
instead flex recognizes it as STRING, because it tries to match the longest input....

How can i fix this problem?
all ideas are welcomed....

PS:: to compile:

flex flex.l
g++ lex.yy.c
./a.out <example

Your string component matches spaces, and now you're complaining that you don't want to match spaces.

You can't have it both ways.

Your string component matches spaces, and now you're complaining that you don't want to match spaces.

You can't have it both ways.

Thank you for answering {apparently, few people have read the post...}

Yes you are rigth, it seems that i can't have it both ways... but from where i stand i want to use flex in order to do the following:::

Recognize some specif keywords {in the simplified example i provided the keyword was "daniweb"} and recognize everything else as a string...any ideas on how can i do that?

PS: maybe start conditions could help me solve the problem?{ i havven't understand them so well...}
PS2:in the beggining i thought it wouldn't be that difficult, but i was wrong...

Member Avatar for iamthwee

What is this Flex? some kinda regular expression library or something. Do you even need it or can your problem be simplified?

What is this Flex? some kinda regular expression library or something. Do you even need it or can your problem be simplified?

Flex

Flex (The Fast Lexical Analyzer)
Flex is a fast lexical analyser generator. It is a tool for generating programs that perform pattern-matching on text. Flex is a non-GNU free implementation of the well known Lex program.


http://www.gnu.org/software/flex/
http://flex.sourceforge.net/

Member Avatar for iamthwee

Um ok, please explain this:

string_component [0-9a-zA-Z \t\.!#$%^&()*@_]

and what you think it does?

There's a way to set precedence of regex's in flex. I don't remember the exact syntax, but you should put it before your catchall regex that you have defined there.

There's a way to set precedence of regex's in flex. I don't remember the exact syntax, but you should put it before your catchall regex that you have defined there.

i haven't seen what you mention in the manual...

unfortunately i haven't found the solution...i worked around my problem by changing the grammar {i.e. bison file}, and finally i gave the project... Now when i find the time i will try to find a solution using starting conditions

Member Avatar for iamthwee

First you gotta know what your regular expressions are doing.

To me string_component [0-9a-zA-Z \t\.!#$%^&()*@_] and the example you have given is contradictory, like salem mentioned.

using boost.spirit may be much easier: http://www.boost.org/libs/spirit/doc/quick_start.html

#include <boost/spirit/core.hpp>
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <boost/assign.hpp>
using namespace std ;
using namespace boost ;
using namespace boost::spirit ;
using namespace boost::assign ;

struct parse_it
{ 
  void operator() ( const string& str ) const
  {
    vector<string> tokens ;
    const char* cstr = str.c_str() ;
    size_t n = 0 ;
    while( n < str.size() )
      n += parse( cstr + n,
                  (+space_p) [  push_back_a( tokens, "SPACE" ) ] |
                  str_p("daniweb") [ push_back_a( tokens, "WEB" ) ] |
                  str_p("lexer") [ push_back_a( tokens, "LEX" ) ] |
                  str_p("tokenizer") [ push_back_a( tokens, "TOK" ) ] |
                  (+~space_p) [ push_back_a( tokens, "STRING" ) ]
                ).length ;
    cout << '\n' << "parsed: " << str << "\ntokens: " ;      
    copy( tokens.begin(), tokens.end(), 
               ostream_iterator<string>(cout," ") ) ;
    cout << '\n' ;      
  }
};
int main()
{
  vector<string> test_cases = list_of
                ( "test daniweb lexer xyz tokenizer lexer" )
                ( "daniweblexer tokenizerlexer abcd lexerlexer" )
                ( "daniwebtest lexerdaniweblexertest tokenizerxxx" ) ;
  for_each( test_cases.begin(), test_cases.end(), parse_it() ) ;
}
/**
>g++ -Wall -std=c++98 -I/usr/local/include keyword.cpp && ./a.out

parsed: test daniweb lexer xyz tokenizer lexer
tokens: STRING SPACE WEB SPACE LEX SPACE STRING SPACE TOK SPACE LEX

parsed: daniweblexer tokenizerlexer abcd lexerlexer
tokens: WEB LEX SPACE TOK LEX SPACE STRING SPACE LEX LEX

parsed: daniwebtest lexerdaniweblexertest tokenizerxxx
tokens: WEB STRING SPACE LEX WEB LEX STRING SPACE TOK STRING
*/

man, i did not know that boost had a parsing tool... unfortunately i was obligated to use bison and flex from the project guide lines!

First you gotta know what your regular expressions are doing.

To me string_component [0-9a-zA-Z \t\.!#$%^&()*@_] and the example you have given is contradictory, like salem mentioned.

ok, maybe it is contradictory, but how can you express in flex the concept i wrote before? eg recognize some tokens and consider everything else a string...

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.