Hi,
I need to write very simple RTF parser to write and read basic rtf (bold,italic, underline, paragraph et al). I have searched for techniques I can implement but havent found good example.
I would like to hear from you guys what approach whatsoever you consider to be best for you.
Thanks!

Member Avatar for jmichae3

just google "rtf format specification". rtf is essentially an ANSI text file. it can have pictures embedded in it, probably through some encoded-binary method.
http://support.microsoft.com/kb/86999

parsers and lexical analyzers are the subject of compilers. compilers are used to translate ascii C code format to binary code or some other language or can be used basically as a converter.
flex and bison (lex and yacc) are compiler generator languages that generate C or C++ code you can compile. there are books on writing compilers. it's a complicated subject, and it's generally a good idea to take a college class (it's a credit class and it's available at some community colleges) on the subject and be very familiar with the language you are using.

just google "rtf format specification". rtf is essentially an ANSI text file. it can have pictures embedded in it, probably through some encoded-binary method.
http://support.microsoft.com/kb/86999

parsers and lexical analyzers are the subject of compilers. compilers are used to translate ascii C code format to binary code or some other language or can be used basically as a converter.
flex and bison (lex and yacc) are compiler generator languages that generate C or C++ code you can compile. there are books on writing compilers. it's a complicated subject, and it's generally a good idea to take a college class (it's a credit class and it's available at some community colleges) on the subject and be very familiar with the language you are using.

Issue is not Docs, I have read them. I want to know how would other implement parser that for example will take rtf document and may parse it to html for browser and back to rtf for saving in file systems.
And no, I don't want to write that complex compiler stuffs

Member Avatar for jmichae3

just watch what happens with certain changes - you will see how complex .rtf files can become:
1.create a plain text document in Wordpad, and save it as .rtf file. keep the program/document open. use shift-enter on some lines and on some lines use enter. try some tabs.
2.install Notepad++ (programmer's editor), and tie it into Windows Explorer.
3.use windows explorer to right click on the .rtf file and open it with notepad++.
4.make a bulleted list in Wordpad in the same document using the nice buttons.
5.save the document.
6.alt-tab switch to Notepad++. it will say that the document has changed and will ask you if you want to load the new version. yes, you always want to.
7.watch the changes! it has now grown from a 10-line text document with an ASCII NUL (\0) character at the end to a gigantic document.

some of these format specifiers are sequential. some are referential I think. for instance, you can \def a following identifier and use that "macro" elsewhere in your document. I don't know how to tell that the macro ends yet, because I haven't read the spec. If I had read the spec, I may or may not be writing a parser... it looks really complicated. the RTF document is structured as a tree of curly braces with the leaves being sequential content. the sequential commands begin with \ and have a command name followed by data. it terminates with \0. this is what I have discovered just looking at the innards of an RTF file.

Member Avatar for jmichae3

those fonts (especially for bullet lists) don't transfer very well to web fonts...

If you want to write your own, I think you could start easy by replacing RTF tags for HTML tags. For example replace \b with <strong> , \b0 with </strong> and likewise for \i being <em> . It depends on how far you want to take this.

Member Avatar for jmichae3

part of the problem is that an RTF file is both sequential and sometimes heirarchichal.

HTML is purely heirarchichal. at some point, you would need to make a decision WHEN to turn off something like bold or italics.

also, because rtf is sequential, with straight translation and no parsing you could theoretically end up with

\bthis is bold\ibold italics \b0italics\i0regular text

coming out with invaliddly nested HTML like this:

<strong>this is bold<em>bold italics </strong>italics</em>regular text

part of the problem is that an RTF file is both sequential and sometimes heirarchichal.

HTML is purely heirarchichal. at some point, you would need to make a decision WHEN to turn off something like bold or italics.

also, because rtf is sequential, with straight translation and no parsing you could theoretically end up with

\bthis is bold\ibold italics \b0italics\i0regular text

coming out with invaliddly nested HTML like this:

<strong>this is bold<em>bold italics </strong>italics</em>regular text

On situation like this, how do you go about?

I am not convinced by jmichae3's example. I think the nesting of RTF is the same as HTML, so the example given is faulty RTF. I have never seen an RTF editor output bold and italic in that order.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.