From following string (without white space) as;
String s= "<a><b>e</b></a>";
How can I get the following tokens;
<a>,<b>,e,</b>,</a>
Thanks for attention!
You could try looping through each index of the string and if the string[index].equals(">") then string[:index+1] will be, for example: <a>. You then make a note of the index for the last ">" found and repeat the process.
I hope I explained that okay.
I can't think of a good way to do this with built-in string methods. You might have to write a little scanner method to do this. It's actually not hard, but you have to think it through before you start writing it.
What you'll be doing is writing a little state machine, essentially. That is, what you come across in the string will determine what decisions you make next.
This machine will have two states: an initial state and an XML-tag state. The initial state is just where you start: you take characters off the front of the string and as long as they're not '<', you just add them to a temp string (or, better a StringBuilder). If you come across a '<', then you're scanning an XML tag. You close off the old string you were building (maybe you put it in an array for safekeeping) and you start building a new String (or StringBuilder). The simplest thing would be to just look for a '>' at this point. When you find that, you close off that String, put it in your array, and go back to the initial state.
When you come to the end of the String, you've got an array of Strings, like you were asking for.
There are some things you can do to make this work better, but they complicate the process, so start with this and we'll figure out what else you want to do.
Very important: With something like this, it's very important that you work out the logic before you start coding. Please be ready to explain your steps in English before you even think about what it looks like in Java. It can also be good to draw diagrams: I'm in this state, what am I looking for? When I find it, where do I go?
Of course, there's probably an easy way to do this that someone will post in about two minutes, and I'll look like a fool. If so, do it the easy way!
You could try looping through each index of the string and if the string[index].equals(">") then string[:index+1] will be, for example: <a>. You then make a note of the index for the last ">" found and repeat the process.
I hope I explained that okay.
@Garee
Sorry, I feel little bit confused.
Are you mentioning first I should convert strings to character, when it encounteres "<" continuely reading until it encounteres ">" and make a token. or something like this?
Could you please explain little bid more in detailed?
Illustration of code would be more helpful I suppose.
Thanks
You could try looping through each index of the string and if the string[index].equals(">") then string[:index+1] will be, for example: <a>. You then make a note of the index for the last ">" found and repeat the process.
I think what he means is that when you come to a '>', you know that you've finished a tag. However, you also need to know that when you come to a '<' that you're starting a new tag. If you just look for '>' then you'll parse
<a><b>e</b></a>
into
<a>, <b>, e</b>, </a>
which is incorrect.
You just have to know the name of what you are looking for:
http://www.google.com/#q=html+parse+regex
You can use Regex "lookarounds" to split at the <> delimiters while retaining the delimiters. For example:
String s = "<A><B>cde</B></A";
String[] a = s.split("(?=[<>])");
gives the following array elements
<A
>
<B
>cde
</B
>
</A
Which is 90% of the way there (ie you have <tag, </tag, or >value (where value may be "")
Nice one, James. You still have to do a certain amount of parsing-type work, but you know that you only have to look at the front of the string.
But what about error handling? You have to be pretty careful dealing with something like
<a><b>c<d<e></e>></d> ...
which gives
<a,>,<b,>c,<d,<e,></e, >,>,</d,> ...
It seems to me that it would be tricky to ensure that you rejected that, or rather it would be easy to end up passing it. Whereas if you just scan it yourself, you know that if you've accepted a '<', you can't take another '<' in that token.
This is why I didn't like the split() for this particular problem - it seems to me that keeping track of the brackets becomes more complicated in this approach, and the hand-rolled scanner would end up being conceptually simpler. However, it's a valid approach and one that tedtdu should consider.
The other consideration is that writing a simple scanner like this is a good introduction to state machines, which you're likely to run into later on if you take a compilers course.
Yes, I thought the regex lookaround was no neat that I just had to share it.
However, if this were my project I'd write my own simple scanner exactly as you described (in fact, a few years ago I did exactly that to scan the iTunes XML database file to look for particular items of interest. It was smaller, easier and very much faster than the SAX/DOM alternatives. But that's another story...)
We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.