I'm trying to write a bash script that converts html file into latex file by processing each line with sed. I'm stuck with following issue:
I need to replace <a name="[I]something[/I]"> with \index{[I]something_else[/I]} . I have an associative array with key-value pairs like this: "[I]something[/I]" => "[I]something_else[/I]" . However, there are 2 catches:

  1. Not all strings from <a> tags are present. If I don't find an array element with "[I]something[/I]" key I need to skip creation of \index{} .
  2. Some lines contain more than one occurrence of '<a name="...">' .

Any suggestions?

Loop through the array and create a tmp file (/tmp/tmp.sed) containing one sed expression per line, like

s/<a name="something1">/\\index{something_else1}/g
s/<a name="something2">/\\index{something_else2}/g
   .
   .
   .

The 'g' at the end says to replace all occurrences.

Then sed -f /tmp/tmp.sed < file.html > file.latex.

You could do it in the script too, but writing the temp file is cleaner, allows you to build the complete set of editing expressions at one time and lets you process the file once.

If bc() can be written in sed, you can translate html to latex with a potentially long file of sed commands.

Thanks a lot for the suggestion.
The problem is that I have to process one line at time rather than whole file at once. The reason is <pre> tag whose content is processed differently than the rest of the file.
I can use sed commands in separate file but I'll have to create it dynamically since all replacement strings are collected into an array over 300 elements long, populated from another HTML page that could change in future. Or I could loop though the array and execute sed command in each iteration, but I believe it would be much slower.
This is generally good solution for processing the whole file, but would be too slow for line-by-line processing.
In mean time, I come up with this solution that doesn't even involve sed:

if [ -n $( echo $line | grep '<a name="' ) ]; then
	OIFS=$IFS
	IFS='<'
	line2=
	for word in $line; do
		pos=$( expr index "$word" '>' )
		if [ $pos -ne 0 ]; then
			if [ ${word:0:8} == "a name=\"" ]; then
				indx=${word:8:(( $pos - 10 ))}
				if [ ${references[$indx]} ]; then
					word="\\index{"${references[$indx]}"}"${word:$pos:(( ${#word} - $pos ))}
				else
					word=${word:$pos:(( ${#word} - $pos ))}
				fi
			else
				word="<"$word
			fi
		fi
		line2+=$word
	done
	IFS=$OIFS
	line=$line2
fi

$line is the line of text currently processed, and references is the array containing replacement strings.
The idea is to break the line on '<'s, and then search for '>' in each segment. If we find a name="..."> inside then we replace it with suitable array element (or delete it if such doesn't exist). If not, that means we found some other (still not processed) tag, so we put back the '<'. Works like a charm.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.