I am parsing xml data generated by backup software for SMS and phone logs.
Most of the data appears like this:
<sms protocol="0" address="##########" date="1495903715000" body="Ok. See you then.;-)" subject="none" ... />
For most bodies of a text message, the data follows the same pattern above: Keyword body="<string>"
. If the message contains single- or double-quotes, things are more complex. The software will convert a single-quote to the entity '
If double-quotes "
are used in the body of the text message, then the text-delimiter will change from single- '
to double-quotes "
, like this:
... body='Member, this win-win is for you! We're awesome, 'cause you're awesome. Some company was just ranked "Highest in "Customer Service" and "Purchase Experience" Satisfaction Among Wireless Providers" by J.D. Power.'
The php script written to parse the rows of data is:
function parse_rows($rows)
for ($a=0; $a < count($rows); $a++) {
preg_match_all("/[a-z_]+=[\"|'](?:[^\"]*)['|\"]/",$rows[$a], $matches);
$data[$a] = $matches[0];
}
return $data;
}
The preg_match_all
utilised is not working well for body
data that is of the exception described. It is clipping at the first instance of a double-quotation mark in mid-stream. This is the breakdown of the regex "/[a-z_]+=[\"|'](?:[^\"]*)['|\"]/"
[a-z_]+=
captures the key tags like 'protocol=', 'address=', 'date='
Intended [\"|']
captures the beginning of the data either single- '
or double-quotes "
Intended (?:[^\"]*)
look-ahead, capturing everything that does not include a double-quote "
Intended pattern ends with ['|\"]
Can someone assist with the regex that will capture for these exceptional body=
text streams?
Each value of $rows
comes in like shown above:
<sms protocol="0" address="##########" date="1495903715000" body="Ok. See you then.;-)" subject="none" ... />
The array $data
is multi-dimensional. The above row would become:
$data
->[0]
->[0] = "protocol="0""
->[1] = "address="##########""
->[2] = "date="1495903715000""
->[3] = "body="Ok. See you then.;-)""
->[4] = "subject="none""
.
.
->[17] = "contact_name"
Variable $data is then parsed further later on.
Is there another method to convert a string like:
<sms protocol="0" address="##########" date="1495903715000" body="Ok. See you then.;-)" subject="none" ... />
to an array like $data
?