I am parsing xml data generated by backup software for SMS and phone logs.

Most of the data appears like this:

  <sms protocol="0" address="##########" date="1495903715000" body="Ok. See you then.;-)" subject="none" ... />

For most bodies of a text message, the data follows the same pattern above: Keyword body="<string>". If the message contains single- or double-quotes, things are more complex. The software will convert a single-quote to the entity &apos; If double-quotes " are used in the body of the text message, then the text-delimiter will change from single- ' to double-quotes ", like this:

... body='Member, this win-win is for you! We&apos;re awesome, &apos;cause you&apos;re awesome. Some company was just ranked "Highest in "Customer Service" and "Purchase Experience" Satisfaction Among Wireless Providers" by J.D. Power.' 

The php script written to parse the rows of data is:

function parse_rows($rows)
    for ($a=0; $a < count($rows); $a++) {
        preg_match_all("/[a-z_]+=[\"|'](?:[^\"]*)['|\"]/",$rows[$a], $matches);
        $data[$a] = $matches[0];
    }
    return $data;
}

The preg_match_all utilised is not working well for body data that is of the exception described. It is clipping at the first instance of a double-quotation mark in mid-stream. This is the breakdown of the regex "/[a-z_]+=[\"|'](?:[^\"]*)['|\"]/"

[a-z_]+= captures the key tags like 'protocol=', 'address=', 'date='
Intended [\"|'] captures the beginning of the data either single- ' or double-quotes "
Intended (?:[^\"]*) look-ahead, capturing everything that does not include a double-quote "
Intended pattern ends with ['|\"]

Can someone assist with the regex that will capture for these exceptional body= text streams?

Each value of $rows comes in like shown above:

  <sms protocol="0" address="##########" date="1495903715000" body="Ok. See you then.;-)" subject="none" ... />

The array $data is multi-dimensional. The above row would become:

$data
  ->[0]
    ->[0] = "protocol="0"" 
    ->[1] = "address="##########"" 
    ->[2] = "date="1495903715000""
    ->[3] = "body="Ok. See you then.;-)"" 
    ->[4] = "subject="none""
    .
    .
    ->[17] = "contact_name"

Variable $data is then parsed further later on.

Is there another method to convert a string like:

  <sms protocol="0" address="##########" date="1495903715000" body="Ok. See you then.;-)" subject="none" ... />

to an array like $data?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.