I'm just starting with PHP and I was given a rather difficult task. I need to search through multiple XML files in different directories, then return the name of the files that contain a given search input.

Will it be easier to write a script that reads all XML files and writes their content to an MySQL database?

No, I don't really think that this would make much of a difference, but maybe someone else disagrees with me. Well, let me cautiously instead say that I don't think it should be a big difference if you're only talking about a few hundred XML files that aren't ridiculously huge (over 50KB each for example). If you are in an environment like that you might be able to get a performance boost by doing it through mySQL since you can use conditions in the SQL statement like this:

SELECT * FROM `table` WHERE `xmldata` LIKE '%searchterm%

And by doing that you would actually be discremenating through records before actually even pulling in and reading all their data. But I really don't think this would be all that much quicker unless the field you are doing the LIKE on is a varchar field, and if it is, it has to be under 255 characters, which pretty much rules out most XML files.

Here is some example code which could recursively open a directory and scan through all sub-directories looking for XML files (I didn't test this; just jotted it down, but it should work):

define("REGEXP", "/{$searchterm}/i");
$dirPath = "lib/xml";

$matches = scanXML($dirPath);
print_r($matches);

function scanXML ($dirPath)
	{
	static $matches;
	$dir = opendir($dirPath);
	while(false !== ($file = readdir($dir)))
		{
		if ($file != '.' && $file != '..')
			{
			if (strtolower(substr($file, strlen($file) - 4, 4)) == '.xml')
				{
				// read the file here and perform necessary search
				if (preg_match(REGEXP, @file_get_contents("{$dirPath}/{$file}")))
					{
					$matches[] = "{$dirPath}/{$file}";
					}
				}
			elseif (is_dir("{$dirPath}/$file"))
				{
				scanXML("{$dirPath}/$file");
				}
			}
		}
	closedir($dir);
	return $matches;
	}

Keep in mind: with the regular expression, you will probably need to escape certain characters if you will be using any special characters in your search term.

There will be no special characters. Additional problem is that each XML file is in a different directory and there are like +100 of those directories and growind. But I will probably copy all the XML files into one dir with a bash script.

Thanks a lot. Your post was very helpful.

Honestly, I don't think it will be a big performance hit to scan a couple hundred or so directories as long as they don't all have a hundred or so XML files in them. If you are dealing with a structure that has lots and lots of directories with lots and lots of XML, then you may benefit from caching in some sense (mySQL or otherwise). I would just try it a few times live and see how it goes. If it takes a few seconds you may want to consider caching.

If this is going to be something that normal users of a website will be allowed to search, then I probably would recommend caching because it's probably not too healthy for the disc to constantly cycle through directories. It likely won't hurt it or anything, but disc access speeds aren't great, especially compared to RAM. If you do use a mySQL table, you should consider setting up a cron script to cache the data irregularly and setup the table as a HEAP/MEMORY, which will mean it will all be in RAM and perform MUCH quicker.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.