My company is retiring an online document warehousing application that stored lots of text data. The application stored the data in a folder hierarchy that looked just like a Windows folder tree. I'm trying to replicate that hierarchy on a UNIX file system, but the tools provided with the application to extract the hierarchy information are not terribly useful.
One tool will give me a list of folder ID numbers and names like so:
Folder 8777 - fozzy.
Folder 8778 - fozzy1.
Folder 8779 - fozzy2.
Folder 8780 - grover1.
Folder 8781 - grover2.
Folder 8782 - rolf1.
Folder 8783 - rolf2.
Folder 8784 - rolf3.
Folder 8785 - rolf4.
Folder 8786 - travel_statements.
Folder 8787 - invoices.
Another tool will give me sub-folder relationships based on folder ID like the following:
Folder 100 - <root>.
subfolder 101 - flag 0.
subfolder 119 - flag 0.
subfolder 227 - flag 0.
subfolder 239 - flag 0.
subfolder 1198 - flag 0.
subfolder 1320 - flag 0.
subfolder 2264 - flag 0.
subfolder 3025 - flag 0.
subfolder 3028 - flag 0.
subfolder 3031 - flag 0.
Folder 1198 - kermit1.
subfolder 1227 - flag 0.
subfolder 1231 - flag 0.
subfolder 1238 - flag 0.
subfolder 1374 - flag 0.
subfolder 1504 - flag 0.
subfolder 1538 - flag 0.
subfolder 1642 - flag 0.
subfolder 2459 - flag 0.
subfolder 2635 - flag 0.
subfolder 2642 - flag 0.
subfolder 3998 - flag 0.
subfolder 7942 - flag 0.
subfolder 8656 - flag 0.
Folder 1227 - monkey1.
subfolder 1228 - flag 0.
subfolder 1327 - flag 0.
subfolder 1347 - flag 0.
subfolder 1390 - flag 0.
subfolder 1396 - flag 0.
Folder 3333 - piggy1.
No sub folders.
I first approached this problem by just looping through the list of folder ID's, and for each folder ID run a recursive function that would continue to scan through the sub-folder information until a path could be built back to the root folder (folder ID 100). This appeared to work great, but I encountered 2 problems:
- I discovered that some sub-folders were present in more than one location, but my code only picked up the first instance
- I also found that some folders were positioned outside the hierarchy of the root folder
Next, I tried using the sub-folder information to start with. I built a list of simple strings representing one parent/child pair like this: 100/1198. Then, for each pair, I looped through the sub-folder info again and tried building paths based on the child element matching the parent element of any scanned lines. This caught some of the duplicate paths, but I ended up with a bunch of paths that had no relationship to the beginning or end of the tree.
Can anyone here think of how I could build folder hierarchy based on this kind of data? Or can anyone here even think of a good way I could represent this data internally so I could build the paths without missing any possible path combinations? Any assistance would be greatly appreciated. Thank you!