Hello there guys! :)

It's been quite a while since the last time I've been here. I am currently working on topics different from my usual, and I have found myself baffled by a problem. So, here goes:

I am working on a web application which collects user fiscal data. Quite sensitive stuff. I need to be able to insert the data, available through a web portal, into a local database. The only way I have to access said data is through the official web site of the organization hosting it, therefore the user must authenticate his access. So, the whole thing boils down to parsing the data from the HTML generated by the portal.

I have already developed a parser, and it works nicely, if the HTML file is provided as a normal, local file (i.e. the user goes on the site, logs in, sees the data he wants, saves it as a web page in HTML format, and then feeds it into the application). I can even access the file online (the component is written in Java, so it's just changing the stream from local to url). But, as you can see, the file is not static, and to generate it, authentication is in order. Furthermore, I cannot have the credentials of the user.

What I wanted to do is create a more user-friendly solution than guiding the user to save the page and feed it into the application. I tried creating a simple page with an iFrame which would load the external page, then would allow the user to navigate to his data, and by pressing a button would save the HTML of the page. And then I came across Cross-domail policies, which I had no idea about.
Since I am no good with JavaScript, I thought of using another Java component. Invoking Swing/SWT components, I created a web-browser window, having complete control over the data, so the idea functioned: when the user is seeing the data he wants, a simple click is sufficient to save it, pass it as an argument, even pre-parse it, as you can imagine. But then, my application runs on Tomcat, so when working in local, everything is OK. When trying to access the Java component from another station, the window never appears, it is instead displayed on the server side.

So, another dead-end. Today I ran across another idea, suggesting the use of a PHP proxy script. That is, I create a simple PHP script on my server, and instead of quering the foreign site, I query my script instead. My script queries the site, gets the data, and then passes it back to me. I could get around to it, I actually think it's quite robust and good a solution, but here is the thing: how can I access the data since I have no access to it? Is the only option to request the credentials of the user and log into the system by using a GET method with the appropriate parameters?

Perhaps I am really confused, but it's been more than a week now, and I really am at a loss. Any suggestion is welcome, and I thank you in advance! :)

Cheers!

Is the only option to request the credentials of the user and log into the system by using a GET method with the appropriate parameters?

If that website does not provide some other method, then yes, you'll need those. The problem is, will the user trust you with his credentials?

Exactly... I don't think it is a good idea to implement something like this.
I think the best practce, after all, is to just instruct the user to navigate to the page by himself and save it locally, then feed it into the system.

Agreed. It's a trust issue. If you do implement this be sure to use SSL.

I wrote something a bit similar to this and it is still in production use. It logs in to a site, navigates to a specific page extracts data and then uploads that data to another site (sending the data to a custom PHP program that updates a mysql database). For the login, the user entered the ID ad PW the first time and then the program saved them to an encrypted file (which the program has to decrypt so it isn't highly secure). The program could have been written to require the ID and PW every time if security of the info was a bigger deal. I did give the user the program to run and that mostly worked for a while but it is a little bit fragile so I ended up taking it back and now I run it for the user. It is getting less fragile as time goes on because I add more defensive code as I run into issues.

It seems that you could give the user a program to extact the data / html and then upload it over a secure connection to a server file. This would require the user to enter the login information every time. You would probably need to have login authority for a while as you develop it (and they might need some special procedure to monitor the usage). If you are going to have access to the extracted (sensitive) data then they will still need to have some trust in you (non-disclosure agreement?). Your program to parse/save the data could run on a scheduled basis and check if a new file was uploaded.

I've done quite a bit of experimenting with screen-scraping and I've tried a number of different tools (not including Java). My tool of choice is Autoit. It has functions that use the Windows (IE) Com interface so you can work at a data element level rather than try to parse your way through a bunch of html. I also developed some of my own (Autoit) tools using a Chrome intereface. Autoit has a syntax very similar to PHP so it is pretty easy to go back and forth between Windows and web development.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.