Member Avatar for kohkohkoh

Hi,
I'm at my wits end. I just need some idea/brain storming here with the expert/professional.

Scenario,
I have 30 text files. (each files about 300mb-500mb)
What i need to do is to convert these files into some sort of binary and store it some where.
But not in SQL database.

I have an intention to store these files into 'look alike' container.

for example.
I have container A and container B
every each container has a cap size of 1GB.
each and every text files will move into this container (until the quota reach)
Once reach, it will move to container B, and so on ...to C, D, E....
On top of that, I will have an application to locate back these files too.

Any medium/container that i can use for this purpose?

Thanks

Storage isn't an issue, and storing them as binary won't change the size unless the majority of the size is formatting of record data. The question you should be asking is "how do I need to access this data"?

For example, if you're just storing the files and will read them sequentially, the file system is adequate. If you need to search within the files and retrieve data based on complex queries, a relational database may be your best option.

What you propose is essentially what the Hadoop hdfs (hadoop distributed file system) does - it stores your data in chunks of a fixed (configurable) size. Go to the Apache Hadoop web pages and read the design documentation for more ideas. FWIW, you can install hdfs on a single computer. It doesn't need to be distributed per se, except that you lose redundancy and fault resilience.

Member Avatar for kohkohkoh

Hi deceptikon,

I have thought of the compression.

Our company previously had a major problem storing the binary files as blob in a relational db.
E.g network problem and others.
Because we are in a financial sector. Dealing with terabytes of data ..files..

Therefore we have been requested to change the way how the thing works.
My team proposed the file system which we shall compress them and mask them in a container.
(Not sure if it is a correct Term)

The question is, how should we mask them in that container?
And as your theory too, where comes after the masking above...
How to access these data? I believed the solution will be there as soon as we know the
Masking technique..

Thanks

Our company previously had a major problem storing the binary files as blob in a relational db.

That's a bad idea in general. I'm not surprised you had issues. Usually you'll store references to the file (eg. a file system path or repository ticket) in the database to avoid the problems inherent in blobs. And if you were using blobs, you must already have something in place to access the data.

Member Avatar for kohkohkoh

U gave a bulb light :)
Great idea.

Just that our concern is, we might have hundreds and thousands files.
We would not like individual files lying on a particular folder. Harder for us to control in such archiving and other program logic.(unsure)

It would better have something like jar file concept
All incoming files will be pushed into this container. Whenever this container is full.
a new container will be created and pushed with the incoming files.

:)

how would you ever extract a specific file? Is the data for a specific file static, that is, is the data written in stone and will never change? Or do you need to edit and modify the data occasionally? The jar concept along with an index file might work as long as no changes are possible to the data. Then you have only one humongous multi-terabyte file to work with. Access time might be slow though.

Member Avatar for kohkohkoh

There won't be any changes to the file
But incoming new files will be appended into it

Thanks rubber man.
Allow me to explore more on Hadoop
:)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.