Hi all,

(this is a repost from here)

I'm working on a relational data model in which I'd like to store boolean expressions. I'm a little rusty so I'm humbly asking for a little push.

For example, let's say I have the following boolean expression:

((A = 8 OR A >= 9) AND B < 3) OR NOT (A <> 8 AND B = 0)

What would be a simple, elegant data model to store this kind (or any kind) of expression? Also important, how would you query the database in order to rebuild the logical expression from the stored data?

BTW, any kind of help would be appreciated. Reference to books, papers, web sites are perfectly ok.

Hey, thanks!

At first glance the answer is 'store it as is in a varchar field'. But it seems you want to do something else. Perhaps if you gave us the context of this problem, it would suggest a design. What do you really want to do?

Are these answers to questions? Are they questions to be answered?

Fair enough, here's a little more information.

I'm writing a data mining application where the database has about a hundred tables sharing close to a thousand fields among them. The application lets users -- university researchers mainly -- create ad-hoc queries on any table, any field to "water down" the data. That is to say, the user input, all done into web forms, is made into a valid SQL query that filters the data. From there they perform statistical analysis and what have you on the resulting data.

The current ad-hoc querying module I wrote is fairly simple but also quite limited. Users can select a table and a field (operand A), an operator, and a value (operand A'). In its simplest form, you get a boolean expression like this:

A > A'

For the same operand A, users can create an OR rule as such:

(A > A' OR A = A'' OR A <= A''')

Furthermore, users can create AND rules for different operands B and C like this:

(A > A' OR A = A'' OR A <= A''')
AND (B <= B' OR B = B'')
AND (C <> C')

...and so on. There is no limit to the amount of OR rules you can put on one line nor is there any limit to the number of AND rule lines.

Like I said, the above model is fairly simple, I store all rules in three tables:

Filter
---------
* id
name

AndRule
-----------
* id
Filter_id (FK)
OperandA
Operator
OperandA'


OrRule
----------
* id
AndRule_id (FK)
Operator
OperandA'

Of course my model breaks when trying to store more complex expressions. For example, I can't store this:

(A = A' OR B = B')
OR (A <> A'' AND B <> B'')

Hence me asking: is there a known way to store expressions (any kind really, they don't have to be boolean I suppose) is a relational model supporting any combination of operators and parenthesis?

Storing the expression (or the resulting SQL query) in a varchar field is of course feasible but I was hoping for something more flexible, i.e. that could be displayed/modified by the user in the same UI where they built the expression in the first place without having to parse it. In other words, I by far prefer a broken down expression that I can assemble fairly easily, rather than a text expression that I must break down anyway if I want to edit it or build a valid SQL query from it.

This is indeed a very interesting problem. Thank you :)

Questions:

Are you allowing the logical expressions XOR and NOT to be included in your model?

Will there ever be more than 3 variables (A,B,C)?

I'm not clear on your use of A=A' (A prime). Can you explain a little more?

If I understand correctly, you want to map the combinations of logical operators with an unknown but finite number of variables. For argument's sake let's say there are 4 logical operators (AND, OR, NOT, XOR) and 3 variables (A, B, C). As well, there are the equivalancy operators (=, <, >, <=, >=, <>).

So, there are 3 things to track (logical operators, variables, equivalency operators). In my mind this indicates 3 tables. Since these are fairly atomic values, they can all be stored as boolean values themselves in the tables (0,1). This is going to get very confusing very quickly :)

Is this the right interpretation of your problem?

For some further insight try to get a copy of 'Data & databases: Concepts in practice' by Joe Celko.

Are you allowing the logical expressions XOR and NOT to be included in your model?

No, since the expressions translate to SQL statements (the WHERE clause to be more specific), I'm sticking to the usual AND and OR operators for the moment.

Will there ever be more than 3 variables (A,B,C)?

Yes, in the current model there is no limit to the number of AND rules, therefore no limit to the number of variables.

I'm not clear on your use of A=A' (A prime). Can you explain a little more?

I use this notation to illustrate in a generic form the two operands of an equivalence test. In practice, I could have said Field = Value, since...

A, B, C... are field names in a table out of the thousand fields we are mining into. For example, if I have a table X in my database with fields X1, X2 and X3, variable A would represent one of those fields (X.X1), B another field (X.X3) and C could be Y.Y65 in a different table. I'm mining into about 1000 fields spread across 100 tables as it is right now.

A', B', C'... are values we are testing against the selected fields. For example, if X1's type is numerical, the expression A = A' could translate to X.X1 = 10. This is how it would be stored in the AndRule table:

A | Operator | A'
-------------------------
X.X1 | = | 10
X.X3 | >= | 10 Feb 1971
Y.Y65 | = | 'apple'

You'll notice that I can test against numerical values, dates and strings. Of course to do this I must track each and every field in the database and know its data type.

So, there are 3 things to track (logical operators, variables, equivalency operators). In my mind this indicates 3 tables.

I would add a fourth dimension to track: the precedence set by the use of parentheses. These two expressions are not logically identical:

(A AND B) OR C
A AND (B OR C)

To me, that is the most difficult part and the main reason why I'm looking for a shortcut :)
Thanks for the help and the book reference, I'll look it up.

So you are trying to create ad-hoc queries on the fly? Is this correct?

I'm not sure why you need to store the elements of a query in a database. If you want users to query a database directly, then provide a controlled environment where they pick ONLY the elements you want them to access. For example drop down lists, radio buttons, check boxes, etc. Build the query from that and execute it. If this is the case you will also want to code against SQL injection attacks. You will also have to allow for an unknown number of variables using some client-side programming.

Or am I completely missing something here? What kind of data are you querying?

Correct.

I feel we're going a little off topic here but if you really must know, the data is medical and I already have a controlled ad-hoc environment with everything you suggest.

The purpose of this application is scientific. We're mining data and looking for statistical meaning where none was before. If a user works hard at creating a filter that will produce a list of brain scans with such and such characteristics belonging to people that show this and that symptom and are smokers or have a genotype like this or like that, you can probably guess that there are many reasons to save that query. I can think of two big ones: 1) perhaps the user will want to come back later, change a single parameter out of the 30 he already set and get a new filtered dataset from there without having to start over; 2) if that researcher publishes a science paper based on the data, I would imagine he'd want a little bit of traceability, i.e. be able to tell exactly where the data is coming from.

I've said enough about the what and the why, any idea how?

Thank you for the further information. You're right, it isn't pertinent, I was just trying to make sure I had the right interpretation to work with.

Saving a 'view' of the query along with the exact query isn't acceptable?

There's certainly merit to the idea as it solves reason #2 in my previous post. It makes it somewhat harder however when wanting to edit the query, I would have to:

a) parse the WHERE clause
b) rebuild the web form
c) accept user input
d) rebuild the WHERE clause
e) rerun the query

Following Occam's Razor logic, if I store it in pure boolean form, I save step a) which in my mind is always a royal pain (maybe because I'm no better at writing parsers than at designing an "expression data model" - sigh).

But thanks for the suggestion, I'll certainly consider it if I come to an absolute stop with my current efforts.

I'm about at the limit of any help I might be able to provide you. This is a very good exercise.

If you were to save the query verbatim, could you not then present it back to the user for editing? Obviously they must know the structure of the database or they wouldn't have been able to create the query in the first place. Your task would be to accept an edited query (properly sanitized) before executing it. An option to save the new query results might be available.

If you haven't yet done so, you may want to check the newsgroups - there are some good ones for database design, and that is where I first ran into Joe Celko. He actually responded to a query I had a few years ago.

The way I designed it so far, the users actually don't need to know the structure of the database. I present them with a web form that contains select boxes for tables, fields, operators and even sometimes values. The table names are comprehensive enough so that someone having the right vocabulary can zoom in on a field in very little time and write a filter (ad-hoc query) fairly quickly. I think this is a strong point of my application and I want to hold on to it.

I'll send you a private message with some info on how to connect. It'll be easier if I show you...

Dear LeBurt,

Did you figure out a solution? I am in the same boat. Would you mind sharing your soltuion?

I have a fair idea of how I want to store the conditions but haven't got to a point on how to evaulate. I am thinking of accomplishing the whole thing with two tables.

I can elaborate the solution if you are interested.

Thanks,
Nachi

Hey ranchcharm,

No, I sort of let it go a while back and focused on other things. I kept thinking about it though, on and off. I have a feeling I could possibly apply a stack model used to store and evaluate any kind of expression (arithmetic, boolean). I haven't pushed that wagon very far and therefore don't know how well a database could be used to modelize a stack...

Here's an example: http://www.cse.iitk.ac.in/~sbaswana/Courses/ESO211/stack-lec.pdf

I'd be glad to hear about your solution, if you have the time.

pretty much away in time, but i am trying to achieve a similar thing as the original poster. my best idea so far is to store the expression in polish form (operand before operands, no brackets needed), it is much easier to parse after retrieval. another option would be to store the expression as a tree, where the operands and the operators are all nodes (with fields for node type, parent id etc.). i think you only need a single database table for that.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.