Member Avatar for westmeadboy

Apologies for the vagueness of the title but I cannot see a way to sum this up in a title...

I have a table storing the individual words of sentences.

Think one column is for a sentence id, the other is for a word in that sentence. So there are 6 rows with sentence id 1, if and only if that sentence has 6 words.

What is the query to get back all sentences (I mean, as ids) containing, say, three given words?

I've thought of a couple of ways:

1. Search the three words separately and intersect the results together.
2. Use subselects to narrow down the search results gradually. So the outermost one might search the first search term but only taking into account sentence ids returned by the subselect etc etc.

These both sound a bit inefficient IMHO.

Is there a better way?

So you want to search for one or many words to see all the different sentences they exist in?

Member Avatar for westmeadboy

For example, return all sentences containing ALL these words:

"fox", "quick", "lazy"

Here is one way:

select *
from sentence
where sentence_id in
(
  select sentence_id
  from word
  where `text` in ('today','is')
  group by sentence_id
)

The sub query selects all sentence_id's for the words you pass, in this case I passed 'today' and 'is'. I group by so I don't get multiple of the same sentence_id back. Then I run all those ids by the first query which returns everything from the sentence table.

In this scenario you have two tables, word and sentence. Here's the db schema I just created to do a quick test. These are not optimized, just for sample purposes:

DROP TABLE IF EXISTS `test`.`sentence`;
CREATE TABLE  `test`.`sentence` (
  `sentence_id` int(10) unsigned NOT NULL auto_increment,
  `text` varchar(100) default NULL,
  PRIMARY KEY  (`sentence_id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `test`.`word`;
CREATE TABLE  `test`.`word` (
  `word_id` int(10) unsigned NOT NULL auto_increment,
  `sentence_id` int(10) unsigned NOT NULL,
  `text` varchar(45) default NULL,
  PRIMARY KEY  (`word_id`)
) ENGINE=InnoDB AUTO_INCREMENT=14 DEFAULT CHARSET=latin1;
select *
from sentence
where sentence_id in
(
  select [B]distinct [/B]sentence_id
  from word
  where `text` in ('today','is')
)

You could also use distinct instead of group by, I don't remember which yields better performance.

Member Avatar for westmeadboy

Thanks very much for that. However, would that query not just select all sentences that contain at least one of the words in ('today', 'is')?

Alternatively, couldn't the subselect be written like this:

select sentence_id, count(sentence_id) as N
from word
where text in ('today','is')
group by sentence_id
having N = 2

(here "2" is the number of words in the ('today', 'is') set)

?

This would only work if there is uniqueness for the combination of the sentence_id and text column.

I apologize, I misunderstood what you were doing. Full-Text searching would work I believe. For example:

select *
from sentence
where match(`text`) against ('+today +good' IN BOOLEAN MODE);

This would return all the sentences that have both 'today' and 'good' in them. There are requirements to use this method. You have to use th MyISAM db engine on that table and you have to create the `text` field (the one you want to search in) as a 'FULLTEXT' index.

More info:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html#function_match
http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html

Member Avatar for westmeadboy

Unfortunately, I don't actually store the full sentence (and don't want to because of capacity limits).

Not sure I'll be much help. Only thing I'll mention is careful with using IN since it can be expensive. I can't think of other ways based on your requirements than what you have already mentioned. Curious to see what others come up with though.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.