Hi ALL,
The project we're working on is a document retrieval system. Two kinds of Features for each document in a repository are extracted as described in table-1 and table-2.
Table1
DocID Word LocationName PersonName Freq
1 Book N N 9
1 Study N N 3
2 Work N N 4
2 Run N N 5
2 User N N 2
3 Search N N 6
3 Sweat N N 7
3 London Y N 3
3 Clinton N Y 1
Table2
DocID Word1 Word2 Freq
1 Book Read 3
1 Study English 2
2 Work Hard 1
3 Sweat warm 1
Given an example document, which is described as
Word LocationName PersonName Freq
ExTable1
Book N N 8
Read N N 7
London Y N 2
Clinton N Y 6
ExTable2
WordA WordB Freq
Book Read 2
Book Write 1
I need to calculate the similarities between the example document and each of the document in the repository.
For example, the similarity between Doc1 and the example document( based on ExTable1 and Table1) can be calculated as follows
Read: 9+8= 17
the similarity based on ExTable2 and Table3 can be calculated as follows
Book--Read: 3+2 = 5
Finally, calculating the final result: 17+5=22
In my program, I used the following SQL statement to implement the calculation.
SELECT view1.DocID
, view1.Score+view2.Score As TotalScore
FROM (
SELECT Table1.DocID
, SUM(ExTable1.Freq + Table1.Freq) Score
FROM Table1
INNER
JOIN ExTable1
ON ExTable1.Word = Table1.Word
GROUP
BY Table1.DocID
) AS view1
LEFT OUTER
JOIN (
SELECT Table2.DocID
, SUM(ExTable1.Freq + Table1.Freq) Score FROM Table2
INNER
JOIN ExTable2
ON ExTable2.WordA = Table2.WordA
AND ExTable2.WordB = Table2.WordB
GROUP
BY Table2.DocID
) AS view2 ON view2.DocID = view1.DocID
ORDER BY TotalScore
Actually,the real picture is a bit more complicated in that some documents in the repository should precluded from matching with the example document.
That is, given an example document (a query), only qualified documents in the repository are compared with it to get the similarities.
I have difficult in implementing the process of document filtering using SQL statements.
The process of document filtering is described below.
Rule1, (Based on ExTable1 and Table1):
if there are person names in the title of the example document, then those documents in the repository , which do NOT include any of the person names will be ruled out. The remaining documents will continue to be evaluated with the following rules before they become qualified.
Rule2(Based on ExTable1 and Table1):
A document is qualified if it contains at least one location name which appears in the example document.
Rule3(Based on ExTable2 and Table2):
A document is qualified if there are more than one word-pair matches between the example document and this document, by examing ExTable2 and Table2.
That is, documents after filtering by rule1 can ruled as qualified by rule2 or rule3.
I have no idea how to implement the doucument filtering. SHould I use a stored procedure to generate a tabular reuslt(Tabular_Result) containing ALL DocIDs of qualified documents and use it in the statemennt to generate fiew1, like
(
SELECT Table1.DocID
, SUM(ExTable1.Freq + Table1.Freq) Score
FROM Table1
INNER
JOIN ExTable1
ON ExTable1.Word = Table1.Word AND Table1.DocID IN Tabular_Result
GROUP
BY Table1.DocID
) AS view1
However, I am not sure whether a stored procedure can generate a tabular result, and whether wiser approaches exists for this purpose.
p.s.
'Word's are extacted from a document, LocationName and PersonName, both are Boolean value(Y/N), indicate whether a particular 'Word' is a name of a place or a name of Person.
Place name and person name in a document are important info for a document matching. Given an example document containing a person name, if a candidate document is similar to the exmaple document, it is very likely the candidate doucment contain the person name in its text title or text body. Hence there is no need to do matching work between the example document and those documents which have no the person name in its text. Similarly, documents cannot meet the rule2 or rule3 also will be ruled out.
Based on this observation, we figure out the document filtering rule1.
Thanks.