Author |
Topic  |
|
Kat
Advanced Member
    
United Kingdom
3065 Posts |
Posted - 08 May 2001 : 05:48:32
|
Can anyone give me some tips on how to go about writing a search.
I already have a search working nicely where it it just using filters to narrow down results. However, I need to add a keywords box to this search and it needs to use the CONTAINS TABLE rather than LIKE because we need it to be returned by relevance.
I am not sure how to pull back to values from the text field in the form and split them up correctly.
Someone could enter "'Search Phrase', Search Words , another, word" or other combinations and I am not sure how to pull these values back and split them into the correct parts to put in the CONTAINS("search for this") value. Request.Form("textfieldname") is not enough to split them up.
Any input would be appreciated!
KatsKorner
|
|
HuwR
Forum Admin
    
United Kingdom
20595 Posts |
Posted - 08 May 2001 : 06:01:43
|
Kat,
You will need to write some parsing functions to extract from the input box. it will be quite compex as you can't control how the users input the keywords. if they were all niely seperated by commas as in your example, you could just use split to obtain the seperate keywords or phrases.
|
 |
|
gor
Retired Admin
    
Netherlands
5511 Posts |
Posted - 08 May 2001 : 06:11:19
|
Even if they are not seperated using commas, you can still use split(). Best thing is to first split on " so users can use "search for this sentence" and then split on spaces. That way you also get the boolean operators split from the rest.
Pierre Join the Snitz WebRing |
 |
|
Kat
Advanced Member
    
United Kingdom
3065 Posts |
Posted - 08 May 2001 : 07:04:47
|
Ok Gor,
I will go with your suggestion and split by spaces.
I understand that their are some words that needs stripping from a search otherwise SQL Server won't like it. Has anyone got any advice on what they are and how to handle it?
KatsKorner
|
 |
|
HuwR
Forum Admin
    
United Kingdom
20595 Posts |
Posted - 08 May 2001 : 07:08:20
|
I can't think of any that would affect your search, there are words you can't use as column names, but there should be no restriction on what you can search for.
|
 |
|
Kat
Advanced Member
    
United Kingdom
3065 Posts |
Posted - 08 May 2001 : 08:06:05
|
Hmm. Some of the guys here have a file of 'noise' words that apparently I should filter before trying to do a search.
I shall investigate.
KatsKorner
|
 |
|
HuwR
Forum Admin
    
United Kingdom
20595 Posts |
Posted - 08 May 2001 : 08:11:55
|
they probably just mean common usage words which are likely to appear in most records, things like 'the'
|
 |
|
gor
Retired Admin
    
Netherlands
5511 Posts |
Posted - 08 May 2001 : 08:16:35
|
yes, a noise file has words like 'the' 'of' 'a' etc. You can see that also at http://www.google.com/ (not the list, but how it works). If you type a search string with words that are on that list, it disregards them in the search because the are too common.
Pierre Join the Snitz WebRing |
 |
|
gor
Retired Admin
    
Netherlands
5511 Posts |
Posted - 08 May 2001 : 08:30:37
|
I found this info in an oracle document (http://ksu154.himolde.no/oracledok/doc/cartridg.804/a58164.pdf):
To calculate a relevance score for a returned document in a text query, ConText uses an inverse frequency algorithm. Inverse frequency scoring assumes that frequently occurring terms in a document set are "noise" terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole. The following table illustrates ConText’s inverse frequency scoring. The first col-umn shows the number of documents in the document set, and the second column shows the number of terms in the document necessary to score 100. Note: This section discusses how ConText calculates score for text queries, which is different from the way it calculates score for theme queries.
This table assumes that only one document in the set contains the query term. The table illustrates that if only one document contained the query term and there were five documents in the set, the term would have to occur 20 times in the docu-ment to score 100. Whereas, if there were 1,000,000 documents in the set, the term would have to occur only 4 times in the document to score 100.
Example You have 5000 documents dealing with chemistry in which the term chemical occurs at least once in every document. The term chemical thus occurs frequently in the document set. You have a document that contains 5 occurrences of chemical and 5 occurrences of the term hydrogen. No other document contains the term hydrogen. Because chemical occurs so frequently in the document set, its score for the docu-ment is lower with respect to hydrogen, which is infrequent is the document set as a whole. This is so even though both terms occur 5 times in the document.
Number of Documents Frequency of Term in Document Set in Document 1 34 5 20 10 17 50 13 100 12 500 10 1,000 9 10,000 7 100,000 5 1,000,000 4
Note: Even if the relatively infrequent term hydrogen occurred 4 times in the document, and chemical occurred 5 times in the docu-ment, the score for hydrogen might still be higher, because chemical occurs so frequently in the document set (at least 5000 times).ഊScoring Inverse frequency scoring also means that adding documents that contain hydrogen lowers the score for that term in the document, and adding more documents that do not contain hydrogen raises the score.
On this site: http://www.phpbuilder.com/columns/clay19990421.php3 They explain how to build in in PHP and provide a noiselist:
noisewords.txt -------------- a about after ago all almost along also am an and answer any anybody anywhere are aren't around as ask at bad be been before being best better between big but by can can't come could couldn't day did didn't do does don't down each either else even ever every everybody everyone far find for found from get go going gone good got had has have haven't having her here hers him his home how href I if in into is isn't it its know large less like little looking look many me more most must my near never new news no none not nothing of off often old on once only or other our ours out over page please question rather recent she should sites small so some something sometime somewhere than true thank that the their theirs them then there these they this those though through thus time times to too under until untrue up upon use users version very via want was way web were what when where which who whom whose why wide will with within without world worse worst would www yes yet you your yours how
Pierre Join the Snitz WebRing |
 |
|
Kat
Advanced Member
    
United Kingdom
3065 Posts |
Posted - 08 May 2001 : 09:04:41
|
Thanks Gor. That noise list is great. I think I know how to handle that bit now. Time for a trial. And to learn more about CONTAINS TABLE.
KatsKorner
|
 |
|
Kat
Advanced Member
    
United Kingdom
3065 Posts |
Posted - 08 May 2001 : 09:45:25
|
I can't get CONTAINS TABLE to work. Can anyone help?
I have a full-text-indexed field on tblCompanyContact called s_companyname. This should contain the word 'company'. Can't get the sql to return anything.
Based it on something I found from Microsoft Site:
SELECT s_companyname FROM tblcompanycontact AS FT_TBL INNER JOIN CONTAINSTABLE(tblcompanycontact, s_companyname, 'company') AS KEY_TBL ON FT_TBL.l_companyid = KEY_TBL.[KEY]
Confused. don't understand what the join is trying to do but if I remove it - nothing works. Doesn't work anyway. Help?
KatsKorner
Edited by - kat on 08 May 2001 09:46:18 |
 |
|
Kat
Advanced Member
    
United Kingdom
3065 Posts |
Posted - 08 May 2001 : 10:02:48
|
Have decided to not bother with full-text-indexing because the amount of data does not justify it. Going to use PATINDEX instead.
Thanks for the help guys!
KatsKorner
|
 |
|
Doug G
Support Moderator
    
USA
6493 Posts |
|
Kat
Advanced Member
    
United Kingdom
3065 Posts |
Posted - 08 May 2001 : 11:45:56
|
Thanks Doug!
KatsKorner
|
 |
|
|
Topic  |
|