Snitz Forums 2000 - Tips on writing a search

Snitz Forums 2000

Username:	Password:
Save Password
Forgot your Password?

All Forums

Community Forums

Code Support: ASP (Non-Forum Related)

Tips on writing a search

New Topic

Topic Locked

Printer Friendly

Author

Topic

Kat
Advanced Member

United Kingdom
3065 Posts

Posted - 08 May 2001 : 05:48:32

Can anyone give me some tips on how to go about writing a search.

I already have a search working nicely where it it just using filters to narrow down results. However, I need to add a keywords box to this search and it needs to use the CONTAINS TABLE rather than LIKE because we need it to be returned by relevance.

I am not sure how to pull back to values from the text field in the form and split them up correctly.

Someone could enter "'Search Phrase', Search Words , another, word" or other combinations and I am not sure how to pull these values back and split them into the correct parts to put in the CONTAINS("search for this") value. Request.Form("textfieldname") is not enough to split them up.

Any input would be appreciated!

KatsKorner

HuwR
Forum Admin

United Kingdom
20595 Posts

Posted - 08 May 2001 : 06:01:43

Kat,

You will need to write some parsing functions to extract from the input box. it will be quite compex as you can't control how the users input the keywords. if they were all niely seperated by commas as in your example, you could just use split to obtain the seperate keywords or phrases.

gor
Retired Admin

Netherlands
5511 Posts

Posted - 08 May 2001 : 06:11:19

Even if they are not seperated using commas, you can still use split().
Best thing is to first split on " so users can use "search for this sentence"
and then split on spaces.
That way you also get the boolean operators split from the rest.

Pierre
Join the Snitz WebRing

Kat
Advanced Member

United Kingdom
3065 Posts

Posted - 08 May 2001 : 07:04:47

Ok Gor,

I will go with your suggestion and split by spaces.

I understand that their are some words that needs stripping from a search otherwise SQL Server won't like it. Has anyone got any advice on what they are and how to handle it?

KatsKorner

HuwR
Forum Admin

United Kingdom
20595 Posts

Posted - 08 May 2001 : 07:08:20

I can't think of any that would affect your search, there are words you can't use as column names, but there should be no restriction on what you can search for.

Kat
Advanced Member

United Kingdom
3065 Posts

Posted - 08 May 2001 : 08:06:05

Hmm. Some of the guys here have a file of 'noise' words that apparently I should filter before trying to do a search.

I shall investigate.

KatsKorner

HuwR
Forum Admin

United Kingdom
20595 Posts

Posted - 08 May 2001 : 08:11:55

they probably just mean common usage words which are likely to appear in most records, things like 'the'

gor
Retired Admin

Netherlands
5511 Posts

Posted - 08 May 2001 : 08:16:35

yes, a noise file has words like 'the' 'of' 'a' etc.
You can see that also at http://www.google.com/ (not the list, but how it works).
If you type a search string with words that are on that list, it disregards them in the search because the are too common.

Pierre
Join the Snitz WebRing

gor
Retired Admin

Netherlands
5511 Posts

Posted - 08 May 2001 : 08:30:37

I found this info in an oracle document (http://ksu154.himolde.no/oracledok/doc/cartridg.804/a58164.pdf):

To calculate a relevance score for a returned document in a text query, ConText uses
an inverse frequency algorithm. Inverse frequency scoring assumes that frequently
occurring terms in a document set are "noise" terms, and so these terms are scored
lower. For a document to score high, the query term must occur frequently in the
document but infrequently in the document set as a whole.
The following table illustrates ConText’s inverse frequency scoring. The first col-umn
shows the number of documents in the document set, and the second column
shows the number of terms in the document necessary to score 100.
Note: This section discusses how ConText calculates score for text
queries, which is different from the way it calculates score for
theme queries.

This table assumes that only one document in the set contains the query term.
The table illustrates that if only one document contained the query term and there
were five documents in the set, the term would have to occur 20 times in the docu-ment
to score 100. Whereas, if there were 1,000,000 documents in the set, the term
would have to occur only 4 times in the document to score 100.

Example
You have 5000 documents dealing with chemistry in which the term chemical occurs
at least once in every document. The term chemical thus occurs frequently in the
document set.
You have a document that contains 5 occurrences of chemical and 5 occurrences of
the term hydrogen. No other document contains the term hydrogen.
Because chemical occurs so frequently in the document set, its score for the docu-ment
is lower with respect to hydrogen, which is infrequent is the document set as a
whole. This is so even though both terms occur 5 times in the document.

Number of Documents Frequency of Term
in Document Set in Document
1 34
5 20
10 17
50 13
100 12
500 10
1,000 9
10,000 7
100,000 5
1,000,000 4

Note: Even if the relatively infrequent term hydrogen occurred 4
times in the document, and chemical occurred 5 times in the docu-ment,
the score for hydrogen might still be higher, because chemical
occurs so frequently in the document set (at least 5000 times).ഊScoring
Inverse frequency scoring also means that adding documents that contain hydrogen
lowers the score for that term in the document, and adding more documents that
do not contain hydrogen raises the score.

On this site: http://www.phpbuilder.com/columns/clay19990421.php3
They explain how to build in in PHP and provide a noiselist:


noisewords.txt 
-------------- 
a 
about 
after 
ago 
all 
almost 
along 
also 
am 
an 
and 
answer 
any 
anybody 
anywhere 
are 
aren't 
around 
as 
ask 
at 
bad 
be 
been 
before 
being 
best 
better 
between 
big 
but 
by 
can 
can't 
come 
could 
couldn't 
day 
did 
didn't 
do 
does 
don't 
down 
each 
either 
else 
even 
ever 
every 
everybody 
everyone 
far 
find 
for 
found 
from 
get 
go 
going 
gone 
good 
got 
had 
has 
have 
haven't 
having 
her 
here 
hers 
him 
his 
home 
how 
href 
I 
if 
in 
into 
is 
isn't 
it 
its 
know 
large 
less 
like 
little 
looking 
look 
many 
me 
more 
most 
must 
my 
near 
never 
new 
news 
no 
none 
not 
nothing 
of 
off 
often 
old 
on 
once 
only 
or 
other 
our 
ours 
out 
over 
page 
please 
question 
rather 
recent 
she 
should 
sites 
small 
so 
some 
something 
sometime 
somewhere 
than 
true 
thank 
that 
the 
their 
theirs 
them 
then 
there 
these 
they 
this 
those 
though 
through 
thus 
time 
times 
to 
too 
under 
until 
untrue 
up 
upon 
use 
users 
version 
very 
via 
want 
was 
way 
web 
were 
what 
when 
where 
which 
who 
whom 
whose 
why 
wide 
will 
with 
within 
without 
world 
worse 
worst 
would 
www 
yes 
yet 
you 
your 
yours 
how

Pierre
Join the Snitz WebRing

Kat
Advanced Member

United Kingdom
3065 Posts

Posted - 08 May 2001 : 09:04:41

Thanks Gor. That noise list is great. I think I know how to handle that bit now. Time for a trial. And to learn more about CONTAINS TABLE.

KatsKorner

Kat
Advanced Member

United Kingdom
3065 Posts

Posted - 08 May 2001 : 09:45:25

I can't get CONTAINS TABLE to work. Can anyone help?

I have a full-text-indexed field on tblCompanyContact called s_companyname. This should contain the word 'company'. Can't get the sql to return anything.

Based it on something I found from Microsoft Site:

SELECT s_companyname
FROM tblcompanycontact AS FT_TBL INNER JOIN
   CONTAINSTABLE(tblcompanycontact, s_companyname, 'company') AS KEY_TBL
   ON FT_TBL.l_companyid = KEY_TBL.[KEY]

Confused. don't understand what the join is trying to do but if I remove it - nothing works. Doesn't work anyway. Help?

KatsKorner

Edited by - kat on 08 May 2001 09:46:18

Kat
Advanced Member

United Kingdom
3065 Posts

Posted - 08 May 2001 : 10:02:48

Have decided to not bother with full-text-indexing because the amount of data does not justify it. Going to use PATINDEX instead.

Thanks for the help guys!

KatsKorner

Doug G
Support Moderator

USA
6493 Posts

Posted - 08 May 2001 : 10:48:14

SQL Server 7 has a built-in noise word list.

http://support.microsoft.com/support/kb/articles/q247/5/61.asp

======
Doug G
======

Kat
Advanced Member

United Kingdom
3065 Posts

Posted - 08 May 2001 : 11:45:56

Thanks Doug!

KatsKorner

Topic

New Topic

Topic Locked

Printer Friendly

Jump To:

Snitz Forums 2000

This page was generated in 1.07 seconds.