Snitz Forums 2000
Snitz Forums 2000
Home | Profile | Register | Active Topics | Members | Search | FAQ
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 Community Forums
 Community Discussions (All other subjects)
 How to identify phrases
 New Topic  Reply to Topic
 Printer Friendly
Author Previous Topic Topic Next Topic  

Podge
Support Moderator

Ireland
3775 Posts

Posted - 31 March 2009 :  15:38:11  Show Profile  Send Podge an ICQ Message  Send Podge a Yahoo! Message  Reply with Quote
Say you have a block of text and you need to identify some phrases that may or may not be in the block of text, what is the most efficient way of doing it. Phrases are given priority based on the length of words.

quote:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer sit amet lacus. Fusce erat. Proin vel arcu quis justo viverra fermentum. Praesent ante mi, pretium vitae, dapibus ut, laoreet vitae, lacus. Aenean non diam. Sed non ipsum. In hac habitasse platea dictumst. Donec tincidunt mollis dui. Praesent magna mauris, elementum sed, cursus non, sodales quis, massa. Vestibulum mattis volutpat leo. Proin ornare ipsum ac justo. Quisque accumsan.


DB structure

phrase					word_count
Lorem ipsum dolor sit amet		5
Lorem ipsum dolor				3
sit amet					2
Lorem						1


The phrase Lorem ipsum dolor sit amet would take preference over Lorem ipsum dolor and Lorem

You could iterate through all of the phrases (starting with the longest) and use inStr to test whether it is there or not. For the purposes of this example the detected phrase is removed i.e. if Lorem ipsum dolor is detected you don't have to worry about Lorem being detected in the same place later on but other Lorem phrases would be if they exist. There would be problems with this method if there are hundreds of thousands of phrases.

At the moment I am breaking the block of text up into words and store them individually in an array. I then use a LIKE query to identify the phrases in the database beginning with the first word in the array. If rows are returned I then check to see if the next word is the second word of any of the rows and if it is I remove it from its position in the array and add it to the preceeding array element and start the process again (recursive function). I don't have this working fully yet but it seems like a very inefficient way of doing things.

asp, vb dotnet, c#, java, pseudo code or any other suggestions welcome.

Podge.

The Hunger Site - Click to donate free food | My Blog | Snitz 3.4.05 AutoInstall (Beta!)

My Mods: CAPTCHA Mod | GateKeeper Mod
Tutorial: Enable subscriptions on your board

Warning: The post above or below may contain nuts.

Edited by - Podge on 31 March 2009 15:39:22

AnonJr
Moderator

United States
5768 Posts

Posted - 31 March 2009 :  17:13:33  Show Profile  Visit AnonJr's Homepage  Reply with Quote
http://lifehacker.com/5190716/primitive-word-counter-analyzes-word-and-phrase-frequency

May help, or maybe one or two of the projects mentioned in the comments.
Go to Top of Page

ruirib
Snitz Forums Admin

Portugal
26364 Posts

Posted - 31 March 2009 :  17:43:25  Show Profile  Send ruirib a Yahoo! Message  Reply with Quote
Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.

Regular expressions are supported in C#, and quite easy to use...


Snitz 3.4 Readme | Like the support? Support Snitz too
Go to Top of Page

SiSL
Average Member

Turkey
671 Posts

Posted - 31 March 2009 :  17:58:25  Show Profile  Visit SiSL's Homepage  Reply with Quote
quote:
Originally posted by ruirib

Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.

Regular expressions are supported in C#, and quite easy to use...



Agreed there, good way to go with RegEx, can be used in all languages even. .net & php regex also supports look-backs too. But if you want to use in DB query , it is whole another story...

For DB, I suggest FREETEXTTABLE option and "RANK" and "DENSE" etc. If you want to sort, or at least CONTAINS statements on Full-Text...

Say like:

SELECT t.COLTOSEARCH, kt.RANK FROM TABLE t
LEFT JOIN FREETEXTTABLE(TABLE, COLTOSEARCH, 'Lorem ipsum something something') AS kt ON (kt.[KEY] = t.ID)

(Note: It may not be exact , I'm just typing what's left in my mind)
Will return numbers as Rank relativeness rate of your search... So fitting 5 words for example will have higher Rank then those who fits only 1 word..


PS: I think it is SQL-2005 or SQL-2008 functions)

CHIP Online Forum

My Mods
Select All Code | Fix a vulnerability for your private messages | Avatar Categories W/ Avatar Gallery Mod | Complaint Manager
Admin Level Revisited | Merge Forums | No More Nested Quotes Mod

Edited by - SiSL on 31 March 2009 18:07:17
Go to Top of Page

Podge
Support Moderator

Ireland
3775 Posts

Posted - 31 March 2009 :  21:51:58  Show Profile  Send Podge an ICQ Message  Send Podge a Yahoo! Message  Reply with Quote
quote:
Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.
It is easy to find phrases in the db but it means that you you have to iterate through every one of what could be thousands of possible phrases stored in the db in order to match any that may be in the block of text. I'm eager to reduce the amount of querying that needs to be done.

It would be very convenient if you could just pass the block of text to the db (as a query) and it would give you back all of the longest phrases contained within it.

Podge.

The Hunger Site - Click to donate free food | My Blog | Snitz 3.4.05 AutoInstall (Beta!)

My Mods: CAPTCHA Mod | GateKeeper Mod
Tutorial: Enable subscriptions on your board

Warning: The post above or below may contain nuts.
Go to Top of Page

ruirib
Snitz Forums Admin

Portugal
26364 Posts

Posted - 01 April 2009 :  07:20:16  Show Profile  Send ruirib a Yahoo! Message  Reply with Quote
Seems like you need some coding there... Probably would do a double pronged process. Would index the words in the text block, would then retrieve phrases containing those words, probably ordered from longest to shortest and try to match them. As I am admiting that you use SQL Server, probably using text indexing for the phrases would be best.



Snitz 3.4 Readme | Like the support? Support Snitz too
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
Snitz Forums 2000 © 2000-2021 Snitz™ Communications Go To Top Of Page
This page was generated in 0.14 seconds. Powered By: Snitz Forums 2000 Version 3.4.07