How to identify phrases - Snitz™ Forums 2000

The Forum has been Updated

The code has been upgraded to the latest .NET core version. Please check instructions in the Community Announcements about migrating your account.

How to identify phrases - Posted 31 Mar 2009 15:38 (1363 Views)

Podge

Support Moderator

Posts: 3776

3776

Say you have a block of text and you need to identify some phrases that may or may not be in the block of text, what is the most efficient way of doing it. Phrases are given priority based on the length of words.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer sit amet lacus. Fusce erat. Proin vel arcu quis justo viverra fermentum. Praesent ante mi, pretium vitae, dapibus ut, laoreet vitae, lacus. Aenean non diam. Sed non ipsum. In hac habitasse platea dictumst. Donec tincidunt mollis dui. Praesent magna mauris, elementum sed, cursus non, sodales quis, massa. Vestibulum mattis volutpat leo. Proin ornare ipsum ac justo. Quisque accumsan.

DB structure

Code:


phrase					word_count
Lorem ipsum dolor sit amet		5
Lorem ipsum dolor				3
sit amet					2
Lorem						1

The phrase Lorem ipsum dolor sit amet would take preference over Lorem ipsum dolor and Lorem
You could iterate through all of the phrases (starting with the longest) and use inStr to test whether it is there or not. For the purposes of this example the detected phrase is removed i.e. if Lorem ipsum dolor is detected you don't have to worry about Lorem being detected in the same place later on but other Lorem phrases would be if they exist. There would be problems with this method if there are hundreds of thousands of phrases.
At the moment I am breaking the block of text up into words and store them individually in an array. I then use a LIKE query to identify the phrases in the database beginning with the first word in the array. If rows are returned I then check to see if the next word is the second word of any of the rows and if it is I remove it from its position in the array and add it to the preceeding array element and start the process again (recursive function). I don't have this working fully yet but it seems like a very inefficient way of doing things.
asp, vb dotnet, c#, java, pseudo code or any other suggestions welcome.

Podge. The Hunger Site - Click to donate free food | My Blog | Snitz 3.4.05 AutoInstall (Beta!)
My Mods: CAPTCHA Mod | GateKeeper Mod Tutorial: Enable subscriptions on your board
Warning: The post above or below may contain nuts.

Last edited by Podge on 31 March 2009, 15:39

Posted 31 Mar 2009 17:13

AnonJr

Forum Moderator

Posts: 5768

5768

http://lifehacker.com/5190716/primitive-word-counter-analyzes-word-and-phrase-frequency

May help, or maybe one or two of the projects mentioned in the comments.

Posted 31 Mar 2009 17:43

ruirib

Snitz Forums Admin

Posts: 26364

26364

Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.
Regular expressions are supported in C#, and quite easy to use...

Snitz 3.4 Readme | Like the support? Support Snitz too

Posted 31 Mar 2009 17:58

SiSL

Average Member

Posts: 671

671

Originally posted by ruirib
Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.
Regular expressions are supported in C#, and quite easy to use...

Agreed there, good way to go with RegEx, can be used in all languages even. .net & php regex also supports look-backs too. But if you want to use in DB query , it is whole another story...
For DB, I suggest FREETEXTTABLE option and "RANK" and "DENSE" etc. If you want to sort, or at least CONTAINS statements on Full-Text...
Say like:

SELECT t.COLTOSEARCH, kt.RANK FROM TABLE t
LEFT JOIN FREETEXTTABLE(TABLE, COLTOSEARCH, 'Lorem ipsum something something') AS kt ON (kt.[KEY] = t.ID)

(Note: It may not be exact , I'm just typing what's left in my mind)
Will return numbers as Rank relativeness rate of your search... So fitting 5 words for example will have higher Rank then those who fits only 1 word..

PS: I think it is SQL-2005 or SQL-2008 functions)

Last edited by SiSL on 31 March 2009, 18:07

Posted 31 Mar 2009 21:51

Podge

Support Moderator

Posts: 3776

3776

Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.

It is easy to find phrases in the db but it means that you you have to iterate through every one of what could be thousands of possible phrases stored in the db in order to match any that may be in the block of text. I'm eager to reduce the amount of querying that needs to be done.

It would be very convenient if you could just pass the block of text to the db (as a query) and it would give you back all of the longest phrases contained within it.

Posted 01 Apr 2009 07:20

ruirib

Snitz Forums Admin

Posts: 26364

26364

Seems like you need some coding there... Probably would do a double pronged process. Would index the words in the text block, would then retrieve phrases containing those words, probably ordered from longest to shortest and try to match them. As I am admiting that you use SQL Server, probably using text indexing for the phrases would be best.

Snitz 3.4 Readme | Like the support? Support Snitz too