Snitz Forums 2000 - How to identify phrases

Snitz Forums 2000

Username:	Password:
Save Password
Forgot your Password?

All Forums

Community Forums

Community Discussions (All other subjects)

How to identify phrases

New Topic

Reply to Topic

Printer Friendly

Author

Topic

Podge
Support Moderator

Ireland
3776 Posts

Posted - 31 March 2009 : 15:38:11

Say you have a block of text and you need to identify some phrases that may or may not be in the block of text, what is the most efficient way of doing it. Phrases are given priority based on the length of words.

quote:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer sit amet lacus. Fusce erat. Proin vel arcu quis justo viverra fermentum. Praesent ante mi, pretium vitae, dapibus ut, laoreet vitae, lacus. Aenean non diam. Sed non ipsum. In hac habitasse platea dictumst. Donec tincidunt mollis dui. Praesent magna mauris, elementum sed, cursus non, sodales quis, massa. Vestibulum mattis volutpat leo. Proin ornare ipsum ac justo. Quisque accumsan.

DB structure


phrase					word_count
Lorem ipsum dolor sit amet		5
Lorem ipsum dolor				3
sit amet					2
Lorem						1

The phrase Lorem ipsum dolor sit amet would take preference over Lorem ipsum dolor and Lorem

You could iterate through all of the phrases (starting with the longest) and use inStr to test whether it is there or not. For the purposes of this example the detected phrase is removed i.e. if Lorem ipsum dolor is detected you don't have to worry about Lorem being detected in the same place later on but other Lorem phrases would be if they exist. There would be problems with this method if there are hundreds of thousands of phrases.

At the moment I am breaking the block of text up into words and store them individually in an array. I then use a LIKE query to identify the phrases in the database beginning with the first word in the array. If rows are returned I then check to see if the next word is the second word of any of the rows and if it is I remove it from its position in the array and add it to the preceeding array element and start the process again (recursive function). I don't have this working fully yet but it seems like a very inefficient way of doing things.

asp, vb dotnet, c#, java, pseudo code or any other suggestions welcome.

Podge.

The Hunger Site - Click to donate free food | My Blog | Snitz 3.4.05 AutoInstall (Beta!)

My Mods: CAPTCHA Mod | GateKeeper Mod
Tutorial: Enable subscriptions on your board

Warning: The post above or below may contain nuts.

Edited by - Podge on 31 March 2009 15:39:22

AnonJr
Moderator

United States
5768 Posts

Posted - 31 March 2009 : 17:13:33

http://lifehacker.com/5190716/primitive-word-counter-analyzes-word-and-phrase-frequency

May help, or maybe one or two of the projects mentioned in the comments.

ruirib
Snitz Forums Admin

Portugal
26364 Posts

Posted - 31 March 2009 : 17:43:25

Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.

Regular expressions are supported in C#, and quite easy to use...

Snitz 3.4 Readme | Like the support? Support Snitz too

SiSL
Average Member

Turkey
671 Posts

Posted - 31 March 2009 : 17:58:25

quote:
Originally posted by ruirib

Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.

Regular expressions are supported in C#, and quite easy to use...

Agreed there, good way to go with RegEx, can be used in all languages even. .net & php regex also supports look-backs too. But if you want to use in DB query , it is whole another story...

For DB, I suggest FREETEXTTABLE option and "RANK" and "DENSE" etc. If you want to sort, or at least CONTAINS statements on Full-Text...

Say like:

SELECT t.COLTOSEARCH, kt.RANK FROM TABLE t
LEFT JOIN FREETEXTTABLE(TABLE, COLTOSEARCH, 'Lorem ipsum something something') AS kt ON (kt.[KEY] = t.ID)

(Note: It may not be exact , I'm just typing what's left in my mind)
Will return numbers as Rank relativeness rate of your search... So fitting 5 words for example will have higher Rank then those who fits only 1 word..

PS: I think it is SQL-2005 or SQL-2008 functions)

Edited by - SiSL on 31 March 2009 18:07:17

Podge
Support Moderator

Ireland
3776 Posts

Posted - 31 March 2009 : 21:51:58

quote:
Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.

It is easy to find phrases in the db but it means that you you have to iterate through every one of what could be thousands of possible phrases stored in the db in order to match any that may be in the block of text. I'm eager to reduce the amount of querying that needs to be done.

It would be very convenient if you could just pass the block of text to the db (as a query) and it would give you back all of the longest phrases contained within it.

ruirib
Snitz Forums Admin

Portugal
26364 Posts

Posted - 01 April 2009 : 07:20:16

Seems like you need some coding there... Probably would do a double pronged process. Would index the words in the text block, would then retrieve phrases containing those words, probably ordered from longest to shortest and try to match them. As I am admiting that you use SQL Server, probably using text indexing for the phrases would be best.

Snitz 3.4 Readme | Like the support? Support Snitz too

Topic

New Topic

Reply to Topic

Printer Friendly

Jump To:

Snitz Forums 2000

This page was generated in 0.13 seconds.