Microsoft.AspNetCore.Mvc.Localization.LocalizedHtmlStringHow to identify phrases

Author : Podge
Subject : How to identify phrases
Posted : 31 March 2009, 15:38
Message :

Say you have a block of text and you need to identify some phrases that may or may not be in the block of text, what is the most efficient way of doing it. Phrases are given priority based on the length of words.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer sit amet lacus. Fusce erat. Proin vel arcu quis justo viverra fermentum. Praesent ante mi, pretium vitae, dapibus ut, laoreet vitae, lacus. Aenean non diam. Sed non ipsum. In hac habitasse platea dictumst. Donec tincidunt mollis dui. Praesent magna mauris, elementum sed, cursus non, sodales quis, massa. Vestibulum mattis volutpat leo. Proin ornare ipsum ac justo. Quisque accumsan.

DB structure

Code:


phrase					word_count
Lorem ipsum dolor sit amet		5
Lorem ipsum dolor				3
sit amet					2
Lorem						1

The phrase Lorem ipsum dolor sit amet would take preference over Lorem ipsum dolor and Lorem
You could iterate through all of the phrases (starting with the longest) and use inStr to test whether it is there or not. For the purposes of this example the detected phrase is removed i.e. if Lorem ipsum dolor is detected you don't have to worry about Lorem being detected in the same place later on but other Lorem phrases would be if they exist. There would be problems with this method if there are hundreds of thousands of phrases.
At the moment I am breaking the block of text up into words and store them individually in an array. I then use a LIKE query to identify the phrases in the database beginning with the first word in the array. If rows are returned I then check to see if the next word is the second word of any of the rows and if it is I remove it from its position in the array and add it to the preceeding array element and start the process again (recursive function). I don't have this working fully yet but it seems like a very inefficient way of doing things.
asp, vb dotnet, c#, java, pseudo code or any other suggestions welcome.

Reply Author: AnonJr
Replied on: 31 March 2009, 17:13
Message:

http://lifehacker.com/5190716/primitive-word-counter-analyzes-word-and-phrase-frequency

May help, or maybe one or two of the projects mentioned in the comments.

Reply Author: ruirib
Replied on: 31 March 2009, 17:43
Message:

Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.
Regular expressions are supported in C#, and quite easy to use...

Reply Author: SiSL
Replied on: 31 March 2009, 17:58
Message:

Originally posted by ruirib
Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.
Regular expressions are supported in C#, and quite easy to use...

Agreed there, good way to go with RegEx, can be used in all languages even. .net & php regex also supports look-backs too. But if you want to use in DB query , it is whole another story...
For DB, I suggest FREETEXTTABLE option and "RANK" and "DENSE" etc. If you want to sort, or at least CONTAINS statements on Full-Text...
Say like:

SELECT t.COLTOSEARCH, kt.RANK FROM TABLE t
LEFT JOIN FREETEXTTABLE(TABLE, COLTOSEARCH, 'Lorem ipsum something something') AS kt ON (kt.[KEY] = t.ID)

(Note: It may not be exact , I'm just typing what's left in my mind)
Will return numbers as Rank relativeness rate of your search... So fitting 5 words for example will have higher Rank then those who fits only 1 word..

PS: I think it is SQL-2005 or SQL-2008 functions)

Reply Author: Podge
Replied on: 31 March 2009, 21:51
Message:

Why not use regular expressions? If you know the phrase candidates, as it seems per your post, seems pretty easy to find them in the text block.

It is easy to find phrases in the db but it means that you you have to iterate through every one of what could be thousands of possible phrases stored in the db in order to match any that may be in the block of text. I'm eager to reduce the amount of querying that needs to be done.

It would be very convenient if you could just pass the block of text to the db (as a query) and it would give you back all of the longest phrases contained within it.

Reply Author: ruirib
Replied on: 01 April 2009, 07:20
Message:

Seems like you need some coding there... Probably would do a double pronged process. Would index the words in the text block, would then retrieve phrases containing those words, probably ordered from longest to shortest and try to match them. As I am admiting that you use SQL Server, probably using text indexing for the phrases would be best.

How to identify phrases

Topic

Replies ...