Say you have a block of text and you need to identify some phrases that may or may not be in the block of text, what is the most efficient way of doing it. Phrases are given priority based on the length of words.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer sit amet lacus. Fusce erat. Proin vel arcu quis justo viverra fermentum. Praesent ante mi, pretium vitae, dapibus ut, laoreet vitae, lacus. Aenean non diam. Sed non ipsum. In hac habitasse platea dictumst. Donec tincidunt mollis dui. Praesent magna mauris, elementum sed, cursus non, sodales quis, massa. Vestibulum mattis volutpat leo. Proin ornare ipsum ac justo. Quisque accumsan.
DB structure
Code:
phrase word_count
Lorem ipsum dolor sit amet 5
Lorem ipsum dolor 3
sit amet 2
Lorem 1
The phrase
Lorem ipsum dolor sit amet would take preference over
Lorem ipsum dolor and
Lorem
You could iterate through all of the phrases (starting with the longest) and use inStr to test whether it is there or not. For the purposes of this example the detected phrase is removed i.e. if
Lorem ipsum dolor is detected you don't have to worry about
Lorem being detected in the same place later on but other
Lorem phrases would be if they exist. There would be problems with this method if there are hundreds of thousands of phrases.
At the moment I am breaking the block of text up into words and store them individually in an array. I then use a LIKE query to identify the phrases in the database beginning with the first word in the array. If rows are returned I then check to see if the next word is the second word of any of the rows and if it is I remove it from its position in the array and add it to the preceeding array element and start the process again (recursive function). I don't have this working fully yet but it seems like a very inefficient way of doing things.
asp, vb dotnet, c#, java, pseudo code or any other suggestions welcome.