Author |
Topic  |
shaneb
Junior Member
 
USA
319 Posts |
Posted - 08 August 2007 : 22:43:41
|
Hi everyone. I've been out of the loop on web development for a long time and Snitz members have always been great in helping me when I got a problem. So thanks in advance for those who help me.
Here is my problem. I found a script using the MS XMLHTTP object that allows me to grab a web page, but I don't know how to parse out the unwanted navigation on the left side and at the top of the page. All I need is the tables on this page http://www.i44speedway.com/Trackpoints.htm without the navigation. The script I found is below
<%
' Url of the webpage we want to retrieve
thisURL = "http://www.i44speedway.com/Trackpoints.htm"
' Creation of the xmlHTTP object
Set GetConnection = CreateObject("Microsoft.XMLHTTP")
' Connection to the URL
GetConnection.Open "get", thisURL, False
GetConnection.Send
' ResponsePage now have the response of
' the remote web server
ResponsePage = GetConnection.responseText
' We write out now
' the content of the ResponsePage var
Response.write (ResponsePage)
Set GetConnection = Nothing
%>
Can anyone rewrite this for me so that is parses out the navigation on the left side and at the top of the page and just leaves me with the tables. Or better yet I only need a specific table on this page. It is the table called Turf Tires. I am making a web site for my little cousin so that he can hopefully get sponsored. I appreciate any help you guys can offer. Thanks Again! |
'Surround your mind and you shall see a great future ahead'
Shane B.
|
|
Doug G
Support Moderator
    
USA
6493 Posts |
Posted - 08 August 2007 : 23:09:10
|
google for "screen scraping"
|
====== Doug G ====== Computer history and help at www.dougscode.com |
 |
|
shaneb
Junior Member
 
USA
319 Posts |
Posted - 08 August 2007 : 23:25:35
|
quote: Originally posted by Doug G
google for "screen scraping"
Hi Doug.
Thanks, but I did that as well as Web Scraping, Page Scraping, Web Fetching, and Page Grabber. There was a script called ASP Page Grabber, but the site no longer exists. I know that there are components out there such as ASP Tear, but my host will not install them. Looks like there is some stuff for .NET but not classic ASP. So I am having a heck of a time trying to find something. Do developers even do this kind of stuff anymore? If not what do they do to grab content from a web page and put it on their own web site. Just so you guys know I always ask permission from an owner before I display their content on my web page. |
'Surround your mind and you shall see a great future ahead'
Shane B.
|
Edited by - shaneb on 08 August 2007 23:40:31 |
 |
|
Shaggy
Support Moderator
    
Ireland
6780 Posts |
Posted - 09 August 2007 : 04:50:23
|
Not an XML programmer so can't provide you with the exact code but what I would do in this case is, instead of just writing the entire contents of the retrieved file, place the contents of the page into a variable (such as ResponsePage above) and then scan through that (using XML, RegEx or whatever way you want) to find the opening and closing tags of what I want, dump them out into their own variables.
|
Search is your friend “I was having a mildly paranoid day, mostly due to the fact that the mad priest lady from over the river had taken to nailing weasels to my front door again.” |
 |
|
pdrg
Support Moderator
    
United Kingdom
2897 Posts |
Posted - 09 August 2007 : 10:01:03
|
How about using a right() and a left() - sorry I'm so rusty now... |
 |
|
pdrg
Support Moderator
    
United Kingdom
2897 Posts |
|
Podge
Support Moderator
    
Ireland
3776 Posts |
|
shaneb
Junior Member
 
USA
319 Posts |
Posted - 09 August 2007 : 20:09:44
|
quote: Originally posted by Podge
You could try using a regex to match all tables on the page. More than likely it will always be the same match i.e. the second or third table.
http://regexlib.com/Search.aspx?k=table
Thanks Everyone! I found this expression on regex.
(?s)<tr[^>]*>(?<content>.*?)</tr>
This expresssion will match complete table rows (<tr>...</tr>) and put everything between the tr tags into a group named "content". Basically if I change it to the following it matches all of the tables and puts it in a group called content correct?
(?s)<table[^>]*>(?<content>.*?)</table> Therefore, technically it will find all of the tables in the page located at http://www.i44speedway.com/Trackpoints.htm Looking at the HTML Code, the Turf Tires table is number 6 in the HTML. I had to copy and paste the code and then do a find next until I got to the table I needed (Turf Tires). My question is how do you write the expression so that it will find only table 6 in the HTML page?
Sorry, I still no very little about ASP.
Thanks again everyone. |
'Surround your mind and you shall see a great future ahead'
Shane B.
|
Edited by - shaneb on 09 August 2007 20:32:03 |
 |
|
Shaggy
Support Moderator
    
Ireland
6780 Posts |
Posted - 10 August 2007 : 04:45:16
|
Just to clarify, is it only the data from the "Turf Tires" table you need to pull into your page or do you need the data from any of the other tables, such as "Kid Sprints"?
|
Search is your friend “I was having a mildly paranoid day, mostly due to the fact that the mad priest lady from over the river had taken to nailing weasels to my front door again.” |
 |
|
Podge
Support Moderator
    
Ireland
3776 Posts |
|
shaneb
Junior Member
 
USA
319 Posts |
Posted - 11 August 2007 : 22:49:20
|
Sorry the only working code I have was from my first post on this topic above. Again this pulls the whole page from that link. I just want a table from this page.
Yes, the only table I need is the Turf Tires table. This is the only class of race he runs in and it is where his standings are.
The expression from regex looks like it would do the trick to pull just the Turf Tires Table data. However, I wouldn't know where to begin.
Thanks |
'Surround your mind and you shall see a great future ahead'
Shane B.
|
Edited by - shaneb on 11 August 2007 22:52:08 |
 |
|
pdrg
Support Moderator
    
United Kingdom
2897 Posts |
|
shaneb
Junior Member
 
USA
319 Posts |
Posted - 20 August 2007 : 02:19:50
|
Seems all I can find is the XML Script I posted already or a .NET solution located at http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack But I am using Classic ASP.
I can't use expressions, don't know how to use them. I guess the best I can do is just capture the whole page. I wanted a cleaner look, but I'm tired of looking for a solution I can understand and implement.
If someone could come up with a way to do this for me so that it parses the HTML, grabs only the table data from Turf Tires and writes it back to a plain HTML table I'll pay you $20.00 USD via PayPal.
Thanks guys!
|
'Surround your mind and you shall see a great future ahead'
Shane B.
|
 |
|
Podge
Support Moderator
    
Ireland
3776 Posts |
Posted - 20 August 2007 : 08:22:34
|
<%
' Url of the webpage we want to retrieve
thisURL = "http://www.i44speedway.com/Trackpoints.htm"
' Creation of the xmlHTTP object
Set GetConnection = CreateObject("Microsoft.XMLHTTP")
' Connection to the URL
GetConnection.Open "get", thisURL, False
GetConnection.Send
' ResponsePage now have the response of
' the remote web server
ResponsePage = GetConnection.responseText
' We write out now
' the content of the ResponsePage var
'Response.write (getTable(ResponsePage))
posStart = inStr(ResponsePage, "<a name=""Turf Tires"">Turf Tires</a>")
posEnd = inStr(posStart, ResponsePage, "</table>", 1)
Response.Write mid(ResponsePage, posStart+88, posEnd-posStart+8-88)
Set GetConnection = Nothing
Function getTable(pageString)
dim myMatches
Set RegularExpressionObject = New RegExp
With RegularExpressionObject
.Pattern = "(<table class=""MsoTableGrid"" border=""1"" cellspacing=""1"" style=""border: 3px ridge #0000FF; padding-left: 4; padding-right: 4; padding-top: 1; padding-bottom: 1"">.*?)</table>"
.IgnoreCase = True
.Global = True
End With
'stripHTMLtags = RegularExpressionObject.Replace(HTMLstring, "")
set myMatches = RegularExpressionObject.Execute(lcase(pageString))
getTable = myMatches(0)
'response.Write(matches)
Set RegularExpressionObject = nothing
End Function
%>
I couldn't get the regexp to work correctly so I use inStr to find the starting point and end point of the table. I've included the regexp function so you can see how to use matches. You don't need it to get it to work. The red line is the one that outputs the table. |
Podge.
The Hunger Site - Click to donate free food | My Blog | Snitz 3.4.05 AutoInstall (Beta!)
My Mods: CAPTCHA Mod | GateKeeper Mod Tutorial: Enable subscriptions on your board
Warning: The post above or below may contain nuts. |
Edited by - Podge on 20 August 2007 08:36:36 |
 |
|
pdrg
Support Moderator
    
United Kingdom
2897 Posts |
Posted - 20 August 2007 : 08:29:25
|
I'm afraid it'll have to be pseudocode from me...no warranties for this code, but it should give you enough of a starting point to get you going I hope. If it's worth it, donate the $20 to the Snitz found - it helps towards Huw's hosting costs as we don't carry any advertising!
sourcestr = "...all the page text as above..." startpos = instr(sourcestr, ">Turf Tires</a></font></b></p>") + 0 mystring = right(sourcestr, len((sourcestr) - startpos)) mystring = left(mystring, (instr(">Multi Class</a></font></b></p>") + 0))
response.write mystring
It's not easy to pick unique markers that don't introduce extra quote matrk complications, but the ones I've got above will work. If you find they return too much or too little , then instead of the (clearly meaningless, they're there as a safe placeholder) + 0 you could put -6 or +12 etc, until you get the result you want.
Hope it helps |
 |
|
Podge
Support Moderator
    
Ireland
3776 Posts |
Posted - 20 August 2007 : 15:18:55
|
Finally got a version working with the regexp. Its a more elegant solution and probably more reliable. It matches all the tables on the page and puts them into an array called myMatches. You can response.write any table on the page using this code. See the line in red.
<%
' Url of the webpage we want to retrieve
thisURL = "http://www.i44speedway.com/Trackpoints.htm"
' Creation of the xmlHTTP object
Set GetConnection = CreateObject("Microsoft.XMLHTTP")
' Connection to the URL
GetConnection.Open "get", thisURL, False
GetConnection.Send
' ResponsePage now have the response of
' the remote web server
ResponsePage = GetConnection.responseText
' We write out now
' the content of the ResponsePage var
Response.write (getTable(ResponsePage))
Set GetConnection = Nothing
Function getTable(pageString)
dim myMatches
Set RegularExpressionObject = New RegExp
With RegularExpressionObject
.Pattern = "<table.*>(.|\n)*?</table>"
.IgnoreCase = True
.Global = True
End With
set myMatches = RegularExpressionObject.Execute(pageString)
getTable = myMatches(4) ' Get the fifth table on the page
Set RegularExpressionObject = nothing
End Function
%> |
Podge.
The Hunger Site - Click to donate free food | My Blog | Snitz 3.4.05 AutoInstall (Beta!)
My Mods: CAPTCHA Mod | GateKeeper Mod Tutorial: Enable subscriptions on your board
Warning: The post above or below may contain nuts. |
 |
|
Topic  |
|