Html to plain text conversion
Posted: Sun May 15, 2011 11:26 pm
To All
I am trying to take some html formatted text ( web page scrape ) and convert it to plain text and remove all the spaces .. .. I have removed all the tags ( for the most part ) with this code but can not remove the spaces and get the readable text ..
Any advice would be appreciated.
Thanks
Rick Lipkin
I am trying to take some html formatted text ( web page scrape ) and convert it to plain text and remove all the spaces .. .. I have removed all the tags ( for the most part ) with this code but can not remove the spaces and get the readable text ..
Any advice would be appreciated.
Thanks
Rick Lipkin
Code: Select all
cDESC := upper(IE:document:documentElement:outerHTML)
// clean up the text
nLOOP := 0
DO WHILE .T.
IF AT('<TD CLASS='+'"'+'STD'+'">' , cDESC) > 0
cDESC := STRTRAN( cDESC, '<TD CLASS='+'"'+'STD'+'">', space(0) )
ENDIF
nLOOP++
IF nLOOP > 10
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("<TD>", cDESC) > 0
cDESC := STRTRAN( cDESC, "<TD>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 30
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("</TD>", cDESC) > 0
cDESC := STRTRAN( cDESC, "</TD>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 30
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("</TR>", cDESC) > 0
cDESC := STRTRAN( cDESC, "</TR>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 30
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("<TR>", cDESC) > 0
cDESC := STRTRAN( cDESC, "<TR>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 30
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("<BR>", cDESC) > 0
cDESC := STRTRAN( cDESC, "<BR>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 30
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("<LI>", cDESC) > 0
cDESC := STRTRAN( cDESC, "<LI>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 30
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("</LI>", cDESC) > 0
cDESC := STRTRAN( cDESC, "</LI>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 30
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("<B>", cDESC) > 0
cDESC := STRTRAN( cDESC, "<B>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 40
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("</B>", cDESC) > 0
cDESC := STRTRAN( cDESC, "</B>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 40
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("</UL>", cDESC) > 0
cDESC := STRTRAN( cDESC, "</UL>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 30
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("</TBODY>", cDESC) > 0
cDESC := STRTRAN( cDESC, "</TBODY>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 30
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("<TBODY>", cDESC) > 0
cDESC := STRTRAN( cDESC, "<TBODY>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 30
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("</TABLE>", cDESC) > 0
cDESC := STRTRAN( cDESC, "</TABLE>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 30
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT("<TABLE border=0 cellSpacing=0 cellPadding=1>", cDESC) > 0
cDESC := STRTRAN( cDESC, "<TABLE border=0 cellSpacing=0 cellPadding=1>", space(0) )
ENDIF
nLOOP++
IF nLOOP > 2
EXIT
ENDIF
ENDDO
MSGINFO( CDESC )
IF AT( (chr(13)+chr(10)), cDESC) > 0
MSGINFO( "FOUND CR LINE FEED")
ELSE
MSGINFO( "NOT FOUND CR LINE FEED") // can not be found
ENDIF
IF AT( (chr(13)), cDESC) > 0
MSGINFO( "FOUND CR FEED")
ELSE
MSGINFO( "NOT FOUND CR FEED") // cannot be found
ENDIF
IF AT( " ", cDESC) > 0
MSGINFO( "FOUND SPACE(2)")
ELSE
MSGINFO( "NOT FOUND SPACE(2)") // can not be found
ENDIF
nLOOP := 0
DO WHILE .T.
IF AT( (chr(13)+chr(10)), cDESC) > 0
cDESC := STRTRAN( cDESC, (chr(13)+chr(10)), space(0) )
ENDIF
nLOOP++
IF nLOOP > 500
EXIT
ENDIF
ENDDO
nLOOP := 0
DO WHILE .T.
IF AT( " ", cDESC) > 0
cDESC := STRTRAN( cDESC, " ", space(0) )
ENDIF
nLOOP++
IF nLOOP > 5000
EXIT
ENDIF
ENDDO
Code: Select all
<TD CLASS="STD">DESCRIPTION:</TD>
<TD CLASS="STD">SPINDLE ASSEMBLY - HEAVY DUTY<BR>AYP 130794</TD>
</TR>
<TR>
<TD CLASS="STD">PACK SIZE:</TD>
<TD CLASS="STD">1</TD>
</TR>
<TR>
<TD CLASS="STD">LIST PRICE:</TD>
<TD CLASS="STD">$48.96</TD>
</TR>
<TR>
<TD CLASS="STD">REPLACES (OEM):</TD>
<TD CLASS="STD">
AYP 130794<BR>
HUSQVARNA 532 13 07-94<BR>
</TD>
</TR>
<TR>
<TD CLASS="STD">FITS MODELS:</TD>
<TD CLASS="STD"><B>AYP</B> 36", 38" AND 42" CUT VENTILATED DECKS USING STAR SHAPED CENTER HOLE BLADES<BR><B>HUSQVARNA</B> 36", 38" AND 42" CUT VENTILATED DECKS USING STAR SHAPED CENTER HOLE BLADES</TD>
</TR>
<TR>
<TD CLASS="STD">
SPECS:
</TD>
<TD>
<TABLE BORDER="0" CELLSPACING="0" CELLPADDING="1">
<TBODY><TR>
<TD>
<UL STYLE="MARGIN-LEFT: 20PX;">
<LI>HEIGHT:7" </LI>
<LI>HEAVY DUTY VERSION OF OUR 285-456</LI>
<LI>INCLUDES PULLEY NUT, BLADE BOLT, WASHER AND SPACER</LI>
<LI>NO THREADS, SELF TAPPING</LI>
<LI>USES 275-280 SPINDLE PULLEY FOR 38" CUT DECKS</LI>
<LI>USES 275-284 SPINDLE PULLEY FOR 42" CUT DECKS</LI>
</UL>
</TD>
</TR>
</TBODY></TABLE>
</TD>
</TR>
<TR>
<TD CLASS="STD">