annotate emeraldtree/html.py @ 68:d7e235461c97

HTML Support - Change imports to internal names
author Bastian Blank <bblank@thinkmo.de>
date Sun, 30 May 2010 17:39:14 +0200
parents 672b181a8ce0
children 54c60c7e7e35
rev   line source
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
1 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
2 # The ElementTree toolkit is
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
3 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
4 # Copyright (c) 1999-2007 by Fredrik Lundh
68
d7e235461c97 HTML Support - Change imports to internal names
Bastian Blank <bblank@thinkmo.de>
parents: 52
diff changeset
5 # 2008-2010 Bastian Blank <bblank@thinkmo.de>
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
6 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
7 # By obtaining, using, and/or copying this software and/or its
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
8 # associated documentation, you agree that you have read, understood,
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
9 # and will comply with the following terms and conditions:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
10 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
11 # Permission to use, copy, modify, and distribute this software and
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
12 # its associated documentation for any purpose and without fee is
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
13 # hereby granted, provided that the above copyright notice appears in
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
14 # all copies, and that both that copyright notice and this permission
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
15 # notice appear in supporting documentation, and that the name of
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
16 # Secret Labs AB or the author not be used in advertising or publicity
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
17 # pertaining to distribution of the software without specific, written
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
18 # prior permission.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
19 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
20 # SECRET LABS AB AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
21 # TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANT-
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
22 # ABILITY AND FITNESS. IN NO EVENT SHALL SECRET LABS AB OR THE AUTHOR
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
23 # BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
24 # DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
25 # WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
26 # ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
27 # OF THIS SOFTWARE.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
28 # --------------------------------------------------------------------
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
29
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
30 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
31 # Tools to build element trees from HTML files.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
32 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
33
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
34 import htmlentitydefs
47
e647f30cc08e remove backwards compat code for python < 2.4
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1
diff changeset
35 import re
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
36 import mimetools, StringIO
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
37 from HTMLParser import HTMLParser as HTMLParserBase
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
38
68
d7e235461c97 HTML Support - Change imports to internal names
Bastian Blank <bblank@thinkmo.de>
parents: 52
diff changeset
39 from . import tree
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
40
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
41
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
42 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
43 # ElementTree builder for HTML source code. This builder converts an
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
44 # HTML document or fragment to an ElementTree.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
45 # <p>
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
46 # The parser is relatively picky, and requires balanced tags for most
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
47 # elements. However, elements belonging to the following group are
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
48 # automatically closed: P, LI, TR, TH, and TD. In addition, the
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
49 # parser automatically inserts end tags immediately after the start
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
50 # tag, and ignores any end tags for the following group: IMG, HR,
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
51 # META, and LINK.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
52 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
53 # @keyparam builder Optional builder object. If omitted, the parser
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
54 # uses the standard <b>elementtree</b> builder.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
55 # @keyparam encoding Optional character encoding, if known. If omitted,
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
56 # the parser looks for META tags inside the document. If no tags
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
57 # are found, the parser defaults to ISO-8859-1. Note that if your
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
58 # document uses a non-ASCII compatible encoding, you must decode
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
59 # the document before parsing.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
60 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
61 # @see elementtree.ElementTree
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
62
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
63 class HTMLParser(HTMLParserBase):
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
64 AUTOCLOSE = "p", "li", "tr", "th", "td", "head", "body"
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
65 IGNOREEND = "img", "hr", "meta", "link", "br", "input", "col"
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
66
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
67 namespace = "http://www.w3.org/1999/xhtml"
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
68
52
672b181a8ce0 HTML Support - Change module name, cleanup contents
Bastian Blank <bblank@thinkmo.de>
parents: 51
diff changeset
69 def __init__(self, encoding=None, builder=None):
68
d7e235461c97 HTML Support - Change imports to internal names
Bastian Blank <bblank@thinkmo.de>
parents: 52
diff changeset
70 HTMLParserBase.__init__(self)
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
71 self.__stack = []
68
d7e235461c97 HTML Support - Change imports to internal names
Bastian Blank <bblank@thinkmo.de>
parents: 52
diff changeset
72 self.__builder = builder or tree.TreeBuilder()
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
73 self.encoding = encoding or "iso-8859-1"
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
74
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
75 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
76 # Flushes parser buffers, and return the root element.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
77 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
78 # @return An Element instance.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
79
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
80 def close(self):
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
81 HTMLParserBase.close(self)
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
82 return self.__builder.close()
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
83
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
84 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
85 # (Internal) Handles start tags.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
86
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
87 def handle_starttag(self, tag, attrs):
68
d7e235461c97 HTML Support - Change imports to internal names
Bastian Blank <bblank@thinkmo.de>
parents: 52
diff changeset
88 tag = tree.QName(tag.lower(), self.namespace)
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
89 if tag.name == "meta":
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
90 # look for encoding directives
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
91 http_equiv = content = None
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
92 for k, v in attrs:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
93 if k == "http-equiv":
47
e647f30cc08e remove backwards compat code for python < 2.4
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1
diff changeset
94 http_equiv = v.lower()
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
95 elif k == "content":
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
96 content = v
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
97 if http_equiv == "content-type" and content:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
98 # use mimetools to parse the http header
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
99 header = mimetools.Message(
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
100 StringIO.StringIO("%s: %s\n\n" % (http_equiv, content))
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
101 )
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
102 encoding = header.getparam("charset")
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
103 if encoding:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
104 self.encoding = encoding
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
105 if tag.name in self.AUTOCLOSE:
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
106 if self.__stack and self.__stack[-1] == tag:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
107 self.handle_endtag(tag)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
108 self.__stack.append(tag)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
109 attrib = {}
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
110 if attrs:
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
111 for key, value in attrs:
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
112 # Handle short attributes
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
113 if value is None:
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
114 value = key
68
d7e235461c97 HTML Support - Change imports to internal names
Bastian Blank <bblank@thinkmo.de>
parents: 52
diff changeset
115 key = tree.QName(key.lower(), self.namespace)
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
116 attrib[key] = value
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
117 self.__builder.start(tag, attrib)
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
118 if tag.name in self.IGNOREEND:
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
119 self.__stack.pop()
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
120 self.__builder.end(tag)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
121
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
122 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
123 # (Internal) Handles end tags.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
124
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
125 def handle_endtag(self, tag):
68
d7e235461c97 HTML Support - Change imports to internal names
Bastian Blank <bblank@thinkmo.de>
parents: 52
diff changeset
126 if not isinstance(tag, tree.QName):
d7e235461c97 HTML Support - Change imports to internal names
Bastian Blank <bblank@thinkmo.de>
parents: 52
diff changeset
127 tag = tree.QName(tag.lower(), self.namespace)
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
128 if tag.name in self.IGNOREEND:
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
129 return
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
130 lasttag = self.__stack.pop()
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
131 if tag != lasttag and lasttag.name in self.AUTOCLOSE:
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
132 self.handle_endtag(lasttag)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
133 self.__builder.end(tag)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
134
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
135 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
136 # (Internal) Handles character references.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
137
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
138 def handle_charref(self, char):
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
139 if char[:1] == "x":
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
140 char = int(char[1:], 16)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
141 else:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
142 char = int(char)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
143 if 0 <= char < 128:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
144 self.__builder.data(chr(char))
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
145 else:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
146 self.__builder.data(unichr(char))
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
147
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
148 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
149 # (Internal) Handles entity references.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
150
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
151 def handle_entityref(self, name):
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
152 entity = htmlentitydefs.entitydefs.get(name)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
153 if entity:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
154 if len(entity) == 1:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
155 entity = ord(entity)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
156 else:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
157 entity = int(entity[2:-1])
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
158 if 0 <= entity < 128:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
159 self.__builder.data(chr(entity))
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
160 else:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
161 self.__builder.data(unichr(entity))
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
162 else:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
163 self.unknown_entityref(name)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
164
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
165 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
166 # (Internal) Handles character data.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
167
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
168 def handle_data(self, data):
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
169 if isinstance(data, str):
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
170 # convert to unicode, but only if necessary
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
171 data = unicode(data, self.encoding, "ignore")
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
172 self.__builder.data(data)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
173
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
174 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
175 # (Hook) Handles unknown entity references. The default action
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
176 # is to ignore unknown entities.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
177
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
178 def unknown_entityref(self, name):
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
179 pass # ignore by default; override if necessary
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
180
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
181 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
182 # Parse an HTML document or document fragment.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
183 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
184 # @param source A filename or file object containing HTML data.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
185 # @param encoding Optional character encoding, if known. If omitted,
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
186 # the parser looks for META tags inside the document. If no tags
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
187 # are found, the parser defaults to ISO-8859-1.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
188 # @return An ElementTree instance
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
189
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
190 def parse(source, encoding=None):
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
191 return ElementTree.parse(source, HTMLTreeBuilder(encoding=encoding))
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
192
51
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
193 def HTML(text):
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
194 parser = HTMLParser()
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
195 parser.feed(text)
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
196 return parser.close()
847897e5fab8 HTMLTreeBuilder - Make namespace aware, add testcases
Bastian Blank <bblank@thinkmo.de>
parents: 47
diff changeset
197
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
198 if __name__ == "__main__":
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
199 import sys
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
200 ElementTree.dump(parse(open(sys.argv[1])))