annotate emeraldtree/TidyTools.py @ 47:e647f30cc08e

remove backwards compat code for python < 2.4
author Thomas Waldmann <tw AT waldmann-edv DOT de>
date Sun, 03 Aug 2008 20:59:44 +0200
parents 7b33b90de8be
children
rev   line source
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
1 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
2 # ElementTree
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
3 # $Id: TidyTools.py 3265 2007-09-06 20:42:00Z fredrik $
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
4 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
5 # tools to run the "tidy" command on an HTML or XHTML file, and return
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
6 # the contents as an XHTML element tree.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
7 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
8 # history:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
9 # 2002-10-19 fl added to ElementTree library; added getzonebody function
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
10 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
11 # Copyright (c) 1999-2004 by Fredrik Lundh. All rights reserved.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
12 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
13 # fredrik@pythonware.com
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
14 # http://www.pythonware.com
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
15 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
16
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
17 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
18 # Tools to build element trees from HTML, using the external <b>tidy</b>
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
19 # utility.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
20 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
21
47
e647f30cc08e remove backwards compat code for python < 2.4
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 21
diff changeset
22 import glob, os, sys
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
23
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
24 from ElementTree import ElementTree, Element
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
25
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
26 NS_XHTML = "{http://www.w3.org/1999/xhtml}"
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
27
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
28 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
29 # Convert an HTML or HTML-like file to XHTML, using the <b>tidy</b>
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
30 # command line utility.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
31 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
32 # @param file Filename.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
33 # @param new_inline_tags An optional list of valid but non-standard
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
34 # inline tags.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
35 # @return An element tree, or None if not successful.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
36
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
37 def tidy(file, new_inline_tags=None):
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
38
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
39 command = ["tidy", "-qn", "-asxml"]
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
40
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
41 if new_inline_tags:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
42 command.append("--new-inline-tags")
47
e647f30cc08e remove backwards compat code for python < 2.4
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 21
diff changeset
43 command.append(",".join(new_inline_tags))
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
44
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
45 # FIXME: support more tidy options!
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
46
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
47 # convert
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
48 os.system(
47
e647f30cc08e remove backwards compat code for python < 2.4
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 21
diff changeset
49 "%s %s >%s.out 2>%s.err" % (" ".join(command), file, file, file)
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
50 )
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
51 # check that the result is valid XML
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
52 try:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
53 tree = ElementTree()
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
54 tree.parse(file + ".out")
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
55 except:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
56 print "*** %s:%s" % sys.exc_info()[:2]
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
57 print ("*** %s is not valid XML "
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
58 "(check %s.err for info)" % (file, file))
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
59 tree = None
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
60 else:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
61 if os.path.isfile(file + ".out"):
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
62 os.remove(file + ".out")
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
63 if os.path.isfile(file + ".err"):
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
64 os.remove(file + ".err")
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
65
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
66 return tree
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
67
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
68 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
69 # Get document body from a an HTML or HTML-like file. This function
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
70 # uses the <b>tidy</b> function to convert HTML to XHTML, and cleans
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
71 # up the resulting XML tree.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
72 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
73 # @param file Filename.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
74 # @return A <b>body</b> element, or None if not successful.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
75
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
76 def getbody(file, **options):
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
77 # get clean body from text file
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
78
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
79 # get xhtml tree
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
80 try:
21
7b33b90de8be some minor coding style cleanups
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1
diff changeset
81 tree = apply(tidy, (file, ), options)
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
82 if tree is None:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
83 return
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
84 except IOError, v:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
85 print "***", v
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
86 return None
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
87
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
88 NS = NS_XHTML
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
89
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
90 # remove namespace uris
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
91 for node in tree.getiterator():
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
92 if node.tag.startswith(NS):
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
93 node.tag = node.tag[len(NS):]
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
94
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
95 body = tree.getroot().find("body")
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
96
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
97 return body
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
98
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
99 ##
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
100 # Same as <b>getbody</b>, but turns plain text at the start of the
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
101 # document into an H1 tag. This function can be used to parse zone
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
102 # documents.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
103 #
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
104 # @param file Filename.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
105 # @return A <b>body</b> element, or None if not successful.
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
106
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
107 def getzonebody(file, **options):
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
108
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
109 body = getbody(file, **options)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
110 if body is None:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
111 return
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
112
47
e647f30cc08e remove backwards compat code for python < 2.4
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 21
diff changeset
113 if body.text and body.text.strip():
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
114 title = Element("h1")
47
e647f30cc08e remove backwards compat code for python < 2.4
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 21
diff changeset
115 title.text = body.text.strip()
0
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
116 title.tail = "\n\n"
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
117 body.insert(0, title)
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
118
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
119 body.text = None
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
120
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
121 return body
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
122
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
123 if __name__ == "__main__":
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
124
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
125 import sys
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
126 for arg in sys.argv[1:]:
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
127 for file in glob.glob(arg):
5169fce2d144 Import ElementTree (1.3a3-20070912-preview).
Bastian Blank <bblank@thinkmo.de>
parents:
diff changeset
128 print file, "...", tidy(file)