forked from hpricot/hpricot
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
284 lines (194 loc) · 9.36 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
= Hpricot, Read Any HTML
Hpricot is a fast, flexible HTML parser written in C. It's designed to be very
accommodating (like Tanaka Akira's HTree) and to have a very helpful library
(like some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSS
parser, in fact, is based on John Resig's JQuery.
Also, Hpricot can be handy for reading broken XML files, since many of the same
techniques can be used. If a quote is missing, Hpricot tries to figure it out.
If tags overlap, Hpricot works on sorting them out. You know, that sort of
thing.
*Please read this entire document* before making assumptions about how this
software works.
== An Overview
Let's clear up what Hpricot is.
# Hpricot is *a standalone library*. It requires no other libraries. Just Ruby!
# While priding itself on speed, Hpricot *works hard to sort out bad HTML* and
pays a small penalty in order to get that right. So that's slightly more important
to me than speed.
# *If you can see it in Firefox, then Hpricot should parse it.* That's
how it should be! Let me know the minute it's otherwise.
# Primarily, Hpricot is used for reading HTML and tries to sort out troubled
HTML by having some idea of what good HTML is. Some people still like to use
Hpricot for XML reading, but *remember to use the Hpricot::XML() method* for that!
== The Hpricot Kingdom
First, here are all the links you need to know:
* http://code.whytheluckystiff.net/hpricot is the Hpricot wiki and bug tracker.
Go there for news and recipes and patches. It's the center of activity.
* http://code.whytheluckystiff.net/svn/hpricot/trunk is the main Subversion
repository for Hpricot. You can get the latest code there.
* http://code.whytheluckystiff.net/doc/hpricot is the home for the latest copy of
this reference.
* See COPYING for the terms of this software. (Spoiler: it's absolutely free.)
If you have any trouble, don't hesitate to contact the author. As always, I'm
not going to say "Use at your own risk" because I don't want this library to be
risky. If you trip on something, I'll share the liability by repairing things
as quickly as I can. Your responsibility is to report the inadequacies.
== Installing Hpricot
You may get the latest stable version from Rubyforge. Win32 binaries and source
gems are available.
$ gem install hpricot
As Hpricot is still under active development, you can also try the most recent
candidate build here:
$ gem install hpricot --source http://code.whytheluckystiff.net
The development gem is usually in pretty good shape actually. You can also
get the bleeding edge code or plain Ruby tarballs on the wiki.
== An Hpricot Showcase
We're going to run through a big pile of examples to get you jump-started.
Many of these examples are also found at
http://code.whytheluckystiff.net/hpricot/wiki/HpricotBasics, in case you
want to add some of your own.
=== Loading Hpricot Itself
You have probably got the gem, right? To load Hpricot:
require 'rubygems'
require 'hpricot'
If you've installed the plain source distribution, go ahead and just:
require 'hpricot'
=== Load an HTML Page
The <tt>Hpricot()</tt> method takes a string or any IO object and loads the
contents into a document object.
doc = Hpricot("<p>A simple <b>test</b> string.</p>")
To load from a file, just get the stream open:
doc = open("index.html") { |f| Hpricot(f) }
To load from a web URL, use <tt>open-uri</tt>, which comes with Ruby:
require 'open-uri'
doc = open("http://qwantz.com/") { |f| Hpricot(f) }
Hpricot uses an internal buffer to parse the file, so the IO will stream
properly and large documents won't be loaded into memory all at once. However,
the parsed document object will be present in memory, in its entirety.
=== Search for Elements
Use <tt>Doc.search</tt>:
doc.search("//p[@class='posted']")
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
<tt>Doc.search</tt> can take an XPath or CSS expression. In the above example,
all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt>
attribute of <tt>"posted"</tt>.
A shortcut is to use the divisor:
(doc/"p.posted")
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
=== Finding Just One Element
If you're looking for a single element, the <tt>at</tt> method will return the
first element matched by the expression. In this case, you'll get back the
element itself rather than the <tt>Hpricot::Elements</tt> array.
doc.at("body")['onload']
The above code will find the body tag and give you back the <tt>onload</tt>
attribute. This is the most common reason to use the element directly: when
reading and writing HTML attributes.
=== Fetching the Contents of an Element
Just as with browser scripting, the <tt>inner_html</tt> property can be used to
get the inner contents of an element.
(doc/"#elementID").inner_html
#=> "..<b>contents</b>.."
If your expression matches more than one element, you'll get back the contents
of ''all the matched elements''. So you may want to use <tt>first</tt> to be
sure you get back only one.
(doc/"#elementID").first.inner_html
#=> "..<b>contents</b>.."
=== Fetching the HTML for an Element
If you want the HTML for the whole element (not just the contents), use
<tt>to_html</tt>:
(doc/"#elementID").to_html
#=> "<div id='elementID'>...</div>"
=== Looping
All searches return a set of <tt>Hpricot::Elements</tt>. Go ahead and loop
through them like you would an array.
(doc/"p/a/img").each do |img|
puts img.attributes['class']
end
=== Continuing Searches
Searches can be continued from a collection of elements, in order to search deeper.
# find all paragraphs.
elements = doc.search("/html/body//p")
# continue the search by finding any images within those paragraphs.
(elements/"img")
#=> #<Hpricot::Elements[{img ...}, {img ...}]>
Searches can also be continued by searching within container elements.
# find all images within paragraphs.
doc.search("/html/body//p").each do |para|
puts "== Found a paragraph =="
pp para
imgs = para.search("img")
if imgs.any?
puts "== Found #{imgs.length} images inside =="
end
end
Of course, the most succinct ways to do the above are using CSS or XPath.
# the xpath version
(doc/"/html/body//p//img")
# the css version
(doc/"html > body > p img")
# ..or symbols work, too!
(doc/:html/:body/:p/:img)
=== Looping Edits
You may certainly edit objects from within your search loops. Then, when you
spit out the HTML, the altered elements will show.
(doc/"span.entryPermalink").each do |span|
span.attributes['class'] = 'newLinks'
end
puts doc
This changes all <tt>span.entryPermalink</tt> elements to
<tt>span.newLinks</tt>. Keep in mind that there are often more convenient ways
of doing this. Such as the <tt>set</tt> method:
(doc/"span.entryPermalink").set(:class => 'newLinks')
=== Figuring Out Paths
Every element can tell you its unique path (either XPath or CSS) to get to the
element from the root tag.
The <tt>css_path</tt> method:
doc.at("div > div:nth(1)").css_path
#=> "div > div:nth(1)"
doc.at("#header").css_path
#=> "#header"
Or, the <tt>xpath</tt> method:
doc.at("div > div:nth(1)").xpath
#=> "/div/div:eq(1)"
doc.at("#header").xpath
#=> "//div[@id='header']"
== Hpricot Fixups
When loading HTML documents, you have a few settings that can make Hpricot more
or less intense about how it gets involved.
== :fixup_tags
Really, there are so many ways to clean up HTML and your intentions may be to
keep the HTML as-is. So Hpricot's default behavior is to keep things flexible.
Making sure to open and close all the tags, but ignore any validation problems.
As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt
to shift the document's tags to meet XHTML 1.0 Strict.
doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }
This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow
the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's
going to move the paragraph below the link. Or up and out of other elements
where paragraphs don't belong.
If an unknown element is found, it is ignored. Again, <tt>:fixup_tags</tt>.
== :xhtml_strict
So, let's go beyond just trying to fix the hierarchy. The
<tt>:xhtml_strict</tt> option really tries to force the document to be an XHTML
1.0 Strict document. Even at the cost of removing elements that get in the way.
doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }
What measures does <tt>:xhtml_strict</tt> take?
1. Shift elements into their proper containers just like :fixup_tags.
2. Remove unknown elements.
3. Remove unknown attributes.
4. Remove illegal content.
5. Alter the doctype to XHTML 1.0 Strict.
== Hpricot.XML()
The last option is the <tt>:xml</tt> option, which makes some slight variations
on the standard mode. The main difference is that :xml mode won't try to output
tags which are friendlier for browsers. For example, if an opening and closing
<tt>br</tt> tag is found, XML mode won't try to turn that into an empty element.
XML mode also doesn't downcase the tags and attributes for you. So pay attention
to case, friends.
The primary way to use Hpricot's XML mode is to call the Hpricot.XML method:
doc = open("http://redhanded.hobix.com/index.xml") do |f|
Hpricot.XML(f)
end
*Also, :fixup_tags is canceled out by the :xml option.* This is because
:fixup_tags makes assumptions based how HTML is structured. Specifically, how
tags are defined in the XHTML 1.0 DTD.