Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support compiled XPath expressions #3380

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

flavorjones
Copy link
Member

What problem is this PR intended to solve?

There has been some discussion, summarized at #3266, about exposing libxml2's support for compiled XPath expressions. The idea is that, if you have a complex expression that you use a lot and you don't want to pay the cost of parsing/compiling it multiple times, then you can compile it once and presumably your document search will be faster.

This PR implements a new T_DATA class, XML::XPath::Expression, which stores the result of compiling an XPath expression via xmlXPathCompile. The XPathContext class knows how to accept either a String or an Expression.

However, I'm not seeing noticeable improvements in speed, though my benchmark may not capture the benefits.

I'm posting this as a draft in case someone wants to write me a benchmark that shows compiled XPath expressions are compellingly faster than just using Strings. Right now, based on what I'm seeing, I'm not at all sure the complexity is worth the benefit.

Have you included adequate test coverage?

Yes.

Does this change affect the behavior of either the C or the Java implementations?

This is an optimization available on CRuby only; though the idea is that the shorthand methods Nokogiri::XML::XPath.expression and Nokogiri::CSS.selector will be no-ops on JRuby (returning the string argument) and code that uses Expressions could be portable across both implementations.

@flavorjones
Copy link
Member Author

flavorjones commented Dec 21, 2024

An example benchmark script:

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
  gem "benchmark-ips"
end

doc_large = Nokogiri::HTML5.parse(File.read(File.join(__dir__, "../test/files/tlm.html")))
doc_small = Nokogiri::HTML5.parse(File.read(File.join(__dir__, "../test/files/noencoding.html")))

Benchmark.ips do |x|
  x.warmup = 0
  expression_str = "//p[nokogiri-builtin:css-class(@class,'br0') and count(preceding-sibling::*)=0]"
  expression_comp = Nokogiri::XML::XPath::Expression.new(expression_str)

  x.report("small: compiled") do
    doc_small.xpath(expression_comp).length == 0 or raise("nope")
  end

  x.report("small: string") do
    doc_small.xpath(expression_str).length == 0 or raise("nope")
  end

  x.compare!
end

outputs:

Calculating -------------------------------------
     small: compiled     56.468k (±12.3%) i/s   (17.71 μs/i) -    269.438k in   4.948751s
       small: string     49.666k (±14.9%) i/s   (20.13 μs/i) -    236.924k in   4.955582s

Comparison:
     small: compiled:    56468.3 i/s
       small: string:    49665.7 i/s - same-ish: difference falls within error

@flavorjones flavorjones force-pushed the flavorjones-compiled-xpath-queries branch from 930e231 to bda0ec6 Compare January 3, 2025 21:02
@flavorjones
Copy link
Member Author

Updated benchmark script, using one extremely small expression as advised by Nick in #3378:

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
  gem "benchmark-ips"
end

doc = Nokogiri::HTML5.parse(File.read(File.join(__dir__, "../test/files/tlm.html")))

expressions = [
  ["//@href", 203],
  ["//*[nokogiri-builtin:css-class(@class,'br0') and count(preceding-sibling::*)=0]", 8],
]

expressions.each do |expression_str, count|
  Benchmark.ips do |x|
    x.warmup = 0
    expression_comp = Nokogiri::XML::XPath::Expression.new(expression_str)

    x.report("string") do
      doc.xpath(expression_str).length == count or raise("nope #{doc.xpath(expression_comp).length}")
    end

    x.report("compiled") do
      doc.xpath(expression_comp).length == count or raise("nope #{doc.xpath(expression_comp).length}")
    end

    x.compare!
  end
end

Result:

Calculating -------------------------------------
              string     18.797k (± 5.6%) i/s   (53.20 μs/i) -     91.300k in   4.978472s
            compiled     18.887k (± 5.6%) i/s   (52.95 μs/i) -     91.721k in   4.978564s

Comparison:
            compiled:    18887.3 i/s
              string:    18797.1 i/s - same-ish: difference falls within error

Calculating -------------------------------------
              string      1.707k (± 3.1%) i/s  (585.67 μs/i) -      8.518k in   4.998124s
            compiled      1.713k (± 3.9%) i/s  (583.92 μs/i) -      8.536k in   4.997768s

Comparison:
            compiled:     1712.6 i/s
              string:     1707.5 i/s - same-ish: difference falls within error

So I'm still not seeing much/any benefit to compiled expressions. Will continue to leave this PR open to invite folks to write a benchmark that demonstrates that this is faster in a real-world benchmark (macro or micro).

@flavorjones flavorjones force-pushed the flavorjones-compiled-xpath-queries branch from bda0ec6 to 0c1362f Compare January 3, 2025 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant