Monday, January 12, 2009

Parsing an OpenXML document (Word 2007)

Back to the geeky ...

I was attempting to parse a Word 2007 document for mail merge purposes and found that libxml was the fastest way to do it with Ruby. The XML document uses namespaces heavily but it's not readily apparent how to search the document with libxml using the namespaces.

Here is a sample section and the code I found after an extensive Google search:

<w:doc>
<w:p>
<w:t>Text being sought</w:t>
</w:p>
</w:doc>
and the code used to find the paragraph node "w:p"

ns="w:http://schemas.openxmlformats.org/wordprocessingml/2006/main"
doc.find("//w:p",ns).each do |p|
#do something special with the paragraph node here
end

It took me a while to track this down so I thought I would share here in the hopes of helping someone else.


No comments: