parsing-xml.txt 5.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170
  1. Requirements for XML parsing in neon
  2. ------------------------------------
  3. Before describing the interface given in neon for parsing XML, here
  4. are the requirements which it must satisfy:
  5. 1. to support using either libxml or expat as the underlying parser
  6. 2. to allow "independent" sections to handle parsing one XML
  7. document
  8. 3. to map element namespaces/names to an integer for easier
  9. comparison.
  10. A description of requirement (2) is useful since it is the "hard"
  11. requirement, and adds most of the complexity of interface: WebDAV
  12. PROPFIND responses are made up of a large boilerplate XML
  13. <multistatus><response><propstat><prop>...</prop></propstat> etc.
  14. neon should handle the parsing of these standard elements, and expose
  15. the meaning of the response using a convenient interface. But, within
  16. the <prop> elements, there may also be fragments of XML: neon can
  17. never know how to parse these, since they are property- and hence
  18. application-specific. The simplest example of this is the
  19. DAV:resourcetype property.
  20. So there is requirement (2) that two "independent" sections of code
  21. can handle the parsing of one XML document.
  22. Callback-based XML parsing
  23. --------------------------
  24. There are two ways of parsing XML documents commonly used:
  25. 1. Build an in-memory tree of the document
  26. 2. Use callbacks
  27. Where practical, using callbacks is more efficient than building a
  28. tree, so this is what neon uses. The standard interface for
  29. callback-based XML parsing is called SAX, so understanding the SAX
  30. interface is useful to understanding the XML parsing interface
  31. provided by neon.
  32. The SAX interface works by registering callbacks which are called *as
  33. the XML is parsed*. The most important callbacks are for 'start
  34. element' and 'end element'. For instance, if the XML document below is
  35. parsed by a SAX-like interface:
  36. <hello>
  37. <foobar></foobar>
  38. </hello>
  39. Say we have registered callbacks "startelm" for 'start element' and
  40. "endelm" for 'end element'. Simplified somewhat, the callbacks will
  41. be called in this order, with these arguments:
  42. 1. startelm("hello")
  43. 2. startelm("foobar")
  44. 3. endelm("foobar")
  45. 4. endelm("hello")
  46. See the expat 'xmlparse.h' header for a more complete definition of a
  47. SAX-like interface.
  48. The hip_xml interface
  49. ---------------------
  50. The hip_xml interface satisfies requirement (2) by introducing the
  51. "handler" concept. A handler is made up of these things:
  52. - a set of XML elements
  53. - a callback to validate an element
  54. - a callback which is called when an element is opened
  55. - a callback which is called when an element is closed
  56. - (optionally, a callback which is called for CDATA)
  57. Registering a handler essentially says:
  58. "If you encounter any of this set of elements, I want these
  59. callbacks to be called."
  60. Handlers are kept in a STACK inside hip_xml. The first handler
  61. registered becomes the BASE of the stack, subsequent handlers are
  62. PUSHed on top.
  63. During XML parsing, the handler which is used for an XML element is
  64. recorded. When a new element is started, the search for a handler for
  65. this element begins at the handler used for the parent element, and
  66. carries on up the stack. For the root element, the search always
  67. starts at the BASE of the stack.
  68. A user's guide to hip_xml
  69. -------------------------
  70. The first thing to do when using hip_xml is to know what set of XML
  71. elements you are going to be parsing. This can usually be done by
  72. looking at the DTD provided for the documents you are going to be
  73. parsing. The DTD is also very useful in writing the 'validate'
  74. callback function, since it can tell you what parent/child pairs are
  75. valid, and which aren't.
  76. In this example, we'll parse XML documents which look like:
  77. <T:list-of-things xmlns:T="http://things.org/">
  78. <T:a-thing>foo</T:a-thing>
  79. <T:a-thing>bar</T:a-thing>
  80. </T:list-of-things>
  81. So, given the set of elements, declare the element id's and the
  82. element array:
  83. #define ELM_listofthings (HIP_ELM_UNUSED)
  84. #define ELM_a_thing (HIP_ELM_UNUSED + 1)
  85. const static struct my_elms[] = {
  86. { "http://things.org/", "list-of-things", ELM_listofthings, 0 },
  87. { "http://things.org/", "a-thing", ELM_a_thing, HIP_XML_CDATA },
  88. { NULL }
  89. };
  90. This declares we know about two elements: list-of-things, and a-thing,
  91. and that the 'a-thing' element contains character data.
  92. The definition of the validation callback is very simple:
  93. static int validate(hip_xml_elmid parent, hip_xml_elmid child)
  94. {
  95. /* Only allow 'list-of-things' as the root element. */
  96. if (parent == HIP_ELM_root && child == ELM_listofthings ||
  97. parent = ELM_listofthings && child == ELM_a_thing) {
  98. return HIP_XML_VALID;
  99. } else {
  100. return HIP_XML_INVALID;
  101. }
  102. }
  103. For this example, we can ignore the start-element callback, and just
  104. use the end-element callback:
  105. static int endelm(void *userdata, const struct hip_xml_elm *s,
  106. const char *cdata)
  107. {
  108. printf("Got a thing: %s\n", cdata);
  109. return 0;
  110. }
  111. This endelm callback just prints the cdata which was contained in the
  112. "a-thing" element.
  113. Now, on to parsing. A new parser object is created for parsing each
  114. XML document. Creating a new parser object is as simple as:
  115. hip_xml_parser *parser;
  116. parser = hip_xml_create();
  117. Next register the handler, passing NULL as the start-element callback,
  118. and also as userdata, which we don't use here.
  119. hip_xml_push_handler(parser, my_elms,
  120. validate, NULL, endelm,
  121. NULL);
  122. Finally, call hip_xml_parse, passing the chunks of XML document to the
  123. hip_xml as you get them. The output should be:
  124. Got a thing: foo
  125. Got a thing: bar
  126. for the XML document.