Vanilla XML Parser

By Malcolm McLean Homepage

This is a "vanilla" XML parser, written in ANSI C. I looked for an XML parser on the net and, surprisingly, they're all either heavyweight or written in C++. For instance Expat is a free C XML parser, but it requires you to set callback handlers to interpret each tag as it is read in. That's useful if you're parsing massive documents which won't fit into memory. But quite commonly you just want a 20 line configuration file defining some paths and user constants. Make it XML to make it self-documenting. But needing to include a huge parser can be overkill.

The general rule with text format documents is that writing them is easy - just step through your arrays or traverse your trees, printing out the data. Parsing them is hard, because you have to validate the data at each turn. Every single numerical field, for example, can techncially cause undefined behaviour if you try to decode it with atoi(), and the value is too large to fit into an integer. The vanilla parser simply ignores complexities like entities, unicode, special embedded characters, and so on. That means that it's harder to exploit, other than by feeding it a huge file.

This simple parser will only handle ASCII-encoded files. It loads the whole thing into memory, beign quite greedy with lots of small allocations. On the other hand memory sizes are often quite big these days. You can traverse the tree yourself, or you can use the access functions for simple data retrieval.

xmlparser.h
xmlparser.c

The code is just recently developed so might have some bugs, particularly when fed with malformed XML.

The basic idea is that the code consists of these fields

typedef struct xmlnode
{
  char *tag;                 /* tag to identify data type */
  XMLATTRIBUTE *attributes;  /* attributes */
  char *data;                /* data as ascii */
  int position;              /* position of the node within parent's data string */
  struct xmlnode *next;      /* sibling node */
  struct xmlnode *child;     /* first child node */
} XMLNODE;

The Lisp-style convention is used whereby each node of the tree has two pointers, one to siblings and one to first child. It differs from a binary tree in that the pointers are called "next" and "child" rather than "left_child" and "right_child". The practical difference is that the sibling list is likely to be quite long, the child depth quite shallow.

The tag is the element name, without the angle brackets. Data is the data associated with the node. This is currently a char * so ASCII. XML allows for tags embedded in data strings, eg <GENESIS>In the beginning God created the heavens and the earth, and the Earth was without form, and void, and darkness <I>was<I> on the face of the deep. The was is italicised because the word doesn't have a matching word in the Hebrew. This feature doesn't really have much place in XML used for data documents, but it's supported. Note that whitespace is a bit dicey in XML.

typedef struct xmlattribute
{
  char *name;                /* attriibute name */
  char *value;               /* attribute value (without quotes) */
  struct xmlattribute *next; /* next pointer in linked list */
} XMLATTRIBUTE;

The attributes are a simple linked list with name, value pairs. Note that values are always strings. You've got to validate them yourself.

To save you having to access these structures directly, there are some simple tools.


Usage

  XMLDOC *doc;
  int Nchildren;
  XMLNODE *node;
  XMLNODE **desc;
  int N;

  doc = loadxmldoc("fred.xml", &err);
  if(!doc)
  {
    printf("Can't load file error %d\n", err);
	return 0;
  }
  printf("root tag %s\n", xml_gettag(xml_getroot(doc)));
  Nchildren = xml_Nchildren(xml_getroot(doc));
  for(i=0;i<Nchildren;i++)
  {
    node = xml_getchild(xml_getroot(doc), 0, i);
    printf("child tag %s data %s\n", xml_gettag(node), xml_getdata(node));
  }
  desc = xml_getdescendants(doc->root, "down", &N);
  printf("N descendants with tag \"down\", %d\n", N);
  free(desc);
  killxmldoc(doc);


loadxmldoc

XMLDOC *loadxmldoc(char *fname, int *err);

The master function. Loads an entire XML file into memory in one go.


floadxmldoc

XMLDOC *floadxmldoc(FILE *fp, int *err);

This is the same fucntion, but reads the xml from an already opened stream.


killxmldoc

void killxmldoc(XMLDOC *doc);

The document destructor. Note it may take non-trivial time to execute.


xml_getroot

XMLNODE *xml_getroot(XMLDOC *doc);

Simple access fucntion, to get the root node of the document. Everything starts from there.


xml_gettag

const char *xml_gettag(XMLNODE *node);

Simple access function. Get the tag associated with the node. Note that the tag isn't copied, so don't write to it.


xml_getdata

const char *xml_getdata(XMLNODE *node);

Simple access function. Get the data string associated with the node. Note that the data is not copied, so do not write to it. Note also that a lot of nodes will have "data" which is purely whitespace.


xml_getattribute

const char *xml_getattribute(XMLNODE *node, const char *attr);

Get the named attribute.


xml_Nchildren

int xml_Nchildren(XMLNODE *node);

Get the number of direct children of the node.


xml_Nchildrenwithtag

int xml_Nchildrenwithtag(XMLNODE *node, const char *tag);

Get the number of direct children of a node, with a particular tag. This function is used to check nodes for consistency with the expected data format. Passing null for the tag is equivalent to xml_Nchildren.


xml_getchild

XMLNODE *xml_getchild(XMLNODE *node, const char *tag, int index);

Get the ith child of the node with the tag. If you've called xml_Nchildrenwithtag(), this fucntion will retrive the actual nodes, in order. It's rather iefficient since it searches throught he child list from the beginning on each call. This won't matter if you've only a small file, but for huge data sets, work through the list explicitly. If you pass NULL for the tag, you get the ith child.


xml_getdescendants

XMLNODE **xml_getdescendants(XMLNODE *node, const char *tag, int *N);

Get all the descendants of a node, with a given tag. This function is rather expensive, but it's convenient for fishing in possibly extended files. Find the tags you can process, then look at their children to check that they are genuinely the tags you want. Returns a malloced list which must be freed. Don't free the nodes themselves.