Cargo for Rust is required.
libxml2
needs to be installed.
$ cargo install --path .
will install validate-xml
into $HOME/.cargo/bin
.
Basic usage:
$ validate-xml root_dir 2> log.txt
Detailed usage:
$ validate-xml --help
Validate XML files concurrently and downloading remote XML Schemas only once.
Usage:
validate-xml [--extension=<extension>] <dir>
validate-xml (-h | --help)
validate-xml --version
Options:
-h --help Show this screen.
--version Show version.
--extension=<extension> File extension of XML files [default: cmdi].
This was written to be super fast:
- Concurrency, using number of cores available.
- Use of C FFI with
libxml2
library. - Downloading of remote XML Schema files only once per run, and caching to memory.
- Caching of
libxml2
schema data structure for reuse in multiple concurrent parses.
On my machine, it takes a few seconds to validate a sample set of 20,000 XML files. This is hundreds of times faster than the first attempt, which was a shell script that sequentially runs xmllint
(after first pre-downloading remote XML Schema files and passing the local files to xmllint
with --schema
). The concurrency is a big win, as is the reuse of the libxml2
schema data structure across threads and avoiding having to spawn xmllint
as a heavyweight process.
If I had known about Python's lxml
binding to C libxml2
, I might not have bothered this Rust program. I found out about it and wrote the Python version of this program, which looks very similar (but without all the types): https://github.com/FranklinChen/validate-xml-python
However, as the Rust ecosystem fills in gaps in available libraries, I think I would actually be pretty happy to write in Rust instead of Python.