Categories
Moby Dick Annotated with Arc90's Sidenotes | Home | Arc90 Well-Represented At Mashable’s NYC Event
Filed under Quick Tips on January 6, 2009 by Joel PotischmanEncoding and XML
In speccing a project last month, we discussed the best way to attach PDFs to the XML documents we send between very distributed systems. We quickly decided it would avoid a whole host of atomicity, reliability, and redesign issues to simply update our XML schema to include Base64 encoded documents inside the body of the XML document itself.
For my half of our application ecosystem I researched .NET's Base64 encoding/decoding support, got curious about character set encoding, and set out to write a universal encoder to make it simple, easy, and guaranteed safe to insert any kind of binary or text data into an XML document.
But first, a quick nano-refresher: Character encoding specifies how a string of bytes should be mapped to specific text characters. In the simplest case, ASCII, one byte maps to one of 256 possible characters. 65=A, 66=B, .... 90=Z, etc. In Unicode, two (or even four) bytes map to thousands or even (theoretically) billions of characters. So when my program reads four bytes from a text file, I need to know if it represents four 1-byte characters, two 2-byte characters, or one 4-byte character. It's actually even more complicated than that, but the basic problem is making sure I don't accidentally turn a 400-byte ASCII text file into 200 Japanese characters. Or vice versa. Or garbage.
Fortunately, .NET has robust support for character encoding, so all I have to do is load the correct encoding class and ask it to take care of this for me. If I know I will only ever need to deal with Unicode, that class is Encoding.Unicode, but for maximum flexibility I can call Encoding.GetEncoding(encodingName) and get any encoding by name. Like so:
public string GetStringFromFile(string myFilename, string encodingName)
{
byte[] fileBytes = System.IO.File.ReadAllBytes(myFilename);
System.Text.Encoding myEncoding = System.Text.Encoding.GetEncoding(encodingName);
return myEncoding.GetString(fileBytes);
}
After I'm done modifying that string I can easily convert it back into a byte array and save it, preserving the original character encoding:
public void SaveStringToFile(string myFilename, string encodingName, string myString)
{
System.Text.Encoding myEncoding = System.Text.Encoding.GetEncoding(encodingName);
byte[] fileBytes = myEncoding.GetBytes(myString);
System.IO.File.WriteAllBytes(myFilename, fileBytes);
}
But now let's get back to my actual business problem, storing a PDF or other binary data in my XML document. Because the bytes I encounter are not supposed to represent character data, attempting to map them to characters may result in nonsense. For example, whether I decode the byte sequence 00 00 00 00 as ASCII, Unicode, or UTF-32, I get either one, two, or four null characters that will screw up string processing. Note that use of CDATA sections doesn't help. Only being very lucky about which byte sequences I encounter would avert disaster, and I don't like writing lucky code.
Enter Base64, which is designed for exactly this purpose: encoding arbitrary binary data into a string guaranteed to consist of only "safe" ASCII characters and decoding that string back to bytes with 100% fidelity later. Microsoft places Base64 functionality under the System.Convert class, not System.Text.Encoding because it's more of a conversion and translation process, not a direct byte-to-character encoding like those described above.
To read a file into a Base64 string:
public string GetBase64StringFromFile(string myFilename)
{
byte[] fileBytes = System.IO.File.ReadAllBytes(myFilename);
return Convert.ToBase64String(fileBytes, Base64FormattingOptions.InsertLineBreaks);
}
And to decode it and save it back to the filesystem:
public void SaveBase64StringToFile(string myFilename, string myString)
{
byte[] fileBytes = Convert.FromBase64String(myString);
System.IO.File.WriteAllBytes(myFilename, fileBytes);
}
I combined both "real" character encodings and Base64 encoding in my XmlFileEncoder class (attached) to provide unified access to both encodings when working with XML documents. If you know you're dealing with Unicode, simply call XmlFileEncoder.InsertFileIntoXmlDocument with the encoding UTF-16 and the file will safely be inserted as text. If you don't always know the file format, or you are dealing with binary files, simply call the same method with the encoding Base64. In either case, a new node will be added to contain your file data and the encoding attribute will record the encoding method so XmlFileEncoder.ExtractFileFromXmlDocument will use the correct character/Base64 decoding automatically.
The demo WinForms app starts up displaying an XML document with some UTF-16 data already encoded into it. Use the controls along the bottom to experiment with inserting differently encoded files (provided, or use your own) using different application character encodings. Some files will clearly look wrong in the XML when you select the wrong encoding. Others may look correct, or almost correct, but when you press the SaveEncodedFile to File button it will report Copy accuracy FAILED when it verifies against the source file. However, the files encoded and decoded using Base64 will always copy accurately. The only downside is that Base64 encoded data is always 1/3 bigger and much less human readable than the source.
You can find the source code here: XmlEncodingDemo.zip
Have a happy, healthy, and correctly encoded 2009!
Moby Dick Annotated with Arc90's Sidenotes | Main | Arc90 Well-Represented At Mashable’s NYC Event
