SharpZipLib and Mac OS X

TL;DR When creating zip archives with SharpZipLib disable Zip64 format if you care about Mac compatibility.

Extra TL;DR When creating zip archives with SharpZipLib make sure you set the file size, disabling Zip64 is neither here nor there.

A project I am working on sources a zip archive from a Web service, extracts the XML file from the zip and then does silly amounts of processing of the data in that XML to produce a new XML file which is returned to the user.

But the sourced zip archive cannot be opened using Python’s zipfile module, and when saved on a Mac the archive cannot be opened using the built-in Archive If one double-clicks the zip, Archive just compresses it again and sticks “.cpgz” on the end of the file name.

Fortunately the developer of the Web service is very helpful (the service is written in C# and runs on Windows) and although he didn’t know why Macs were having problems (the built-in Windows zip tool can handle the archive fine) he showed me the code that creates the zip file.

Turns out they were using SharpZipLib, an open-source library for C#. And it turns out SharpZipLib creates archives using Zip64 format by default.

The fix was to disable Zip64 when creating the archive. Here’s a trivial command-line program that creates a zip and disables Zip64 for Mac compatibility:

using System;
using System.IO;
using ICSharpCode.SharpZipLib.Core;
using ICSharpCode.SharpZipLib.Zip;

public class ZipTool
    public static void Main(string[] args)
        if (args.Length != 2) {
            Console.WriteLine("Usage: ziptool <input file> <output file>");

        using (ZipOutputStream zipout = new ZipOutputStream(File.Create(args[1]))) {
            byte[] buffer = new byte[4096];
            string filename = args[0];


            // Disable Zip64 for Mac compatibility
            zipout.UseZip64 = UseZip64.Off;

            ZipEntry entry = new ZipEntry(Path.GetFileName(filename));            
            entry.DateTime = File.GetLastWriteTime(filename);

            using (FileStream fs = File.OpenRead(filename)) {
                int sourceBytes;
                do {
                    sourceBytes = fs.Read(buffer, 0, buffer.Length);
                    zipout.Write(buffer, 0, sourceBytes);
                } while (sourceBytes > 0);


The disadvantage of disabling Zip64 is you cannot create archives larger than 4 gigabytes, nor can you add files larger than 4 gigabytes before compression.

The advantage of disabling Zip64 is you make me and all my Mac-using hippy friends happy. In concrete terms, disabling Zip64 makes it more likely I will buy you a drink.

Hi Gaston!

Also many thanks to the maintainer of NAnt in MacPorts, who responded to my bug report and pushed an updated NAnt to MacPorts extremely quickly. Although SharpZipLib doesn’t officially support Mono, it builds without a hitch using the “build-net-2.0” target if you hack the NAnt script like so:

<target name="build-mono-2.0" >
    <call target="build-net-2.0" />
    <!--<fail message="Dont know how to build for Mono 2.0." />-->

Also thanks to the Mono project! Saved me having to fire up a Windows virtual machine to figure out this problem.

Widths & Heights with xlwt + Python

This article about using xlwt to generate Excel in Python reminded me I needed to see exactly how to set column widths (the xlwt documentation doesn’t cover it).

Let’s create a new Excel workbook and add a sheet:

>>> import xlwt
>>> book = xlwt.Workbook(encoding='utf-8')
>>> sheet = book.add_sheet('sheeeeeet')

We need to get a column in order to set its width. You do that by call col() on the sheet, passing the column’s index as the only argument (or row() for accessing rows):

>>> sheet.col(0)    # First column
<xlwt.Column.Column object at 0x10b2a6190>
>>> sheet.row(2)    # Third row
<xlwt.Row.Row object at 0x10b2a7050>

The index is zero-based. You can fetch a column even if you have not written to any cell in that column (this applies equally to rows).

Columns have a property for setting the width. The value is an integer specifying the size measured in 1/256 of the width of the character ‘0’ as it appears in the sheet’s default font. xlwt creates columns with a default width of 2962, roughly equivalent to 11 characters wide.

>>> first_col = sheet.col(0)
>>> first_col.width = 256 * 20              # 20 characters wide (-ish)
>>> first_col.width

For rows, the height is determined by the style applied to the row or any cell in the row. (In fact rows also have a property called height but it doesn’t do what you want.) To set the height of the row itself, create a new style with a font height:

>>> tall_style = xlwt.easyxf('font:height 720;') # 36pt
>>> first_row = sheet.row(0)
>>> first_row.set_style(tall_style)

Setting the style on the row does not change the style of the cells in that row.

There is no obvious way to set a default width and height for all columns and rows. An instance of xlwt.Worksheet.Worksheet has properties for col_default_width and row_default_height but changing those does not actually change the defaults.

The problem is that new columns are always created with an explicit width, while rows take their height from the style information.

My first attempt at setting defaults set the width on every column and the height on every row. It works, but creates 65,536 unnecessary empty row objects.

A slightly better approach is to set the width on every column and to set the font height on the default style record in the workbook:

import itertools
import xlwt

book = xlwt.Workbook(encoding='utf-8')
sheet = book.add_sheet('sheeeeeet')

col_width = 256 * 20                        # 20 characters wide

    for i in itertools.count():
        sheet.col(i).width = col_width
except ValueError:

default_book_style = book.default_style
default_book_style.font.height = 20 * 36    # 36pt'example.xls')

Here I used itertools.count() and wrapped the loop in a try block so I can forget exactly how many columns are permitted. When the loop tries to access a bad index it will throw ValueError and the loop will exit.

You mustn’t replace the default style on an instance of xlwt.Workbook.Workbook, you have to update the property of the existing style (to ensure you are changing the first style record). Unfortunately there is no way to set a default column width (as of xlwt version 0.7.2) so the brute force method of setting every column will have to do – it isn’t so bad since there are only 256 columns.

Talking of widths and heights, have you heard “Widths & Heights” by Magic Arm? Is good.

Date variables in InDesign

Interesting InDesign problem: the format for a modification date variable changes per document.

(This post describes a problem using Adobe InDesign CS4 but applies just as well to CS5 and CS5 and a half.)

Suppose you have a text frame containing the file modification date variable, created using TypeText VariablesInsert VariableModification Date, displays as “14 September 2011 7:50 PM” (on my system). Now open an existing document that was created on a different system and copy and paste the text frame containing the date variable. But depending on what system created the other document the date displays using a different format, e.g as “September 14, 2011 19:50”.

It appears that the date format is determined by the document in which it is placed, rather than by the date format in use when the variable was created.

Workaround: define a new / custom text variable that uses the file modification time but with your own explicit format, then insert your custom variable instead of the pre-defined “Modification Date” variable.

In my brief testing this custom variable and its format is preserved when pasting the text frame into other documents that were created on other systems.

But now I want to know exactly how the format is chosen for the built-in “Modification Date” variable. I am guessing that when a story with a variable is pasted into a document the format is determined by the format of an existing variable of the same name, and if there is no existing variable of that name then InDesign brings in the new variable definition (along with its format) from the clipboard.

But why models?

No, that’s not what I mean… why the default date format? I tried changing the date formats in System Preferences. Doesn’t seem that InDesign picks it up from there. I tried trashing the Adobe InDesign preferences folder, changing the date format in System Preferences and launching InDesign again. InDesign is still using the original format, so doesn’t get it from the user’s preferences. I had a look through ~/Library/Preferences/com.adobe.InDesign.plist but nothing date-related in there.

Perhaps it is set by the local-domain /Library/Preferences/*. Perhaps it is set by the built-in preferences of your installed language version of Creative Suite. Perhaps…

So I gave up. I will leave the investigation for some day when I am younger and it is more important to understand how Adobe’s InDesign picks the format for the built-in date modified variable. The workaround works around.


Free software FTW! Updated

Two years ago (flippin’ heck it seems like only yesterday) I wrote about converting between Unix timestamps and Windows timestamps using Python. In that post I linked to my very simple implementation of a module that provides converting back and forth between the formats.

A few weeks ago I received an e-mail from Timothy Williams with changes to the my module so that it preserves the fractions of a second in the conversion. How sweet is that?!?!! Exclamation mark question mark exclamation mark cellida diaresis em-dash full stop king of punctuation.

It is fantastic that not only did someone find my code useful but also that they were generous enough to take the time to improve it and give the changes back to me. I love tasty, delicious free software and the people like Tim who make it tastier and more delicious.

So here is the new version of incorporating Tim’s fixes.

XPath bug in old versions of ElementTree

I figured out why my XML parsing code works fine using the pure-Python ElementTree XML parsing module but fails when using the speedy and memory-optimized cElementTree XML parsing module.

The XPath 1.0 specification says '.' is short-hand for 'self::node()', selecting a node itself.

Parsing an XML document and selecting the context node with ElementTree in Python 2.5:

>>> from xml.etree import ElementTree
>>> ElementTree.VERSION
>>> doc = "<Root><Example>BUG</Example></Root>"
>>> node1 = ElementTree.fromstring(doc).find('./Example')
>>> node1
<Element Example at 10e0ed8c0>
>>> node1.find('.')
<Element Example at 10e0ed8c0>
>>> node1.find('.') == node1

See how the result of node1.find('.') is the node itself? As it should be.

Parsing an XML document and selecting the context node with cElementTree in Python 2.5:

>>> from xml.etree import cElementTree
>>> doc = "<Root><Example>BUG</Example></Root>"
>>> node2 = cElementTree.fromstring(doc).find('./Example')
>>> node2
<Element 'Example' at 0x10e0e3660>
>>> node2.find('.')
>>> node2.find('.') == node2

Balls. The result of node2.find('.') is None.

However! I have a kludgey work-around that works whether you use ElementTree or cElementTree. Use './' instead of '.':

>>> node1.find('./')
<Element Example at 10e0ed8c0>
>>> node1.find('./') == node1
>>> node2.find('./')
<Element 'Example' at 0x10e0e3660>
>>> node2.find('./') == node2

Kludgey because './' is not a valid XPath expression.

So we are back on track. Also works for Python 2.6 which has the same version of ElementTree.

Fortunately Python 2.7 got a new version of ElementTree and the bug is fixed:

>>> from xml.etree import ElementTree
>>> ElementTree.VERSION
>>> doc = "<Root><Example>BUG</Example></Root>"
>>> node3 = ElementTree.fromstring(doc).find('./Example')
>>> node3
<Element 'Example' at 0x107257210>
>>> node3.find('.')
<Element 'Example' at 0x107257210>
>>> node3.find('.') == node3

However! They also fixed my kludgey work-around:

>>> node3.find('./')
>>> node3.find('./') == node3

So I can’t code something that works for all three versions. This is annoying. I was hoping to just replace ElementTree with the C version, makes my code run in one third the time (the XML parts of it run in one tenth the time). And cannot install any compiled modules – the code can only rely on Python 2.5’s standard library.

Lion: Spotlight still broken

I’ve installed Mac OS X 10.7 (upgrading from 10.6) and was very interested to see how the new search tokens feature would work in Spotlight. But on my Mac it doesn’t. Here’s the result of a search for files whose name contains the text “david buxton”:

Note how none of the files in that list has a name containing the text “david buxton”.

Perhaps deleting the Spotlight index would fix things. Meh.

CSV sucks

For future reference: CSV is a terrible format for spreadsheet data.

Any by spreadsheet data I mean CSV is a terrible format for text data that will be opened in Microsoft Excel (unless you only speak English).

My objection to CSV as a spreadsheet format is it has no way to indicate the text encoding, and Excel (at least Excel 2008 for Mac and Excel 2007 for Windows) has no way of choosing the encoding when opening a CSV file. When your data is text from any number of non-English languages you will likely be dealing with characters outside the standard ‘latin-1’ character set.

Even if Excel did have a little pop-up menu for choosing the encoding when opening a CSV file, it would mean instructing your Excel-happy friends to “open as UTF-8” whenever they wanted to open your CSV data.

Wot you mean you don’t have Excel-happy friends?

Styling your Excel data with xlwt

This post is about how to create styles in Excel spreadsheets with the most excellent xlwt for Python. The documentation for xlwt (version 0.7.2) is a little sketchy on how to use formatting. So here goes…

To apply formatting to a cell you pass an instance of the xlwt.XFStyle class as the fourth argument to the xlwt.Worksheet.write method. The best way to create an instance is to use the xlwt.easyxf helper, which takes a string that specifies the formatting for a cell.

The other thing about using styles is you should only make one instance of each, then pass that same style object every time you want to apply it to a cell.

An example which uses a few styles:

import xlwt

styles = dict(
    bold = 'font: bold 1',
    italic = 'font: italic 1',
    # Wrap text in the cell
    wrap_bold = 'font: bold 1; align: wrap 1;',
    # White text on a blue background
    reversed = 'pattern: pattern solid, fore_color blue; font: color white;',
    # Light orange checkered background
    light_orange_bg = 'pattern: pattern fine_dots, fore_color white, back_color orange;',
    # Heavy borders
    bordered = 'border: top thick, right thick, bottom thick, left thick;',
    # 16 pt red text
    big_red = 'font: height 320, color red;',

I have no idea what it is based on, but 20 = 1 pt. So 320 = 16 pt text.

book = xlwt.Workbook()
sheet = book.add_sheet('Style demo')

for idx, k in enumerate(sorted(styles)):
    style = xlwt.easyxf(styles[k])
    sheet.write(idx, 0, k)
    sheet.write(idx, 1, styles[k], style)'Example.xls')

It isn’t included with the current distribution on the cheese shop, but there is a useful Excel spreadsheet demonstrating cell patterns in the source repository.

You can find the complete list of possible cell formats by reading the source for xlwt.Styles.

VCS for Cocoa Programming for Mac OS X

I am working my way through Aaron HillegassCocoa Programming for Mac OS X (again). I find it a good book, despite the fact I’ve started it twice before and have abandoned my tuition twice before. I like to think my progression is an awful lot like young Luke Skywalker recklessly abandoning his Jedi training on Dagobah in order to save his Web application friends in the cloud city called Bespinternet. Eventually Luke (and I) will go back to complete his (and my) training and will become a truly great Jedi programmer for Mac OS X.

Anyway, for the first few chapters the book leads you through a number of small applications, each exercise starting a new project. Then in the later chapters Hillegass has you adding features to a single project called RaiseMan.

It strikes me that the next edition of the book should introduce using version control to track development.

Many chapters in the book end with challenges that go off on a tangent, getting you to implement alternate approaches to the exercises covered in that chapter. But then the subsequent chapter will expect you to work with the version of the RaiseMan application as it stood before the previous chapter’s programming challenge.

This is a natural fit for revision control (aka VCS). The book would teach you how to tag “release” versions of your code. It would teach you how to create an experimental feature branch for the chapter-end challenges, and then how to resume development from the last “release” version for the next chapter’s main exercises.

This is particularly relevant now that Xcode 4 includes Git integration. Of course there is no reason one can’t employ a suitable version control system with the current (third) edition of the book; I just think it makes a lot of sense to get younglings used to this workflow while teaching them all the other stuff about lightsabres and the Force while you’re at it.

You will up-vote this on Reddit / Hacker News (waves hand like her Imperial majesty).

Adobe Software Updates

What Adobe’s software update site needs is:

Anyway, those be my principal beefs with the current Adobe software updates site. So I made a site that tries to satisfy my beeves. Beefs.

It is here:

Please provide feedback to