Python + Encoding

Like so many computer science students, I started my programming career using Java. It's a fantastic first language for programmers for a host of reasons: it's free (including some awesome IDEs like Eclipse and IntelliJ), there's a huge and supportive community with significant amounts of commercial use (jobs!), the compiler is fairly helpful in finding bugs, strict typing encourages explicit programming, and so on. My first job was programming in Java, too. I loved it.

I was baffled by Python the first time I saw it. It felt like cheating, as though programming Python was simply a matter of politely asking the computer to complete various tasks in plain English. Windsor Circle is a Python shop on the back-end, and while my work here wasn't my first rodeo with Python, it is the first time I've used it in a large production environment.

Python does a lot of things well. It's easy to read in nearly plain English (much the same way that Ruby is) and programs are usually far shorter than in Java. Typing is dynamic, so it's easier to whip together a script without having to fuddle around with variable type mismatches. But Python isn't perfect. It's slower than Java. And it doesn't play well with encoding.

I'm not going to go into much depth here in terms of what encoding is, but if you're a dev and don't understand the topic well, I recommend a) learning about it stat, and b) this article.

Here are the three workarounds I've used to make Python play nicely when it comes to issues with encoding. In order from most- to least-hacky, they are:

1. Do a string literal replace of the problematic characters.

I ran into an issue where a Python didn't want to write the copyright symbol back out to UTF-8 to send to another server for use in HTML. Andy pointed out that since we later would be escaping the copyright symbol for HTML anyways, we could replace the problematic Unicode symbol directly with the HTML escape, so we wrote a little method to do just that:

<def _copyright_encode(self, string): return string.replace(u'\u00A9', '©')>

2. Use Python's encode() and decode() to play with strings.

If you know the encoding of your input and output formats (and this can't always be guaranteed), you can specify the codec for encoding / decoding using Python's built-in methods for dealing with Unicode (see this HOWTO and this guide to codecs). This is considerably more explicit and technical than I'm used to Python being - specifying codecs is too close to MIPS for me (kidding!) - but it does work, and there's even a replaceparam for when Python can't figure out how to re-encode a particular character.

3. Use csvkit.

In a different instance, I ran into an issue where I was handling a CSV saved from Excel. I spent a while trying to access the data with DictReader, tinker with it, then spit it back out with DictWriter. Except the data I was spitting out was different - most of the Latin characters (a-z, A-Z, 0-9 and special characters) were fine, but anything outside of that range rendered as weird (Gaelic-looking, maybe?) strings.

After what certainly felt like (and may have been) weeks, I figured out the issue: Excel saves CSVs in Latin-1 encoded format, but Python was trying to print out in UTF-8. I perhaps could have wrangled the unruly CSV into submission using methods 1 and 2 as detailed above, but at some point I stumbled across csvkit.

As the name implies, csvkit is a set of tools for playing with CSVs. csvlook, csvcut and csvstat are all tremendously useful for manipulating CSVs, but the in2csvis what takes the cake. With it, you can take a pesky XLSX and convert it to a CSV on the command line like so:

--CODE HERE-- in2csv pesky.xlsx > farbetterfileformat.csv --END CODE--

And csvkit can be used as a library in your Python code! I stronglyrecommend taking a look at the library if you deal with Excel-saved files in Python (which, weirdly, we do).

Hope that's all helpful - and good luck with encoding!


There are no comments yet.

Leave a Comment