I recently ran into a problem where I was trying to export an Excel (.xlsx) file to a csv (comma-separated value) file. This should be as easy as saving the file as a csv file in Excel, but when Asian characters are involved, this doesn’t work at all (apparently it exports to ASCII only).
The object of the game was to get to Unicode (UTF-8), and several options did come to mind (parsing the XML source, importing into R, etc.), but I wanted to see if there was a simpler solution. I did find a clever workaround on the Salesforce helpdesk, but this wasn’t particularly elegant (in short, this involves saving the file as “Unicode text” and then swapping out the tabs for commas in Notepad). In trying to work with this solution, my experience resulted in a few extra tips:
- Save a duplicate of the Excel master file, and tidy the data to be used beforehand, rather than later on. Use the power of Excel while you still can! Basic things like the columns/rows to include and their order can often be easily adjusted while still in Excel.
- If you attempt to save the file as “Unicode text” (in the file-type menu when saving), the file will be saved in the UTF-16 format (as indicated by the included BOM), meaning that most programs expecting UTF-8 files will have serious problems. This may not be a bad thing if you are okay with UTF-16, but it is critical to know this.
- If you really badly need commas as delimiters, you can do a find-replace command as explained in the article. Otherwise, depending on what you’re doing, the exported txt file is really just a tsv (tab-separated value) file, so provided you can specify the delimiter, this may be a better option, especially if dealing with phrases or numerical figures that may include commas.
- One quick way of producing a UTF-8 tsv or csv file is by uploading the Excel file to Google Docs and exporting it from there. Not the solution I really wanted, but it’s at least a step easier than trying to convert a UTF-16 file to UTF-8.
In summary, there was no direct, easy way of exporting Excel documents to UTF-8 csv files via Excel itself (at least on a Windows version of Excel 2010). Only mildly annoying.