Changed mt2rest to hatena2rest¶
I released mt2rest at 30 March that is tool to convert(GEEK DAY TOKYOに参加しました。). It is to convert Hatena Diary exported into reST format. I had migrated from Hatena Diary to Tinkerer. But it was incomplete conversion with this tool, because I made improvised for that time event.
I rewrote this tool with using Hatena diary XML format instead of MovableType format as input. The two reason is as follows. Firstly it is too hard to distinguish the notation of Hatena Diary written in the code block to separate the entries of daily correctly. Secondary it is also hard to parse distinguish the code of blog-parts when I use the MovableType format as input file. I renamed a project name “hatena2rest. It is now not use MovableType format why “mt” of mt2rest is Abbreviation of “MovableType”.
Use this tool if you will migrate tinkerer from Hatena Diary. Usage is written in README.
I make a note that I had a hard time.
Detection of half-width characters and full-with characters¶
String must has border line as “=” * character length at section, subsection and simple table of Sphinx. Single-byte character is simple, so character length of border equals character length of string. String encoded with unicode is also the same.
>>> len(r"single") 6 >>> len(u"single") 6
But double-byte character is complicated. Character length of string is different with encoded with unicode and raw.
>>> len(r"全角") 6 >>> len(u"全角") 2
“Hankaku kana” is below
>>> len(r"ﾃｽﾄ") 9 >>> len(u"ﾃｽﾄ") 3
“Zenkaku” must be two or more characters, “Hankaku” must be at least one character when Sphinx border character. It is hard to detect with len(). So I used unixcodedata.east_asian_width().
unicodedata module is embedded with Python. east_asian_width() has next 6 values.
- F: Fullwidth
- H: Halfwidth
- W: Wide
- Na: Narrow
- A: Ambiguous
- N: Neutral
According to “WikiPedia”, “A” is processed as 1 or 2 characters, but Sphinx processes as like below probably.
- F: 2 character width
- H: 1 character width
- W: 2 character width
- Na: 1 character width
- A: 1 character witdh
- N: 1 character width
Then next code is enable to get width of string.
def length_str(string) fwa = ['F', 'W', 'A'] hnna = ['H', 'N', 'Na'] if isinstance(string, unicode): zenkaku = len([unicodedata.east_asian_width(c) for c in string if unicodedata.east_asian_width(c) in fwa]) hankaku = len([unicodedata.east_asian_width(c) for c in string if unicodedata.east_asian_width(c) in hnna]) return (zenkaku * 2 + hankaku) elif isinstance(string, str): return len(string)
Exception occurs when use “&” in raw directive of html¶
“&” is disable to use in raw directive of html. Blog parts is converted to html raw directive. Then exception occurs when running build(tinker -b command)..
# Sphinx version: 1.1.3 # Python version: 2.7.3 # Docutils version: 0.8.1 release # Jinja2 version: 2.6 Traceback (most recent call last): File "/usr/lib/pymodules/python2.7/sphinx/cmdline.py", line 189, in main app.build(force_all, filenames) File "/usr/lib/pymodules/python2.7/sphinx/application.py", line 204, in build self.builder.build_update() File "/usr/lib/pymodules/python2.7/sphinx/builders/__init__.py", line 196, in build_update 'out of date' % len(to_build)) File "/usr/lib/pymodules/python2.7/sphinx/builders/__init__.py", line 255, in build self.finish() File "/usr/lib/pymodules/python2.7/sphinx/builders/html.py", line 433, in finish for pagename, context, template in pagelist: File "/usr/lib/python2.7/dist-packages/tinkerer/ext/blog.py", line 85, in html_collect_pages for name, context, template in rss.generate_feed(app): File "/usr/lib/python2.7/dist-packages/tinkerer/ext/rss.py", line 54, in generate_feed app.config.website + post[:11])), File "/usr/lib/python2.7/dist-packages/tinkerer/ext/patch.py", line 91, in patch_links doc = xml.dom.minidom.parseString(in_str) File "/usr/lib/python2.7/xml/dom/minidom.py", line 1930, in parseString return expatbuilder.parseString(string) File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString return builder.parseString(string) File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString parser.Parse(string, True) ExpatError: not well-formed (invalid token): line 70, column 363
This problem is solved with escaping to character entity references, but there is no meaning as hyperlink. So I extracted URI as simple hyperlink.
I spent a lot of regular expression.