Home > Howto > Regex golf with calibre

Regex golf with calibre

I noticed one of my ebooks is a bit odd. It’s potentially a DRM issue as I’m converting it through formats to read on an old (and so far moderately indestructible) nook. Seems that speech marks and apostrophes have been converted to question marks, so you end up with:

?Have you seen the screwdriver?? ?Didn?t I already give you it? Sure it?s not with you?? ?It?s all right, I?m an idiot, it?s right here!?

Gets annoying pretty quickly, even with better dialogue, so regex to the rescue.

First the apostrophes. Now we need to just match question marks in the _middle_ of words, not at the end, so that’s nice and easy;

# Search:
(\w)\?(\w)
# Replace:
\1'\2

So, \w we find a letter (or number, or hyphen, but not a space) followed by a question mark (escaped with the backslash because it has special meaning in regex world), followed by another letter (number…. not a space). We remember the letters in two separate match groups by sticking them in brackets.
We then replace the three characters with the original first letter we found (\1), overwrite the question mark with an apostrophe and then put the second letter back (\2). Boom -straight away I?s, I?m and what?s go back to being readable.

The next bit is more complicated, but thanks to me using calibre to convert it in the first place, it’s littered with calibre’s mad class formatting separating every paragraph or newline. It’s also badly documented as it’s 4am and I really should have gone to bed. Also, I realise this fails in a lot of cases, so part 2 to follow.

Testcase:
<span class="calibre6">?How long is it??</span>
Find: ()\?(.+?)\?() Replace:
\1'\2'\3
Categories: Howto Tags: ,
  1. No comments yet.
  1. No trackbacks yet.