In the previous post about regular expressions we explained how to express a variable number of characters at certain position in our regex. And how to write a regex that would allow us to capture words inside a context (specific characters before and after the word). But we’ve seen that context is considered part of the matching by the regex engine and we need to avoid that. Let’s see how to solve that issue.
In the first post about regular expressions we’ve explained how it is possible to write a regex matching specific characters (or character group/type) at certain position. In a way that makes it very easy to write a regex that finds an x followed by a white space, followed by a y. But what if we need to find an a followed by four to six decimal digits, followed by a b.
Not kidding! Developers fear the regex. And yeah, I get it. They look ugly. But they are powerful. And it is way more hard to read them than to write them. Which is not ideal. But at least you can benefit from writing a regex here and there. Probably document them in-code with a meaningful comment for your future self and you’ll be fine. Learning to write some basic, simple yet powerful regex is not impossible and I’m writing this post to prove that.
The following code defines, tests and illustrates the use of utf8len() function. Which is a small piece of code for counting characters in UTF-8 (multibyte) string. Compile this example with GCC by running: $ gcc utf8len.c -lrt -o utf8len The RT library is used for the high precision clock only, you don’t need to link it if you are using the function itself into your own code. This utf8len() function provides a portable (and small footprint) way of counting UTF-8 charactes in standard C or C++.