正则表达式

00 01 元字符

正则表达式 —— 字符串的规则。

元字符就是指那些在正则表达式中具有特殊意义的专用字符。

特殊单字符
- . 任意字符（换行除外）
- \d 任意数字 \D 任意非数字
- \w A-Za-z0-9_ \W
- \s 空白符 \S
空白符
- \r 回车符
- \n 换行符
- \f 换页符
- \t 制表符
- \v 垂直制表符
范围
- | 或
- [abc] 多选一
- [a-z] 之间
- [^abc] 取反，不能是括号中的任意单个元素
量词
- * 0<=
- + 1<=
- ? 0或1
- {m} m
- {m,} m<=
- {m,n} m-n

02 量词与贪婪

贪婪(Greedy) *：匹配最长。在贪婪量词模式下，正则表达式会尽可能长地去匹配符合规则的字符串，且会回溯。

preg_match_all("/a*/i", "aaabb", $matches);
var_dump($matches);

非贪婪(Reluctant) +?：匹配最短。在非贪婪量词模式下，正则表达式会匹配尽可能短的字符串。

ENV：Python3

import re
re.findall(r'a*', 'aaabb') # 贪婪模式
# ['aaa', '', '', '']
re.findall(r'a*?', 'aaabb') # 非贪婪模式
# ['', 'a', '', 'a', '', 'a', '', '', '']

re.findall(r'".+"', '"the little cat" is a toy, it lokks "a little bad"') # 贪婪模式
# ['"the little cat" is a toy, it lokks "a little bad"']
re.findall(r'".+?"', '"the little cat" is a toy, it lokks "a little bad"') # 非贪婪模式
# ['"the little cat"', '"a little bad"']

独占模式(Possessive) ++：同贪婪一样匹配最长。不过在独占量词模式下，正则表达式尽可能长地去匹配字符串，一旦匹配不成功就会结束匹配而 不会回溯。

# 回溯示例：
import re
re.findall(r'xy{1,3}z', 'xyyz') # 回溯
# ['xyyz']
# 正则 xy{1,3} 会尽可能长地去匹配到 xyyz，无法匹配 z，向前回溯 xyy
# 正则 z 匹配到剩下字符串 z
re.findall(r'xy{1,3}?z', 'xyyz') # 非贪婪
# ['xyyz']
# 正则 xy{1,3} 会尽可能短地去匹配到 xy
# 正则 z 匹配到字符串 y，无法匹配，向前回溯
# 正则 xy{1,3} 会尽可能短地去匹配 xyy
# 正则 z 匹配到剩下字符串 z

# 独占模式示例：
# pip install regex -i https://mirrors.aliyun.com/pypi/simple/
import regex
regex.findall(r'xy{1,3}+z', 'xyyz') # 独占
# ['xyyz']
# 正则 xy{1,3}+ 会尽可能长地去匹配到 xyy 并占用
# 正则 z 匹配到字符串 z
regex.findall(r'xy{1,3}+yz', 'xyyz') # 独占
# []
# 正则 xy{1,3}+ 会尽可能长地去匹配到 xyy 并占用
# 正则 yz 无法匹配到剩下字符串 z

03 分组与引用

import regex
# 不保存分组 (?:正则)
regex.sub(r'(\d{4})-(?:\d{2})-(\d{2})', r"年：\1  日：\2", '2023-03-01')
# '年：2023  日：01'

# 去除重复连续单词
regex.sub(r'(\w+)(\s\1)+', r"\1", 'the little cat cat is in the hat hat hat, we like it.')
# 'the little cat is in the hat, we like it.'

04 匹配模式

指改变元字符匹配行为。

不区分大小写模式（Case-Insensitive）(?模式标识) (?i)。

import regex
regex.findall(r"(?i)cat", "cat Cat CAt")
# ['cat', 'Cat', 'CAt']

# https://regex101.com/r/3OUJda/1
# 二次重复时的大小写一致
((?i)cat) \1

点号通配模式（Dot All）(?s) 让英文的点 . 可以匹配上包括换行的任何字符。等价 [\s\S] [\d\D] [\w\W]。

# https://regex101.com/r/zXtwLv/1
# 匹配包括换行符
(?s).+

多行匹配模式（Multiline）(?m) 使 ^ 和 $ 能匹配上每行的开头或结尾。

# 分行匹配
(?m)^cat|dog$

注释模式（Comment）(?#)

(\w+)(?#word) \1(?#word repeat again)

05 断言 Assertion

对要匹配的文本的位置也有一定的要求。只用于匹配位置，而不是文本内容本身，这种结构就是断言。

边界（Boundary）

import re
# 单词边界 \b
# tom -> jerry, tomorrow 不受影响
re.sub(r'\btom\b', 'jerry', "tom asked me if I would go fishing with him tomorrow.")
# 'jerry asked me if I would go fishing with him tomorrow.'

# 行的开始结束
# \A \z 不受模式影响
# \A -> ^, \z -> $
re.sub(r'\Atom', 'jerry', "tom asked me if I would go fishing with him tomorrow.")

# 环视 左尖括号代表看左边，没有尖括号是看右边，感叹号是非的意思
# (?<=Y) 左边是Y
# (?<!Y) 左边不是Y
# (?=Y) 右边是Y
# (?!Y) 右边不是Y

re.findall(r'[1-9]\d{5}', "138001380002")
# ['138001', '380002']
re.findall(r'(?<!\d)[1-9]\d{5}(?!\d)', "138001380002")
# 左边不是数字、右边不是数字
# []
re.findall(r'(?<!\d)[1-9]\d{5}(?!\d)', "code138001code")
# 左边不是数字、右边不是数字
# ['138001']

# \b\w+\b -> (?<!\w)\w+(?!\w) -> (?<=\W)\w+(?=\W)
# https://regex101.com/r/PBEKxY/1

# (\w+)(\s+\b\1\b)+
# 单词，单词的左边是单词边界、可以有一个及以上空格，右边是单词边界
# 比 (\w+)(\s+\1)+ 更严谨 eg: the little cat cat2 is in the hat hat2

06 转义

转义字符 Escape Character 后面的字符，不是原来的意思了。

import re
re.findall(r'\\d', 'abc\\d123d\\')
# ['\\d']
re.findall('\\', 'a*b+c?\\d123d\\')
# bad escape (end of pattern) at position 0
re.findall('\\\\', 'a*b+c?\\d123d\\')
# ['\\', '\\']
# 字符串->正则表达式：字符串转义和正则转义
# \\\\ 字符串转义 \\
# \\ 正则转义 \
re.findall(r'\\', 'a*b+c?\\d123d\\')
# ['\\', '\\']
re.findall('\(\)\[]\{}', '()[]{}')
# ['()[]{}']
# 方括号和花括号的转义一般转义开括号就可以，但圆括号两个都需要转义

import re
re.escape('\d') # 反斜杠和字母d转义
# '\\\\d'
re.findall(re.escape('\d'), '\d')
# ['\\d']
re.escape('[+]')
# '\\[\\+\\]'
re.findall(re.escape('[+]'), '[+]')
# ['[+]']

import re
re.findall(r'[^ab]', '^ab')  # 转义前代表"非"
# ['^']
re.findall(r'[^cd]', '^ab')
# ['^', 'a', 'b']
re.findall(r'[\^ab]', '^ab')  # 转义后代表普通字符
# ['^', 'a', 'b']
re.findall(r'[a-c]', 'abc-')  # 中划线在中间，代表"范围"
# ['a', 'b', 'c']
re.findall(r'[a\-c]', 'abc-')  # 中划线在中间，转义后的
re.findall(r'[-ac]', 'abc-')  # 在开头，不需要转义
re.findall(r'[ac-]', 'abc-')  # 在结尾，不需要转义
# ['a', 'c', '-']
re.findall(r'[]ab]', ']ab')  # 右括号不转义，在首位
# [']', 'a', 'b']
re.findall(r'[a]b]', ']ab')  # 右括号不转义，不在首位
# []
re.findall(r'[a\]b]', ']ab')  # 转义后代表普通字符
# [']', 'a', 'b']
re.findall(r'[.*+?()]', '[.*+?()]')  # 单个长度的元字符在中括号里，可以不转义
# ['.', '*', '+', '?', '(', ')']
re.findall(r'[\d]', 'd12\\')  # \w，\d等在中括号中还是元字符的功能
# ['1', '2']

import re
re.findall('\n', '\\n\n\\')
# ['\n'] \n -> (\n) -> (\n)
re.findall('\\n', '\\n\n\\')
# ['\n'] \\n -> \n -> (\n)
re.findall('\\\n', '\\n\n\\')
# ['\n'] \\\n -> \n -> (\n)
re.escape('\n')
# '\\\n'
re.findall('\\\\n', '\\n\n\\')
# ['\\n'] \\\\n -> \\\n -> \(\n)
re.escape('\\n')
# '\\\\n'

07 流派及其特性

POSIX Portable Operating System Interface。不能使用 \d。
- BRE Basic Regular Expression 基本正则表达式。grep sed 花园问管家 {}()?|+ 要转义。
- ERE Extended Regular Expression 扩展正则表达式。egrep grep -E sed -E。
PCRE Perl Compatible Regular Expressions。可以使用 \d \w \s。grep -P sed -P。

grep --help | grep PATTERN
# PATTERN is, by default, a basic regular expression (BRE).
#   -E, --extended-regexp     PATTERN is an extended regular expression (ERE)
#   -F, --fixed-strings       PATTERN is a set of newline-separated fixed strings
#   -G, --basic-regexp        PATTERN is a basic regular expression (BRE)
#   -P, --perl-regexp         PATTERN is a Perl regular expression

Linux/Unix 工具与正则表达式的 POSIX 规范 | 余晟

08 处理 Unicode 文本

Unicode 相当于规定了字符对应的码值，这个码值得编码成字节的形式去传输和存储。最常见的编码方式是 UTF-8，另外还有 UTF-16，UTF-32 等。UTF-8 之所以能够流行起来，是因为其编码比较巧妙，采用的是变长的方法。也就是一个 Unicode 字符，在使用 UTF-8 编码表示时占用 1 到 4 个字节不等。最重要的是 Unicode 兼容 ASCII 编码，在表示纯英文时，并不会占用更多存储空间。而汉字呢，在 UTF-8 中，通常是用三个字节来表示。

# python2.7
import re
u'极客'.encode('utf-8')
# '\xe6\x9e\x81\xe5\xae\xa2'
u'时间'.encode('utf-8')
# '\xe6\x97\xb6\xe9\x97\xb4'
# 都含有 e6

re.search(r'[时间]', '极客') is not None
# True

re.compile(r'[时间]', re.DEBUG)
# in
#   literal 230
#   literal 151
#   literal 182
#   literal 233
#   literal 151
#   literal 180
# <_sre.SRE_Pattern object at 0x10ab44d78>

re.compile(r'[极客]', re.DEBUG)
# in
#   literal 230
#   literal 158
#   literal 129
#   literal 229
#   literal 174
#   literal 162
# <_sre.SRE_Pattern object at 0x10ab44e40>

re.compile(ur'[时间]', re.DEBUG)
# in
#   literal 26102
#   literal 38388
# <_sre.SRE_Pattern object at 0x10ac02710>

re.search(ur'[时间]', '时间') is not None
False

re.search(ur'[时间]', u'时间') is not None
True

# python2.7
import re
re.findall(r'^.$', '学')
# []
re.findall(r'^.$', u'学')
# [u'\u5b66']
re.findall(ur'^.$', u'学')
# [u'\u5b66']
print(unichr(0x5B66))
# 学

# python3
import re
re.findall(r'^.$', '学')
# ['学']
re.findall(r'(?a)^.$', '学')
# ['学']
# (?a) 表示启用 ASCII 模式
chr(0x5B66)
# '学'

// 可以匹配汉语 in PHP
\p{Han}

# python2.7
import re
re.findall(r'客{3}', '极客客客客')
# []
re.findall(ur'客{3}', '极客客客客')
# []
re.findall(r'客{3}', u'极客客客客')
# []
re.findall(ur'客{3}', u'极客客客客')
# [u'\u5ba2\u5ba2\u5ba2']
re.findall(r'(客){3}', '极客客客客')

# python3
re.findall(r'客{3}', '极客客客客')
# ['客客客']
# 在 Python3 中，不需要在正则表达式字符串前面添加 u 前缀，因为所有字符串都默认为 Unicode 字符串。

Script (Unicode) | wikipedia

09 编辑器中使用正则

竖向编辑：MacOS alt + 鼠标纵向滑动。

10 语言中用正则

校验文本内容：

import re
reg = re.compile(r'\A\d{4}-\d{2}-\d{2}\Z')  # 建议先编译，提高效率
reg.search('2020-06-01') is not None
# True
reg.match('2020-06-01') is not None  # 使用 match 时 \A 可省略，match 就是从头匹配
# True

reg = re.compile(r'\d{4}-\d{2}')
reg.findall('2020-05 2020-06')
# ['2020-05', '2020-06']

/^\d{4}-\d{2}-\d{2}$/.test("2020-06-01")
// true
var regex = new RegExp(/^\d{4}-\d{2}-\d{2}$/)
regex.test("2020-01-01")
// true
var regex = /^\d{4}-\d{2}-\d{2}$/
"2020-06-01".search(regex)
// 0

$regex = '/^\d{4}-\d{2}-\d{2}$/';
$ret = preg_match($regex, "2020-06-01");
var_dump($ret);
// int(1)

提取文本内容：

import re
# 没有子组时
reg = re.compile(r'\d{4}-\d{2}')
reg.findall('2020-05 2020-06')
# ['2020-05', '2020-06']

# 有子组时
reg = re.compile(r'(\d{4})-(\d{2})')
reg.findall('2020-05 2020-06')
[('2020', '05'), ('2020', '06')]

reg = re.compile(r'(\d{4})-(\d{2})')
for match in reg.finditer('2020-05 2020-06'):
    print('date: ', match[0])  # 整个正则匹配到的内容
    print('year: ', match[1])  # 第一个子组
    print('month:', match[2])  # 第二个子组
# date:  2020-05
# year:  2020
# month: 05
# date:  2020-06
# year:  2020
# month: 06

// 使用g模式，查找所有符合要求的内容
"2020-06 2020-07".match(/\d{4}-\d{2}/g)
// ['2020-06', '2020-07']

// 不使用g模式，找到第一个就会停下来
"2020-06 2020-07".match(/\d{4}-\d{2}/)
// ['2020-06', index: 0, input: '2020-06 2020-07', groups: undefined]

$regex = "/\d{4}-\d{2}/";
$str = "2020-05 2020-04";
$matchs = [];
preg_match_all($regex, $str, $matchs, PREG_SET_ORDER);
var_dump($matchs);
// array(2) {
//   [0] =>
//   array(1) {
//     [0] =>
//     string(7) "2020-05"
//   }
//   [1] =>
//   array(1) {
//     [0] =>
//     string(7) "2020-04"
//   }
// }

// PREG_PATTERN_ORDER: 结果排序为$matches[0]保存完整模式的所有匹配, $matches[1]保存第一个子组的所有匹配，以此类推。
// PREG_SET_ORDER: 结果排序为$matches[0]包含第一次匹配得到的所有匹配(包含子组)，$matches[1]是包含第二次匹配到的所有匹配(包含子组)的数组，以此类推。

替换文本内容：

reg = re.compile(r'(\d{2})-(\d{2})-(\d{4})')
reg.sub(r'\3年\1月\2日', '02-20-2020 05-21-2020')
# '2020年02月20日 2020年05月21日'

# 可以在替换中使用 \g<数字>，如果分组多于10个时避免歧义
reg.sub(r'\g<3>年\g<1>月\g<2>日', '02-20-2020 05-21-2020')
# '2020年02月20日 2020年05月21日'

# 返回替换次数
reg.subn(r'\3年\1月\2日', '02-20-2020 05-21-2020')
# ('2020年02月20日 2020年05月21日', 2)

// 使用g模式，替换所有的
"02-20-2020 05-21-2020".replace(/(\d{2})-(\d{2})-(\d{4})/g, "$3年$1月$2日")
// "2020年02月20日 2020年05月21日"

// 不使用 g 模式时，只替换一次
"02-20-2020 05-21-2020".replace(/(\d{2})-(\d{2})-(\d{4})/, "$3年$1月$2日")
// "2020年02月20日 05-21-2020"

$ret = preg_replace('/(\d{2})-(\d{2})-(\d{4})/', '\3年\1月\2日', "02-20-2020 05-21-2020");
var_dump($ret);
// string(35) "2020年02月20日 2020年05月21日"

切割文本内容：

reg = re.compile(r'\W+')
reg.split("apple, pear! orange; tea")
# ['apple', 'pear', 'orange', 'tea']

# 限制切割次数，比如切一刀，变成两部分
reg.split("apple, pear! orange; tea", 1)
# ['apple', 'pear! orange; tea']

"apple, pear! orange; tea".split(/\W+/)
// ["apple", "pear", "orange", "tea"]

// 传入第二个参数的情况
"apple, pear! orange; tea".split(/\W+/, 1)
// ["apple"]
"apple, pear! orange; tea".split(/\W+/, 2)
// ["apple", "pear"]
"apple, pear! orange; tea".split(/\W+/, 10)
// ["apple", "pear", "orange", "tea"]

$ret = preg_split('/\W+/', 'apple, pear! orange; tea');
var_dump($ret);
// array(4) {
//   [0] =>
//   string(5) "apple"
//   [1] =>
//   string(4) "pear"
//   [2] =>
//   string(6) "orange"
//   [3] =>
//   string(3) "tea"
// }
$ret = preg_split('/\W+/', 'apple, pear! orange; tea', 2);
var_dump($ret);
// array(2) {
//   [0] =>
//   string(5) "apple"
//   [1] =>
//   string(17) "pear! orange; tea"
// }

11 匹配原理以及优化原则

回溯不可怕，我们要尽量减少回溯后的判断

import re
x = '-' * 1000000 + 'abc'
timeit re.search('abc', x)

提前编译好正则。
尽量准确表示匹配范围：匹配引号里面的内容 .+? 改写为 [^"]+。
提取出公共部分：(abcd|abxy) => ab(cd|xy)，(^this|^that) => ^th(is|at)。
出现可能性大的放左边：\.(?:com|net)\b。
只在必要时才使用子组：把不需要保存子组的括号中加上 ?: 来表示只用于归组。
警惕嵌套的子组重复：(.*)* 匹配的次数会呈指数级增长，尽量不要写这样的正则。
避免不同分支重复匹配。

NFA 是以表达式为主导的，先看正则表达式，再看文本。而 DFA 则是以文本为主导的，先看文本，再看正则表达式。POSIX NFA 是指符合 POSIX 标准的 NFA 引擎，它会不断回溯，以确保找到最左侧最长匹配。

12 常见问题

import re
re.match(r'^(?:(?!\d\d)\w){6}$', '11abcd') # 不能匹配上
# 否定预测先行断言的语法"(?!)"来排除两个数字字符结尾的情况
# (?!) 表示匹配不满足某个条件的位置
re.match(r'^(?:\w(?!\d\d)){6}$', '11abcd') # 错误正则示范
# <re.Match object; span=(0, 6), match='11abcd'>
# (11) 回溯
# 1(1a) ok
# 11ab... ok

正负号、可二位小数、小数位末尾 0 无影响 Regulex：^[-+]?\d+(?:\.(?:\d){0,2}0*)?$
手机号码：1(?:3\d|4[5-9]|5[0-35-9]|6[2567]|7[0-8]|8\d|9[1389])\d{8}
身份证：[1-9]\d{14}(\d\d[0-9Xx])?
邮政编码：(?<!\d)\d{6}(?!\d)
中文字符：[\u4E00-\u9FFF] \p{Han}
邮箱：a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+

程语言的角度来理解正则

命令式编程的世界观是：程序是由若干行动指令组成的有序列表；
命令式编程的方法论是：用变量来存储数据，用语句来执行指令。
声明式编程的世界观是：程序是由若干目标任务组成的有序列表；
声明式编程的方法论是：用语法元素来描述任务，由解析引擎转化为指令并执行。

References

《精通正则表达式（第三版）》
《正则指引（第二版）》

– EOF –

00#

01 元字符#

02 量词与贪婪#

03 分组与引用#

04 匹配模式#

05 断言 Assertion#

06 转义#

07 流派及其特性#

08 处理 Unicode 文本#

09 编辑器中使用正则#

10 语言中用正则#

11 匹配原理以及优化原则#

12 常见问题#

程语言的角度来理解正则#

References#

00

01 元字符