1. DESCRIPTION
2. POSIX Regex Compiling
3. regcomp() is used to compile a regular expression into a form that is
4. suitable for subsequent regexec() searches.
DESCRIPTION
POSIX Regex Compiling
regcomp() is used to compile a regular expression into a form that is
suitable for subsequent regexec() searches.
1. regcomp() is supplied with preg, a pointer to a pattern buffer storage
2. area; regex, a pointer to the null-terminated string and cflags, flags
3. used to determine the type of compilation.
4.
5. All regular expression searching must be done via a compiled pattern
6. buffer, thus regexec() must always be supplied with the address of a
7. regcomp() initialized pattern buffer.
regcomp() is supplied with preg, a pointer to a pattern buffer storage
area; regex, a pointer to the null-terminated string and cflags, flags
used to determine the type of compilation.
All regular expression searching must be done via a compiled pattern
buffer, thus regexec() must always be supplied with the address of a
regcomp() initialized pattern buffer.
preg, a pointer to a pattern buffer storage area就說明preg這個對象的空間是需要我
們自己分配的,分配完了再傳一個地址也就是preg給regcomp。man page不會直接說你應該自
己分配了空間再傳給我,這么說也太貳了。但你要自己體會出它真正想傳達給你的信息。
C代碼
1. cflags may be the bitwise-or of one or more of the following:
2.
3. REG_EXTENDED
4. Use POSIX Extended Regular Expression syntax when interpreting
5. regex. If not set, POSIX Basic Regular Expression syntax is
6. used.
7.
8. REG_ICASE
9. Do not differentiate case. Subsequent regexec() searches using
10. this pattern buffer will be case insensitive.
11.
12. REG_NOSUB
13. Support for substring addressing of matches is not required.
14. The nmatch and pmatch parameters to regexec() are ignored if the
15. pattern buffer supplied was compiled with this flag set.
16.
17. REG_NEWLINE
18. Match-any-character operators don’t match a newline.
19.
20. A non-matching list ([^...]) not containing a newline does not
21. match a newline.
22.
23. Match-beginning-of-line operator (^) matches the empty string
24. immediately after a newline, regardless of whether eflags, the
25. execution flags of regexec(), contains REG_NOTBOL.
26.
27. Match-end-of-line operator ($) matches the empty string immedi‐
28. ately before a newline, regardless of whether eflags contains
29. REG_NOTEOL.
30.
31. POSIX Regex Matching
32. regexec() is used to match a null-terminated string against the precom‐
33. piled pattern buffer, preg. nmatch and pmatch are used to provide
34. information regarding the location of any matches. eflags may be the
35. bitwise-or of one or both of REG_NOTBOL and REG_NOTEOL which cause
36. changes in matching behavior described below.
37.
38. REG_NOTBOL
39. The match-beginning-of-line operator always fails to match (but
40. see the compilation flag REG_NEWLINE above) This flag may be
41. used when different portions of a string are passed to regexec()
42. and the beginning of the string should not be interpreted as the
43. beginning of the line.
44.
45. REG_NOTEOL
46. The match-end-of-line operator always fails to match (but see
47. the compilation flag REG_NEWLINE above)
cflags may be the bitwise-or of one or more of the following:
REG_EXTENDED
Use POSIX Extended Regular Expression syntax when interpreting
regex. If not set, POSIX Basic Regular Expression syntax is
used.
REG_ICASE
Do not differentiate case. Subsequent regexec() searches using
this pattern buffer will be case insensitive.
REG_NOSUB
Support for substring addressing of matches is not required.
The nmatch and pmatch parameters to regexec() are ignored if the
pattern buffer supplied was compiled with this flag set.
REG_NEWLINE
Match-any-character operators don’t match a newline.
A non-matching list ([^...]) not containing a newline does not
match a newline.
Match-beginning-of-line operator (^) matches the empty string
immediately after a newline, regardless of whether eflags, the
execution flags of regexec(), contains REG_NOTBOL.
Match-end-of-line operator ($) matches the empty string immedi‐
ately before a newline, regardless of whether eflags contains
REG_NOTEOL.
POSIX Regex Matching
regexec() is used to match a null-terminated string against the precom‐
piled pattern buffer, preg. nmatch and pmatch are used to provide
information regarding the location of any matches. eflags may be the
bitwise-or of one or both of REG_NOTBOL and REG_NOTEOL which cause
changes in matching behavior described below.
REG_NOTBOL
The match-beginning-of-line operator always fails to match (but
see the compilation flag REG_NEWLINE above) This flag may be
used when different portions of a string are passed to regexec()
and the beginning of the string should not be interpreted as the
beginning of the line.
REG_NOTEOL
The match-end-of-line operator always fails to match (but see
the compilation flag REG_NEWLINE above)
前面猜測過了,cflags和eflags既然不叫同一個名字,肯定分別有不同的取值,并且通常這
些取值都是bitwise-or起來用的。本文重點在于講如何閱讀理解man page,而不在于講具體
的技術,所以這些標志都起什么作用不詳細解釋了。但是再做幾個猜縮寫的練習,這不僅有
助于理解,更有助于記憶這些標志,有些常用的標志把它記住了就不必每次用都查手冊了。
REG_ICASE,ICASE表示ignore case,這種縮寫很常見。REG_NOSUB,SUB有些時候表示
substitute,有些時候表示substring,在這里就表示substring。REG_NOTBOL,初看不知道
BOL是什么,看是再看和它對稱的REG_NOTEOL,根據經驗,我們已經知道EOF是end of file,
那么這個EOL應該是end of line,那么相對地BOL就應該是beginning of line。作者: linux_Ultra 時間: 2009-6-30 09:36
C代碼
1. BYTE OFFSETS
2. Unless REG_NOSUB was set for the compilation of the pattern buffer, it
3. is possible to obtain substring match addressing information. pmatch
4. must be dimensioned to have at least nmatch elements. These are filled
5. in by regexec() with substring match addresses. Any unused structure
6. elements will contain the value -1.
7.
8. The regmatch_t structure which is the type of pmatch is defined in
9. .
10.
11. typedef struct {
12. regoff_t rm_so;
13. regoff_t rm_eo;
14. } regmatch_t;
15.
16. Each rm_so element that is not -1 indicates the start offset of the
17. next largest substring match within the string. The relative rm_eo
18. element indicates the end offset of the match.
BYTE OFFSETS
Unless REG_NOSUB was set for the compilation of the pattern buffer, it
is possible to obtain substring match addressing information. pmatch
must be dimensioned to have at least nmatch elements. These are filled
in by regexec() with substring match addresses. Any unused structure
elements will contain the value -1.
The regmatch_t structure which is the type of pmatch is defined in .
Each rm_so element that is not -1 indicates the start offset of the
next largest substring match within the string. The relative rm_eo
element indicates the end offset of the match.
沒錯,先前我們猜測,regmatch_t對象表示匹配的位置信息,從regexec函數返回后,那組
regmatch_t對象后面無效的部分一定是用一個特殊值來表示無效,這個特殊值就是-1。匹配
位置信息包括起始位置和結束位置,再一猜就知道,rm_so表示regmatch start
offset,rm_eo表示regmatch end offset,要有這樣的敏感性,rm_so和rm_eo,別的字母都
一樣,就s和e不一樣,表示相對概念的s和e就是start和end,這在程序代碼中很常見。還有
一個很常見的現象是結構體成員名字有一個前綴是結構體名字的縮寫,比如這里的rm_表示
regmatch。
C代碼
1. Posix Error Reporting
2. regerror() is used to turn the error codes that can be returned by both
3. regcomp() and regexec() into error message strings.
4.
5. regerror() is passed the error code, errcode, the pattern buffer, preg,
6. a pointer to a character string buffer, errbuf, and the size of the
7. string buffer, errbuf_size. It returns the size of the errbuf required
8. to contain the null-terminated error message string. If both errbuf
9. and errbuf_size are nonzero, errbuf is filled in with the first
10. errbuf_size - 1 characters of the error message and a terminating null.
11.
12. POSIX Pattern Buffer Freeing
13. Supplying regfree() with a precompiled pattern buffer, preg will free
14. the memory allocated to the pattern buffer by the compiling process,
15. regcomp().
Posix Error Reporting
regerror() is used to turn the error codes that can be returned by both
regcomp() and regexec() into error message strings.
regerror() is passed the error code, errcode, the pattern buffer, preg,
a pointer to a character string buffer, errbuf, and the size of the
string buffer, errbuf_size. It returns the size of the errbuf required
to contain the null-terminated error message string. If both errbuf
and errbuf_size are nonzero, errbuf is filled in with the first
errbuf_size - 1 characters of the error message and a terminating null.
POSIX Pattern Buffer Freeing
Supplying regfree() with a precompiled pattern buffer, preg will free
the memory allocated to the pattern buffer by the compiling process,
regcomp().
1. RETURN VALUE
2. regcomp() returns zero for a successful compilation or an error code
3. for failure.
4.
5. regexec() returns zero for a successful match or REG_NOMATCH for fail‐
6. ure.
RETURN VALUE
regcomp() returns zero for a successful compilation or an error code
for failure.
regexec() returns zero for a successful match or REG_NOMATCH for fail‐
ure.
1. ERRORS
2. The following errors can be returned by regcomp():
3.
4. REG_BADBR
5. Invalid use of back reference operator.
6.
7. REG_BADPAT
8. Invalid use of pattern operators such as group or list.
9.
10. REG_BADRPT
11. Invalid use of repetition operators such as using ’*’ as the
12. first character.
13.
14. REG_EBRACE
15. Un-matched brace interval operators.
16.
17. REG_EBRACK
18. Un-matched bracket list operators.
19.
20. REG_ECOLLATE
21. Invalid collating element.
22.
23. REG_ECTYPE
24. Unknown character class name.
25.
26. REG_EEND
27. Non specific error. This is not defined by POSIX.2.
28.
29. REG_EESCAPE
30. Trailing backslash.
31.
32. REG_EPAREN
33. Un-matched parenthesis group operators.
34.
35. REG_ERANGE
36. Invalid use of the range operator, e.g., the ending point of the
37. range occurs prior to the starting point.
38.
39. REG_ESIZE
40. Compiled regular expression requires a pattern buffer larger
41. than 64Kb. This is not defined by POSIX.2.
42.
43. REG_ESPACE
44. The regex routines ran out of memory.
45.
46. REG_ESUBREG
47. Invalid back reference to a subexpression.
48.
49. CONFORMING TO
50. POSIX.1-2001.
ERRORS
The following errors can be returned by regcomp():
REG_BADBR
Invalid use of back reference operator.
REG_BADPAT
Invalid use of pattern operators such as group or list.
REG_BADRPT
Invalid use of repetition operators such as using ’*’ as the
first character.
REG_EBRACE
Un-matched brace interval operators.
REG_EBRACK
Un-matched bracket list operators.
REG_ECOLLATE
Invalid collating element.
REG_ECTYPE
Unknown character class name.
REG_EEND
Non specific error. This is not defined by POSIX.2.
REG_EESCAPE
Trailing backslash.
REG_EPAREN
Un-matched parenthesis group operators.
REG_ERANGE
Invalid use of the range operator, e.g., the ending point of the
range occurs prior to the starting point.
REG_ESIZE
Compiled regular expression requires a pattern buffer larger
than 64Kb. This is not defined by POSIX.2.
REG_ESPACE
The regex routines ran out of memory.
REG_ESUBREG
Invalid back reference to a subexpression.
1. SEE ALSO
2. grep(1), regex(7), GNU regex manual
3.
4. COLOPHON
5. This page is part of release 2.77 of the Linux man-pages project. A
6. description of the project, and information about reporting bugs, can
7. be found at http://www.kernel.org/doc/man-pages/.
8.
9. GNU 1998-05-08 REGEX(3)
SEE ALSO
grep(1), regex(7), GNU regex manual
COLOPHON
This page is part of release 2.77 of the Linux man-pages project. A
description of the project, and information about reporting bugs, can
be found at http://www.kernel.org/doc/man-pages/.
GNU 1998-05-08 REGEX(3)
man page的最后這一段比較有價值的是SEE ALSO。由于每個man page都有自己的主題,而不
會去扯一些離題的話,有時候就需要把幾個相關的man page結合起來看,從一系列的相關主
題中把握一個overview。有的man page有BUGS節,這也是非常重要的,最典型的是gets(3),
前面描述了半天這個函數是干嗎用的,最后在BUGS節里面說,Never use gets(),如
果沒看見這一句,前面的都白看。作者: 宇宙飛船 時間: 2009-6-30 09:43
俺等會也搞點英語閱讀材料上來,也是關于GNU工具的,這些都是電工們吃飯的家當。作者: qupeng2008 時間: 2009-6-30 09:51
幫頂啊~雖然俺看不懂~O(∩_∩)O~作者: linux_Ultra 時間: 2009-6-30 09:52
寫帖子的人還 GNU Free Documentation License發布的 linux 編程書,
雖然有一定的商業目的,但是還可以看看的, http://djkings.javaeye.com/blog/218542