Saturday, January 9, 2010

Python Regular expression pattern special sequences

In Table 1-18, "C" is any character, "R" is any regular expression form in the left column of the table, and "m" and "n" are integers. Each form usually consumes as much of the string being matched as possible, except for the nongreedy forms (which consume as little as possible, as long as the entire pattern still matches the target string).

Table 1-18. Regular expression pattern syntax

Form

Description

.

Matches any character (including newline if DOTALL flag is specified).

^

Matches start of string (of every line in MULTILINE mode).

$

Matches end of string (of every line in MULTILINE mode).

C

Any nonspecial character matches itself.

R*

Zero or more occurrences of preceding regular expression R (as many as possible).

R+

One or more occurrences of preceding regular expression R (as many as possible).

R?

Zero or one occurrence of preceding regular expression R.

R{m,n}

Matches from m to n repetitions of preceding regular expression R.

R*?, R+?, R??, R{m,n}?

Same as *, +, and ? but matches as few characters/times as possible; nongreedy.

[...]

Defines character set; e.g., [a-zA-Z] matches all letters (also see Table 1-19).

[^...]

Defines complemented character set: matches if character is not in set.

\

Escapes special characters (e.g., *?+|( )) and introduces special sequences (see Table 1-19). Due to Python rules, write as \\ or r'\\'.

\\

Matches a literal \; due to Python string rules, write as \\\\ in pattern, or r'\\'.

R|R

Alternative: matches left or right R.

RR

Concatenation: matches both Rs.

(R)

Matches any RE inside ( ), and delimits a group (retains matched substring).

(?: R)

Same as (R) but doesn't delimit a group.

(?= R)

Look-ahead assertion: matches if R matches next, but doesn't consume any of the string (e.g., X (?=Y) matches X if followed by Y.

(?! R)

Negative look-ahead assertion: matches if R doesn't match next. Negative of (?=R).

(?P R)

Matches any RE inside ( ) and delimits a named group (e.g., r'(?P[a-zA-Z_]\ w*)' defines a group named id).

(?P=name)

Matches whatever text was matched by the earlier group named name.

(?<= R)

Positive look-behind assertion: matches if preceded by a match of fixed-width R.

(?

Negative look-behind assertion: matches if not preceded by a match of fixed-width R.

(?#...)

A comment; ignored.

(?letter)

letter is one of "i", "L", "m", "s", "x", or "u". Set flag (re.I, re.L, etc.) for entire RE.

In Table 1-19, \b, \B, \d, \D, \s, \S, \w, and \W behave differently depending on flags: if LOCALE (?L) is used, they depend on the current 8-bit locale; if UNICODE (?u) is used, they depend on the Unicode character properties; if neither flag is used, they assume 7-bit U.S. ASCII. Tip: use raw strings (r'\n') to literalize backslashes in Table 1-19 class escapes.

Table 1-19. Regular expression pattern special sequences

Sequence

Description

\num

Matches text of the group num (numbered from 1)

\A

Matches only at the start of the string

\b

Empty string at word boundaries

\B

Empty string not at word boundary

\d

Any decimal digit (like [0-9])

\D

Any non-decimal digit character (like [^0-9])

\s

Any whitespace character (like [ \t\n\r\f\v])

\S

Any non-whitespace character (like [^ \t\n\r\f\v])

\w

Any alphanumeric character

\W

Any non-alphanumeric character

\Z

Matches only at the end of the string



*************** FROM PERL (common) *************

my $str = "Hello saju how are you" # Declaring and initializing a variable '$str'.

## Finding ##
$str =~ /pattern/ # Equivalent to function 'regmatch(pattern,$str)'.
Example: $str =~ /saju/ # Finding for pattern 'saju' in variable '$str'.

## Finding and substituting or replacing ##
$str =~ s/pattern/replacement/ # Equivalent to function 'regsubstitute(pattern,replacement,$str)'.
Example: $str =~ s/saju/sanu/ # Finding for pattern 'saju' in variable '$str' and replace it with 'sanu'.

***************************************

# Regular expression is written in between / ... /.
# A few 11 characters mean something special in between / ... / , they are
[], {}, (), *, +, ?, ., \, ^, $, |
# Other characters in between / ... / just means themselves.

# \ ---> Backslash removing special meaning of special characters in between / ... / s

**************************************************************
### Character Classes ###

[aeiou] ---> Match any one of these characters a,e,i,o,u.
Example: /p[aeiou]t/ matches pat,pet,pit,pot,put.
Example: /p[^aeiou]t/ it not maches pat,pet,pit,pot,put.But it maches pbt,pct,pdt,----.

## character class shortcuts ##

[0123456789] ----> \d
[abc..xyxABC...XYZ0123456789_] ----> \w # Except white space.
[ \n\r\t\f] ----> \s # White space charater class.

[^0123456789] ----> \D
[^abc..xyxABC...XYZ0123456789_] ----> \W
[^ \n\r\t\f] ----> \S # opposite of White space charater class.

. ---> 'dot' shortcut matches any character(whilte space character,abc---xyzADC---XYZ0123456789_).It is widely used.

## character class shortcuts Examples ##

/\d\d-\d\d-\d\d\d\d/ ---> 12-11-2009
/\d\d\/\d\d\/\d\d\d\d/ ---> 11/10/2009
#or
m{\d\d/\d\d/\d\d\d\d} ---> 11/10/2009 # This method is used to avoid usage of delimiter Backslash '\' .Here we using m{..} sinstead of /../.

***************************************
### Quantifiers ###

# Normally, we match exactly one (thing)literal character or character class.
# Put a quantifier after a (thing) literal character or character class to change that.
# We can say we want to match
1) Zero or one (thing) literal character or character class. ---> ?
2) Zero or more (thing) literal character or character class. ---> *
3) One or more (thing) literal character or character class. ---> +

## Zero or One thing --> ? ##

/sa?ju/ ----> Matches 'saju', 'sju'. Here literal character 'a' matches Zero or One time. Not matches 'saaju'.

/s[aj]?u/ ----> Matches 'sju', 'sau', 'su'. Not matches 'saju'.Here characters in the character class '[aj]' matches Zero or One time,

## Zero or More thing --> * ##

/sa*ju/ ----> Matches 'saju', 'sju', 'saaaju', etc. Here literal character 'a' matches Zero or More times.

/s[aj]*u/ ----> Matches 'saaaju', 'sjjju', 'saaajjju', 'su', 'saju', 'sajajaajju'.Here characters in the character class '[aj]' matches Zero or More times.

## One or More thing --> + ##

/sa+ju/ ----> Matches 'saju', 'saaaju'. Not matches 'sju'. Here literal character 'a' must matches One or More times.

/s[aj]+u/ ----> Matches 'saju', 'saaaju', 'sajjju', 'saaajjju', sajajaajju'. Not matches 'su'.Here characters in the character class '[aj]' must matches One or More times.

## Without Qualifiers ##

/(s[aj]u)/ ----> Matches 'sau', 'sju'. Not matches 'saju','su'.

***************************************
### Anchors ###

\A ---> Matches only at te begining of the text.It must be first thing in Regular Expression.

\Z ---> MAtches only at the end, or newline followed by end. It must be last thing in Regular Expression.

Examples:

/\A\d+/ ---> Line start with digits.

/\s\d{5}\Z/ ----> Line end with space and 5 digits.

***************************************
### Capturing Groups ###

# Parentheses '()' capture whatever they match in $1,$2,----.
# Count left-most parentheses to get corresponding $n.

Example:

/\A(\w+),\s+([A-Z][A-Z])\s+(\d{5})\Z/ then,

print "City: $1 , State: $2 , Zip: $3 ";

***************************************


No comments:

Post a Comment