regular expression (math.)
1.1 Basic grammar
A review of the basics of regular expressions through a chart
single char | Quantifiers (number) | position |
---|---|---|
\d Matching numbers | * :: 0 or more | ^ Beginning of line |
\w Match word (numbers, letters) | + 1 or more, at least 1 | End of $ line |
\W Match non-word (numbers, letters) | The following are some of the options that can be used in this program. 0 or 1, an Optional | \b Word bounds |
\s match white space (including spaces, tabs, etc.) | The number of occurrences of {min,max} is within a range | |
\S matches non-white space (including spaces, tabs, etc.) | {n} matches n occurrences of | |
. Match any, any character |
1.1.1. single char
Suppose you have a paragraph of characters as follows:
\w
will match all words, except, of course, characters like () – etc.
\w\w\w
Found matches for ‘ The
se are
som
e pho
ne number
s …’ Note that regular expressions are rules for matching a continuous string, so you can see that three-letter words can be matched, as well as six-word ones.
\s\s
Match to two consecutive spaces in a line
quantifiers
Suppose we have this passage:
The colors of the rainbow have many colours
and the rainbow does not have a single colour.
We’re trying to find all the colors colors
colours
colour
Answers colou?rs?
Well, it looks simple and easy.
Well, now you want to match 4 numbers in a line, or 5 letters in a line, etc. This is where quantifiers come in handy.
I’m looking for words with 5 letters.
\w{5}
Is that okay? Hmmm… No, look at what it matches, as follows: ‘These
are somephone
numbe
rs 915-555-1234…’ Indeed, our template is very simple, it only looks for sequences of 5 consecutive letters in a line. So it’s good to improve it now
\w{5}\s
To be able to find the word, so I want 5 letters followed by a sequence of spaces, that should do it, look at the match: ‘These
are somephone
numbers
915-555-1234…’ Well, yes, with only these current methods, it can’t be done. So, we need a third tool “position”
1.1.2. position
Before returning to the earlier question, familiarize yourself with ^
$
and the \b
This is somthing
is about
a blah
words
sequence of words
Hello and
GoodBye and
Go gogo!
Take a look at what the various rules match
\w+
There should be no doubt that this one matches all the words
^\w+
There is an extra^
which, in this way, only matches the word at the beginning of each lineThis
is
a
words
sequence
Hello
GoodBye
Go
\w+$
This will match the last letter of each line
Back to the earlier question.
Now trying to find words with 5 letters.
It becomes very simple to use the word conjunctions \b
The answer. \b\w{5}\b
1.1.3 Get a phone number.
Finally, look for a phone number that just came up 123-456-1231
The most basic regularization method is \d{3}-\d{3}-\d{4}
, and that’s how you find it. But sometimes, the phone number is 123.456.1234
or (212)867-4233
structure how to do?
The regular expression 或
or other expressions are described below.
1.2 Character classes
The previous section documented the most basic methods, followed by the classifiers []
This symbol is used to represent the logical relationship 或
, for example [abc]
means a or b or c. [-.]
means the symbol -
or .
(note here that the .
symbol in []
represents this symbol, but if it is outside, it means a match all. So if it is not in []
, and you want to match ‘. , you have to use the escape symbol \.
)
1.2.1 Simple applications of classification
Character Sequence:
The lynk is quite a link don't you think? l nk l(nk
Regular Expressions: l[yi (]nk
so:
lynk link l nk l(nk
It’s easy to understand, it’s expressing 或
logic.
1.2.2. match all possible phone numbers
Ok, now back to the previous legacy, there are the following fields, please match all possible phone numbers:
These are some phone numbers 915-134-3122. Also,
you can call me at 643.123.1333 and of course,
I'm always reachable at (212)867-5509
Okay, step by step, we just used \d{3}-\d{3}-\d{4}
to match the hyphenated case. Now we can easily add the .
case to it
Step one: \d{3}[-.]\d{3}[-.]\d{4}
Step 2: To be able to match the brackets, you can use ? to, since this is an option choice. So you end up with
\(?\d{3}[-.)]\d{3}[-.]\d{4}
It is still important to note that in [], special characters do not need to be escaped and can be used directly, such as [.()]
,but outside, it is necessary to escape \(
\.
, etc.
1.2.3. Special syntax of []
The simplest and most basic functions have just been described, but there are some special points to note
- -When the concatenator is the first character
For example, [-.]
means the hyphen -
or the dot .
. However, when the hyphen is not the first character, as in the case of [a-z]
, this means from the letter a to the character z.
- ^ in [].
^
In the previous introduction, it means the beginning of a line, but in []
, it has a different meaning. [ab]
It means a or b [^ab]
anything except a or b (anything except a and b), which is equivalent to the inverse of
1.2.4. [] and ()
In addition to using []
for or logic, ()
is also possible. The usage is (a|b)
for a or b
For example, the following example matches all emails
[email protected]
[email protected]
[email protected]
The first thing to think about is what exactly I’m matching, and here’s what I’m trying to match
Any one beginning with words, one or more\w+
Immediately followed by a@
symbol\w+@
Followed by one or more words\w+@\w+
followed by a.
punctuation\w+@\w+\.
followed by acom
net
oredu
\w+@\w+\.(com|net|edu)
Still drawing attention to the \.
escape symbol in step 4
Well, this can match all the above mailboxes. But there is still a problem, because the mailbox username can have .
, such as [email protected]
It’s still really simple, and the fix is as follows: [\w.]+@\w+\.(com|net|edu)
1.2.5 Summary
[]
The role of the English expression is “alternation”, expressing the logic of an or;
/[-.(]/
The hyphen-
in a symbol is placed first to indicate the hyphen itself, or in the middle to indicate “from… to…”. to…” For example,[a-z]
means a-z
[.)]
Special symbols in parentheses indicate themselves without being escaped
[^ab]
^
in parentheses means not, anythings excepta
andb
(a|b)
Can also mean choice, but it has more power ….
So what is the powerful feature of ()
? Grouping capture, which is helpful for sequence substitution, swapping. Learning logging in a later section
1.3. capturing groups
What is group capture, now back to the previous phone number example
212-555-1234
915-412-1333
👇👇👇👇👇👇👇👇👇👇👇👇
212-xxx-xxxx
915-xxx-xxxx
Following the previous practice of \d{3}-\d{3}-\d{4}
,this kind of matching is to match the whole phone number as a group (group). We call 212-555-1234
such as Group0
.
At this point, if we add a parenthesis \d{3}-(\d{3})-\d{4}
then the match to 555
is called Group1
. By analogy, if there are two parentheses \d{3}-(\d{3})-(\d{4})
then the grouping is the following:
212-555-1234 Group0
555 Group1
1234 Group2
1.3.1 Selection of groups
Now that the groups have been divided, how do I select the groups that have been matched?
There are two methods here, the first uses the $
symbol, such as $1
for 555
, $2
for 1234
; the second, uses \
,such as \1
for 555
. The two kinds of use scenarios are different, let’s start with $
Now to fulfill the very first requirement, we can do this
reg: \(?(\d{3})[-.)]\d{3}[-.]\d{4}
replace: $1-xxx-xxxx
ps: Here you can directly use the JS replace function to operate, but the regular is not exclusive to JS, so here is the first introduction to the general method, and then summarize the JS part of the
1.3.2 Scenario-based training
Now there is a list list but the last name and first name are reversed and I need to swap him over
shiffina, Daniel
shifafl, Daniell
shquer, Danny
...
Realization method.
reg: (\w+),\s(\w+)
replace: $2 $1
Note: $0
is all matches to, so the first one with brackets is the $1
Match link tags in markdown and replace with html tags
[google](http://google.com)
[itp](http://itp.nyu.edu)
[Coding Rainbow](http://codingrainbow.com)
Ans: This question is a bit of a pitfall and you need to take your time.
The first thing I wanted to consider when I saw this was matching the [google]
thing, and immediately thought of the regular expression \[.*\]
. This one is a huge pitfall, and at the current time, it does match the three above correctly. But if the text looks like this:
As you can see, the first line will match all the way down, without being able to distinguish between [google]
and [test]
. The reason for this is that .
is greedy, he means all, all that can be matched, so of course it includes ]
, and it doesn’t stop until the last one in the line, ]
.
So in order for it to match correctly, this greedy attribute needs to be removed. Here ?
is used. When ?
is placed after the quantifiers
symbol, it means that the greedy attribute is removed and the match stops when the termination condition is reached.
\[.*?\]
In this way, you can separate [google]
and [test]
, the effect is as follows:
Finish everything next:
reg: \[(.*?)\]\((http.*?)\)
replace: <a href="$2">$1</a>
1.3.3. Using the \
selector
$
Selectors are flags or selections made at the time of substitution, but if in the regular expression itself, it’s time to use \
to select. For example the following scenario
This is is a a dog , I think think this is is really
a a good good dog. Don't you you thinks so so ?
We want to match sequential sequences such as is is
so so
, so we use the following expression. (\w+)\s\1
Well, it almost works, but there are a few minor bugs, such as the first sentence, This is is a
, which doesn’t match correctly, and matches the last letter of the first This. This uses the character boundaries \b
mentioned in the first section, which becomes \b(\w+)\s\1\b
Well, the big job is done, so I won’t post the results, just make up your own mind.
1.3.4 Summary
Grouping capture, use () for data grouping, number 0 represents the entire match, selected groups start at number 1
The selector can be used with$1
and\1
, but in different scenarios,\
is used for regular expressions themselves
?
The symbol disables the greedy attribute, and is placed after.*
to indicate that a single match can be stopped when it encounters the focus. Otherwise, it will keep matching backwards.
1.4. in JavaScript
In js, the main regular expressions are involved in the application of string.
var str = "hello"
var r = /w+/
These are the literal creation methods for string and reg respectively. The methods r.test()
and str.match()
as well as str.replace
are used when regulars are to be used for manipulation.
1.4.1. reg.test()
The regular expression itself has a test method, which can only test for inclusion and returns a bool variable.
var r = /\d{3}/;
var a = '123';
var b = '123ABC';
var c = 'abc';
r.test(a) //true
r.test(b) //true
r.test(c) //false
Well, this one is pretty simple and not used practically much, so here are some ways to focus on the str.
1.4.2. str.match()
Unlike test(), instead of just returning the bool variable, it will return what you matched to.
var r = /compus/
var reg = /w+/
var s = "compus, I know something about you"
r.test(s) //true
s.match(r) //["compus"]
s.match(reg) //["compus"]
Wait, there’s something wrong. Why is the last one returned “compus”? That’s not scientific.
Well, actually, match() returns the first sequence that can be matched. To achieve the previous effect, you need to use a couple of flags in JS regarding regularity
1.4.2.1. flag
This flag should be present at the time of the creation of the rule, and there are three main ones
flag | sense |
---|---|
g | All of them. Match me up with all of them. |
i | ignore capitals |
m | multilinear matching |
So to solve the problem, just set up the reg like this
var reg = /w+/g
Look at the following exercise
var str = "Here is a Phone Number 111-2313 and 133-2311"
var r = /\d{3}[-.]\d{4}/
var rg = /\d{3}[-.]d{4}/g
console.log(str.match(r)); //["111-2313"]
console.log(str.match(rg));//["111-2313","133-2311"]
Well, finding phone numbers, yes, is convenient. But there’s another question… I was talking about grouping, so does match return the grouping?
var sr = /(\d{3})[-.]\d{4}/
var srg = /(\d{3})[-.]\d{4}/g
console.log(str.match(sr)); //["111-2313","111"]
console.log(str.match(srg)); //["111-2313","133-2311"]
So the conclusion is: when the global flag g
is used, it will not return the group, but all the matched results; if g
is not used, it will return the matched results and the group as an array.
So how do you implement global grouping?
1.4.3. reg.exec()
Literally, the regular expression execution method. This method enables matching globally and returns grouped results.
reg.exec() each call, return a matching result, matching results and grouping in the form of an array to return, the next call can be returned to the next result, until the return ofnull
var str = "Here is a Phone Number 111-2313 and 133-2311" ;
var srg = /(\d{3})[-.]\d{4}/g;
var result = srg.exec(str);
while(result !== null) {
console.log(result);
result = srg.exec(str);
}
The result may contain more than meets the eye, it is an array of, for example, the first execution, who results in:
["133-2311", "133", index: 36,
input: "Here is a Phone Number 111-2313 and 133-2311" groups: undefined]
1.4.4. str.split
Now comes to a stronger function, first of all splitting, we know that split is a string according to a certain character separated, for example, there is the following paragraph, you need to split it into words.
var s = "unicorns and rainbows And, Cupcakes"
The first thing that comes to mind when splitting into words is to separate them by spaces, so this can be done in the following way
var result = s.split(' ');
var result1 = s.split(/\s/);
//["unicorns", "and", "rainbows", "And,", "Cupcakes"]
Well, that doesn’t reflect the power of regularity, and most of all, it doesn’t fulfill the requirement. Because there is another “And,”. So I’m going to use a regular, and the match condition is
result = s.split(/[,\s]/);
//["unicorns", "and", "rainbows", "And", "", "Cupcakes"]
The result is still different from what is needed, because there is an extra “”. We don’t want to make it split based on ,the basis should be . Adding a
+
to the original base and changing it to /[,\s]+/
, the meaning of this is
result = s.split(/[,\s]+/);
// ["unicorns", "and", "rainbows", "And", "Cupcakes"]
1.4.4.1. word segmentation
Well, to expand on that, a regular expression that implements word splitting for a paragraph is
result = s.split(/[,.!?\s]+/)
Of course, there’s an easiest way to go about it, and we can go about it like this
result = s.split(/\W+/);
Next, if we want to separate all of the sentences in a paragraph, an achievable expression would be
result = s.split(/[.,!?]+/)
Finally, there is a small requirement to split sentences while keeping the corresponding separators.
var s =
"Hello,My name is Vincent. Nice to Meet you!What's your name? Haha."
It’s a little ponit, remember that if you want to keep the separators, just group the matches together
var result = s.split(/([.,!?]+)/)
//["Hello", ",", "My name is Vincent", ".", " Nice to Meet you", "!", "What's your name", "?", " Haha", ".", ""]
As you can see, this stores the separators as well.
1.4.5. str.replace()
replace is also a string method, its basic usage is str.replace(reg,replace|function)
, the first parameter is a regular expression representing the match, the second parameter is the replacement string or a fallback function.
Note that replace doesn’t modify the original string, it just returns a modified string; except that regular expressions that don’t use the g
flag also match/replace the first string, just like match
.
1.4.5.1 Simplest substitution
Replace a vowel letter (aeiou) in a sequence by replacing it with a double. e.g. x->xx
var s = "Hello,My name is Vincent."
var result = s.replace(/([aeiou])/g,"$1$1")
//"Heelloo,My naamee iis Viinceent."
Note that the second argument must be a string; be careful not to forget to add the g
1.4.5.2. Here come the awesome function parameters!
Well, that’s the most powerful part, the second parameter passed into function, let’s look at the simplest example first
var s = "Hello,My name is Vincent. What is your name?"
var newStr = s.replace(/\b\w{4}\b/g,replacer)
console.log(newStr)
function replacer(match) {
console.log(match);
return match.toUpperCase();
}
/*
name
What
your
name
Hello,My NAME is Vincent. WHAT is YOUR NAME?
*/
So, the parameters of the function are the content that is matched to, and the return is the content that needs to be replaced. Well, the basic example explains the basic usage, so what about the previously discussed grouping? How to realize the grouping?
function replacer(match,group1,group2) {
console.log(group1);
console.log(group2);
}
If regular expressions are handled in groups, then in the callback function, the second and third arguments to the function are group1,group2. this way, you can do a lot of amazing things!
1.4.5.3 Comprehensive exercise questions
- Determine the character with the most occurrences in a string and count the number of occurrences
var s = 'aaabbbcccaaabbbaaa';
var a = s.split('').sort().join(""); //"aaaaaaaaabbbbbbccc"
var ans = a.match(/(\w)\1+/g);
ans.sort(function(a,b) {
return a.length - b.length;
})
console.log('ans is : ' + ans[ans.length-1])
1.4.6 Summary
In js, regular expression literal/reg/
and string literal"str"
are used to create regulars and strings. There are two methods on the regularreg.test()
andreg.exec()
reg.test(str)
method, which returns a boolean variable indicating whether or not there was a match;reg.exec(str)
is somewhat similar to an iterator, returning the matches and groupings each time it is executed, until it ends withnull
.
The three main string methods arestr.match(reg)
,str.split(reg)
andstr.replace(reg,str|function)
.
match
Specifically, if the regular contains a group and does not have theg
flag, it returns the match and the group; if it does not have a group and has theg
flag, it returns all matches.
split
method is mainly used for string splitting, remember to group matches (wrap them in parentheses) if you want to save separators
replace
is the most powerful method, when using the fallback function, the return value is the replacement value; the parameters aregroup1
group2
…