A common variable in all of coding, writing, formatting, and anything related to computers, is the ability to search for information pertaining to whatever you are working on. From databases, lists, and documents to books, scripts, or code, there will come a time when you need to search for information. The common way of doing it is simply a CTRL + F to search for things already there. But what if you have a database full of information about how users use your site? What if the exported information contains sensitive material such as emails, passwords, or addresses? You probably won't know every address you need to remove from the information so a simple search and find won't work. In steps a regex or a "regular expression", a pattern to search for information.
In this demonstration, we will be examining a regex to specifically search for an email address. Key parameters for an email address would be characters, followed by the @ symbol, followed by additional domain characters, followed by the . symbol, finished off with additional characters such as a com or org. A common regex to identify this pattern would be the following: /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
A common component of a regex is what is known as an anchor. There are two characters that are considered an anchor:
^ and $
The ^ symbol refers to the start of a string or line. For example, if I wrote, ^benjamin and searched a document with a regex, I would be looking for the string benjamin at the start of a string or line.
The $ symbol refers to the end of a string or line. For example, if I wrote, min$ and searched a document, I would find any string or line ending in the string min I would find all instances of the string benjamin, but I would also find words like cumin, vermin, vitamin, etc.
Anchors are often paired with another regex component called boundaries. This is past what we will discuss in this tutorial but anchors and boundaries allow you to describe terms of a search.
Quantifiers allow for the definition of quantities in your pattern. Quantifier characters are *, +, ?, {}. We are going to focus on the + and {} because that is what we find in our regex for an email address.
+ is used to require our search to match the pattern one or more times. Let's take the first part of our regex, /^([a-z0-9_\.-]+). The + means we are requiring the string or line we are looking for, to match atleast one time to our pattern. As a recap, ^ means we are looking for a string that begins with characters from a-z, digits with 0-9, or the special characters of _, ., and -. The + at the end means our string must have these parameters in order to register as a positive match.
{} Gives new parameters to our search. With this group of characters, we can provide ways to limit our match. In the last portion of our regex, ([a-z\.]{2,6})$/, we can see that we have a pair of {} characters. This requires the pattern to be a minimum of 2 characters and a maximum of 6 characters. In this example email address, test@email.com, our com section meets the parameters of our {2,6}. In this example, test@email.coalition would not match our quantifier definition.
The grouping constructs use the () characters. Each section within a pair of parentheses is called a subexpression. In our example, we have three subexpressions. With our example, test@email.com the word test is our first subexpression, email is our second subexpression, and com is our third subexpression. The @ and the . are outside of our expression and so they are required to be there.
Bracket Expressions is anything that falls inside of a [] pair. In our regex example, /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/ we have three bracket expressions. When putting characters inside of these brackets, it means we are looking for anything that contains any of the mention parameters. For example, we are looking for anything that has characters from a-z, 0-9 or our special characters mentioned above. The following are examples of emails that would be a positive match:
1983@123mail.org
test_22@email.com
__--..@email.test
All of these emails would return a postive match because we are asking it to contain, not require, the characters. This would not return anything with a capital letter. That would require, [A-Z].
Character Escapes are used in a tegex to signify a character becoming a string instead of that character. For example, if you wanted to specifically search for a {} pair, you would search with \{}. This signifies the {} becoming a string. If you do not want {} to start a quantifier, you would need to use the character escape. In our email regex, /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/ we are searching for email addresses that end in a .com or something similar, however, the . is a character used in character classes. We do not use them in our regex, so we won't discuss them, however in order to require the . in our .com we need a character escape to allow the . to appear as a string element and not a special character used in a regex.
Benjamin Bushman is an IT professional with years of experience in network administratiion, computer hardware, and technical support. Recently making the jump into Web Development, Benjamin excels in all things tech. Checkout his GitHub profile at: https://github.com/benbushman98