Skip to content

Instantly share code, notes, and snippets.

@benbushman98
Last active October 17, 2022 18:00
Show Gist options
  • Select an option

  • Save benbushman98/c96e1f7babd92254181b2b8a16c96ef7 to your computer and use it in GitHub Desktop.

Select an option

Save benbushman98/c96e1f7babd92254181b2b8a16c96ef7 to your computer and use it in GitHub Desktop.

Regex Tutorial

A common variable in all of coding, writing, formatting, and anything related to computers, is the ability to search for information pertaining to whatever you are working on. From databases, lists, and documents to books, scripts, or code, there will come a time when you need to search for information. The common way of doing it is simply a CTRL + F to search for things already there. But what if you have a database full of information about how users use your site? What if the exported information contains sensitive material such as emails, passwords, or addresses? You probably won't know every address you need to remove from the information so a simple search and find won't work. In steps a regex or a "regular expression", a pattern to search for information.

Summary

In this demonstration, we will be examining a regex to specifically search for an email address. Key parameters for an email address would be characters, followed by the @ symbol, followed by additional domain characters, followed by the . symbol, finished off with additional characters such as a com or org. A common regex to identify this pattern would be the following: /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

Table of Contents

Regex Components

Anchors

A common component of a regex is what is known as an anchor. There are two characters that are considered an anchor:

^ and $

The ^ symbol refers to the start of a string or line. For example, if I wrote, ^benjamin and searched a document with a regex, I would be looking for the string benjamin at the start of a string or line.

The $ symbol refers to the end of a string or line. For example, if I wrote, min$ and searched a document, I would find any string or line ending in the string min I would find all instances of the string benjamin, but I would also find words like cumin, vermin, vitamin, etc.

Anchors are often paired with another regex component called boundaries. This is past what we will discuss in this tutorial but anchors and boundaries allow you to describe terms of a search.

Quantifiers

Quantifiers allow for the definition of quantities in your pattern. Quantifier characters are *, +, ?, {}. We are going to focus on the + and {} because that is what we find in our regex for an email address.

+ is used to require our search to match the pattern one or more times. Let's take the first part of our regex, /^([a-z0-9_\.-]+). The + means we are requiring the string or line we are looking for, to match atleast one time to our pattern. As a recap, ^ means we are looking for a string that begins with characters from a-z, digits with 0-9, or the special characters of _, ., and -. The + at the end means our string must have these parameters in order to register as a positive match.

{} Gives new parameters to our search. With this group of characters, we can provide ways to limit our match. In the last portion of our regex, ([a-z\.]{2,6})$/, we can see that we have a pair of {} characters. This requires the pattern to be a minimum of 2 characters and a maximum of 6 characters. In this example email address, test@email.com, our com section meets the parameters of our {2,6}. In this example, test@email.coalition would not match our quantifier definition.

Grouping Constructs

The grouping constructs use the () characters. Each section within a pair of parentheses is called a subexpression. In our example, we have three subexpressions. With our example, test@email.com the word test is our first subexpression, email is our second subexpression, and com is our third subexpression. The @ and the . are outside of our expression and so they are required to be there.

Bracket Expressions

Bracket Expressions is anything that falls inside of a [] pair. In our regex example, /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/ we have three bracket expressions. When putting characters inside of these brackets, it means we are looking for anything that contains any of the mention parameters. For example, we are looking for anything that has characters from a-z, 0-9 or our special characters mentioned above. The following are examples of emails that would be a positive match:

1983@123mail.org
test_22@email.com
__--..@email.test

All of these emails would return a postive match because we are asking it to contain, not require, the characters. This would not return anything with a capital letter. That would require, [A-Z].

Character Escapes

Character Escapes are used in a tegex to signify a character becoming a string instead of that character. For example, if you wanted to specifically search for a {} pair, you would search with \{}. This signifies the {} becoming a string. If you do not want {} to start a quantifier, you would need to use the character escape. In our email regex, /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/ we are searching for email addresses that end in a .com or something similar, however, the . is a character used in character classes. We do not use them in our regex, so we won't discuss them, however in order to require the . in our .com we need a character escape to allow the . to appear as a string element and not a special character used in a regex.

Author

Benjamin Bushman is an IT professional with years of experience in network administratiion, computer hardware, and technical support. Recently making the jump into Web Development, Benjamin excels in all things tech. Checkout his GitHub profile at: https://github.com/benbushman98

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment