Skip to content

Instantly share code, notes, and snippets.

@alexgeis
Last active April 1, 2022 22:50
Show Gist options
  • Select an option

  • Save alexgeis/eb569555aab0f9e44366634690b3bccf to your computer and use it in GitHub Desktop.

Select an option

Save alexgeis/eb569555aab0f9e44366634690b3bccf to your computer and use it in GitHub Desktop.
Regex Tutorial

Regex Tutorial - Matching a URL

This gist provides a tutorial on regular expressions (regex), specifically using regex to match a URL.

Summary

Regular expressions use a specific sequence of characters to define a search pattern or validate data.

Different requirements will utilize different regex sequences. For example, the regex below matches a valid URL:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

The sequence of characters above will validate any text as a URL. The meaning of each character and its purpose will be detailed below.

Table of Contents

Regex Components

Below are the components used in the URL regex above. These compenents exclude the slash characters (/) that are wrapping the regex due to it being a literal.

Anchors

The characters ^ and $ are both considered to be anchors. They match a position before, after, or between characters - they "anchor" the regex match at a certain position.

^ matches the position before the first character in the string.

$ matches the position after the last character in the string.

By wrapping our regex in these anchors, we are defining what character(s) the search pattern starts and ends with.

Quantifiers

Quantifiers in regex set constraints on how many instances of a character, group, or character class that must exist in the input for a match to be found.

They include the following: * - Match zero or more times

+ - Match one or more times

? - Match zero or one time

{n} - Match exactly n times

{n,} - Match at least n times

{n,x} - Match from n to x times

The URL matching regex uses ?, *, and {n,x} so let's highlight their purpose in detail.

--?--

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

? is used twice in the following grouping:

/^(https?:\/\/)?

This grouping is checking for either http:// OR https://. The ? is directly following s, which means the s can be present 0 or 1 time (a.k.a it's optional).

The second use of ? directly follows the grouping. As with the previous usage, this ? is declaring the preceding grouping to be optional. In simple terms, this means that not only is the s following "http" optional, but the entire grouping http:// OR https:// doesn't need to be included (a valid URL could start with either, or even "www.").

--*--

Similar to the ? quantifier, * declares an expression optional as it matches the precending element zero or more times. It is equivalent to {0,}, compared to ? which would be equivalent to {0,1}.

([\/\w \.-]*)*

The filepath section of the URL regex above uses the * quantifer twice, the first instance being between the subexpression (a section of regex grouped using parentheses () ) and the bracket expression (range of characters grouped using brackets [] ).

Let's break down this bracket expression first:

[\/\w \.-]

\/ => this sequence is first using an escape character \ to allow the use of a reserved special character in regex as a literal. In this case, we want the filepath to begin with "/", so its first in our bracket expression and preceeded by \.

\w => this is a character class, which matches any alphanumeric character from the basic Latin aphabet, including the underscore. This class is equivalent to the bracket expression [A-Za-z0-9_].

\. => as with the "/" character, we want to include "." in our search so it must be preceeded by an escape character \.

- => this hyphen, included at the end of a bracket expression, is a special character that we want to include follow alphanumeric character ranges or classes within the brackets.

With this context we can now tackle the first usage of *:

([\/\w \.-]*)

In this instance * is allowing for any amount of characters in the filepath that follows the TLD. Not that this is applying to the bracket expression.

([\/\w \.-]*)*

In the second usage of the * quantifier it is applying to the entire subexpression.

--{n,x}-- Using this curly bracket quantifer, we're setting the the minimum (n) and maximum (x) limits for our match.

The following section of our URL regex concerns the TLD (e.g. .com, .net, .gov, etc.):

\.([a-z\.]{2,6})

\. => using an escape character \ to allow the use of a reserved special character in regex as a literal (in this case .).

() => start/end of bracket expression

[] => start/end of subexpression

[a-z\.] => within bracket expressions and not preceeded by \, hyphens define the range between two characters. Regex are case sensitive, so in this instance the range is all lower case letters between "a" and "z". The \. at the end is including a period (".") in our range using an escape character.

Now this brings us to the {} quantifier: {2,6} => since two values are provided, we're setting a min and max for our search pattern. This quantifer directly follows the expression it's being applied.

In simple terms, our {} quantifer is saying "between a length of 2 and 6 characters, I'm looking for a string that includes only lowercase letters or periods."

Grouping Constructs

As noted above, there are multiple ways to group expressions between your anchors.

() => start/end of bracket expression (range of characers we want to match - positive character group)

[] -> start/end of subexpression (group a section of a regex)

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Our full URL regex can be broken up into the following grouping constructs (excluding our slash characters and anchors):

http/https scheme: (https?:\/\/)?

second-level domain: ([\da-z\.-]+)

period between second-level domain and top-level domain: \.

top-level domain (TLD): ([a-z\.]{2,6})

subdirectory/filepath: ([\/\w \.-]*)*\/?

Bracket Expressions

3 total bracket expressions are used, and their purposes are explained below;

[\da-z\.-]+

The expression above is allowing for numbers (\d), all lowercase letters from a-z (a-z), periods (\.), and hyphens (-). The plus at the end is searching for a match that occurs one or more times (equavalent to {1,}).

([a-z\.]{2,6})

The expression above is allowing for all lowercase letters from a-z (a-z), periods (\.), and the quantifier following ({2,6}) is searching for a match with minimum length 2 and maximum length 6.

([\/\w \.-]*)*

The expression above is allowing for forward slashes (\/), all alphanumeric character from the basic Latin alphabet (\w), periods (\.) and hyphens (-). The * at the end is searching for a match that occurs zero or more times (equavalent to {0,}).

Character Classes

2 character classes are used within the URL regex:

\d => Matches any Arabic numeral digit. This class is equivalent to the bracket expression [0-9].

\w => Matches any alphanumeric character from the basic Latin alphabet, including the underscore (_). This class is equivalent to the bracket expression [A-Za-z0-9_].

Author

Alex Geis is an aspiring full-stack developer using the MERN stack. He can be found in Denver making music or coding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment