Revivedaniel/Regex-URL-Validation.md

## Regex-URL-Validation.md

      
    Raw
  

              Regex-URL-Validation.md
            
          
    Regex URL Validation

In this article we are going to talk about Regex and URL validation.
We will be going over everthing from Anchors to Character Escapes and the examples are in JavaScript.
Summary

If Regex is new to you. Regex is used in many programming languages to distinguish patterns within strings. Each character in Regex is used to represent a pattern within the string being searched. Patterns are made up of normal and special characters. Normal characters like abc 123 represent themselves while special characters like /.$ have special meaning.
For example, the following code will check if a string is exactly abc.
/abc/
And this will check if a string is aphanumeric with a minimum of 8 characters and a maximum of 50.
/[a-z0-9]{8,50}/
Here is the Regex for URL validation we will be looking at today:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
It's pretty intimidating at first but by the end of this article. You'll be able to decipher the Regex easily. 

Really quick before we begin. There are two ways to use Regex in JS.

RegExp Constructor Object

let regex1 = new RegExp("123");

Literal Value

let regex2 = /123/
For this article we will but using the Literal Value.
Table of Contents


Anchors
Character Escapes
Quantifiers
Bracket Expressions
Grouping Constructs
Character Classes

Regex Components

Lets start by constructing our Regex. Then, we can break it back down by component.
First off, all Regex Literals are wrapped in forward slashes.
/abc/
And our Regex has the following components:

Plain Text

http

Anchors

^ $

Quantifiers

{2,6} + \ * ?

Grouping Constructs

()

Bracket Expressions

[]

Character Classes

\d \w

Character Escapes

\/ \.
Take another look of the Regex. Try to identify the components we listed before we move on.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Anchors

Lets take quick look at the begining and end of our Regex:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
/^$/
These are called Anchors. The ^ indicates the characters after it are at the beginning of the string. This could either be plain text or a group of Regex. More on groups later. The other Anchor is $. Which indicates that the string ends with the characters before the $. Here are some examples:
/^abc[a-z0-9_-]*123$/

Example Matches:
abc123
abcdefg123
abcdaniel123
Because there is another section of regex between abc and 123, the ^ anchor only matches abc for the begining of the string and 123 for the end. Mostly everything else is matched because of the bracket expression. More on those later.
Character Escapes

The second component we are going over is Character Escapes. And specifically this piece of code:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
/^  (https?:\/\/)  ?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
(https?:\/\/)
https?:\/\/
To be short, this piece of Regex is searching for http:// or https://. Regex will match any normal characters, so https and :. But what is the ? doing? And what about \/\/? Lets focus on the \/\/.
Regex Literals require certain characters to be escaped. Those few characters such as { and ( have special meaning in Regex. They are used to define components like Quantifiers and Grouping Constructs.
In our example, we are using the backslash to escape the forwardslashes. Thats because Regex uses / to define the beginning and end of the literal. If we wrote a normal /, the Regex would consider this the end of the literal. We don't want that, we want the character /. Therfore we must escape the / with a backslash.
Here is how the forward slashes translate in our Regex Literal:
https?:\/\/
https?://
Quantifiers

Our snippit of Regex is almost deciphered. But whats the ? doing? Well, Regex has a component called a Quantifier.
Quantifiers allow you to control the amount of times the character preceding it should match.

? - Will match a pattern 0 or 1 times.
* - Will match a pattern 0 or more times.
+ - Will match a pattern 1 or more times.

Check out our snipped:
https?//
s?
This is what allows us to match http and https. The regex is looking for either no s or exactly one s on https.
Aside from those characters there are {}'s. The curly brackets allow you to determin the exact numbers of matches. There are three ways to write a curly bracket quantifier.

{ 5 } - Will match exactly 5 times.
{ 5, } - Will match atleast 5 times.
{ 5, 10 } - Will match atleast 5 times and maximum 10 times.

Take one more look at our sequence and try to decipher what you see.
(https?:\/\/)
I bet you're wondering what those () brackets are.
Grouping Constructs

We've matched the http(s):// protocol.
https://
http://
But what if we didnt want to match the protocol? We would need to put a ? after it. This was the url can either contain or not contain the http protocol. And to do this we us the () to group the http protocol.
https?:\/\/
(https?:\/\/)
Now that the Regex for http protocol is grouped lets look at our full example:
/^  (https?:\/\/)  ?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
And lets drag that ? over.
/^  (https?:\/\/)?  ([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Tada! We can now match the http protocol 0 or 1 times. Seems like we have a lot more regex to go right? Well, now that we know about Grouping Constructs. Lets space the rest of the Regex into its Grouping Constructs:
/^  (https?:\/\/)?  ([\da-z\.-]+)  \.  ([a-z\.]{2,6})  ([\/\w \.-]*)  *\/?$/
Lets move onto the second grouping:
([\da-z\.-]+)
[\da-z\.-]+
We will come back to the + later. Lets focus on this:
[\da-z\.-]
Bracket Expressions

So I mentioned earlier that we can match with just plain text.
/abc 123/
But what if we wanted to match any letter and or any number? That is where the Bracket Expressions come in. Braket Expressions create a range of characters to match. For example, the following Regex will match any lowercase letter:
/[a-z]/
Notice the - in between a and z. The hyphen, when written within a Bracket Expression, indicates a range. Ranges are determined by each characters Unicode number.
/[a-z]/ === U+0061-U+007A
And if we want to test for both uppercase and lowercase we would simply repeat the range with capitols:
/[a-zA-Z]/
Numbers work the same:
/[0-9]/
Lets check back to the Regex group:
/[\da-z\.-]/
For now, lets focus on this:
/[a-z\.-]/
Here, the \. is escaping the period so it can be included in the range. We also have a -. Note that the \. and - are after a-z. This is because in Bracket Expressions, any special character you want to include in the range must come after the alphanumerics. In our example we are searching for a-z, a period, or a hyphen. But what about the \d?
/[\da-z\.-]/
Character Classes

Regex can define a set of character matches with a shorthand Character Class. Take a look at these two Regex Literals:
/[0-9a-z\.-]/
/[\da-z\.-]/
They will both match the same exact pattern. Lets break the most common Character Classes down for a better understanding.

\d - Will match any number digit (0-9).
\w - Will match any alphanumeric from the latin alphabet. This includes Lower case, Uppercase letters and the underscore _.
. - Will match any character except newline characters \n.
\s - Will match any whitespace exactly 1 time. This includes tabs, space, and line breaks.

All Character Classes, except for the . can be inversed. This is done by capitalizing the letter.

\d - Will match any non-number digit (0-9).
\w - Will match any non-alphanumeric from the latin alphabet.
\s - Will match any non-whitespace exactly 1 time.

Moving back to our snippit we can now decipher that we are trying to match any number, any lowercase letter, a period, or a hyphen.
([\da-z\.-])
Adding the + after the bracket matches this expression 1 or more times.
([\da-z\.-]+)
So guess what? We're pretty much done. So lets step through our url and decipher the pattern one group at a time:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

this first section matches http://, https://, or no protocol. And the ^ means the string should start with this. Since the ? is after the grouping attached the the ^ anchor, this group remains optional.

/^(https?:\/\/)?

Example Matches: 
http://
https://

Adding the second section matches any number, any lowercase letter, a period, or a hyphen 1 or more times.

/^(https?:\/\/)?  ([\da-z\.-]+)

Example Matches: 
http://www.facebook
http://w1w.face-book
https://www.facebook
https://w1w.face-book
www
w1w
w-w

The third group has an escaped period between it ment to be a literal period. The third section matches for any lowercase letter and a period a minimum of 2 times and a max of 6. This is where the top level domain is matched

/^(https?:\/\/)?([\da-z\.-]+)  \.  ([a-z\.]{2,6})

Example Matches: 
http://www.facebook.com
http://w1w.face-book.com
https://www.facebook.com
https://w1w.face-book.com
www.com
w1w.co
w-w.paint

This last group uses a bracket expression to match a forward slash, any alphanumeric, a period, or a hyphen 0 or more times. And the group itself can match 0 or more times. Meaning this section of the url doesnt have to exist. But if it does, lets make sure they only use specific characters.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})  ([\/\w \.-]*)*

Example Matches: 
http://www.facebook.com/peperoni
http://w1w.face-book.com/90AzaZ
https://www.facebook.com
https://w1w.face-book.com.pizza
www.com/poAzaZ
w1w.co.uk
w-w.paint/brush

After the last group we have \/?$. This is matching 0 or 1 forwardslash at the end of our string.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*  \/?$/
Example Matches: 
http://www.facebook.com/peperoni/
http://w1w.face-book.com/90AzaZ
https://www.facebook.com
https://w1w.face-book.com.pizza/
www.com/poAzaZ
w1w.co.uk
Thank you for taking the time to read this article! I hope this helped you understand how this URL Validation Regex works.
Author

Daniel Stark (Revivedaniel)
No results found