joshnck/Regex_Capstone

## Regex_Capstone
There comes a time in every engineer's life when they ask themselves "should I use regex for this?". As often as possible, I try to answer that question with "no" and sometimes "No! Absolutely not!". Today is one of those days where I went against my better judgement, and attempted to parse Sysmon logs using Regex instead of just using the built-in `Splunk for Windows` app.

The Data:
```
<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'><System><Provider Name='Microsoft-Windows-Sysmon' Guid='{5770385f-c22a-43e0-bf4c-06f5698ffbd9}'/><EventID>1</EventID><Version>5</Version><Level>4</Level><Task>1</Task><Opcode>0</Opcode><Keywords>0x8000000000000000</Keywords><TimeCreated SystemTime='2023-07-04T00:32:35.4466929Z'/><EventRecordID>6695055</EventRecordID><Correlation/><Execution ProcessID='4528' ThreadID='5312'/><Channel>Microsoft-Windows-Sysmon/Operational</Channel><Computer>HillarysEmails</Computer><Security UserID='S-1-5-18'/></System><EventData><Data Name='RuleName'>-</Data><Data Name='UtcTime'>2023-07-04 00:32:35.440</Data><Data Name='ProcessGuid'>{be777238-68a3-64a3-b7a5-020000007c00}</Data><Data Name='ProcessId'>12984</Data><Data Name='Image'>C:\Program Files\SplunkUniversalForwarder\bin\splunk-powershell.exe</Data><Data Name='FileVersion'>-</Data><Data Name='Description'>-</Data><Data Name='Product'>-</Data><Data Name='Company'>-</Data><Data Name='OriginalFileName'>-</Data><Data Name='CommandLine'>"C:\Program Files\SplunkUniversalForwarder\bin\splunk-powershell.exe"</Data><Data Name='CurrentDirectory'>C:\WINDOWS\system32\</Data><Data Name='User'>NT AUTHORITY\SYSTEM</Data><Data Name='LogonGuid'>{be777238-5370-6489-e703-000000000000}</Data><Data Name='LogonId'>0x3e7</Data><Data Name='TerminalSessionId'>0</Data><Data Name='IntegrityLevel'>System</Data><Data Name='Hashes'>MD5=5D35E9914422D9C706AFE92A26C18BA4,SHA256=A6A74EBF5C9B5AEBFE110416C8E96078AFBCC4582B3F200DABB7353946A5A7F6,IMPHASH=6A5601498E7E7959885DB6B8832ECC0A</Data><Data Name='ParentProcessGuid'>{be777238-b99e-6490-c447-000000007c00}</Data><Data Name='ParentProcessId'>20076</Data><Data Name='ParentImage'>C:\Program Files\SplunkUniversalForwarder\bin\splunkd.exe</Data><Data Name='ParentCommandLine'>"C:\Program Files\SplunkUniversalForwarder\bin\splunkd.exe" service</Data><Data Name='ParentUser'>NT AUTHORITY\SYSTEM</Data></EventData></Event>
```

So there are some trends in here that are worth calling out. First off - the data is broken into two imporant parts: <System> and <EventData>. For the sake of my own sanity, I'm going to only focus on the <EventData> for now.

The Regex:
```
Data Name\=\'(?<_KEY_>[A-Za-z]+)\'>(?<_VAL_>[^<]+)<\/Data>
```
All fields start with "Data Name='" so we can start each search with ignoring that bit. Then we'll create a named Capture Group _KEY_ and look for a greedy amount of letter characters including upper and lowercase. This is followed by a single quote so we'll capture all of the stuff in the single quotes and store it to _KEY_. Then we will move past the `>` and start a new Capture Group _VAL_ that starts with a `<` and we'll assign all of the greedy chars between that and the end token of `</Data>`

Lessons Learned:
This was a bad use of regex! A better use of Regex would be to parse out the `Hash` field and look pull out the specific hash types and drop them into new fields. Did I learn something? Yes - I learned about named capture groups and clever and versitile ways to use them. After doing this, though, I discovered some problems with how Splunk handles Regex and named capture groups - especially in the context of XML data. My primary lesson learned was that you should use XML parsers for XML data. Regex is a last-resort parsing method for many cases with complex data. A future and probably better project would be for me to parse through the spreadsheets my company uses to track our trainings and replace all of the disparate date formats with ISO8601
	There comes a time in every engineer's life when they ask themselves "should I use regex for this?". As often as possible, I try to answer that question with "no" and sometimes "No! Absolutely not!". Today is one of those days where I went against my better judgement, and attempted to parse Sysmon logs using Regex instead of just using the built-in `Splunk for Windows` app.

	The Data:
	```
	<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'><System><Provider Name='Microsoft-Windows-Sysmon' Guid='{5770385f-c22a-43e0-bf4c-06f5698ffbd9}'/><EventID>1</EventID><Version>5</Version><Level>4</Level><Task>1</Task><Opcode>0</Opcode><Keywords>0x8000000000000000</Keywords><TimeCreated SystemTime='2023-07-04T00:32:35.4466929Z'/><EventRecordID>6695055</EventRecordID><Correlation/><Execution ProcessID='4528' ThreadID='5312'/><Channel>Microsoft-Windows-Sysmon/Operational</Channel><Computer>HillarysEmails</Computer><Security UserID='S-1-5-18'/></System><EventData><Data Name='RuleName'>-</Data><Data Name='UtcTime'>2023-07-04 00:32:35.440</Data><Data Name='ProcessGuid'>{be777238-68a3-64a3-b7a5-020000007c00}</Data><Data Name='ProcessId'>12984</Data><Data Name='Image'>C:\Program Files\SplunkUniversalForwarder\bin\splunk-powershell.exe</Data><Data Name='FileVersion'>-</Data><Data Name='Description'>-</Data><Data Name='Product'>-</Data><Data Name='Company'>-</Data><Data Name='OriginalFileName'>-</Data><Data Name='CommandLine'>"C:\Program Files\SplunkUniversalForwarder\bin\splunk-powershell.exe"</Data><Data Name='CurrentDirectory'>C:\WINDOWS\system32\</Data><Data Name='User'>NT AUTHORITY\SYSTEM</Data><Data Name='LogonGuid'>{be777238-5370-6489-e703-000000000000}</Data><Data Name='LogonId'>0x3e7</Data><Data Name='TerminalSessionId'>0</Data><Data Name='IntegrityLevel'>System</Data><Data Name='Hashes'>MD5=5D35E9914422D9C706AFE92A26C18BA4,SHA256=A6A74EBF5C9B5AEBFE110416C8E96078AFBCC4582B3F200DABB7353946A5A7F6,IMPHASH=6A5601498E7E7959885DB6B8832ECC0A</Data><Data Name='ParentProcessGuid'>{be777238-b99e-6490-c447-000000007c00}</Data><Data Name='ParentProcessId'>20076</Data><Data Name='ParentImage'>C:\Program Files\SplunkUniversalForwarder\bin\splunkd.exe</Data><Data Name='ParentCommandLine'>"C:\Program Files\SplunkUniversalForwarder\bin\splunkd.exe" service</Data><Data Name='ParentUser'>NT AUTHORITY\SYSTEM</Data></EventData></Event>
	```

	So there are some trends in here that are worth calling out. First off - the data is broken into two imporant parts: <System> and <EventData>. For the sake of my own sanity, I'm going to only focus on the <EventData> for now.

	The Regex:
	```
	Data Name\=\'(?<_KEY_>[A-Za-z]+)\'>(?<_VAL_>[^<]+)<\/Data>
	```
	All fields start with "Data Name='" so we can start each search with ignoring that bit. Then we'll create a named Capture Group _KEY_ and look for a greedy amount of letter characters including upper and lowercase. This is followed by a single quote so we'll capture all of the stuff in the single quotes and store it to _KEY_. Then we will move past the `>` and start a new Capture Group _VAL_ that starts with a `<` and we'll assign all of the greedy chars between that and the end token of `</Data>`

	Lessons Learned:
	This was a bad use of regex! A better use of Regex would be to parse out the `Hash` field and look pull out the specific hash types and drop them into new fields. Did I learn something? Yes - I learned about named capture groups and clever and versitile ways to use them. After doing this, though, I discovered some problems with how Splunk handles Regex and named capture groups - especially in the context of XML data. My primary lesson learned was that you should use XML parsers for XML data. Regex is a last-resort parsing method for many cases with complex data. A future and probably better project would be for me to parse through the spreadsheets my company uses to track our trainings and replace all of the disparate date formats with ISO8601
No results found