LPI 010-160 – Searching and Extracting Data from Files and Archiving Part 2
- Basic Regular Expressions
Let’s talk about how we can search, extract and archive information from data files. Now when we want to do this we’re going to use what’s known as a regular expression. Regular expressions are a way to describe patterns that a user might want to use as they’re looking for data inside of files. There are lots of different Linux programs that will use regular expressions and many of these are tools for expressing these patterns inside the text so we can search for things not just by the word itself, but by the patterns. When we look at regular expressions I want you to think about the concept of wild cards that we talked about before.
And when we use wild cards to look for file names, we’re going to use regular expressions to search within those files. Now at their simplest method, regular expressions can be plain text without any extra symbols. For example, if I wanted to search the file and look for all of the cases where we find the word Jason, I would just type Jason and I’d be able to find it. But what if I want to find things that are variations on that? That’s where regular expressions really help us out. There are certain characters that we can use to help denote different patterns we want to look for.
Now we have two different forms of this. There are basic ones and extended ones. The form that we’re going to use is going to depend on the program that we’re using to do our searches. Now some programs will accept just one expression form, but others will let you use both the basic and the extended form. Now the simplest type of regular expression is an alphanumeric string or an alphabetic string, something like why which stands for hardware, address or Linux three.
Now these regular expressions will match any string of the same size or longer that contains that regular expression. So if I search for the word header, I might find header, but I also might find this is the header or the hydro is unknown. All three of these contain the word and therefore they’re going to be found. But the real strength of regular expressions comes in the use of non-alphanumeric characters that we’re going to be able to activate using advanced matching rules.
Now when we talk about this basic, powerful regular expression features, there are lots of different special characters we can use. The first one is known as a bracket expression. Now this is where you take characters enclosed in square brackets and this is going to allow you to match any one character from within the brackets to the search term you’re looking for. So let’s say I had something like this b, bracket AEIOU, bracket G, and I ran a search. What words would I find? Well, I would find bag and beg and big and bog and bug because all of those take one letter each out of that bracket. And put it in the middle of the word between B and G. The next thing we can look at is a range expression. Now, a range expression is a variant on that bracket expression.
Now, instead of listing every character or number that I want to look for in a range expression, I’m just going to give you the starting and the endpoints. For example, maybe I’m looking for all of the combinations from A to z that have the numbers two, three or four in between them. So I might do a two z, a three z, a four z. That’s the idea. Next, we have a single character that can be represented by a dot. So if I’m doing something like a dot z, that means I can find everything that starts with A, ends with z, and has one letter or number in between.
The next thing we have is when we’re looking through log files, we may need to know where the start of a line is and where the end of a line is. And there are special characters for that too. If I use something like a carrot, that is going to give me the start of a line. If I use something like a dollar sign, that’s going to tell me that I have the end of a line and that will help me be able to break apart those logs and find the information I need. Another thing you might use is repetition. Now, you can have a full or partial regular expression that is followed by a special symbol to denote repetition of the matched item. Specifically, I might look for an asterisk that denotes zero or more matches.
So the asterisk can be combined with a dot such as dot star, and this will specify a match with any substring that you’ve been finding. Next, you have so many special characters like a dot that represents any character. But what if I’m trying to find a dot inside the text I’m looking for? For instance, maybe I’m looking for the file name file name TXT. If I want to search for a dot or another special character, I have to escape it first. If you remember in our previous lesson, we talked about the fact that you can escape a character by using a backslash. So in this example, I would do file name backslash TXT to search for file name TXT within my log files.
- Archiving Files
Let’s talk about file archiving. Now, when you’re doing a file archive, you need to use a tool, and a file archiving tool is just going to collect a group of files into a single package file to easily move them around on a single system. You can do this to backup files to record a DVD, a flash drive or other removal media, or transfer it across the network. Linux is going to support several archiving commands, but the most prominent of them is the Tar and the zip.
The Tar program’s name actually comes from the term tape archiver. Now, regardless of its background, tar is actually going to be used nowadays to backup or archive your data. It’s a very popular tool that’s used to archive various data files into a single file called an archive file. And when it does this, it’s going to keep the original files on the disk as well. So this can result in a very large archive file, though, and because of that, we want to compress it. To compress it, we can use part of the Tar program to do that too, and it creates what’s known as a tarball.
Now, in fact, these tarballs are often used to transfer lots of large files around from one computer to another in a single time, just like when we’re trying to distribute source code. So if you’re downloading a lot of programs, you may find that they come in Tarball files. The Tar program is a very complex package with lots of different options. But most of what needs to be done with utility can be covered with just a few commands or options. When you’re running Tar, you want to use one command with at least one qualifier or option. Now, if you’re assuming the Tar package is installed in the distro being used, you can just check the Tar Man page for more information by typing man tar. If your Tar package wasn’t installed in your distro, you’re going to have to install it first.
Once we go back into the command line environment, I’ll show you how we can compress and uncompressed files to and from a Tarball. Next, we have programs that are going to use zip compression. These are programs like Grip, zip, Two, and XZ. These all are used to compress individual files. The result is a file with a name that’s like the original, but it’s going to have a new extension added to it that will tell you which format has been used to compress it. Now, most programs you’re going to use in Linux aren’t going to be able to read a compressed file natively. Instead, you have to uncompressed the file first, and then you’ll be able to read it by other programs. So let’s take a look at some common file names that you’re going to find that have been compressed with one of these zip programs. If somebody compresses it with gzip, you’re going to see it with a GZ file name.
This can be uncompressed using the Gun Zip program. If you have a bZIP two compression, you’re going to see BZ two as your extension, and it can be uncompressed using B unzip two. If somebody compresses something with XZ, you’re going to see the file name of XZ and to uncompressed it, you’re going to use Unix. Now, the tar program provides an explicit support for all three of these different compression standards as well. And tar balls often have their own unique file extensions that can indicate the type of compression that was used. So if you use TGZ, this is used for tarballs that were compressed with Gzip. If you see a TBZ or TBZ two or TB two, this is a tarball that was used with the bZIP two compression. If you see something with a TXZ on the tarball, this means the tarball was compressed with an XZ compression scheme.
- Data Search and Extraction
In this lesson we’re going to learn how to use grip to search for data within files. So what we’re going to do first is I’m going to use a file that’s sitting in my Documents drive. So if I hit LS, you’ll be able to see what files I have. And I have one known as fruitstand. TXT. Now Fruitstand TXT is simply a file that contains a list of names, fruit, and a dollar amount. So if I want to display that to the screen green, I can just type cat and then fruitstand. TXT. Now one thing to note here is you’ll notice how quickly I was able to type Fruitstand. The way I did that is I used a feature known as Autocomplete in Linux. When you’re typing a command or a file name or a directory, if you hit the tab key, if it knows what it is, it will guess and try to autocomplete for you. So in this case I only typed in cat and then Fr from Fruit hit tab and it fully was able to display that to the screen for me.
And now I can just hit enter. It’s a nice little shortcut that you’ll get very comfortable using. So here you see on the screen I have this column that has the names of people. The second column has the type of fruit that they wanted and the third column has the amount of money that they spent on that fruit. Maybe I was collecting lunch orders. I was going to go to the grocery store and pick up a lot of fruit for everybody. So I might take this list with me. Now if I wanted to answer some simple questions like how many melons were ordered, I can do that by looking through this and going, okay, on the third line there was watermelons, there was twelve. If I look down further I see rock melons. There’s two of those, but I have to kind of search for it myself, right? Well, if I use Grep, I can actually have it search for me. And to use Grep, you’re just going to type in grip and then the expression that you’re going to be looking for. In my case I’m looking for the word melon and then whatever your file is that you want to search in. And in my case it’s fruit stand. And again I use that autocomplete with using the tab key there. If I hit Enter, I should see two things returned to me.
I should see Mark for watermelons and Oliver for rock melons. Let’s see if that happens. And there you go. You’ll notice also it highlighted the word melon because that’s what I was searching for. That me. L-L-O-N is now shown in red because I found that on those two lines. Now this is helpful, but we can actually go a little bit further. Maybe I wanted to know what line numbers those were inside my order, so I can know which order number it was. Well, I can do that by simply modifying the command and adding a N to it. Now, I can type this whole thing out again by typing in Grep N and then melon and then fruitstand, TXT if I wanted to. And that would work. Or another shortcut is you can hit the up arrow, and that will go up to the last command you entered. If you keep hitting the up arrow, it’ll keep going back in the history of the previous commands you entered.
So here I have the last command I entered, grep, melon, fruitstand, TXT. If I arrow over, I can then go back and type in the N, and that way I had a lot less typing to do. Then we’ll hit enter. And you can see here on the third line we had Mark, and on the 11th line we had Oliver. So that’s the basic way to use Grep. Grep is basically a search command. I want to search for the word melon within the file fruit stand, or I want to search for the word melon within this file fruit stand and tell you what line number it was on. But we can get a lot more creative when we’re using Grep. Grep also has a thing known as regular expressions. And so if I want to go ahead and slice and dice this data file in a bunch of different ways, I can do that. So let’s go ahead and search for it. And what we want to use here is capital E. This stands for using regular expressions. And then I’m going to go ahead and use the quote mark to say what I’m searching for. And I want to search for AEIOU, which are vowels. And I also want to find if I see two of those vowels repeated. And so what that means is if I find something like ae or OE or something like that, where I have two vowels next to each other, it’s going to come back and report that to me. And then I have my fruit stand as my file.
So if I run this command, which ones do you think are going to pop up to me because they have two vowels next to each other. Well, the first one I’m going to have is the second line, which is Susie, because Susie has an Ie in it. Those are two vowels sitting next to each other. The next one is on the fourth line, it’s Robert, because Robert has pears pea, and those EA are two vowels next to each other. Then I have Lisa with peaches, because, again, EA. If I go down to Anne and mangoes, I have an OE. And if I go down to Greg with pineapples, I have an EA.
And so if I hit enter, I’m going to find all of those. And here we go. We’ve got Susie with the Ie, robert with the EA. Lisa E a the mangoes with OE and Greg’s pineapples with EA. And you can see how we sliced and diced that and found a particular data set using a regular expression. Now, we can take this a step further and we can find other types of data. For example, maybe I wanted to find anything that’s showing up that has a two in it that isn’t at the end of the line. So, for instance, I want to find things like Fred that has apples with 20 apples, but the two isn’t the last thing on the line. But Mark, who has twelve, I don’t want that to show up. Or Tim’s Oranges, which has twelve, I don’t want that to show up. Or Oliver with his rock melons. I don’t want that one to show up even though it’s got a two, because the two is the last thing.
So instead, what I’m trying to find out here is Fred’s apples and Matt’s grapes because both of those are 20 something, not two or twelve or 22 or something like that. So if I wanted to do that, I again can use grep and I can use my capital E with that dash capital E for an expression. And in this case, I’m going to do two dot plus quote, and that says, I want anything that has a number two in it that is not at the end of the line because the plus means the preceding number of characters matches one or more times. And so here I want the number two to match, but I don’t want it to be at the end of my line. So basically what I’m looking for is two and some character.
That’s all I’m looking for. So if I find that and then we’re going to do that inside of our fruit stand. Okay. And again, this is because the dot is a wild card character for a single character when we are using a regular expression. So let’s go ahead and hit enter. And there we go. We found Fred, who has 220 apples, and Matt who has grapes, and he has 25. 25. And again, that two is not the last thing. It’s the second to last character here. All right, that’s a good example as well. Let’s take this another step further. Let’s say that I now want to find all the things that have two as the last number. So now I want to find things like Mark’s watermelons. I want to find things like Tim’s oranges and I want to find things like Oliver’s rock melons. Well, again, I can use grep.
I’ll use the capital E again, and this time I’m going to use two dollar sign. And what that dollar sign says is this is going to match the end of the line. The dollar sign is a special character here in search that anytime you have the dollar sign, it’s the end of the line. So we’re looking for two end of the line. And then we have our fruit stand. And there we go. We have Mark, Tim and Oliver because the two is the last thing that is going to be found there. So you can see how we can start slicing and dicing this data. Right now let’s say that I want to find anything that has either the word is go or or. So if I want to find those three things, I can do that again using grep and my capital E.
And this time I’m going to look for Is go or or. And by using that pipe that up and down key, which is the shift and the backspace key, which is right above your return or enter key, you can then say I want is or go or or I’m looking for any of those two letter combinations within my list up there. And then again we’re going to search inside fruitstand. Which ones are we going to find? Well, let’s find out here we have Susie’s oranges because or Terry’s oranges because of or. Lisa has the is, tim’s oranges and Anne’s mangoes. And so those are the ones that get pulled out based on finding those letter combinations. Now, I know this is all kind of silly examples here, but if you think about the fact that you might have a log file that has hundreds or thousands or hundreds of thousands of lines in it, being able to slice it and dice it to find keywords key phrases or things that you need out of it is really important.
And that’s what these regular expressions allow you to do. For example, let’s say you were going through and searching a list and you wanted to find all the people’s names that started with A through L. Could you do that? Well, certainly, because you can use a Grep command and you can use a regular expression. Now in this case, what we want to do is we want to use this carrot. Now the carrot means it matches the beginning of the line. So what is it going to start with? Well, since we know our data format has the name is the first word, we want to start this line with something. Now if the line starts with the letter A through L, that means we have names like Fred and Lisa and Anne and Greg and Betty. Those are the names. We want to find that first half of the alphabet. We’ll do that using that and put that quote Mark. And then we’ll use our fruit stand.
Let’s see if we find those answers. And there we go. We have Fred and Lisa and Ann and Greg and Betty because those were all the first half of the alphabet. Now maybe I wanted to go back and find everything that was in the second half of the alphabet. So that would be M through Z. And again I just hit the up arrow and then arrowed over and change that and hit Enter. And there’s the second half of the alphabet. Now, where this gets really useful is we can start piping this data out to other files. For instance, if I took this first half, I can then use a carrot to the right, this greater than sign, which is going to pipe this information to a file. And I can call this al names. TXT. If I hit Enter, nothing showed up on the screen. Why is that? Because the output, instead of going to the screen, went to that file that we just created. And so if I do a directory listing, you’ll now see Al names is listed there. And if I do a cat of Al names, you’re going to see that we should have Fred, Lisa, Ann, Greg, and Betty, just like we did when we outputted that to the screen. And there you go.
You can see that data is now in that file. And so you can save this and pull out from this big log data of hundreds of thousands of entries, maybe the five or ten people that you’re looking for based on certain search criteria, okay? So that gives you an idea of how you can start pulling things out based on letters, based on character sets or whatever else it is that you’re trying to dissect this information from. Let me show you one more piece of information that I think is important. And this is where you’re going to be able to say, I don’t want these characters. So in this case, let’s say I wanted Grep and I wanted to have anything that starts at the beginning of the line but doesn’t have the letter F because I don’t like people named Frank maybe. And then in there, I’m going to use the upper carrot again, and the second time I’m using the upper carrot when it’s inside the brackets, says that I don’t want these things.
So in this case, I don’t want people whose letters start with F, or maybe I don’t want Lisa for L and I don’t want T for Terry. Okay? So if I do that and then I put in my fruit stand, let’s see what I get. This time I should get everybody’s names except ones that start with Flor. And so if we hit Enter, you can see I have Susie, Mark, Robert, Matt, Ann, Greg, Oliver, and Betty, because all of those names don’t start with FL or T. And so I can then exclude that information from my search. Now where this becomes really helpful again is if I’m doing something like logs and I’m going through as a cybersecurity analyst and we’re trying to find out anybody who logged into the system between 02:00 A. m and 03:00 A. m. . Or I want to find everybody who didn’t log into the system between 02:00 A. m and 03:00 a. m. . And so you can find these things out by including people or excluding people based on your search criterias. So this is the way that you use things like Grep.
And again, this is just a very quick introduction. I would recommend that you try this out yourself. If you want to try this yourself, go ahead and create yourself a text file like I did here with fruitstand, and I’ll display it to my screen. Again, one more time. And you can type this in with whatever data set you want. Give it some names, give it some fruit, give it some numbers, and then you can start slicing and dicing it. Or if you go online, you can find a lot of data sets that are already out there and you can find things like log files and you can start going through and playing with it.
But I would recommend you start out with a small data set, something like this, five or ten lines, because that way you can go through and think about what it is you’re trying to do and then try to create those Grep commands to output what you’re expecting, just like we did here where I talked through. I want the people whose names all start between A and L and then I crafted a command for that and then we executed it. So that’s the way we want to do this kind of stuff. Now, one more thing. As you’re playing with Grep, if you ever get lost or confused, one of the best things you can do is check the man pages. So just type man and then Grep and hit Enter and it will display to the screen.
This will show you all of your options, your patterns and how you can use this. And as you go through here, you hit spacebar to go down a page. You’ll get more and more detail about all of this data and how you can save it to the screen, save it to a file, and slice and dice these data sets in different ways, using the different types of inclusions and exclusions based on all of the regular expressions. And again, this is a great manual to read through because it will really get you comfortable with doing things like regular expressions and search.