Python Institute PCAP – File IO and Exception Handling in Python Part 4
- argv Command Line Arguments and the re Module
To do with scripting. So we talked about how we can run a Python program from the command line from a terminal window, right? We did that earlier in the course. So we’re going to be building a utility where we can actually have a Python file and we can use it to do something, something useful, okay? And it’ll have to do with doing things in the directories. And you’re going to get practice working with the OS s Walk method as well. So if you want to accept arguments in your program from the command line, I’ll show you how you can do that. So, for example, let’s say if I just have a program, let’s just say that this prints something, all right? Application PY. If I run this, let me open the terminal and head over to where this particular location is.
So that’s in my home directory and in PyCharm projects is where I have this. And then it’s the app folder. So here it is, this application PY. Now I can run it from the command line by typing in Python three and then giving the file name, which is application PY. So let’s run that and notice it prints program. Let’s say if I wanted to give some arguments to this program, and those arguments could be used to perform various tasks within the code that you write. So let’s say I wanted to run application PY with my name and my age, for example, right? Obviously, if you just run it like this, it doesn’t do anything with these additional arguments. My name and my age, it doesn’t do anything with this. But I can actually accept those arguments in the code and do stuff with them.
So I’ll prove it to you. To do this, you need to import Sys, which stands for system. And this brings in a lot of handy tools that we can use. One of the most important ones is to be able to accept arguments from the terminal. So to accept an argument, you type in sys argv, and then you specify the index position of where that argument is. So in the terminal, this is the first argument, index position one. This actually is index position zero, believe it or not. Okay? So if I wanted to, for example, print argv zero, let me print this out here, and let’s run the same code from the command line. Let’s hit enter. Notice it prints application PY, because that’s the first thing by default, that comes as an argument, okay? Even though to this program, the first argument that I gave was MTOs, but argv zero is the actual program file itself, right? The application PY.
So officially, this is the first argument, and that starts with argv one, okay? So at index position one is where we officially start the arguments for a given program. So let’s run this again, and you’ll see that now it prints the first argument, which is my name, and if I change this to argue two, hopefully you get the point. It’s going to print my age. Let’s run that again and notice now it prints my age. Okay? So that’s how you can get access to the arguments that are passed to the command line when you run a Python application.
And let’s say you want to accept an unlimited number of arguments. And so for that you can actually just do you can give a one here and then do a colon and that will basically take everything, all of the arguments. So let’s try that now. And not only that, what I’m going to do is I’m going to save this into a list. So we’ll say argument list, and then I can loop through this list. So for Arg in argument list, we’d like to print out whatever the Arg is. Okay? So let’s test this out. Now I’m going to run this program again and I’m going to give a couple of words such as apple and you can’t have commas here, right? That’s just how command line works. This is the application that we’re trying to run. This is the program. And this is the first argument.
And then the second argument is we can give anything, really. So I’m just going to add a couple of fruits here. We’ll say fruit, orange, computer, doesn’t really matter. I’m just giving random things here. Mouse, person, car. Okay? And so when we run this, notice, it takes all these arguments and prints them out here because that’s exactly what this is doing, right? This is we’re looping through the argument list and we’re printing it out. So that’s how you deal with arguments passed to the command line. Another thing I want to look at in this lesson is right in the browser here. I’ve opened it up. I suggest you visit W threeschools. com, and there’s this Pythonregx ASP link. You could actually just type in Google Python regex and it’ll take you to this page.
I think it’s the first or second link that appears in Google. And this module called Re is used to search for things using regular expressions. And if you’re not familiar with regular expressions, don’t worry. This is a nice little handy tutorial that you can go through to learn about regular expressions. It’s pretty straightforward. It gets tricky if you need to build really complex regular expressions. But don’t worry, you’re probably not going to have to do that very often. There’s plenty of tools that exist now that help you search, but basically this Re, this regular expression tool or this library rather helps you search for specific things. Okay, notice in this text, in this first example here, in this given text, we are searching for this particular regular expression. Re search is the actual search is the method that is coming from the Re module. So it searches for this particular regular expression in this text.
And if it’s found, it returns a true otherwise it returns a false. Okay, so if you can actually click on this, try it yourself. And it has this little Python code here where this is the text. You could change the text around and try different regular expressions using this chart down here that goes over a couple of different examples. And in this particular case it says search the string to see if it starts with the and ends with Spain. And you can of course run this and you’ll see it says that, yes, we have a match. I think that’s what it yes, we have a match.
So it was able to find this particular regex in this text. It starts with the and it ends in Spain. I think that was the requirement. And it utilizes some of these things that you can learn yourself if you need to, but there’s not going to be a regex tutorial. We don’t want to spend too much time learning that. That’s kind of off topic. You can learn that on your own. We’re going to actually use this method called re search and re find all Now Find All what this does, it gives a list back of all the times where the search string was found in the text. So in this case, the rain in Spain is the text. And so anytime it’s AI was found in text, it basically saves it in a list and returns that list back.
So if you let’s click on this and try it, find all is going to return a list containing the words that have this AI in it. Okay, so let’s run this notice. It found AI in two places and we know that’s in right here, Reign and Spain. So it doesn’t return the entire word to be specific. It actually just returns the regex that we are searching for all of the occurrences. So Find All returns a list of all the occurrences where your regex was matched in the text and search just returns a true or a false research. So go through this tutorial, play around with it. You don’t have to go too crazy with regex, but understand how to use Re Search and Re Find All methods because that’s what is going to be required for the next assignment that’s about to follow this lecture.
- Section 6 Assignment
And in section six, this is the assignment. Now go over exactly what you need to do here in this video. Now just to show you that in this particular folder there’s a folder called Project Files. And if I open these files up, let’s expand this folder and you’ll see there’s this folder in here called First Folder. And inside of first folder there’s another folder called Second Folder. And in the each of these folders we’ve got these files and let’s open up one of these files. And all of the files basically contain text like this. And this is basically pretty much most of the work that Shakespeare has ever written that’s been combined and put into these files. And it doesn’t really matter the order, but I just put all that work into these files and split them up for this particular assignment. And understand, there’s the Project Files folder in which there’s these files and then there’s a First Folder which is a subfolder of Project Files. And inside the first folder there’s a couple of files and then there’s another subfolder inside of First Folder which is called Second Folder. And you get the picture, right?
So this assignment has to do with walking directories. So you’re going to walk the Project files directory and in there, there’s all these files. And so the assignment is for you to write code that accepts input arguments to this script that we’re about to write. And notice the first argument is saved into this variable called directory containing files. So this is the directory which is going to point to Project Files in this particular case when a user or could, it could serve any directory on the machine. That’s why we’re making this a command line argument option. And then the second one is Words to Aggregate. Okay, this is also a list of arguments that are going to be accepted when a user tries to run this file. And it could be an unlimited number of words, ten words, 20 words or three words, whatever. And those words will be saved into these words to aggregate list variable. And what the idea is, I want to know the occurrence of the words that are in these words to aggregate. I want to know the occurrence of those words in these project files, in these text files that we have right in each of the folders. Okay, so basically your code is going to search through every single file in this project directory structure here inside of First Folder, inside of Second Folder, as well as all these files. It’s going to search through all of them to look for the words that are passed as an argument and it’s going to get the count for those words, the number of occurrences of each of those words are going to be searched through here and then that tabulated result needs to be printed out.
Okay, so before we make this a sys argument, what I’m going to do temporarily is put a comment here so that we’re not accepting any command line arguments. We’re going to hard code this thing to start with, just to test out the basic functionality. And I’m going to say that the directory containing files is going to be OS Get Current Working Directory. All right? And as a matter of fact, let’s make this equal to Dir, okay? And let’s print what dir is real quick so that we can understand where exactly we are in this project. So I’ll comment out those lines here. Let’s just run this real quick and see that we are in my home directory picharm, projects, assignments, folder, assignments, project. And then section six is the actual parent folder of this project. That’s the current working directory. So I can do an OS changedirectory C hdir we know this. And change it to Project files, make sure it’s in quotes like that. And now the get current working directory OS get current working directory is going to print something else. Okay, so let’s run this and notice now we are in project files. Okay, so we changed the working directory to project files.
And then we’re just making sure that that’s where we are. Now let’s take this entire directory location, all right, and paste it for this variable. Make sure you put it in strings. I forgot that in quotes. So we’re right now hard coding where those files are. And I’m assigning that directory location to directory containing files and then the words to aggregate. I’m just going to say that those words for now are there, Michael, and running. Now, Shakespeare has written a lot of words, so I just picked a couple of random words. Now these words are going to be searched through this entire directory structure, all of the files, and it’s going to at the end, what I expect to see is a dictionary printed. And I’ll just going to print out, give you the expected output here. Let me just put a comment here. Expected output of running your program. We want to see how many occurrences of the word there were in those files. And let’s say that there is like 2000 occurrences.
I don’t know, it could be anything. And then for Michael, there could be 30 occurrences or something like that. And then for running, there could be a certain number, n number of let’s just 200, right? Some kind of output like this. Okay. As a matter of fact, just to be clear, I’ll just put N here because we don’t know what number, how many occurrences of each of these words are going to be there. But this is generally the output I expect from your code when you run it. Okay, so why don’t you start coding this product up and you can resume to watch my solution once you’re done. Make sure to work on this on your own. Don’t just resume to watch my solution. I really want you to struggle. And I talked about the re import. Right. This is used to use regular expression search, but you won’t need to get very complicated at all, really. You’re going to be searching for the words there, Michael, and running or any other word, really, using this re library.
And we’re also importing sys that’s in the case where we want to get the system arguments, the command line arguments when you run this file. Right now we’re just hard coding the words, hard coding the directory location. But we’d like to make this code available throughout the machine anywhere. Any person that has this file should be able to run it on their machine across a particular directory and get the count of words that they are trying to get for any number of files that would be in that directory structure. Okay, but for now, I’m just leaving this as a hard coded value for both. So pause the video now, try this out and then you can resume to watch my solution. Okay, welcome back. Hopefully you took your time and really worked on this assignment. So first things first is I need to figure out what this kind of structure is, of course, is the dictionary. This is the output that I expect. So let’s create a variable that will serve as the output, and we’ll call it words and counts.
Okay. And this will be a directory dictionary. So this is the output that I expect, and I’ll be assigning this with the proper counts once we figure them out. So what we need to do is, of course, loop through this directory structure and get all the data from the files that we need. And we need to do that for each of the given words in this words to aggregate. So what I can do is I can say for word in words to aggregate. For each of those words, I’m going to reset the count to zero because there’s going to be a different count for each of those words. And then I’m going to use that OS Walk functionality that we talked about. So that’ll be a four path for folder names and then for file names. Right. We’ve seen this. So we are unpacking the OS Walk method into these variables and we’re really only going to need to worry about the file names.
But I’m just showing you that you have access to the folder names as well as the path names. Okay? And so for these variables in OS Walk, and then what are we walking through? We’re walking through this directory containing files. And again, right now I’m just using the hard coded location, but it could be a command line argument as well. We’ll switch that later. Now inside of this for loop, we can be concerned with the files, right?
So every directory is going to have a number of files. So we could say for file name in file names, we’re going to try to get the exact file name because remember when you loop through it like this, this actually treats it as like a list. File names are lists. It’s going to give a list of those file names. And that is not enough for us to actually use the open command. We need to give a full directory location to open a file. So we’re going to do OS path join. We looked at this and then we specify the path and then we specify file name.
Okay? So this will allow us to join the file name that’s actually part of that list. We’re going to get just the name and then we’re going to join it with the path. Okay? And this is going to be the path of the given folder in which we are in during a particular iteration. And we’re going to assign this to a variable called file. And this is the full file name with the proper path and everything. And so we can use the with open command that we talked about on how to open files. So we’ll do file and we’re going to open it in read mode. So we have to put R here. We’re not writing to these files, we just want to read through them. So that’s why I have R and then I’m going to save it as a file. So now we are opening that file and we’re using this variable to represent that open file. And so using this a file, I can actually go through each of the lines. So for that we need a loop so we can say four line in the given file, which is a file if re search. And here’s where we specify the pattern that we’re looking for. And the pattern here is as simple as this particular word. All right? So for the given word in the given text and what is that text? That text is right here, this line. So we’ll say line. We can use Re find all method so we could do Re find all and then specify the word that we are looking for inside of the line and what this is going to do. So this if re search basically returns a boolean expression, either a true or false. Is this particular word, is this word anywhere in this line? Okay, if it is, then it returns a true. And if it is true, then we go into this.
So this method is actually going to give us all the words that are repeating in this line of text, okay? And it’s going to save that into this list called we’ll call it word list. And once we have all those words and there could be one word in the particular line or ten words, all the occurrences of those words in that given line, we can get the length of that list. So it’ll say word list and then we’ll assign that to count. Now we can’t just assign it. We need to increment it right, with the number of words that were given. All right. And so we increment count and then outside of this for loop, where this is still working with each of the lines of the file outside of this for loop, which is dealing with each of the file names outside of this path where we have the count. Up to this point, count has been incremented for as many occurrences of the words that were found. Here is where we actually do the assignment to this variable that I created up here, words and counts.
So this is where we do the final assignment. So we’ll say words and counts will specify the dictionary key to be the particular word that we are going through, which is here. And then we’re going to assign it to the count variable like that. All right? And of course, once it goes through this word, it’s going to choose the next word in this words to aggregate list. And then, of course, count is going to be turned back to zero and it’s going to start incrementing for as many occurrences of that word count in those files. Okay, so this particular looping mechanism of looping through the entire directory structure is going to happen for each of the words and the words to aggregate. All right, so this is more advanced code for many of you that are new to this. So don’t worry.
Study this code. Practice with it. You have a for loop inside of a for loop that’s inside of a for loop. So that’s just the structure that we have to set up here to get the data that we want. And now let’s print the result, which is just words and counts. So by the time we get to this line of code, all of the looping is complete. We’re done. Through looping all of those words, we’ve got the counts and we are assigning those into the dictionary. Each word is getting its count. And here we print that count. Okay, so now let’s run this. And there we go. We get the word there is 2191. The number of occurrences of the word Michael is 21. The number of occurrences of the word running is 33. Okay, let’s choose a different set of words. We’ll say hello, Peter. Now, we know computers didn’t exist back then, so let’s just for the fun of it, search for the word computer.
Now, this text I got from the Internet, and this is a public work where they display all the texts along with some other texts that may be relating to the website itself. So I don’t know if they use the word computer in it, but let’s see if the word computer is existing anywhere in these files. So let’s run this. And there we go. The word hello is 60. There’s 60 hellos in there. There’s 41 peters and yeah, there’s no computers because during the time of Shakespeare, there were no computers. This is also a good check to see if this code is working. All right, so that’s it. This is the code. Now, to change this so that it accepts a system argument from the command line, all we have to do is get rid of this hard coding and assign it to Sys Argv Two, and we can change this to the same here.
As a matter of fact, let me comment out both of these lines here in case we want to get back to this code later. And I’m just going to assign it down here, the Sys Arg V. So this will allow us the ability to get the command line arguments here. And I’ll do a similar thing for words to aggregate. All right. And then once we have this, we don’t really need any of this. This was just some code that I wrote to find out the exact location of the particular folder. And so that’s all we really need. This file now can be run anywhere in your machine for any given directory structure. Now, for Windows users, this code will work normally for Mac users sometimes. I think I showed you this in the previous couple of lectures. There’s this secret folder, a hidden folder called DS Store. It’s like a hidden file in your file system. So if you run this code and you run into some kind of an issue, go to that particular directory location. Let’s go here, for example, this is my home directory.
And if I type in Lstrha, this will give me all the details of all the files in this particular directory. If I hit enter notice there’s this DS store. You’ll find it somewhere if you’re using the Mac operating system. If you’re not using the Mac, you won’t see this. But if you run this code on some directory on your machine and it gives you an error, that error could be because it didn’t know what to do with this DS Store. And since this file is really only supposed to look at text files, things that you can open and read, notice the R here? We’re opening the file to read it. If it’s some weird file like DS Store that is operating system specific, then this program might not work. So what you’ll have to do is you’ll have to delete this particular file from that given folder location.
Now, I know this particular folder on my machine here. Let’s just go to this project files, directory locations. I have it here. I’m going to paste it. Let’s CD over to that location and do an LS, Ltrha, or whatever, and you’ll see that that DS Store is not there. Let’s navigate over inside into the first folder, and you’ll do the same command, and you won’t see the DS Store. Let’s CD into the second folder and make sure that it’s not there. Yes, that’s why we don’t see an error. But for Mac users, if you run into some error, make sure that that’s the first thing you check is to eliminate that folder from the given directory in which you’re running this, okay? Make sure it’s not an important directory. It’s probably just a directory that you’re playing around with.
As a matter of fact, you could just copy this particular directory and move it onto your desktop, move it on into your documents, into your download folder, and you can run this script anywhere in your machine as long as as an argument to this script. You give the directory location of where those files are, and then the particular words that you want to aggregate, you’ll just pass them as an argument. Here. Okay? As a matter of fact, let’s test that functionality out here in the command line. So this particular file, where does it exist? Let’s go back a couple of spaces. I’m going to do a CD back, CD to go back. And here is the file assignment one, PY, right?
So what I could do is let’s just run this here from the command line with the same exact approach that we used. So I’m going to type in Python three and then specify the given file name is the script name. And then the first argument to the script that we’re going to pass is the directory location. So we know that these files are in this project Files folder. So let’s just take this entire directory location and paste it here. You don’t need to put string quotes around it in the command line here. So I just pasted that there. And then the next argument that we need after this is the set of words. So we chose hello, Peter and Computer. So let’s just type them up hello, Peter and Computer. Right? No commas you’re just space. And it could be a number of words here.
You could just keep listing them out and just put a space between each of the words. Because remember, the words to actually create accepts an unlimited number of arguments. Starting at index position two onward. All right, so any number of words here, you can paste, and it will run for those words. So let’s run this and notice that we do get the expected response, which is right here. It gives us the same exact response that it gave us in pie charm. It’s giving us here on the command line here.