Though the current project started as a series of posts charting my grief journey after the death of my mother, I am no longer actively grieving. Now, the blog charts a conversation in living, mainly whatever I want it to be. This is an activity that goes well with the theme of this blog (updated 2018). The Sense of Doubt blog is dedicated to my motto: EMBRACE UNCERTAINTY. I promote questioning everything because just when I think I know something is concrete, I find out that it’s not.
Hey, Mom! The Explanation.
Here's the permanent dedicated link to my first Hey, Mom! post and the explanation of the feature it contains.
Saturday, December 9, 2017
Hey, Mom! Talking to My Mother #887 - Web Scraping with Python
Hey, Mom! Talking to My Mother #887 - Web Scraping with Python
Hi Mom,
I may have mentioned that I am taking this Python course. It's a MOOC, actually. My primary care doctor turned me on to MOOCs a couple of years ago.
Here's my most recent Python program from the third of the Python For Everybody courses. Strictly speaking, I am not supposed to share code. If someone taking the course were to find this, they could post it as their assignment without doing the work. However, I think it's unlikely that the code is found and used in this way. My blog does not come up easily in Google searches I have found. I have experimented with this this searching. Even when I put the NAME of my blog in the search, I do not always find the entry I am seeking. I can search "sense of doubt blog anger" to try to get to yesterday's blog post, and it does not appear in the top ten results. I have to force the search with "sense of doubt blog spot anger" to get just the main page, not even the specific page of yesterday's entry.
Anyway...
This Python program is related to such searching. Its a web scraping program. It uses a Python library called Beautiful Soup to parse URL links on web pages.
But to keep it tricky Dr. Chuck added some wrinkles. This program asks for a count and a position, intending that starting with a given link, such as
http://py4e-data.dr-chuck.net/known_by_Fikret.html
with a count of 4 and a position of 3, which means the program extracts the third web link on the page and cycles back to feed that link to the parser, extracting a new web page, and then extracting the third link from that page and repeats this process the number of times for the given count number, which in the test case is four times.
Here's sample output of the tester materials.
$ python3 solution.py
Enter URL: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Enter count: 4
Enter position: 3
Retrieving: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Anayah.html
Official Python comments marked with # and in the gold font.
My comments marked with # and in the normal white font.
# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4
# Or download the file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Some materials have to be imported for the program to work, such as the aforementioned Beautiful Soup library and three functions from the Pyton URL library: request, parse, and error.
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
# SSL certificates give funny errors, so the above code circumvents this problem.
# User input below
url = input('Enter URL: ')
count = int(input('Enter Count: '))
position = int(input('Enter Position: '))
# Below, a while loop until count runs out. Using the "urlopen" function the url from th euser input is opened and the data read into the html variable as bytes. Beautiful soup is called on the html variable to parse the contents. Soup is called again to get the anchor tags, which retrieves all the links, one for each anchor tag and puts them all in a list object called tags. You will see I commented out some of my test print statements to make sure I was getting the right data.
#I retrieve from tags the link at the position I want, which will be "position-1" because we start at 0 as in tags[0] is the first link. I called the get method on the tag at that position to extract just the web address, the URL, and I put that string in the url variable, overwriting the previous, which the first time through is user input. I count off one extraction be decrementing the coutn variable, and the program loops back and checks the "while" condition. If we still have a positive count, we do another, and so on until we run out of count.
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
# print(tags)
print('Retrieving: ', tags[position-1])
# print(tags[position-1].get('href', None))
url = str(tags[position-1].get('href', None))
count = count - 1
This program would be MUCH harder to write without the useful libraries, Python's url library and the cool Beautiful Soup, created at crummy.com by Leonard Richardson. There's cool stuff at the crummy site, such as a blog, a zine, and documentation.
https://www.crummy.com/software/BeautifulSoup/
That is all.
I am having fun with coding.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Reflect and connect.
Have someone give you a kiss, and tell you that I love you, Mom.
I miss you so very much, Mom.
Talk to you tomorrow, Mom.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- Days ago = 889 days ago
- Bloggery committed by chris tower - 1712.09 - 10:10
NEW (written 1708.27) NOTE on time: I am now in the same time zone as Google! So, when I post at 10:10 a.m. PDT to coincide with the time of your death, Mom, I am now actually posting late, so it's really 1:10 p.m. EDT. But I will continue to use the time stamp of 10:10 a.m. to remember the time of your death, Mom. I know this only matters to me, and to you, Mom.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment