sense of doubt

Though the current project started as a series of posts charting my grief journey after the death of my mother, I am no longer actively grieving. Now, the blog charts a conversation in living, mainly whatever I want it to be. This is an activity that goes well with the theme of this blog (updated 2018). The Sense of Doubt blog is dedicated to my motto: EMBRACE UNCERTAINTY. I promote questioning everything because just when I think I know something is concrete, I find out that it’s not.

Saturday, December 9, 2017

Hey, Mom! Talking to My Mother #887 - Web Scraping with Python

Hey, Mom! Talking to My Mother #887 - Web Scraping with Python

Hi Mom,

I may have mentioned that I am taking this Python course. It's a MOOC, actually. My primary care doctor turned me on to MOOCs a couple of years ago.

Here's my most recent Python program from the third of the Python For Everybody courses. Strictly speaking, I am not supposed to share code. If someone taking the course were to find this, they could post it as their assignment without doing the work. However, I think it's unlikely that the code is found and used in this way. My blog does not come up easily in Google searches I have found. I have experimented with this this searching. Even when I put the NAME of my blog in the search, I do not always find the entry I am seeking. I can search "sense of doubt blog anger" to try to get to yesterday's blog post, and it does not appear in the top ten results. I have to force the search with "sense of doubt blog spot anger" to get just the main page, not even the specific page of yesterday's entry.

Anyway...

This Python program is related to such searching. Its a web scraping program. It uses a Python library called Beautiful Soup to parse URL links on web pages.

But to keep it tricky Dr. Chuck added some wrinkles. This program asks for a count and a position, intending that starting with a given link, such as

http://py4e-data.dr-chuck.net/known_by_Fikret.html

with a count of 4 and a position of 3, which means the program extracts the third web link on the page and cycles back to feed that link to the parser, extracting a new web page, and then extracting the third link from that page and repeats this process the number of times for the given count number, which in the test case is four times.

Here's sample output of the tester materials.

$ python3 solution.py
Enter URL: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Enter count: 4
Enter position: 3
Retrieving: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Anayah.html

Official Python comments marked with # and in the gold font.
My comments marked with # and in the normal white font.

# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4

# Or download the file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Some materials have to be imported for the program to work, such as the aforementioned Beautiful Soup library and three functions from the Pyton URL library: request, parse, and error.

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# SSL certificates give funny errors, so the above code circumvents this problem.

# User input below

url = input('Enter URL: ')
count = int(input('Enter Count: '))
position = int(input('Enter Position: '))

# Below, a while loop until count runs out. Using the "urlopen" function the url from th euser input is opened and the data read into the html variable as bytes. Beautiful soup is called on the html variable to parse the contents. Soup is called again to get the anchor tags, which retrieves all the links, one for each anchor tag and puts them all in a list object called tags. You will see I commented out some of my test print statements to make sure I was getting the right data.

#I retrieve from tags the link at the position I want, which will be "position-1" because we start at 0 as in tags[0] is the first link. I called the get method on the tag at that position to extract just the web address, the URL, and I put that string in the url variable, overwriting the previous, which the first time through is user input. I count off one extraction be decrementing the coutn variable, and the program loops back and checks the "while" condition. If we still have a positive count, we do another, and so on until we run out of count.

while count > 0:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
# print(tags)
print('Retrieving: ', tags[position-1])
# print(tags[position-1].get('href', None))
url = str(tags[position-1].get('href', None))
count = count - 1

This program would be MUCH harder to write without the useful libraries, Python's url library and the cool Beautiful Soup, created at crummy.com by Leonard Richardson. There's cool stuff at the crummy site, such as a blog, a zine, and documentation.

https://www.crummy.com/software/BeautifulSoup/

That is all.

I am having fun with coding.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reflect and connect.

Have someone give you a kiss, and tell you that I love you, Mom.

I miss you so very much, Mom.

Talk to you tomorrow, Mom.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

- Days ago = 889 days ago

- Bloggery committed by chris tower - 1712.09 - 10:10

NEW (written 1708.27) NOTE on time: I am now in the same time zone as Google! So, when I post at 10:10 a.m. PDT to coincide with the time of your death, Mom, I am now actually posting late, so it's really 1:10 p.m. EDT. But I will continue to use the time stamp of 10:10 a.m. to remember the time of your death, Mom. I know this only matters to me, and to you, Mom.

No comments:

I am Christopher Tower the gmr

I am Christopher Tower (or Chris), and I am a writer of stuff. I grew up in Michigan; I now live in the Portland, OR metro area. I play Ultimate, ride a bike, and supposedly educate college persons while myself being educated in college. I am married with two kids, a beagle, a curly lab, and a fiesty pug/frenche mix. I like sushi. I love all SF, fantasy, comic books, D&D, board games, and Gnosticism. I am a Jungian. I am currently studying Clinical Mental Health Counseling at Walden University.

SENSE OF DOUBT STATUS AS OF 0705.04 - 16:45

Sense of Doubt is not currently dedicated to any themes or special interest. The subject matter is mine and may range from comic books to ultimate or from Baseball to feminist-centered media criticism. Until I feel I have enough content for multiple blogs, or until I am seized with a desire to create multiple blogs, this is it, and appropriately so. "Sense of Doubt" came about in Bowie’s Berlin period and the dark, ambient collaborations with Brian Eno. Like the Bowie of 1978, I have my own darkness that steals over me and through me, infecting everything. At the risk of sounding far too melodramatically obsessed with my own self-flagellations, this blog dedicates itself to that darkness, that infection. But it’s fun, too. Hey, I can be amusing? Or not. It’s the way of the [w]rench. Neurosis compelling action in insecure double-checking and misunderstanding evasions. It is my way.

More from the original description text that needed editing in 2015: Furthermore, Sense of Doubt is dedicated to the random. The theme is no theme. Just questions, doubt, and uncertainty. Feel the power of not knowing the answer. So dedicated on the last day of July 2006 by the Galactic Monkey Wrench.

The Galactic Monkey Wrench

This is the logo of the Galactic Monkey Wrench. I was given the nickname Galactic Monkey Wrench in college by a friend of mine who felt that I threw the monkey wrench into the cosmos at every available opportunity. Later, in discussions with my best friend, who is the Lord of Chaos (the Loc), he asked for my title and when I told him, without thinking, he blurted out "the gmr!" Since this was random and we appreciate randomness, I became the gmr, even though technically I should be the gmw. But gmw sounds like a car or some industrial manufacturing firm that makes a strange widget of which one has never heard. This acronym fetish may make no sense to anyone else, but my friend and I are quite driven to provide acronyms for many things. At the very least, it allows us to keep our conversations obscure and often private as no one knows about what we're talking.

Monkey Wrench Books

Chris's books

A Feast for Crows

by George R.R. Martin

This book is a little slower than the others. But if you become invested in this series, it provides key information about the history of Westeros and the lands across the Narrow Sea. It may not contain chapters with my favorite characters ...

A Game of Thrones

by George R.R. Martin

Do I really need to review this book? It's one of the best books I have ever read. Martin is a great writer. All the books are great, and I am loving my time rereading them. If you have not checked out these books, start here and get ready ...

The Golden City

by John Twelve Hawks

These books are immensely entertaining. Treat yourself to some strong writing, great action, compelling characters, and a mix of metaphysics and theology. The ending of this third book, which is presumably the last, is anti-climactic and so...

Pathfinder

by Orson Scott Card

This was a fun book. Not OSC's best but very good OSC nonetheless. The best thing about is the time travel, slowing, and speeding powers of the characters and how OSC engages the reader in discussion of causality and time paradox. For fans ...

Swamplandia!

by Karen Russell

This book came to me via my wife Liesel who discovered it and urged me to read it. Beautifully written with compelling characters and a sense of the magical (yet realistic, somewhat). Funny yet full of the pathos that marks a good if not gr...

Hey, Mom! The Explanation.

Saturday, December 9, 2017

Hey, Mom! Talking to My Mother #887 - Web Scraping with Python

No comments: