← 返回首页
Add health check for urls by huangsam · Pull Request #148 · mattmakai/fullstackpython.com · GitHub
Skip to content

Navigation Menu

Toggle navigation
Sign in
Appearance settings
Search or jump to...

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Resetting focus

Add health check for urls#148

Merged
mattmakai merged 1 commit into
mattmakai:masterfrom
huangsam:feature/check-url-health
Jan 13, 2018
Merged

Add health check for urls#148
mattmakai merged 1 commit into
mattmakai:masterfrom
huangsam:feature/check-url-health

Conversation

Copy link
Copy Markdown
Contributor

Here is the output that comes from the check_urls.py script: urlout.txt

This solution uses ThreadPoolExecutor to resolve the inherent I/O bottleneck of URL requests. Also uses a fairly comprehensive regular expression for matching URLs. The pattern can be tweaked in the future if needed.

Copy link
Copy Markdown
Owner

thanks @huangsam this is super useful! looks like there may be a bug though because the URLs that end in .html do not resolve correctly in this program. Any ideas there?

For example, on the Flask page it says "http://blog.startifact.com/posts/older/what-is-pythonic" is a 404 but the actual URL is "http://blog.startifact.com/posts/older/what-is-pythonic.html" which resolves fine.

Copy link
Copy Markdown
Contributor Author

Thanks for pointing an example case out. The regular expression is good at detecting URLs but it's not perfect at capturing all of it. Separate parsing for Markdown and HTML might be necessary to better capture the URLs in their entirety. As for the core logic of verifying a URL, that works just fine.

Comment thread check_urls.py Outdated
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Configurable variables
URL_MATCH = 'https?:\/\/[a-zA-Z0-9\.\-]+(html|\/)[=a-zA-Z0-9\_\/\?\&\-]+'
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason Spam Abuse Off Topic Outdated Duplicate Resolved Low Quality Hide comment

The last portion of the regex [=a-zA-Z0-9\_\/\?\&\-]+ should be [=a-zA-Z0-9\_\/\?\&\.\-]+ since it missed the . thereby ignoring websites that ended with .html.

Copy link
Copy Markdown
Contributor Author

The output has been reduced significantly down to the following:

http://intermediatepythonista.com/python-comprehensions: -1 http://intermediatepythonista.com/python-generators: -1 http://learntocodewith.me/: -1 http://learntocodewith.me/getting-started/: -1 http://testdriven.io/part-five-intro/: 404 http://packetbeat.com/: -1 http://w3techs.com/technologies/details/ws-cherrypy/all/all: -1 https://c6c6d4e8.ngrok.io: 404 https://wiki.jenkins-ci.org/display/JENKINS/Securing: 404

- Add url collection algorithm - Optimize regex + config for clarity - Handle exceptions in get_url_status
huangsam force-pushed the feature/check-url-health branch from 0cdd819 to b56bb5e Compare January 12, 2018 08:07
Copy link
Copy Markdown
Contributor Author

Timeout errors are now showing up with 504 instead of -1 in urlout.txt. Let me know if there's anything else that needs to be done to get this merged in.

Copy link
Copy Markdown
Owner

I'm happy to merge this now because it's super helpful. I guess the only other bit is that it's picking up non-URLs like "https://c6c6d4e8.ngrok.io", which are embedded in the code but don't actually link to sites. It's not a huge deal but if you want to improve the script that'd be a big improvement.

mattmakai merged commit 24737bc into mattmakai:master Jan 13, 2018
mattmakai added a commit that referenced this pull request Jan 13, 2018
Copy link
Copy Markdown
Owner

Updated change log with a shout out for the new health check script. Thanks again @huangsam!

huangsam deleted the feature/check-url-health branch January 13, 2018 18:44
Copy link
Copy Markdown
Contributor Author

Thank for the reference @mattmakai!

I do understand that non-URLs are being picked up. Not a fault of the actual regex, but more because of the context of the content surrounding the URLs. As a workaround, I created this line to ignore some obvious URLs.

To provide an "authentic" solution, I imagine that the one-line Bash command I invoke at the start of main won't be sufficient to cover this use case. Would be open to suggestions on how to proceed forward.

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Footer

© 2026 GitHub, Inc.