5.9 KiB
Tree building functions
def make_results_tree(name, page):
""" Returns a tree - takes a name to search for and a page number and returns an etree suitable for apply_xpath() """
url = "http://solicitors.lawsociety.org.uk/search/results?Name=%s&Type=0&IncludeNlsp=false&Pro=True&Page=%s" % (name, page)
with open(lawsoc_results_fo, 'r') as fo:
if not any(url == x.rstrip('\r\n') for x in fo):
print("New Lawsoc results page URL, writing to file." + " - " + url)
write_file(lawsoc_results_fo, url)
else:
print("Lawsoc results page URL already saved.")
text = fetch_url(url)
# print(text)
tree = parse(StringIO(text))
return tree
def make_office_url(classification, idnumber, options):
""" Returns a URL - takes a office/person, id options after a ? and returns a url. """
url = "http://solicitors.lawsociety.org.uk/%s/%s/%s" % (classification, idnumber, options)
# print("make_office_url: url: " + url)
return url
def make_office_tree(classification, idnumber, options):
""" Returns a tree - takes a office/person, id options after a ? and returns an etree suitable for apply_xpath(). """
url = "http://solicitors.lawsociety.org.uk/%s/%s/%s" % (classification, idnumber, options)
# print("make_office_tree: url: " + url)
text = fetch_url(url)
tree = parse(StringIO(text))
return tree
def make_simple_tree(url):
""" Returns a tree - Fetches the content from the given URL
takes a url and returns a tree suitable for apply_xpath()"""
text = fetch_url(url)
# print(text)
tree = parse(StringIO(text))
return tree
SOURCES 1  stackoverflow.com > questions > 24393029 > how-to-parse-html-tags-as-raw-text-using-elementtree How to parse HTML tags as raw text using ElementTree Jun 24, 2014 — I have a file that has HTML within XML tags and I want that HTML as raw text, rather than have..
2  discuss.python.org > t > parsing-html-with-the-xml-module > 12013 Parsing HTML with the XML module - Python discussion Nov 17, 2021 — I need to parse HTML files with the Python 3.8 xml package. This must be possible because some of the xml.etree.ElementTree..
3  docs.python.org > 3 > library > xml.etree.elementtree.html xml.etree.ElementTree — The ElementTree XML API Parses an XML section from a string constant. This function can be used to embed “XML literals” in Python code. text is a string containing..
4  lxml.de > lxmlhtml.html Parsing HTML lxml comes with a dedicated Python package for dealing with HTML: lxml.html. It is based on lxml's HTML parser, but provides a special Element API..
The functions make_results_tree, make_office_url, make_office_tree, and make_simple_tree are designed to fetch and parse web pages into tree structures suitable for XPath queries. Here are some suggestions to improve these functions:
General Suggestions
-
Error Handling: Add error handling for network requests and parsing. This can help in identifying issues with the fetched content or the URL format.
-
Logging: Replace print statements with logging. This provides more control over the output and can be configured to write logs to files, which is useful for debugging and monitoring.
-
URL Construction: Consider using the urllib.parse.urljoin function for constructing URLs. This ensures that URLs are correctly formed, especially when dealing with relative paths.
-
Use of StringIO: Ensure that StringIO is imported from the io module. This is used to convert the fetched text into a file-like object for parsing.
-
Function Documentation: Improve the documentation of each function to include more details about the parameters and return values. This helps in understanding the purpose and usage of each function.
Specific Suggestions
make_results_tree
- Check for URL Existence: The current check for URL existence in the file is inefficient. Consider using a set or a database for faster lookups.
- File Handling: Opening a file for each URL check can be inefficient. Consider reading the file once and storing its contents in memory if the file size is not too large.
make_office_url and make_office_tree
- Parameter Validation: Add validation for the classification, idnumber, and options parameters to ensure they are in the expected format.
make_simple_tree
- Error Handling for fetch_url: Ensure that fetch_url can handle network errors gracefully and that make_simple_tree can handle the case where fetch_url returns an empty or invalid response.
Example Improvements
import logging
from io import StringIO
from urllib.parse import urljoin
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
def make_results_tree(name, page):
""" Returns a tree - takes a name to search for and a page number and returns an etree suitable for apply_xpath() """
base_url = "http://solicitors.lawsociety.org.uk/search/results"
url = urljoin(base_url, f"?Name={name}&Type=0&IncludeNlsp=false&Pro=True&Page={page}")
with open(lawsoc_results_fo, 'r') as fo:
if not any(url == x.rstrip('\r\n') for x in fo):
logging.info("New Lawsoc results page URL, writing to file." + " - " + url)
write_file(lawsoc_results_fo, url)
else:
logging.info("Lawsoc results page URL already saved.")
text = fetch_url(url)
tree = parse(StringIO(text))
return tree
# Similar improvements can be applied to make_office_url, make_office_tree, and make_simple_tree
These improvements aim to make the code more robust, maintainable, and easier to understand. Remember to test the refactored code thoroughly to ensure it behaves as expected.