lawsoc-scraper/refactor_notes/lawsoc_scraper/2024-04-06_00-00-00_Phind-S...

5.9 KiB
Raw Blame History

Tree building functions

def make_results_tree(name, page):
    """ Returns a tree - takes a name to search for and a page number and returns an etree suitable for apply_xpath() """
    url = "http://solicitors.lawsociety.org.uk/search/results?Name=%s&Type=0&IncludeNlsp=false&Pro=True&Page=%s" % (name, page)
    with open(lawsoc_results_fo, 'r') as fo:
        if not any(url == x.rstrip('\r\n') for x in fo):
            print("New Lawsoc results page URL, writing to file." + " - " + url)
            write_file(lawsoc_results_fo, url)
        else:
            print("Lawsoc results page URL already saved.")

    text = fetch_url(url)
    # print(text)
    tree = parse(StringIO(text))
    return tree


def make_office_url(classification, idnumber, options):
    """ Returns a URL - takes a office/person, id options after a ? and returns a url. """
    url = "http://solicitors.lawsociety.org.uk/%s/%s/%s" % (classification, idnumber, options)
    # print("make_office_url: url: " + url)
    return url


def make_office_tree(classification, idnumber, options):
    """ Returns a tree - takes a office/person, id options after a ? and returns an etree suitable for apply_xpath(). """
    url = "http://solicitors.lawsociety.org.uk/%s/%s/%s" % (classification, idnumber, options)
    # print("make_office_tree: url: " + url)
    text = fetch_url(url)
    tree = parse(StringIO(text))
    return tree


def make_simple_tree(url):
    """ Returns a tree - Fetches the content from the given URL
    takes a url and returns a tree suitable for apply_xpath()"""
    text = fetch_url(url)
    # print(text)
    tree = parse(StringIO(text))
    return tree

SOURCES 1 stackoverflow.com > questions > 24393029 > how-to-parse-html-tags-as-raw-text-using-elementtree How to parse HTML tags as raw text using ElementTree Jun 24, 2014 — I have a file that has HTML within XML tags and I want that HTML as raw text, rather than have..

2 discuss.python.org > t > parsing-html-with-the-xml-module > 12013 Parsing HTML with the XML module - Python discussion Nov 17, 2021 — I need to parse HTML files with the Python 3.8 xml package. This must be possible because some of the xml.etree.ElementTree..

3 docs.python.org > 3 > library > xml.etree.elementtree.html xml.etree.ElementTree — The ElementTree XML API Parses an XML section from a string constant. This function can be used to embed “XML literals” in Python code. text is a string containing..

4 lxml.de > lxmlhtml.html Parsing HTML lxml comes with a dedicated Python package for dealing with HTML: lxml.html. It is based on lxml's HTML parser, but provides a special Element API..

The functions make_results_tree, make_office_url, make_office_tree, and make_simple_tree are designed to fetch and parse web pages into tree structures suitable for XPath queries. Here are some suggestions to improve these functions:

General Suggestions

  1. Error Handling: Add error handling for network requests and parsing. This can help in identifying issues with the fetched content or the URL format.

  2. Logging: Replace print statements with logging. This provides more control over the output and can be configured to write logs to files, which is useful for debugging and monitoring.

  3. URL Construction: Consider using the urllib.parse.urljoin function for constructing URLs. This ensures that URLs are correctly formed, especially when dealing with relative paths.

  4. Use of StringIO: Ensure that StringIO is imported from the io module. This is used to convert the fetched text into a file-like object for parsing.

  5. Function Documentation: Improve the documentation of each function to include more details about the parameters and return values. This helps in understanding the purpose and usage of each function.

Specific Suggestions

make_results_tree

  • Check for URL Existence: The current check for URL existence in the file is inefficient. Consider using a set or a database for faster lookups.
  • File Handling: Opening a file for each URL check can be inefficient. Consider reading the file once and storing its contents in memory if the file size is not too large.

make_office_url and make_office_tree

  • Parameter Validation: Add validation for the classification, idnumber, and options parameters to ensure they are in the expected format.

make_simple_tree

  • Error Handling for fetch_url: Ensure that fetch_url can handle network errors gracefully and that make_simple_tree can handle the case where fetch_url returns an empty or invalid response.

Example Improvements

import logging
from io import StringIO
from urllib.parse import urljoin

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def make_results_tree(name, page):
    """ Returns a tree - takes a name to search for and a page number and returns an etree suitable for apply_xpath() """
    base_url = "http://solicitors.lawsociety.org.uk/search/results"
    url = urljoin(base_url, f"?Name={name}&Type=0&IncludeNlsp=false&Pro=True&Page={page}")
    with open(lawsoc_results_fo, 'r') as fo:
        if not any(url == x.rstrip('\r\n') for x in fo):
            logging.info("New Lawsoc results page URL, writing to file." + " - " + url)
            write_file(lawsoc_results_fo, url)
        else:
            logging.info("Lawsoc results page URL already saved.")

    text = fetch_url(url)
    tree = parse(StringIO(text))
    return tree

# Similar improvements can be applied to make_office_url, make_office_tree, and make_simple_tree

These improvements aim to make the code more robust, maintainable, and easier to understand. Remember to test the refactored code thoroughly to ensure it behaves as expected.