A Lawsoc site scraper
Go to file
cyteen 6556f52393 Initial Commit.
An old project in need of a refactoring and the removal of the polipo
dependency.
2024-04-05 14:09:03 +01:00
elasticsearch Initial Commit. 2024-04-05 14:09:03 +01:00
notes Initial Commit. 2024-04-05 14:09:03 +01:00
outlines Initial Commit. 2024-04-05 14:09:03 +01:00
refactor_notes/lawsoc_scraper Initial Commit. 2024-04-05 14:09:03 +01:00
test Initial Commit. 2024-04-05 14:09:03 +01:00
text_files Initial Commit. 2024-04-05 14:09:03 +01:00
text_files_old Initial Commit. 2024-04-05 14:09:03 +01:00
utils Initial Commit. 2024-04-05 14:09:03 +01:00
vcard Initial Commit. 2024-04-05 14:09:03 +01:00
.gitignore Initial Commit. 2024-04-05 14:09:03 +01:00
README.md Initial Commit. 2024-04-05 14:09:03 +01:00
convert_json_address_space.py Initial Commit. 2024-04-05 14:09:03 +01:00
convert_json_date.py Initial Commit. 2024-04-05 14:09:03 +01:00
convert_json_latlon.py Initial Commit. 2024-04-05 14:09:03 +01:00
create_geodata_file.py Initial Commit. 2024-04-05 14:09:03 +01:00
lawsoc_scraper.py Initial Commit. 2024-04-05 14:09:03 +01:00

README.md

A Lawsoc site scraper

The main script

lawsoc_scraper.py

Polipo is no longer considered useful and is not available other than in a docker container. We never used it for caching just as a way of creating an HTTP proxy wrapper for tor's SOCKS proxy.

Now we have requests_tor that can talk directly to tor/torbrowser and use a new identity each time. See: https://pypi.org/project/requests-tor/

Remove the need for polipo in the refactoring.

This script expects polipo to be running and the polipo setting current point to 9150 the port used by the tor browser bundle which then needs to be running.

To run headless change the polipo port to 9050 and run tor or use the tor rotating proxy docker container and use 5566 - see /usr/share/doc/polipo/examples

/etc/polipo/config

polipo socksParentProxy=localhost:9050

socksProxyType = socks5

current_dir = os.getcwd()
fo_home = os.path.join(current_dir, 'text_files')
lawsoc_results_fo = os.path.join(fo_home, "lawsoc_results_urls.txt")
missing_page_fo = os.path.join(fo_home, "missing_files.txt")
office_url_fo = os.path.join(fo_home, "lawsoc_office_urls.txt")
office_email_fo = os.path.join(fo_home, "lawsoc_office_email.txt")
person_url_fo = os.path.join(fo_home, "lawsoc_person_urls.txt")
person_email_fo = os.path.join(fo_home, "lawsoc_person_email.txt")
firm_url_fo = os.path.join(fo_home, "lawsoc_firm_urls.txt")
prefix_fo = os.path.join(fo_home, "lawsoc_prefix.txt")
geodata_fo = os.path.join(fo_home, "geodata.txt")

Other scripts and directories

create_geodata_file.py

For each of the office json in json_out files extract the url domain and output with lat lon to text_files/geodata.txt

workspace_home = os.getcwd()
workspace = workspace_home + "/json_out/"
workspace_out = workspace_home + "/text_files/"
office_workspace = workspace + "office/"
geodata_fo = workspace_out + "geodata.txt"
geodata_named_fo = workspace_out + "geodata-named.txt"
geodata_missing_both_fo = workspace_out + "geodata_missing_both.txt"
geodata_missing_web_fo = workspace_out + "geodata_missing_website.txt"
geodata_missing_geo_fo = workspace_out + "geodata_missing_geo.txt"

convert_json_address_space.py

Remove leading spaces from the fields of the json address.

convert_json_date.py

Update the format of the admitted_date field in JSON files related to individuals.

convert_json_latlong.py

English lat should be around 50 so I've got them around the wrong way, swap them.

strip_office_email.py

Opens lawsoc_office_email.txt, dedups and splits '@', taking the domain part of the email address. and writes to office_email_stripped.txt

utils/

Contains utility scripts:

install-debs.sh
ipinfo_request.sh
polipo_offline.sh
polipo_online.sh
push_lawfirm_seed_list_to_nutch.sh
push_lawsoc_seed_list_to_nutch.sh
start_nutch_lawfirms.sh
start_nutch_lawsoc.sh

outlines/

A marked up representation of a typical lawsoc entry. I used these to help in writing the json output functions.

office_info_outline.json

{
  "name": "Dutton Gregory LLP",
  "tel": "01962 844333",
  "web": "http://duttongregory.co.uk",
  "email": "contact@duttongregory.co.uk",
  "sra_id": "496960",
  "solicitor_id": "468235",
  "type": "Recognised body law practice",
  "address": [
    "Trussell House,",
    "23 St. Peter Street,",
    "Winchester,",
    "Hampshire,",
    "SO23 8BT,",
    "England"
  ],
  "lat": "-1.313652",
  "long": "51.06269",
  "dx_address": "DX 2515 WINCHESTER",
  "facialities": [
    "Office does not provide Sign Language",
    "Office has no disabled access",
    "Office does not support hearing induction loops",
    "Office accepts legal aid"
  ]
}

solicitor_info_outline.json

{
  "name": "Karen Lorraine Andrews",
  "admitted_date": "15/02/94",
  "sra_id": "162857",
  "type": "SRA Regulated",
  "person_id": "119847",
  "solicitor_id": "468235",
  "employer_address": [
    "Dutton Gregory LLP",
    "Ambassador House",
    "8 Carlton Crescent",
    "Southampton",
    "SO15 2EY"
  ],
  "dx_address": "49653 SOUTHAMPTON 2",
  "tel": "02380 221344",
  "accreditations": ["Family - Advanced accreditation"],
  "roles": ["Associate", "Assistant"],
  "areas_of_practice": ["Family - Advanced Accredited"],
  "languages": ["English"]
}

json_out/

The collected json representation of office and person.

#### json_out/office/office_523803.json
```json
{
  "address": [
    "220 Soho Road",
    "Handsworth",
    "Birmingham",
    "West Midlands",
    "B21 9LR",
    "England"
  ],
  "dx_address": "",
  "email": "bassisolicitors@googlemail.com",
  "facilities": "[]",
  "location": {
    "lat": "52.50331",
    "lon": "-1.934184"
  },
  "name": "Bassi Solicitors Limited",
  "solicitor_id": "523803",
  "sra_id": "566080",
  "tel": "01215540868",
  "type": "Recognised body law practice",
  "web": ""
}

json_out/person/person_139188.json

{
  "accreditations": "['Family accreditation']",
  "admitted_date": "1990/08/15",
  "areas_of_practice": "['Family', 'Family Accredited']",
  "dx_address": "",
  "email": "sld45@hotmail.co.uk",
  "languages": "['French', 'English']",
  "name": "Joanne Graham",
  "person_id": "139188",
  "roles": "Locum",
  "solicitor_id": "556394",
  "sra_id": "146314",
  "tel": "01706 399919",
  "type": "SRA Regulated"
}

elasticsearch/

Elasticsearch related python scripts and curl examples.

We had wanted to add the lawsoc data and subsequent firm website job page data to ELK stack to process, map and graph the results.

notes/

Information on the current state of the code, some more useful than others.

current_errors.txt
exclude_mail.txt
firms_missing_geo.txt
firms_missing_website.txt
remaining-xpaths.txt
todo_2015-10-07.txt
TODO.txt
uncaught_errors.txt
uncaught_exceptions.txt

refactor_notes/

Suggestions from phind for the refactoring of this functional but sequential constructed code.

test/

A directory containing scripts that try to show the functionality of libraries later used in the main code. A scratchpad area.

vcard/

Initial examples of vcards to see the differences and decide on the layout for the vcard output function, not implemented yet.