lawsoc-scraper/notes/TODO.txt

77 lines
2.2 KiB
Plaintext

1. polipo not active as an offline proxy or css not working or?
----------------------
502 Disconnected operation and object not in cache
The following error occurred while trying to access http://solicitors.lawsociety.org.uk/office/476844/a-e-payne-limited:
502 Disconnected operation and object not in cache
Generated Wed, 26 Aug 2015 12:04:20 BST by Polipo on sparky:8123.
----------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Proxy error: 502 Disconnected operation and object not in cache.</title>
</head>
<body>
<h1>502 Disconnected operation and object not in cache</h1>
<p>The following error occurred while trying to access <strong>http://bugs.pearsoncomputing.net/show_bug.cgi?id=1998</strong>:<br><br>
<strong>502 Disconnected operation and object not in cache</strong></p>
<hr>Generated Thu, 27 Aug 2015 12:12:42 BST by Polipo on <em>sparky:8123</em>.
</body>
</html>
----------------------
xpath
.//html/body/p/strong[2].text()
Either add code to catch 502 from polipo and put the unfetched url in a file or;
check to see polipo is working in that it only fetches unfected pages.
2. are string() xpaths returning more than one item when needed?
Person
Roles at this organsation - first only
Roles at other organisations - missing
Office
3. lawsoc_prefix.txt not being updated so restarts from beginning again.
4. signal.SIGHUP doesn't work for non root when killing tor, rotating proxy doesn't work when privoxy points to it but anyway we are using tor browser atm.
5. duplicates added to all files
6. add elastic search mapping for lawsoc/office and lawsoc/person to specify:
lawsoc/person
a. date format for the date field
b. _parent for person solicitor_id
c. unique key on lawsocs person_id, copy person_id -> _id
{
"person" : {
"person_id" : {
"type" : "string",
"index" : "not_analysed",
"copy_to" : "_id"
}
}
}
lawsoc/office
a. unique key on solicitor_id, copy solicitor_id -> _id
b. geo_point for location
{
"office" : {
"solicitor_id" : {
"type" : "string",
"index" : "not_analysed",
"copy_to" : "_id"
}
}
}
7. Copy to David
rsync -avz -e ssh markm@igm-legal.co.uk:/var/tmp/lawsoc_new /var/tmp/