[clug] [SOLUTION] command line Web scraping ASP.net, Webforms & Gridview
steve jenkin
sjenkin at canb.auug.org.au
Sat Feb 22 07:24:52 UTC 2020
Developed a bash / wget solution for anyone discovering this thread and wanting to web-scrape similar Webforms and having trouble.
Checked for one division and produces identical output.
Thanks for everyone who offered advice and help - will make a big difference in my next adventure down these roads.
The biggest hurdle was seeing the FORM / POST data being sent over SSL to the webserver - the tricks & tools suggested to help with that are generic and way, way better than the GUI tool I found.
> On 21 Feb 2020, at 18:01, steve jenkin via linux <linux at lists.samba.org> wrote:
>
> The problem:
>
> A friend wants to analyse some data by Electoral Division, but only has addresses, which includes Postcodes.
> He’s done a few thousand records by hand and wanted a better way for the rest.
>
> There’s an AEC "Locality search” that returns postcodes by Electorate (Division),
> but while they offer *downloads* for many other SA1, polling place and some SA2’s,
> I couldn’t find a download for Postcodes within Divisions.
==========
The master file used to drive this script has 4 fields per line.
the last field is the number of webpages per District, each having up to 20 lines - found manually.
steve$ head AEC-divs.tsv
101 Canberra ACT 3
102 Fenner ACT 2
103 Banks NSW 2
104 Barton NSW 2
105 Bennelong NSW 2
106 Berowra NSW 3
107 Blaxland NSW 2
108 Bradfield NSW 2
109 Calare NSW 20
111 Chifley NSW 2
This file layout creates a shell problem that could be avoided if the Name (2nd field) was the last field.
There are 5 Names that contain a space, which the shell, because of the default IFS of "tab or space or newline", expands into more variables. There’s another two Names with non-Alpha chars [shown below in a grep]
Naive and dangerous: 2nd word of Division Name breaks assumption of 4 variables read:
cat AEC-divs.tsv | while read divid Name State n ; do echo $n; done # Name, State and ’n’ are wrong
This fragment works correctly. Uses a sub-shell ‘(…)’ to limit the scope of the IFS change.
(IFS=$'\t'; cat AEC-divs.tsv | while read divid Name State n ;do echo $n; done)
steve$ cut -f2 AEC-divs.tsv | grep -Ev -e '^[A-Za-z]+$’
Eden-Monaro
Kingsford Smith
New England
North Sydney
Wide Bay
La Trobe
O'Connor
==========
#!/bin/bash
# get-div - fetch all HTML 'Locality' pages for a single AEC Division.
# HTML files dropped in local directory, but won't be overwritten.
# Check for failed page loads, 'Temporarily Unavailable' (due to bad credentials), but keep going.
# Stopping on error and explaining it is right thing to do.
# Steve Jenkin Sat 22 Feb 2020 17:33:47 AEDT
# rudimentary & unchecked argument passing.
divid=${1:-109}
Name=${2:-Calare}
n=${3:-20}
# Ideally, should fetch these (unchanging) values from HTML of 1st page
EVENTTARGET='ctl00%24ContentPlaceHolderBody%24gridViewLocalities'
VIEWSTATEGENERATOR='DC0A9FCB'
### get first page. Default to 'Page 1', loads VIEWSTATE and EVENTVALIDATION
i=1
f="${divid}_${Name}_${i}.html"
URL="https://electorate.aec.gov.au/LocalitySearchResults.aspx?filter=${Name}&filterby=Electorate&divid=${divid}"
echo $f >/dev/tty
wget -nc -O "${f}" "${URL}";
for ((strt=2; $strt <= $n ; strt=$(( $strt + 5 )) ))
do
end=$(( $strt + 4 ));
# must renew credentials after every 5 pages
# cheating, use last value in "$f"
VIEWSTATE=$(urlencode $(grep 'VIEWSTATE"' $f | hxpipe - | grep -e 'Avalue CDATA ' | sed -e 's/^Avalue CDATA //' | tr -d '\n') )
EVENTVALIDATION=$( urlencode $(grep 'EVENTVALIDATION"' $f | hxpipe - | grep -e 'Avalue CDATA ' | sed -e 's/^Avalue CDATA //' | tr -d '\n' ) )
for ((i=$strt; $i <= $n && $i <= $end ; i=$(( $i + 1 )) ))
do
f="${divid}_${Name}_${i}.html"
URL="https://electorate.aec.gov.au/LocalitySearchResults.aspx?filter=${Name}&filterby=Electorate&divid=${divid}"
echo $f >/dev/tty
EVENTARGUMENT="Page%24${i}"
wget -nc -O "${f}" --post-data="__EVENTTARGET=${EVENTTARGET}&__EVENTARGUMENT=${EVENTARGUMENT}&__VIEWSTATE=${VIEWSTATE}&__VIEWSTATEGENERATOR=${VIEWSTATEGENERATOR}&__EVENTVALIDATION=${EVENTVALIDATION}&ctl00%24ContentPlaceHolderBody%24searchControl%24searchFilter=&ctl00%24ContentPlaceHolderBody%24searchControl%24dropDownListSearchOptions=0" "${URL}"
done
# if these pages are found, it's an error and the script should exit
grep -l 'Temporarily Unavailable' ???_*_?.html ???_*_??.html | tr '\n' '\0' | xargs -0 echo
done
==========
--
Steve Jenkin, IT Systems and Design
0412 786 915 (+61 412 786 915)
PO Box 38, Kippax ACT 2615, AUSTRALIA
mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin
More information about the linux
mailing list