Database Techniques

From nswccWiki
Jump to: navigation, search

Database Techniques-II

Contents

Driving Google Earth using AppleScript

Where in the world is ...

We teach a course in Year 8 Science called “Stuck on a Deserted island”. As part of their work, students build a physical model of the island “somewhere in the Pacific Ocean”. Model building is a desirable outcome of the course.

I designed the model of the island that students build and later found a real island that closely matches my model. The model design resembled an enlarged topographical map.

Wouldn’t it be sound pedagogy to give our students some context of where they are in the world and where an island like their model island is located? To “demonstrate the technology” here is a short AppleScript that drives the Google Earth application.

This short application will immediately fix a Google Earth view over the Southern Highlands of NSW, wait 10 seconds then traverse to Erromango Island in the Pacific Ocean. The traversal is done automatically by Google Earth. Of course, for it to work, you need to be connected on-line to Google. The traversal behaves just like it does when you manually drive Google Earth. I was astounded by how simple it was. I thought I would have to generate the intermediate positions, and speed of the traversal. Not so... This AppleScript does it all for you. Imagine what could be done... clues for a crossword puzzle anyone.

-- on-line traversal from one location on the Earth to another location on the Earth
-- Ian W. Parker, 2008
-- the distance variables optimise views for the two locations
tell application "Google Earth"
	activate
	-- distance = height above ground level 
	-- latitude = decimal latitude (see GE preferences)
	-- longitude = decimal longitude
	-- titl, azimuth can be left as is
	-- Start Location - Southern Highlands of NSW
	SetViewInfo {latitude:-34.47, longitude:150.42, distance:1.0E+5, tilt:3.77E-10, azimuth:0.03}
	delay 10 -- this gives you 10 seconds to orient yourself at the initial view
	-- speed determines the behaviour. Slower better, smoother. 
	-- traversal path determined by Google Earth application
	-- End Location - Erromango Island in the Pacific Ocean
	-- so all we have to do is ...
	SetViewInfo {latitude:-18.825, longitude:169.175, distance:4.8E+4, tilt:3.77E-10, azimuth:0.03} speed 0.02
end tell

Parsing text fields using AppleScript

Data fields in a database may have very unusual features. For example: Suppose you had a database that used the following format: <SurName>'@'<firstName>'#'<group1>'%'<group2> <eol>. There are no embedded tabs or comma separators and each field string may have leading spaces, trailing spaces and/or embedded spaces. AppleScript can parse this very easily.

The aim is to extract a 4-tuple list {<SurName>,<FirstName>,<group1>,<group2>} without leading or trailing spaces and single spaces where there are multiple embedded spaces. Loading a string buffer from a file is not considered here: the string buffer is assumed to contain the data shown and further processing such as data integrity checking and copying the 4-tuples (where each would end up in a column) in an Excel Spreadsheet is trivial but not considered here.

Consider a typical extract from a text report file from a database with the assumed format:

test Record:: = "La1                 @ Step7          # 7     %G10"

We could attack this problem in many ways even only using just basic AppleScript. I will offer two options, one based on AppleScript's text item delimiter, and the second using Regular Expression Analysis part of Satimage's incredibly powerful OSAX.

Assumptions about the text.

The text file is assumed to be plain ASCII or UTF-8 text: nothing fancy. The text is divided into lines, ending with a <cr>-<lf> sequence (= Windows <eol>.) Usually we expect either ',' (CSV) or '<tab>' as the delimiters. In this case however, the line is divided into fields of varying length, delimited by '@', '#','%' ASCII characters. <,> and <tab> could be treated as acceptable characters in a field. One reason for using the special delimiter character set approach is that it is easy to identify missing field entries.

-- Ian W Parker 2008
-- first worked on 2008-02-08, last worked on 2008-02-08
-- set up the test harnass string with appropriate structure
set theRecord to " La1  @ Step7          # 7     %G10"

-- this executes
doTokenize(theRecord)

--result : {" La1  ", " Step7          ", " 7     ", "G10"}

-- -------------------------------------------------------------------------------
-- defines the basic splitting routine
-- ------------------------------------------------------------------------------
on doTokenize(currentRecord)
	-- initilise
	set {surname, firstname, Group1, Group2} to {"", "", "", ""}
	local currentRecord2, currentRecord3, surname, firstname, yearGroup, rollGroup
	--we save the original TID definition
	set TID to AppleScript's text item delimiters
	-- split on "@"
	set AppleScript's text item delimiters to "@"
	set surname to text item 1 of currentRecord
	set currentRecord2 to text item 2 of currentRecord
	-- split remainder on "#"
	set AppleScript's text item delimiters to "#"
	set firstname to text item 1 of currentRecord2
	set currentRecord3 to text item 2 of currentRecord2
	-- split remainder on %
	set AppleScript's text item delimiters to "%"
	set Group1 to text item 1 of currentRecord3
	set Group2 to text item 2 of currentRecord3
	-- restor the original TID
	set AppleScript's text item delimiters to TID
	-- these may have leading or trailing spaces
	return {surname, firstname, Group1, Group2}
end doTokenize

In the second example we use regular expression techniques to extract the sub-field items and squeeze multiple blanks.

-- Ian W Parker 2008
-- first worked on 2008-02-08, last worked on 2008-02-09
-- requires satimage OSAX to be installed
-- set up the test harnass string with appropriate structure
set crlf to (ASCII character 10) & (ASCII character 13)
-- assume for the example the string has a tralining crlf as eol as per Windows
-- in a file processing example, the eol would have already been stripped
set theRecord to " La1   @ Step7          # 7     %G10 " & crlf

-- this executes
getTokens(theRecord)

-- result is {"La1", "Step7", "7", "G10"}

-- -------------------------------------------------------------------------------
-- defines the enhanced REGEX based routine
-- returns a list of 4 strings. Internally the strings 
-- have no leading, no trailing spaces AND any internal
-- spaces are respected, but contracted to a single space
-- ------------------------------------------------------------------------------
to getTokens(currentRecord)
	set surname to ""
	set yearGroup to ""
	set firstname to ""
	set rollGroup to ""
	-- change trailing crlf to cr so Mac friendly
	set currentRecord to convert to Mac (currentRecord)
	-- use regular expressions to select components, could concatenate to one regular expression
	set MetaString1 to re_compile "[ ]*(.*)[ ]*@.*"
	set MetaString2 to re_compile ".*@(.*)#.*"
	set MetaString3 to re_compile ".*#[ ]*(.*)%.*"
	set MetaString4 to re_compile ".*%[ ]*(.*)"
	-- save TID and ensure text item delimiter set to ' '
	-- the TID enables us to extract a list of words separated by the TID
	-- if the TID is ' ' then blanks are squeezed
	set {TID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, space}
	-- extract surname
	set surname to find text MetaString1 in currentRecord using "\\1"
	-- chop leading trailing blanks, squeeze internal banks
	set surname to (every word of matchResult of surname) as string
	-- extract firstname
	set firstname to find text MetaString2 in currentRecord using "\\1"
	-- chop leading trailing blanks, squeeze internal banks
	set firstname to (every word of matchResult of firstname) as string
	-- extract group 1 information
	set Group1 to find text MetaString3 in currentRecord using "\\1"
	-- chop leading trailing blanks, squeeze internal banks
	set Group1 to (every word of matchResult of Group1) as string
	-- extract group 2 information
	set Group2 to find text MetaString4 in currentRecord using "\\1"
	-- chop leading trailing blanks, squeeze internal banks
	set Group2 to (every word of matchResult of Group2) as string
	-- revert TID as good programming practice
	set AppleScript's text item delimiters to TID
	-- collect items as strings into a list for convenience
	return {surname, firstname, Group1, Group2}
end getTokens

Language Interpretation

As a science teacher, a long time interest of mine has been the realisation of a language parser/ evaluator/ display string generator capable of correctly interpreting scientific expressions like: v = u + a * t, including perform dimensional consistency checking and evaluation to a result for the left hand side. This is not a trivial task. Significant time and effort is expended in physics courses at high school and university in this endeavour.

Although this idea may seem obvious to a physics teacher, there are few computer languages or systems incorporating this level of sophistication. And yet it is germane to building useful interactive online learning-science systems.

The main issues are:

Building a SI database to associate the names of the types of variables with a formal syntax of unevaluated internal string representations so that regular expression pattern matching and substitution may be used to resolve the left hand side with each of the terms on the right hand side of the expression. This is similar to the first pass parsing, symbol identification in a 2 -pass computer language compiler.

The interleaving of three distinct domains of discourse (a) numerical expression evaluation (b) SI prefix interpretation and mapping to the numerical domain (c) dimensional analysis and dimensional consistency checking of variables in an expression.

Building a translator and display system ultimately based on a stack evaluator for {length, time, mass, (electrical) current}.

The script below works towards resolving the first two issues.

-- Ian Parker 1998, 2008 first model  13/2/91 (in BBC BASIC) 
-- last worked on 2008-03-06
-- requires satimage OSAX for regex analysis
-- evaluates an equation of the form : SI_Dim1 = SI_Dim2  [* | / ] SI_Dim3
-- eg. <acceleration> = <velocity> / <time>
-- and returns an updated database as a recordset
-- ---------------------------------------------------------------------------------
-- construct a basic SI database for augmentation
property SI_symtab : ¬
	{{title:"distance", syn:{"length"}, type:"base", formula:"m"}, ¬
		{title:"time", syn:{""}, type:"base", formula:"s"}, ¬
		{title:"mass", syn:{""}, type:"base", formula:"k"}, ¬
		{title:"current", type:"base", formula:"A"}, ¬
		{title:"velocity", syn:{"speed"}, type:"distance / time", formula:"(m)/(s)"}, ¬
		{title:"acceleration", type:"velocity / time", formula:"((m)/(s))/(s)"}, ¬
		-- this type is correct BUT the formula is actually wrong!
		-- the system will ask the user to resolve the ambiguity
		{title:"force", syn:{"thrust", "push", "pull", "weight"}, type:"mass * acceleration", formula:"((k)*(((m)/(s))/(s)))"}}

-- -----------------------------------------------------------
-- extract item quad: (<lvar token>, <rhs1 token>, <operator>, <rhs2 token>)
set workList to SI_items_of("Force  = distance * Acceleration")
SI_process_rhs(workList)
set x to SI_symtab -- return the whole database as a recordset

-- result is {{title:"distance", syn:{"length"}, type:"base", formula:"m"},
-- {title:"time", syn:{""}, type:"base", formula:"s"}, 
--{title:"mass", syn:{""}, type:"base", formula:"k"}, 
--{title:"current", type:"base", formula:"A"}, 
--{title:"velocity", syn:{"speed"}, type:"distance / time", formula:"(m)/(s)"}, 
--{title:"acceleration", type:"velocity / time", formula:"((m)/(s))/(s)"}, 
--{title:"force", syn:{"thrust", "push", "pull", "weight"}, type:"distance * acceleration", formula:"(m)*(((m)/(s))/(s))"}}
-- ---------------------------------------------------------------------
--support routines
-- ---------------------------------------------------------------------
on SI_items_of(a_line)
	local div_replace, mul_replace, theItems, x
	-- lowercase to prevent issues with letter case
	set a_line to lowercase a_line
	-- ** uses regex based token parsing, not character by character parsing **
	set expr1 to re_compile "(.*)[ ]*=[ ]*(.*)[ ]*([/*])[ ]*(.*)" without case sensitive
	-- name some synonyms for '*' and '/'
	set div_replace to re_compile "divided[ ]*by|over" without case sensitive
	set mul_replace to re_compile "by|times|multiplied[ ]*by" without case sensitive
	-- standardise the form
	set a_line to change div_replace in a_line into " / " with regex
	set a_line to change mul_replace in a_line into " * " with regex
	-- recompose as a standard list
	set theItems to matchResult of (find text expr1 in a_line using "\\1, \\2, \\3, \\4")
	-- this part is tricky because of the '/' 
	if theItems contains "/" then
		set x to theItems & " d"
		set x to {word 1 of x} & {word 2 of x} & {word 4 of x} & {word 3 of x}
	else if theItems contains "*" then
		set x to theItems & " m"
		set x to {word 1 of x} & {word 2 of x} & {word 5 of x} & {word 4 of x}
	end if
	return x
end SI_items_of
-- ----------------------------------------------------------
on SI_check_sym_table(entry)
	global SI_symtab
	local x
	set x to 0
	-- because small a straight linear search of symtab, symtab is a record
	repeat with thisItem in SI_symtab
		set x to x + 1 -- we need to know the index value
		if entry is (title of thisItem) then return x
	end repeat
	set SI_symtab to SI_symtab & {{title:entry, syn:" ", type:" ", formula:" "}}
	return number of items in SI_symtab
end SI_check_sym_table
-- -----------------------------------------------------------
on SI_process_rhs(workList)
	-- create variable lists
	local my_result, lhs, rhs1, op, rhs2, checked
	set checkedOK to false
	set {lhs, rhs1, op, rhs2} to every item of workList --explode the workList string	
	set templhs to SI_check_sym_table(lhs)
	set {temprhs1, temprhs2} to {SI_check_sym_table(rhs1), SI_check_sym_table(rhs2)}
	-- cannot proceed, symbol not found on the rhs
	if (temprhs1 is number of items in SI_symtab or temprhs2 is number of items in SI_symtab) then
		display alert "Error: Unknown symbol on right hand side "
		return
	end if
	set rhs1result to (formula of item temprhs1 of SI_symtab) -- 
	set rhs2result to formula of (item temprhs2 of SI_symtab)
	if op is "m" then set op to "*"
	if op is "d" then set op to "/"
	set lhsresult to formula of last item of SI_symtab
	-- look at new definition, consider the reverse unless we reduce before comparison	
	set tryformula to "(" & rhs1result & ")" & op & "(" & rhs2result & ")"
	-- build entry, check formulas, then add to SI_symtab if not in agreement
	set checkrhsformula to "((" & rhs1result & ")" & op & "(" & rhs2result & "))"
	--do the formulas agree?
	if lhsresult is checkrhsformula then
		set checkedOK to true
	else
		if lhsresult = " " then set lhsresult to "'undefined'"
		display alert "New definition for " & lhs & ":  

" & checkrhsformula & " was " & lhsresult buttons {"OK", "Cancel"} default button 2
		if (button returned of result) is "OK" then
			set formula of item templhs of SI_symtab to tryformula
			set trytype to rhs1 & " " & op & " " & rhs2
			set type of item templhs of SI_symtab to trytype
		end if
	end if
	-- lhsresult is old, checkedformula is new
	return {checkedOK, lhsresult, checkrhsformula}
end SI_process_rhs

Slash it, Mash it, Mashups...

The Internet has invented the ‘mashup’. It is not new; the only thing new is the name. Many internet news providers have been at it for some time. Teachers are really familiar with printed mashups. A mashup takes content from a number of sources and brings it all together, in a new container. The look and feel might change, but the (selected) meta-content remains the same. I have to say that because usually the actual data changes. We sometimes call these collected teaching resources for a unit.

How would you begin the process of building a mashup?

The easy part is to determine where the mashup will fit into your teaching and learning programme. The second most difficult part is to design a web page. These two are familiar to many.

Then comes the most difficult part, finding reliable sources for each container on the page. You will invariably have to extract data from the source, and slice and dice it to fit your requirements. However, some service providers are now supplying web page feeds specifically for embedding in mashup containers. Google even supplies application programming interfaces (APIs) for a number of their free offerings.

As an example, you’re putting together a unit on weather, and want to have at least one page that automatically collates relevant up-to-date data and possibly graphs it for you in ways that make sense to you.

An example of a source for a mashup (that has connotations of dynamic Web data, ie. that changes periodically) is data from the Australian Bureau of Meteorology( Aust BOM).

A student could manually go to Aust BOM pages (via searching for useful information using their favourite internet search engine) and extract (or copy and paste) raw data using their favourite word processor, they then rearrange the data for personal use and hand it to their teacher, who, in the circular process, has probably done most of the work.

For a teacher, another approach is to build at least one page, using mashup technology. I’m sure that in the near future we will see applications for building a mashup, but for now...

This is a simple script that does the extracting, slicing AND the dicing of a portion of an Aust BOM page providing weather forecasts for the Illawarra and Southern Highlands of NSW. The end result of running this script is three web page files. Two of them could be embedded in another web page... a mashup. I had no idea that this would be so difficult using REGEXP! The main issue is that the weather data has several variants...

Now, if we could do the same for UV forecasts, thunderstorm forecasts, and put these all together on the front page of my T&L website... that would be useful. Remember, for a mashup, the actual data may change but the source and look and feel will remain the same.

-- Vers 1.1 -get current weather forecast from the Aust Bureau of Meteorology(BOM)
-- for Illawarra and Southern Highlands, NSW
-- Ian Parker, first worked on: 2008-03-23 last worked on: 2008-03-31
-- requires SATIMAGE OSAX to be installed, for REGEXP 
-- requires user currently connected on-line to the Internet
-- web pages on current user Desktop, new forecasts at 5:15 am, 4:05 pm 
-- store in a fixed location on a local server as source for LAN users, for a simple mashup
set myWeather1 to "ISHweather.htm"
set myLoc1 to (path to desktop as string) & myWeather1
--  sliced data file from BOM site, not useful as is...
set myWeather2 to "ISHweather1.htm"
set myLoc2 to (path to desktop as string) & myWeather2
-- the diced and condensed version ready for mashup use
set myWeather3 to "ISHweather2.htm"
set myLoc3 to (path to desktop as string) & myWeather3
-- get weather data  from the internet, ie. real CURRENT forecast
try -- to start a session... if not possible then skip the lot!
	tell application "URL Access Scripting"
		-- weather forecast, auto replacing old files
		set useFile to download ¬
			"http://www.bom.gov.au/products/IDN10061.shtml" to myLoc1 replacing yes
	end tell
	-- copy the (HTML) text file into string for slicing
	set theFile to open for access myLoc1
	set the_Text to read theFile
	close access theFile
	try -- to extract useful section of text from dowloaded file using SATIMAGE OSAX
		set thePattern1 to re_compile "ILLAWARRA" -- start of useful section
		set usefulData1 to find text thePattern1 in the_Text with regexp
		set thePattern2 to re_compile "IDN1006102" -- end of useful section
		set usefulData2 to find text thePattern2 in the_Text with regexp
		-- slice off some begin noise, and end noise -- stop pattern being greedy
		set usefulString to extract string the_Text ¬
			from ((matchPos of usefulData1) + 77) to ((matchPos of usefulData2) - 500)
		-- maintain original HTML data layout, replace the < pre>..< /pre> tags for browser
		-- this artificial chop is to placate the wiki
		set usefulString to "<P" & "RE><B>ILLAWARRA</B><BR>" & usefulString & "</" & "PRE> " as text
		-- save to a new HTML file (on desktop)
		set theFile to open for access myLoc2 with write permission
		write usefulString to theFile
		close access theFile
	on error
		display alert "Download file: No useful data" as critical
		close access theFile
	end try
	try
		-- Safari shows how  it would be displayed in a large mashup container?
		tell application "Safari" to open myLoc2
		-- --------------[ optional to extract essential forecast data] ----------------------------------
		-- dice for only the data giving a string: Bowral & Wollongong 		
		try --  to dice for first Wollongong entry with Min: and Max: data	
			set thePattern4 to re_compile "Wollongong :(.*)Min:.*([ |-][0-9]+).*Max:.*([ |-][0-9]+).*"
			set usefulData2 to find text thePattern4 in usefulString ¬
				using "Wol:\\1 \\2 &#186C to \\3 &#186C" with regexp
		on error -- to dice for first Wollongong entry  with Min: data missing	
			set thePattern4 to re_compile "Wollongong :(.*)Max:.*([ |-][0-9]+).*"
			set usefulData2 to find text thePattern4 in usefulString ¬
				using "Wol:\\1  to \\2 &#186C" with regexp
		end try
		try -- to dice for first Bowral entry  with Min: and Max: present
			set thePattern3 to re_compile "Bowral :(.*)Min:.*([ |-][0-9]+).*Max:.*([ |-][0-9]+).*</"
			set usefulData1 to find text thePattern3 in usefulString ¬
				using "Bow:\\1 \\2 &#186C to \\3 &#186C" with regexp
		on error -- to dice for first Bowral entry  with Min: data missing	
			set thePattern3 to re_compile "Bowral :(.*)Max:.*([ |-][0-9]+).*</"
			set usefulData1 to find text thePattern3 in usefulString ¬
				using "Bow:\\1  to \\3 &#186C" with regexp
		end try
		-- usefulData1-> Bowral forecast string
		--usefulData2 ->first Wollongong forecast string
		set weatherForecast to "Tomorrow"
		-- check for 5:15 am  forecast for today or assume for tomorrow
		if the_Text contains "515" then set weatherForecast to "Today"
		-- note line continuation char (opt-return)at end of line
		set weatherForecast to weatherForecast & "<BR>" & ¬
			matchResult of usefulData1 & "<BR>" & matchResult of usefulData2
		-- write the one liner to a text file with HTML formatting to 3 liner HTML
		try -- to delete the exiting file first to ensure file create, modify correct
			tell application "Finder" to delete alias myLoc3
		end try
		set theFile1 to open for access myLoc3 with write permission
		write weatherForecast to theFile1
		close access theFile1
		tell application "Safari" to open myLoc3 -- to visually check result
	on error --  alert user if weather data could not be extracted
		display alert "Extracted data file: cannot get useful data" as critical
		close access theFile1
	end try
on error -- alert user if session could not be started
	display alert "Aust. BofM not responding: check connection to Internet" as critical
end try

UV to the Max - updated

Each day, the Australian Bureau of Meteorology(Aust. BoM) supplies a large amount of aggregated data. Here is a script to get next day forecast of UV maxima for a selected town from the Aust BoM.

We can embed try...on error.. end try blocks, allowing the script to gracefully degrade its performance. Besides using the http:// protocol the download command can also access ftp sites, especially public sites. To compress the data and recompose it we use some date and time slicing and dicing. The “test harnass” lines show how to construct simple record structures in AppleScript. The original line for Bowral was

0019 068102 94747 Bowral              26 10 2010  UV Alert from  8.40 to 16.40  Max: 13

The result is an internally held list

 {"Bowral", "8:40:00 AM", "4:40:00 PM", "UV Max:13"} 

And the script to do this ...

-- Ian Parker Vers 1.1 first worked on 2008-10-10, last worked on 2010-10-25
-- extract a list of useful data from a live ftp site: BOM, Australia
global theFile
set {myLoc3, UVfileName} to {path to desktop as string, "IDYGP007.txt"}
-- the Aust. BoM page for next day UV forecasts
set UVDataFile to "ftp://ftp2.bom.gov.au/anon/gen/fwo/IDYGP007.txt"
try -- to ...
	tell application "URL Access Scripting" -- then
		set UVfile to download "ftp://ftp2.bom.gov.au/anon/gen/fwo/IDYGP007.txt" to ¬
			myLoc3 & UVfileName replacing yes
	end tell
	try -- to get lines for bowral
		set theFile to open for access UVfile
		set UVBowral to UVstringFor("bowral") -- replace with any valid town name (case unimportant)
		close theFile
	on error
		display alert "UV data file: " & myLoc3 & UVfileName & " missing" as critical
	end try
on error
	display alert "Cannot connect to Bureau of Meteorology(BOM)- Australia, check Internet connection" as critical
end try

UVBowral
-- result is {"Bowral", "8:40:00 AM", "4:40:00 PM", "UV Max:13"} -- for 2010-10-26

--  ----- [ conditional read of text file ]-------------------------------
on UVstringFor(the_loc)
	set target to ""
	-- search for, and slice  line from the text file
	repeat while (the_loc is not in target)
		-- this is terminated by a newline 
		set target to read theFile before "
" -- slurp in a line as per readline
	end repeat
	set target to every word of target -- split the line on spaces
	tell target to set loc to item 4
	set earlyUV to time string of date (item 11 of target & "am") -- dice the time
	set lateUV to time string of date (item 13 of target & "am") -- dice the time (wierd!)
	tell target to set UVMax to item 15
	tell target to set thedate to item 7 & "-" & item 6 & "-" & item 5
	-- recompose, and return list of strings with ISO date formats
	return {thedate, loc, earlyUV, lateUV, "UV Max:" & UVMax}
end UVstringFor
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox