Database Techniques-II

From nswccWiki
Jump to: navigation, search

Content, Context and Construct, or “The three Cons”

There is shallow depth meaning in that title, there is medium depth meaning and there is real deep meaning. There is meaning here and there is meaning there, and there would be even more meaning if you had a lisp. I mean, see what it means to you.

We live in a non-linear world. The content of printed documents is arranged linearly. We could all benefit from being able to deconstruct the document and reconstruct the document for our needs. The modern approach is to do this while constructing a document. Well that’s OK if you have a very neat linear mindset and you are constructing the document. What if you was given a pdf file (such as a Syllabus document) and wanted detailed cross references to just about every (significant) word in the document so you could construct your teaching program from it, including references, cross references and happy references (emoticons for those who don’t know. Ah. Deeper meaning. Do yo see the connexion?).

OK.. print the document- you have to start by printing! Think of a useful word- a and word are not useful words here. Write the word in a spreadsheet (or a Word document for argument sake) Enter the word into the search function of Adobe Acrobat or Skim Append the page number where the word is found, to a list of page numbers for the word in your Word document. Repeat the process for all words you think would be useful. Oh yeah! Mind numbing is not an adequate term to describe this.

I want to take the content of the Syllabus and construct from it a database of every unique word used, a list of pages on which the word occurs in the document, AND some context. Context means show me some text before, and some text after the word. That might be one word before, and one word after, or three words before and three words after.. you get the idea.

A record would look something like (assuming 1 word context is set).

“The”, “MODERN”, “approach”

Now we could do this many ways, but one of the most efficient ways is to use an application to do the heavy lifting. Really we are talking about making a database of words and strings (phrases or text).

I have resisted referring to database applications such as Filemaker Pro because I wanted to find a relatively simple example that could easily be constructed and clearly demonstrates this very powerful approach. This simple example does not give you the context field.

Make sure you are using one of FMP 6,7,8,9. Before you go any further, you will need to manually build a very simple Filemaker Pro database.

  • Open File Maker Pro
  • Select new Database from the Menu...
  • Create a New EMPTY Database
  • Save the database with the name: “WordXRef.fp7” (on the desktop to make it easily accessible to the AppleScript)
  • Create a field: “Keyword”, (using the “text” type- which is not the confection “text type”)
  • Create a field: “Page”
  • Click “Done”

Make sure the database file is saved on the desktop. Now copy a syllabus pdf file onto the desktop. Replace the name of the syllabus pdf document in the applescript, to the actual file name.

You do not need to do any other configuration. Letter rip. It will take some time to do its job. You can watch it do its magic if you so desire. The “WordXRef.fp7” database file will be of indeterminable value in promoting self confidence to meet educational contingencies.

-- Syllabus Analyser - constructs a concordance database
-- Ian Parker First worked on : 1998-01-20, last worked on 2008-08-10
-- requires Skim and Filemkaer Pro 7+
set myPath to path to desktop as string -- assumes files on your desktop
set writeDocument to myPath & "WordXRef.fp7"
set readDocument to myPath & "chemistry_stg6_syl.pdf"
-- ------------------------------------------------------------------------
-- key verbs list
set keyWordList to {"identify", "describe", "give", "distinguish", "define", "apply", "quantify", ¬
	"perform", "present", "analyse", "plan", "choose", "solve", "gather", "relate", " explain", ¬
	"compare", "outline", "discuss", "assess", "design", "draw", "construct", "process", "build", ¬
	"evaluate", "determine", "recommend", "predict", "recognise", "undertake", "dispose", ¬
	"use", "measure", "observe", "record", "access", "extract", "practice", "summarise", ¬
	"collate", "illustrate", "consider", "address", "express", "show", "convey", "justify", ¬
	"make", "develop", "propose", "include", "formulate", "account", "perform", ¬
	"recount", "explain", "demonstrate"}
-- -----------------------------------------------------------------------------------	
tell application "Finder" to open writeDocument
tell application "Skim"
	open readDocument -- open the file to scan
	set startPage to 28 -- first page of document 1
	set endPage to 44 -- page <number> of document 1
	set view settings of document 1 to {display mode:single page} -- one up
	-- we start at page 28 of the stage 4-5 science syllabus, and end at page 44
	-- these are the relevant pages of interest
	repeat with cur_page from startPage to endPage
		-- page text into a string!
		set page_text to (get text for page cur_page of document 1)
		set page_text to (every word of page_text) as list -- to tokenise on spaces!	
		repeat with nextWord in page_text
			tell application "FileMaker Pro"
				activate
				show every record -- because we use find
				try -- to find in database the keyword				
					set x to get (ID of every record whose cell "Keyword" is nextWord)
					-- still in correct error trapping mode, and so ...
					go to record ID x
					set page_list to get cell "Page" of current record
					set cell "Page" of current record to page_list & "," & cur_page
				on error --we expect to create a record and populate with  information
					go to (create record)
					set field "Keyword" of current record to nextWord
					set field "Page" of current record to cur_page
				end try
			end tell
		end repeat
	end repeat
end tell

Reading Age, Concordance Tables and teaching praxis literacy levels

There is a groundswell of concern about the poor literacy levels of students leaving compulsory schooling. This is producing a shunt-back effect back onto teachers, who some think are ill equipped to begin teaching formal grammar. Many have little or no formal training in this aspect, and precious few useful resources to help them in this task.

In the past, documents were carefully prepared by teachers and the content passed on to students, often by writing on a mass presentation device (MPD) that used to be called blackboards, then chalkboards, then whiteboards, then interactive whiteboards. Now teachers write material to be directly presented to students in the form of worksheets prepared using very sophisticated tools mostly with little consideration of checking the reading age, other than through professional judgement.

One very useful, but overlooked tool is something that can produce what is called a concordance table, a table of unique words along with a count of how many times the word was used in a document. This provides you with very important information regarding the focus and complexity of the document. The table or database can be sorted alphabetically or by frequency of word occurrence, in which case you can see what technical terminology is used, and guessimate the "reading age of the document" without using fancy formulas that sometimes do not accurately satisfy your needs.

The following AppleScript uses a similar core to the script above, but instead of finding the pages on which particular words occur, it provides a database of unique words in a document. You will observe that there are also two forms of exclusion: by length of word (in this case less than 4 characters long), and by a special exclusion list of words. These two techniques can be used as simple global and local filters. The accumulated database will have words longer than 3 characters, and without the specified list.

I have used this to build a database of words used in science education and their frequency of occurrence in a number of .pdf documents to build a teaching lexicon. The lexicon, in the form of a simple Filemaker Pro database, can be manually maintained. As the number of documents presented to it increases, the database will begin to stabilise, indicating which words are being used in the documents I use when working with students, or use in presentations for students, and so highlight the literacy level at which I am working. Many applications such as Word, Powerpoint etc can export documents as .pdf files.

The basic Filemaker Pro database only requires two fields per record: one is a text field called "word", and the other a number field called "access". The database is expected to be called "SciLexicon.Fp7" and be located on the user desktop. The .pdf file acting as the source for the words can be located anywhere since a choose file pane is used to access the source file. I have called the script "Lexicon Builder" but any useful name is possible. The end of the process is signalled when the source document is closed and no longer visible. Because of the way in which AppleScript works you can watch the process at work. Nice!

As always, this is a rough cut, without any warranties and you should consider it to be a proof of concept. This script contains some interesting code, that relies on forcing errors to cause particular actions to occur. For example, each word is "tried to be coerced to an integer type". An error will occur if this cannot be done. In some computer languages this requires many lines of code or specialised use of regexp functions. Accessing Skim and Filemaker Pro in this way is equivalent to bolting a huge library of high level functions (with a GUI!) on to a conventional language. The search for a particular word is done by Filemaker, not Applescript, making the script very high-level, and readable. Debugging uses the GUI of the applications!

-- Ian Parker, science lexicon builder for use in determining literacy level of content
-- reads a pdf file and places "interesting" words into a database
-- first worked on 2010-01-21, last worked on 2010-01-25

set excludes to {"this", "both", "after", "while", "want", "where", "your", "that", "which", "when", "where", "with", "here", "such", "wherever", "whenever", "also", "what", "could", "these", "they", "there", "their", "does", "only", "them", "those", "were", "about", "above", "across", "again", "away", "will", "from", "next", "than", "within", "have"}

set WordAnDB to ((path to desktop) & "SciLexicon.fp7") as string -- assumes on your desktop
set myPath to choose file with prompt "Open pdf file..." default location (path to desktop) -- use a file chooser

-- initialise
tell application "FileMaker Pro"
	open WordAnDB
	activate
	go to database 1
end tell

tell application "Skim"
	set current_document to open (myPath) -- open the file to scan
	set all_pages to count of pages of current_document -- to only calculate it once
end tell

-- loop for all pages
repeat with next_page from 1 to all_pages
	tell application "Skim"
		activate -- to watch
		try
			tell current_document
				set view settings to {display mode:single page} -- page display, one up
				-- for each page
				go to page next_page -- to only to see it in action, comment out to speed up (a little)
				set page_words to get text for page next_page
			end tell
		end try
	end tell
	-- for each page of words
	repeat with newWord in (every word of page_words as list)
		set temp to 0
		set newWord to lowercase newWord -- to ensure consistent case
		try -- to exclude any valid numbers!
			set x to newWord as number
		on error
			tell application "FileMaker Pro"
				activate -- to watch		
				-- all words less than 4 characters are not collected
				if not (newWord is in excludes or length of newWord ≤ 3) then
					show every record of database 1
					try --to find an entry	
						show (the records whose cell "word" is newWord)
						delay 0.1
						copy (cell "accessed" of current record) + 1 to temp
						copy temp to cell "accessed" of current record
						show every record of database 1
					on error -- we could not find the word in the database, so ...
						show every record of database 1
						create new record
						go to last record
						set cell "word" of current record to newWord
						set cell "accessed" of current record to 1
						show every record of database 1
					end try
				else
					-- ignore this item		
				end if
			end tell
		end try
	end repeat
end repeat
tell application "FileMaker Pro" to show every record of database 1

close current_document
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox