Text Processing

From nswccWiki
Jump to: navigation, search

The importance of being earnest! Did I use that word?

Pages is a worthy application for writing documents. One way to mark some words for inclusion in a dictionary/index at the end of the document is to use italic style for the words as a mark for inclusion into the list. This means the marking can stay in the text without having to "un-mark" it some time later. It is a good example of how computers can use commonly accepted human mark-up. If the document is only 2 to 3 pages then you might extract the list manually. But what if you are dealing with a document that is many more pages long, and you want to be sure you have collected all marked words, or indeed ensured that you have used a particular word.

One way is to proceed as follows:

  • As you write the document, italicise words that you wish to be automatically placed in the index.
  • When you have finished, keeping the document open (but saved for safety), run the AppleScript shown below.
  • The result will be a list of 2 item lists in the form {{"<word>",<page number>},...}

This script is an incredibly short way to extract a list of words with corresponding page numbers for a Pages document of any number of pages. It demonstrates, convincingly, the expressive power of AppleScript. Ideally this script would be extended to perhaps dump the result into a text file for further processing or going even further, produce a dual column formatted index replete with a title. An intermediary step would be to alphabetically sort the dictionary entries such as {{"environment",12},{"environment",13},{"environment",20},...} and combine them into something like {"environment",12,13,20}.

-- Construct an Index list from italicised words in a Pages Document
-- cross references against page
-- by Ian W. Parker first worked on: 2009-01-04, Last worked on: 2009-01-04
--
set word_index to {}
--
tell application "Pages"
	-- scan each paragraph
	repeat with current_paragraph in every paragraph of document 1
		-- scan each word
		repeat with next_word from 1 to count of words in current_paragraph
			-- check to see if word in italics to accumulate for index
			set current_word to (a reference to word next_word of current_paragraph)
			if italic of current_word then
				-- accumulate word
				set current_page to page number of containing page of current_paragraph
				set word_index to word_index & {{word of current_word as string, current_page}}
			end if
		end repeat
	end repeat
end tell
-- see result in result pane in the form {{"<word>",<page number>},...}
word_index

Fonts I have known, but now wish to forget

I have a large number of documents that over the years have undergone many revisions. In the process stray characters in strange fonts have remained in the document. These arise from several reasons:

  • similarity of glyph rendering for different fonts - this is difficult to weed out manually
  • non transfer of fonts from one system to another, but the current system picks the "nearest font"
  • embedded rogue fonts in the form of non printing characters, especially blanks and tabs.

The issue is that you want a clean document that won't give formatting troubles when transferred from one system to another. While this solution is not as good in some ways as converting to Adobe Acrobat, that takes the required characters from used fonts with it, this solution is quick and works well. The Acrobat conversion technique just staves off the inevitable.

One good feature of the "Pages" word processor is that it will alert you to any required font that is not present in the system when you open a document from an external source, such as a Word document from a Windows machine.

Using this script, will of course, change some of the text formatting, but considering you want to remove stray fonts, it is a very useful start. You will notice it scans the document one paragraph at a time! Because of the way AppleScript works this make the process very, very fast. Secondly, you will notice there are three fonts that are not touched. These are the three fonts I use for writing: Times Roman for normal text, Helvetica for headings and AndaleMono for fixed width text (computer program listings etc). The list can be augmented BUT these are the internal names of the fonts, not those of the font file name.

No other formatting such as indenting, underlining, bold, italic etc is changed. Anyone interested in fast removal of (some) font formatting, or changing the look of a whole document, this would be a good start. Of course, you could do much more in one go but you have to start somewhere.


-- text font cleaner, changes every character not in the default list to first named font in default_font
-- designed to remove rogue or obsolete text formatting
-- vers 0.1 - first worked on 2010-07-10, last worked on 2010-07-10
-- requires an already open Pages application document, and you can watch it do its work!!!
set default_fonts to {"TimesNewRomanPSMT", "Helvetica", "AndaleMono"}
tell application "Pages"
	activate
	-- roll over each paragraph
	-- this does not look at each character in turn, so expect some reformatting to be done
	repeat with current_paragraph from 1 to count every paragraph of document 1
		try -- using is a fast way to skip over embedded images
			set current_text to select every text of paragraph current_paragraph of document 1
			if font name of current_text is not in default_fonts then
				set font name of current_text to item 1 of default_fonts
				set font size of y to 12 -- default to 12 pt
			end if
		on error
			-- ignore the change font request
		end try
	end repeat
end tell
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox