Wednesday, November 1, 2017

Database documentation in R

R database and process documentation

As someone who has to return to old R projects the management of 'legacy' databases is an on-going 'challenge'. Some analysis are frequently called on to replicated or modify previous work or to use old databases for new analysis. Knowing what is in a database, what changes have been made to it and how to repeat key processes is central to reusing those resources efficiently. Comprehensive documentation is obviously key to redeploying the databases efficiently but, candidly, documentation tends to be poor. Usually impatience coupled and the perception that analysis and procedures seemed perfectly clear, if not intuitive, at the time contributes to inadequate documentation. In practice returning to old databases usually results in considerable time being spent on deciphering .Rhistory scripts and divining what had been done previously. This invariably results in a few 'WTF' moments or total bewilderment when trying to reverse engineer what the earlier analyst did.

Fortunately a number of protocals and procedures can radically improve the maintenance of databases and scripts.


1. Keep databases (i.e. everything in the current R directory) small

By keeping databases small and focussed on key subjects it is easier to increase the level of documentation (see below) and eliminate the fluff that inevitably clutters working directories.

Ideally issuing the ls() command should result in as sparse an output as possible. Using multiple small databases allows for information to be compartmentalised (and thus 'filed') it may results in confusion as to which database should be used in the first place. As far as possible start every project in its own directory.

2. Issue all commands from the CLI.
To record all changes and procedures in scripts only issue commands through the command line (LI),  this way a record of changes and procedures are always at hand. If you exit R using q('yes') R records all command line instructions in the hidden ".Rhistory" file. Unless you make extensive use of drop-down menus and some spreadsheet-like facilities, this automatically creates a record of the commands used. By systematically recording commands database changes and procedures become replicable.

Many spreadsheet-type actions are not recorded and using commands like fix(dataframename) results in changes that are not documented. Accordingly these routines should be avoided and all changes are made by issuing a command on the CLI.

Be generous in setting the memory allocated to ".Rhistory". On linux systems this is set using a command in a system-wide or personal .Rprofile file. This file should contain a line like:

Sys.setenv(R_HISTSIZE='100000')

The size should be large enough to ensure all changes to that database are recorded. As .Rhistory is a record of all commands submitted keeping these files clean requires regular housekeeping. Ideally .Rhistory should have only the commands required to replicate the output ab initio.  The .Rhistory file can easily be cluttered by repeated iterations, experimentation or the running of demo syntax. Users making minute adjustments on a ggplot, for example, can easily issue very similar ggplot( commands a few dozen times thereby cluttering the script file with fluff.

3. Use "comments" everywhere you can.

"Comments" enable users  to keep track of observations they wish to have a record of.  and using them often is key to keeping track of modifications. Comments can be pegged to R objects like dataframes and dataframe columns using syntax like:

'Dataframe altered on 3 May 2016 by adding the column "imp_val"' -> comment(dfname)  

or, referencing a column in a dataframe,

'These are imputed values for the age variable assuming missing obs were C.A.R.' -> comment(dfname$imp_val)

To retrieve the comment assigned to a particular dataframe column use:

comment(dfname$imp_val)

The main problem with comments is that they are concealed and users have to tell R to show the comment associated with an object. In other words the user  must know, or suspect, that a comment was assigned to the relevant object. If users do not routinely use comment assignment (and retrieval) the facility easily falls into disuse.

4. Use R through a script editor

To get better use from .Rhistory it is useful to use R through an editor like EMACS (vis ESS) or VI. These editors bring highlighting, colour coding, spell checking, formatting and other facilities to both .Rhistory and the current session. More importantly syntax editors bring search functionality into the active session of R and .Rhistory. The use of a script editor makes it easy to find every instance where a particular command or string was used or a particular variable referred to. Clearly this can be done by simply searching the .Rhistory but with slight modifications it is possible to rotate through every instance a particular command was used. The user can, for example, rotate through every ggplot(), lm() or table() command to find when it was last used and what the syntax was.

Under EMACS a slight modification to the .emacs file in the home directory  is needed. These changes enable users to type the first few letters of a command and and then use the up and down arrows to rotate through useage. The following text must be place in a personal .emacs file:

---------------------------------------------------------------------
 (eval-after-load
   "comint"
   '(progn
      (setq comint-input-ring-size 20480)
      (setq comint-scroll-to-bottom-on-output 'others) ; not current
      ;;=default: (setq comint-scroll-to-bottom-on-input nil)
      (setq comint-scroll-show-maximum-output t) ;;; this is the key
      (define-key comint-mode-map [up]
        'comint-previous-matching-input-from-input)
      (define-key comint-mode-map [down]
        'comint-next-matching-input-from-input)
(Define-key comint-mode-map "\C-a" 'comint-bol)))
----------------------------------------------------------------

Now, by typing "ggplot(" at the prompt and pressing the up arrow EMACS will show, in order, every command starting with the string "ggplot(" showing when it was used. While this is useful to iterate through your history its utility is enhanced by adopting standardised practices that allow easy access to key functions.

Instead of using the standard format of  LHS <- eg.="" i="" rhs="">comment(dfname)  <- i="" nbsp="">'this dataframe was altered by adding column called "imp_val"' )

routinely reverse the direction of assignment to: "xx ->  yyy". For example the above command becomes ...

comment(dfname) -> 'this dataframe was altered by adding column called "imp_val" '

Now if, in EMACS,  the user types "comment(" and press the up arrow EMACS will rotate through every comment and (hopefully) find the relevant assignment. Analysts are more likely to be looking for statistical procedures or plots rather than comment assignment. If every import of csv data  followed the syntax of "read.csv(" followed by the assignment  " ->  importedfilename" the user can easily rotate through all the csv imports. It now becomes easy to examine processes used in creating the current database. If the direction of assignment was not reversed and remains "LHS <- -="" i="" rhs=""> then pressing the up arrow will not find commands but assignment objects. For this to be of value the user will need to know the assigned name.

5. Create a sparse model of core commands

The practice of following the above syntax format also has the advantage of enabling the user to extract every assignment with ease. Typically analysts will experiment with a procedure or plot until the desired results are achieved. Once the user is happy with the syntax the result/output is assigned a name (by using the " ->  newname" syntax).

If this syntax format is routinely used then every change in a value or assignment can be extracted from .Rhistory with ease. Under linux this key syntax can be parsed to a new file (called "mychanges.txt") using a command like:

less .Rhistory | grep -E ' -> |%>%' > mychanges.txt

If the user has been inserting #comments into the syntax file then use:

less .Rhistory | grep -E ' -> |%>%|#' > mychanges.txt

The file "mychanges.txt' is a record of every assignment made in that database using either "->" or tidyverse's  "%>%". This new file is a complete record of every change made to that database and contains all the information needed to replicate all the work to date.

Commands are often refined incrementally until the desired outcome is reached and the extracted "mychanges.txt" file will contain every iteration made. In practice analyst usually only want the final iteration to be preserved (analysis presumably stops when the solution has been identified). It is a little extra housekeeping to remove all iterations but the last in "mychanges.txt". Doing this should result in a sparse and clean syntax file that records all changes to the database and no 'fluff'.

6. Routinely use a file to record notes 

The key to adequate documentation is being reminded to make notes and being able to do so easily. Database documentation can be improved by adding a decriptive file to record intentions, procedures, progress or conclusions to every R directory (and thus to every R database). To be consistent keep one of these files in every R directory and use the same name eg. "desc.txt".

On linux systems the process is facilitated by creating an alias for EMACS (or VI) that equates the intial command "emacs" with "emacs desc.txt". Now when the user navigates to the relevant R database/directory typing "emacs"results in "desc.txt" being created (or re-opened) and opened in EMACS. The user can thus edit this file while running R in another buffer ("Ctrl-x 2" to open a new buffer in which to run R via "Alt-x R").

On Debian systems the alias is set in the home directory's .bashrc or .bash_aliases file using a string like:

alias emacs="emacs -g -1 desc.txt" (or put this into a script as shown below)

The "desc.txt" file is used to record notes that don't quite belong in .Rhistory. Once the R session is over exiting requires that EMACS be closed. Before the user exits EMACS they will be prompted to save changes to "desc.txt". This can be done after adding notes that will guide future users. As the notes made gradually accumulate a useful addition to .Rhistory is  created.

7. Routinely add a timestamp to .Rhistory 

Placing a timestamp in .Rhistory may help you to find  previously used procedures or identify when modifications were made to a database. However placing a timestamp in .Rhistory is not straight-forward as this file is only updated when you exit R. Also .Rhistory records only the commands issued and not the output (the date etc.). One workaround is to create a script that appends a timestamp to .Rhistory immediately on starting R. I combine the insertion of the timestamp with the opening of "desc.txt" referred to above through a script.

To use the script navigate to the working directory and start "emacsR.sh". emacsR.sh is a text file which contains the following text

sed -i -e '$a\ ' .Rhistory
echo 'Session started.... ' $(date +%Y-%m-%d:%H:%M) >> .Rhistory
sed -i -e '$a\ ' .Rhistory
emacs -g -1 desc.txt

Line 2 inserts a timestamp at the end of the .Rhistory file in the working directory. Lines 1 and 3 pad that text with a blank line and the last line opens "desc.txt" with emacs. Remember to make emacsR.sh executable ("sudo chmod a+x emacsR.sh").


8. Refer to R objects in external products

The best record of how a database was used/applied is kept by writing reports in R Markdown, or knitr. This is because the code 'chunks' in the scripts refers to the R objects directly. Users, for example, can see that a particular table was used by crosstbulating variables x and y and what the  syntax was used in creating a particular plot.

Unfortunately many clients insist on tables, graphics etc. being  placed in a word processor document (eg. Word or LibreOffice) document. By doing this  they can apply their own styles easily or not have to extend their IT skills. If you are required to do this rather than pasting plots and tables into their document try linking to the R object (OLE?) instead. This way you may be able to preserve references to both the name of the R object used and the database/directory it camefrom. These references help when trying to work out what the author (you?) did last time.

9. Full circle

The use of many small databases coupled to the creation of sparse histories and descriptive files may result in a plethora of files and greater confusion.  Fortunately on linux systems it is easy enough to examine every .Rhistory (or 'desc.txt' file) for a particular search phrase. Once the directory has been identified the user has a reference to the relevant database and is better able to locate the key information on procedures or  data.

An example of a process may run something like this:

A user needs to change a plot they provided in a report last year.

1. From item 8.  above the report In LibreOffice/Word contains a table produced by R called, say,  "fiscalcliff".

2. Search every R database on the system that produced an object called "fiscalcliff". This is done by typing, in linux,

find / -name .Rhistory -exec grep -H "fiscalcliff" {} \;

The '-H'  command above outputs the filename of relevance. Now you should know which directory is most relevant.

3. Navigate to the database and open  'mychanges.txt'  created from step 5 above (or just use .Rhistory). Search (based on step 4 above) from the end of the file backwards for  'fiscalcliff'  ("Ctrl-r fiscalcliff" in EMACS) to show the last time that term was used.

4. The mychanges.txt file will contain all the commands needed to replicate everything to get to that point of output. Theoretically the user merely has to paste all the relevant text into R to get to that point again.


Caveats

There are procedures that may not be saved using the above text varbatim. Both base plots and ggplot, for example do not output graphics using the ' -> ' or ' <- assignment.="" base="" i="" like="" plots="" something="" use="">png('filename.png')
and ggplot uses ggsave('filename.png'). These commands can be outputted to the sparse history (mychanges.txt) file by widening the grep term in:
less .Rhistory | grep -E ' -> |%>%|#'  mychanges.txt

to include the additional terms. For example ggplot users may want to use:

less .Rhistory | grep -E ' -> |%>%|#|ggsave' > mychanges.txt