How to: Custom charts with R, OpenOffice Calc, and Inkscape

In my recent posts analysing the Ontario Public Sector Salary Disclosure, I produced several visuals and I thought I would share how I did it. Custom, high quality, effective, professional, clean, stylish visuals can be difficult to wrestle out of standard analytical software, so here’s some guidance on how-to.

pathologistsovertime top4000_raise_2012 graphv2

I used all open source tools:

  • R – Open source statistical analysis and charting software
  • OpenOffice Calc – Open source alternative to Excel
  • Inkscape – Open source alternative to Adobe Illustrator

The process was:

  1. Data Gathering and Analysis – a deep topic for another post
  2. CSV as a bridge between analysis tools and charting tools
  3. R or Calc to generate base chart
  4. Inkscape to clean up

Charts with R and Inkscape

Chart 1: Average Pathologist Salary by List Ranking
pathologistsovertime

1. Data Gathering and Analysis – was an extensive task undertaken with C# and a topic for another post

2. CSV – In order to bridge from the analysis tool to the charting application (R), I used a simple flat file. If the analysis were done in R, this would obviously not be necessary.
csv_image

3. R – Load the CSV into R and produce the base chart with the following code. See in-line comments for detail on how it works.

# Clear all existing variables from memory
rm(list=ls())
 
# Set working directory for the csv file
setwd("C:\\Users\\Aleksey\\Documents\\Data Journalism\\c#\\salaryDisclosure")
 
# load the csv file
data <- read.csv("rankRaiseAalysis.csv", header=TRUE, sep=",", as.is=TRUE)
 
# take a subest of the data, only the top some
lessData <- data[1:4000,]
 
# x axis is rank
x <- lessData$rank
 
# set up the grid for the graphs
# mfrom (4,4) defines 4 x 4 grid
# mar defines margins, bottom, left, top, right
# mgp moves the axis labels around and is currently redundant
par(mfrow = c(4,4), mar = c(0.1,0.5,2,0.5), mgp=c(1,1,0))
 
 
for (i in 1998:2013)
{
  # for whatever reason R doesn't do sensible string concatenation
  # this adds X to i to get the string for fetching the variable from the dataframe
  y <- paste0("X",i)
  plot(x, lessData[,y]*100, # also mutiply by 100 to get % values
       type="h", # histogram
       ylim=c(0,20), # y-axis limits min 0, max 20
       main=i-1, # this is the chart title
       ylab="", # no y axis lavels 
       xlab="", # no x axis labels
       xaxt="n", # suppress x axis
       yaxt="n", # suppress y axis
       xaxs="i", # no margin within the plotting frame to the left or right
       yaxs="i", # similarly
       col="#550000" # plotting colour
       )
}
 
# for the next plot, we don't want the 4x4 grid, so set it back to 1x1
par(mfrow=c(1,1), mar=c(3,3,3,3), mgp=c(1.5,0.5,0))
 
plot(x, lessData$X2013*100, # plot only 2013 in this chart, multiply by 100 to get % values
     type="h", #histogram
     ylim=c(0,20), # y-axis goes from 0 to 20
     ylab="", # no y axis labels
     yaxs="i", # no margin within the plotting frame to the top or bottom
     yaxt="n", # suppress y axis
     xlim=c(0,4000), # x-axis goes from 0 to 4000
     xlab="", # no x axis labels
     xaxs="i", # no margin within the plotting frame to the left or right
     xaxt="n", # suppress the x-axis
     main="Year 2012 % Salary Growth For Top 4000 on Sunshine List", #title
     col="#D45500")
 
# add our own axis title
title(xlab="Rank",
      cex.lab = 1) # size
 
# add a custom x-axis
axis(1, # 1 = at the bottom
     at=c(1,1000,2000,3000,4000), # vector of value locations for the ticks
     labels=c(1,1000,2000,3000,4000)) # vector of labels for those ticks
 
# add a custom y-axis
axis(2, # 2 = y axis
     at=c(0, 10, 20), # vector of value ocations for the ticks
     labels=c("0", "10", "20"), # vector of labels for those ticks
     cex.lab=0.5) # size

Created by Pretty R at inside-R.org

The following chart is generated by R and can be exported to SVG for loading into Inkscape.
Rplot

4. Inkscape

Open the file in Inkscape, ungroup the elements and start cleaning:

  • Colourise
  • Add legends and labels
  • Removal of excess ink for a cleaner look
    • No Y or X axis- lines, these can be visually implied by the ticks
    • No plotting area border, again visually implied by the other elements
  • Better Y and X axis ticks and labels
  • Required extensive use of the object align and distribute features




Chart 2 – Salary Growth at top of “Sunshine List”
top4000_raise_2012

1. Data Gathering and Analysis – was an extensive task undertaken with C# and a topic for another post

2. CSV – In order to bridge from the analysis tool to the charting application (R), I used a simple flat file. If the analysis were done in R, this would obviously not be necessary.
csv_image2

3. R

Load the CSV into R and produce the base chart with the following code. See in-line comments for detail on how it works.

# Clear all existing variables from memory
rm(list=ls())
 
# Set working directory for the csv file
setwd("C:\\Users\\Aleksey\\Documents\\Data Viz\\blogging\\017 - Pathologists follow-up")
 
# load the csv file
data <- read.csv("newPathologists.csv", # csv file
                 header=TRUE, # varaible names are at the top
                 sep=",", # it's commas, it's a csv
                 as.is=TRUE)
 
# build a plot with the firs tseries of data against year
plot(data$year, data$X1.to.25,
     type="l", # makes a line plot
     ylim=c(0,450000), # sets the range for the y axis
     xaxt="n", # supresses the x-axis for customisation later
     lwd=3) # sets the line width
 
# build our own custom x axis 
axis(1, # puts it at the bottom
     at=c(1997,2002,2007,2012), # position for the ticks
     labels=c(1997,2002,2007,2012)) # labels for those ticks
 
# use a for loop for the rest of the data, the other 7 series
for (i in 1:7)
{
  # below is my ridiculous solution for string manipulation in R
  # in order to turn 1, 2, 3... into the names of my variables
  y <- paste0("X",paste0((1+25*i),paste0(".to.",(25+25*i))))
 
  # the lines command adds lines to an existing plot
  lines(data$year, # x is still year
        data[,y], # having built y as a string i.e. "X25.to.50", you can refernce it this way
        lwd=(3-2*(i/7))) # a little function for line width that makes later series thinner
}

Created by Pretty R at inside-R.org

The following two charts are generated by R and can be exported to SVG for loading into Inkscape.
rankraise part1
rankraise part2

4. Inkscape

  • Open the file in Inkscape, ungroup the elements and start cleaning:
  • Customise colours
  • Better titles
  • Grey-out borders and labels to reduce chart clutter




Chart with OpenOffice Calc and Inkscape

Chart 3 – Ontario “Sunshine List” Salary Growth
graphv2

1. Data Gathering and Analysis – was an extensive task undertaken with C# and a topic for another post

2. CSV

In order to bridge from the analysis tool to the charting application (R), I used a simple flat file. If the analysis were done in R, this would obviously not be necessary.

3. OpenOffice Calc

Load the CSV into Calc. Create the three bar charts.

4. Inkscape

  • Copy and paste from OpenOffice Calc into Inkscape.
  • Tear up everything, keeping only the bars.
  • Custom labelling, lines, row shading, etc.
  • Extensive use of align and distribute features.

Pathologists follow-up, data from 1996 to 2013 from Ontario “Sunshine List”

This is a follow-up on my article, 20-25% raise for Ontario’s pathologists in 2012. Evolving from an explorative analysis of the 2013 Ontario Public Sector Salary Disclosure, the “Sunshine List”, I identified a surprising change in the packages (salary + benefits) of pathologists on the list.

Further Analysis

Extending the analysis backwards to 1996, the start of the “Sunshine List” we can get a more complete picture of salary changes for Ontario pathologists.

pathologistsovertime

The analysis and the visual shows that:

  • Since 1997, pathologist packages have increased on average 5.5% annually
  • From 1997 to 2007, packages increased 6.6% annually and exponentially which was clearly unsustainable
  • From 2007 to 2012, packages saw little to no growth
  • In 2013, there was an unprecedented, single-year growth of 22%

Contact

I contacted the Ontario Ministry of Health and Long-Term Care. Their media relations passed me on to the Ontario Association of Pathologists, whom I had already contacted and have yet to see a response. A sensible strategy for them would be to ignore enquiries from bloggers regardless of if they had anything to hide.

Second largest growth-gap in 2012 for Ontario “Sunshine List”

I have previously shown that 2012 was a good year for the highest paid individuals in the Ontario Public Sector Salary Disclosure (“Sunshine List”). The top 1,000 best paid workers saw salary growth of average 7.2% where everyone else saw 2.2%. It can further be shown that 2012 was one of the biggest years for disproportionate growth at the top. Only 2008 shows a bigger gap between the 1,000 best paid and everyone else:

top4000_raise_2012

Furthermore, We can see that:

  • A similar shape to 2012 in 2008, 2006, 2004, and 2000. In all of these years, the very top (1000 or so) of the list saw considerably more growth than those near the top (ranks 1000 to 4000 or so).
  • 2009-2011 were weak growth years, with 2009 and 2011 showing actually lower salary growth at the very top

Bolivia: El Régimen Tributario Simplificado

I am writing this from Sucre, Bolivia which is under siege with all major roads in and out of the city blockaded. Doing some research I found this article (Spanish language): http://eju.tv/2013/04/la-semana-se-inicia-con-amenazas-de-paros-y-bloqueos-el-gobierno-no-cede/

Naturally I saw the infographic, didn’t like it, and as an exercise, spruced it up.

Improved:

bolivian bars

Original:

bolivian bars original

Current publishing of Ontario “Sunshine List” not good enough

Standing where we are in 2013, the Public Services Salary Disclosure Act of 1996 in Ontario seems ahead of it’s time in terms of open government data. 17 consecutive lists published of all public sector employees who earned more than $100,000 in a year. But, what was once a bold step forward in terms of public accountability, is now falling behind in other ways.

By today’s standards, publishing an intimidatingly long list of approaching 100,000 names and salaries across 100 or so HTML or PDF pages does not constitute disclosure. Sure, it’s great if you want to look up how much your boss makes or to keep an eye on the salaries of TVO presenters, but after that it falters. Making data possible to access, and making it easy to access are different things. If the data were published in print, but not made available online, would that be acceptable? Was it in 1996?

Any data journalist who wants to work with the data, must first scrape it from those hundred pages, which either requires some technical skill and some time, or brute force and quite a lot of time. Even answering simple questions like, “How many names are on the list?” and “What is the average salary?” have to wait for this scraping to be performed.

What about the general public? Even if they’ve never heard of scraping a web page, they should still have natural questions like: How many people from each employer is on the list? How much money are CEOs making on average this year? Is that more than last year? How many people on the list are Pathologists?

Easy change

At a minimum, the entire list should be made available for download in a single file in CSV and/or XLS format. This would remove barriers and save time for any data journalist wanting to access the information. This should be trivially easy to do, because by the looks of the URLs, the 2013 (for 2012) disclosure is already stored in a database.

Empower the ecosystem

Not only does this mean that citizens and journalists could better access the data, but it also would enable data visualisation and interaction practitioners to create tools for the entire public to access the information.

20-25% raise for Ontario’s pathologists in 2012

Evidence from the Ontario Public Sector Salary Disclosure, the so-called “Sunshine List”, shows that pathologists in Ontario saw an average salary increase in 2013 of 20-25% over the previous year. This average for the entire list was 2.2%.

Appearing in both the 2013 and 2012 disclosures, 195 pathologists saw their average package (salary + taxable benefits) increase by $57k or 20.6% from $280k to $337k. The top 200 earning pathologists in 2012 averaged $348k, a 25.4% increase over 2011.

Ontario Public Salary Disclosure

Every year since 1996, the Ontario Ministry of Finance has released a list of all public sector employees who earned more than $100,000 in the previous year.

Why?

So what’s happening here? Why are pathologists seeing a 25% raise while the rest of the list shows a very reasonable growth of 2.2%?

At this point all we have are hypotheses. Analysis of the publicly available data has uncovered a surprising feature, and further investigation is required to find the cause. This is exactly the sort of process we should expect from an open government/open data initiative like the Sunshine List.

Detail

The positions in the data mapped to pathologist were:
  • Pathologist
  • Pathologist / Pathologiste
  • Pathologist Laboratory Medical Director and Chief of Medical Staff
  • Pathologist/Laboratory Medical Director
  • Pathologist/ Anatomopathologiste
  • Pathologist – Pathology / Pathologiste
  • Associate Director Pathology / Directrice adjointe Pathologie
  • Neuropathologist / Professor
  • Neuropathologist
  • Laboratory Pathologist / Pathologiste du laboratoire
  • Laboratory Pathologist/Pathologiste du laboratoire
  • Medical Director Clinical Lab Services / Pathologiste
  • Pathologist/Pathologiste
  • Associate Director Pathology / Directeur adjoint Pathologie
  • Associate Pathologist / Pathologiste adjoint
  • Associate Pathologist/Pathologiste adjoint
  • Senior Associate Pathologist
  • Senior Associate Pathologist / Pathologiste associé(e) principal(e)
  • Division Head Haematopathology
  • Chief Pathology & Laboratory Director
  • Associate Pathologist
  • Associate Pathologist / Pathologiste associé(e)
  • Pathologist and Director Laboratory Medicine
  • Pathologist/Director Laboratory Medicine
  • Director Pathology / Directeur Pathologie
  • Pathologist–in–Chief
  • Pathologist-in-Chief
  • Associate Head of Pathology
  • Associate Head Pathology
  • Division Head Pathology
  • Pathologist – General / Pathologiste général
  • Pathologist – General/Pathologiste général
  • Anatomical Pathologist / Anatomopathologiste
  • Anatomical Pathologist/Anatompoathologiste
  • Chief Pathologist
  • Administrative Director Pathology and Laboratory Medicine
  • Senior Pathologists Assistant
  • Speech Pathologist – Voice
  • Speech Pathologist Voice
  • Administrative Director Pathology & Laboratory Medicine
  • Haematopathologist
  • Senior Manager Pathology & Laboratory Medicine
  • Anatomic Pathologist / Professor
  • Pathologist & Discipline Director

Sources:

http://www.fin.gov.on.ca/en/publications/salarydisclosure/pssd/

7.2% raise for 1,000 best paid Ontario public sector employees

graphv2

The top 1,000 employees with the highest package (salary + taxable benefits) in the Ontario Public Sector Salary Disclosure, the so-called “Sunshine List”, saw an average increase of almost $25,000 in 2012 compared to the previous year, an increase of 7.2%, much higher than the bottom half of the 80,000-strong list which saw an increase of only 2.2%.

Is this cause for alarm? Highly paid CEO’s are fully in the public spotlight, and the many many school principals have their pay closely monitored, but what about the highly paid individuals near, but not at the top? The data shows that for them, 2012 was a good year.

Every year since 1996, the Ontario Ministry of Finance has released a list of all public sector employees who earned more than $100,000 in the previous year.

Oversight

We can all see that “Sunshine List” champion Thomas Mitchell, President & CEO of Ontario Power Generation took a pay cut this year, but with approaching 100,000 names on the list, more sophisticated, data-drive oversight is possible.

Government-friendly observes point out that the average salary on the list has decreased, just like last year, but that is a red herring. Anyone can add over 9,000 people earning just over $100k to a list with an average salary of $129k and bring down the average. As the list continues to grow from the bottom, we can expect the average salary to decline, without this being any indicator of public fiscal discipline.

Opposition partisans will lament the increasing growth of the list, 9,600 more this year and 7,500 the year before. This is again misleading. The pyramid shape of any organisation tells us that there are more people as you move down the salary brackets. With a perfectly reasonable average salary growth at just over 2.5%, 9,600 employees graduated to the “Sunshine List” this year after having earned around $98k last year. Probably more than 9,600 employees, currently earning around $98k will be new additions to the list next year, and more the year after. Inflation and economic growth will ensure that the list grows, and the pyramid shape will ensure that it grows faster.

Top 1,000

So who are these lucky 1,000 who on average made 7.2% more in 2012?

This year the top 1000 best packages on the list included:
  • 583 individuals working in hospitals
    • 176 Pathologists
    • 50 Chief Executive Officers
    • 66 Vice-Presidents (Senior, Executive, etc.)
    • 79 Psychiatrists
  • 86 employees in electricity
    • 56 Vice-Presidents (Senior, Executive, etc.)
  • 144 working at Universities
    • 100 Professors

Big raises

Of the 1,000, 737 can be matched exactly by name and organisation type to last year. 92 of those fortunate souls saw an increase of over 25%! At the top of the pack was Mohamed Abelaziz Elbestawi, Vice-President Research/Professor at McMaster University who was reported as paid salary $266k in 2011 and $506k in 2012! Trung Kien Mai, a Pathologist at The Ottawa Hospital saw his paid salary move from $306k in 2011 to $515k in 2012!

Of those 92 with big raises:
  • 83 work in hospitals
  • 50 are Pathologists

More questions

At this point, this analysis raises more questions than it answers, but that is to be expected from an analysis of this salary disclosure data. The Public Salary Disclosure Act can help us find questions, not answers.

What we do know is that:
  • Salaries near the top grew substantially
  • Those salaries grew much more, even on a % basis than those at the bottom
  • Growth was higher than expected given slow economic growth
  • Some individuals can be shown to have experienced extraordinary raises
  • Pathologists do well, and 2012 was a particularly good year for some

Source

http://www.fin.gov.on.ca/en/publications/salarydisclosure/pssd/

Create: Information Tree

Today I am releasing another tool that allows users to create and export a diagram/visualisation. Today it is an Information Tree.
Click here to access the tool.



Users can:

  • Define a complete 4-level hierarchy, breaking a concept down to four levels
  • Customise some aspects
  • Export to SVG

This is an alpha release of the tool, hopefully the first of many. Any and all feedback is welcome.

Also, if you have the time and ability to do similar or better things, I invite you to contact me regarding collaboration.

Create: Information Wheel

Today I am releasing a tool that allows users to create and export an Information Wheel.
Click here to access the tool.



Users can:

  • Define a complete 4-level hierarchy, breaking a concept down to four levels
  • Customise ring-sizes
  • Customise colouring
  • Customsie text
  • Export to SVG

This is an alpha release of the tool, hopefully the first of many. Any and all feedback is welcome.

Also, if you have the time and ability to do similar or better things, I invite you to contact me regarding collaboration.

New Project: K-means Clustering

When it comes to data visualisation design, it’s always important to consider your purpose and your audience. Are you trying to convince your audience of a particular point of view? Are you giving your audience an platform from which to explore and find their own insights? In my latest piece I take a step down a less discussed path.

I have created an interactive tool using D3.js that gives the user a chance to see and interact with the typical k-means clustering algorithm from data mining/machine learning. It is my hope, that it will enable students to develop an intuition for how the algorithm works, and a better appreciation of its shortcomings.

You can learn more about k-means clustering here.

K-means Clustering