Question. Everything.

Area Determination of Property Images

April 06, 2013

In a class I am taking, we were given the following problem:

Calculate (rougly) the area of land shaded green, and the area of property in the green land (the grey buildings).

Here is a typical image:

'Sample Image'

My Attempt

Thankfully, reading .PNG files into R is easy

library(png)
library(geometry)

files <- dir("rawimages", full.names=TRUE)
j <- 2
x <- readPNG(files[j])
dim(x)
## [1] 512 517   3
x[1:5,1,]
##        [,1]   [,2]   [,3]
## [1,] 0.9725 0.9804 0.9922
## [2,] 0.9725 0.9804 0.9922
## [3,] 0.9725 0.9804 0.9922
## [4,] 0.9725 0.9804 0.9922
## [5,] 0.9725 0.9804 0.9922

The output of the readPNG() is a three dimensional array (x-coordinate, y-coordinate, RGB values), which is fairly unwieldy. Thankfully, it is easy to coerce matrices in R.

nrows <- dim(x)[1]
ncols <- dim(x)[2]
total <- nrows * ncols

dim(x) <- c(total, 3)
Note! R stores it's matrices as vectors, by column.

If you’ve played around with matrices, you’ll have noticed that for a 2-dimensional matrix, x[3] will return the element x[3,1]. Thus, the first 5 elements of the new coerced x is equivalent to the above x[1:5,1,] command (the first 5 elements in column 1).

head(x)
##        [,1]   [,2]   [,3]
## [1,] 0.9725 0.9804 0.9922
## [2,] 0.9725 0.9804 0.9922
## [3,] 0.9725 0.9804 0.9922
## [4,] 0.9725 0.9804 0.9922
## [5,] 0.9725 0.9804 0.9922
## [6,] 0.9725 0.9804 0.9922

To make sure plotting gives the same orientation, we need to flip things around, since graphs start the origin at the bottom left, while the data structure of the PNG will be starting from the top left.

yvals <- rep(nrows:1, ncols)
xvals <- rep(1:ncols, each=nrows)

head(cbind(xvals, yvals))
##      xvals yvals
## [1,]     1   512
## [2,]     1   511
## [3,]     1   510
## [4,]     1   509
## [5,]     1   508
## [6,]     1   507

Now, we are ready to do some analysis. At first glance, we want to be able to retrieve the red line (or the green area), and the grey area inside that region. It seemed natural to start with k-means clustering, to get a better picture of the different color groups.

km.0 <- kmeans(x, 5, 10)
km.0$centers
##     [,1]   [,2]   [,3]
## 1 0.8980 0.8980 0.8981
## 2 0.0000 0.1789 0.0000
## 3 0.9985 0.9989 0.9996
## 4 1.0000 0.0000 0.0000
## 5 0.8392 0.8392 0.8392
these <- km.0$cluster==4

Recall that the second column is the green channel (so we would expect the green points to have negligable values in the 1st and 3rd, but high in the second). However, the k-means clustering doesn’t seem to pick them up (weird). We were able to pick up the red and the grey though (red shown below).

plot(xvals, yvals,
    col=apply(x, 1, function(a) rgb(a[1], a[2], a[3])),
    pch=18, cex=0.2)
points(xvals[these], yvals[these], cex=1)

plot of chunk unnamed-chunk-6

Unfortunately, there are images that have multiple areas denoted by a red border. Since we only care about those shaded green, it is not enough to use the red points. We overstep this problem by restricting our attention to the neighbours of the green area. To get the green values, we shall hazard an educated guess as to the form of the RGB.

green <- x[,2] > 0.9 &
         x[,1] < 0.1 &
         x[,3] < 0.1

Let’s check to see if we’re right

plot(xvals, yvals,
    col=apply(x, 1, function(a) rgb(a[1], a[2], a[3])),
    pch=18, cex=0.2)
points(xvals[green], yvals[green], cex=1)

plot of chunk unnamed-chunk-8

Nice. Okay, so now we can create a rectangle around this area (with some slack, to ensure the red border lies inside the area) that we are interested in.

slack <- 5
xmin <- min(xvals[green])-slack
xmax <- max(xvals[green])+slack
ymin <- min(yvals[green])-slack
ymax <- max(yvals[green])+slack

x[xvals <= xmin | xvals >= xmax | yvals <= ymin | yvals >= ymax,] <- c(1,1,1)

The restricted diagram looks like so:

plot(xvals, yvals,
    col=apply(x, 1, function(a) rgb(a[1], a[2], a[3])),
    pch=18, cex=0.2)

plot of chunk unnamed-chunk-10

Great! Now we can ignore the green points, and focus our attention on the greys and reds. In the above diagram, we come across another potential problem: there are grey areas in the image that aren’t inside the red boundary. Our algorithm handles that nicely though.

grey <- x[,1] >0.80 & x[,1] < 0.95 &
        x[,2] >0.80 & x[,2] < 0.95 &
        x[,3] >0.80 & x[,3] < 0.95
greys <- which(grey)

red <- x[,1] > 0.95 &
       x[,2] < 0.05 &
       x[,3] < 0.05
reds <- which(red)

The Algorithm

The crux of the algorith is the use of the areapl() function in the splancs library. This function takes a sequence of points that join up to make a polygon, and calculates the area. Importantly, the points must be sequential around the polygon. Hence, the variable red is insufficient. The algorithm is as follows: for each row,

  • retrieve the first and last red point
  • retrieve the first and last grey point within the red points
min.red <- c()
max.red <- c()
min.grey <- c()
max.grey <- c()
groups <- split(1:total, yvals)
for(i in 1:nrows) {
  row <- groups[[i]]
  row.reds <- row[row %in% reds]
  if (length(row.reds) < 2) next
  min.red <- c(min.red, row.reds[1])
  max.red <- c(max.red, row.reds[length(row.reds)])
  row.greys <- row[row %in% greys]
  # greys must be within the range of red
  row.greys <- row.greys[row.greys > row.reds[1] & row.greys < row.reds[length(row.reds)]]
  if (length(row.greys) < 2) next
  min.grey <- c(min.grey, row.greys[1])
  max.grey <- c(max.grey, row.greys[length(row.greys)])
}

red.pts <- cbind(c(xvals[min.red], rev(xvals[max.red])),
                 c(yvals[min.red], rev(yvals[max.red])))
grey.pts <- cbind(c(xvals[min.grey], rev(xvals[max.grey])),
                  c(yvals[min.grey], rev(yvals[max.grey])))

The polygons we retrieve are shown below.

plot(red.pts, col=1, type='b')
points(grey.pts, col=2, type='b')

plot of chunk unnamed-chunk-13

We can now calculate a rough area.

library(splancs)
c(areapl(red.pts), areapl(grey.pts))
## [1] 22306 18594

We can now create a function with takes a .PNG file path (and a jump parameter, which determines the number of rows to skip), and returns a plot of the original, the polygons, and the area of the polygons:

area <- function(file, jump=1){
  x <- readPNG(file)

  nrows <- dim(x)[1]
  ncols <- dim(x)[2]
  total <- nrows * ncols

  yvals <- rep(nrows:1, ncols)
  xvals <- rep(1:ncols, each=nrows)
  dim(x) <- c(total, 3)

  plot(xvals, yvals,
       col=apply(x, 1, function(a) rgb(a[1], a[2], a[3])),
       pch=18, cex=0.2)

  green <- x[,2] > 0.9 &
           x[,1] < 0.1 &
           x[,3] < 0.1

  slack <- 5
  xmin <- min(xvals[green])-slack
  xmax <- max(xvals[green])+slack
  ymin <- min(yvals[green])-slack
  ymax <- max(yvals[green])+slack

  x[xvals <= xmin | xvals >= xmax | yvals <= ymin | yvals >= ymax,] <- c(1,1,1)

  grey <- x[,1] >0.80 & x[,1] < 0.95 &
          x[,2] >0.80 & x[,2] < 0.95 &
          x[,3] >0.80 & x[,3] < 0.95
  greys <- which(grey)

  red <- x[,1] > 0.95 &
         x[,2] < 0.05 &
         x[,3] < 0.05
  reds <- which(red)

  min.red <- c()
  max.red <- c()
  min.grey <- c()
  max.grey <- c()
  groups <- split(1:total, yvals)
  n <- floor(nrows/jump)
  for (i in 1:n) {
    row <- groups[[i*jump]]
    row.reds <- row[row %in% reds]
    if (length(row.reds) < 2) next
    min.red <- c(min.red, row.reds[1])
    max.red <- c(max.red, row.reds[length(row.reds)])
    row.greys <- row[row %in% greys]
    # greys must be within the range of red
    row.greys <- row.greys[row.greys > row.reds[1] & row.greys < row.reds[length(row.reds)]]
    if (length(row.greys) < 2) next
    min.grey <- c(min.grey, row.greys[1])
    max.grey <- c(max.grey, row.greys[length(row.greys)])
  }

  red.pts <- cbind(c(xvals[min.red], rev(xvals[max.red])),
                   c(yvals[min.red], rev(yvals[max.red])))
  grey.pts <- cbind(c(xvals[min.grey], rev(xvals[max.grey])),
                    c(yvals[min.grey], rev(yvals[max.grey])))

  points(red.pts, col=2, type='b', cex=0.5)
  points(grey.pts, col=1, type='b', cex=0.5)
  return(c(areapl(red.pts), areapl(grey.pts)))
}
files <- dir("rawimages", full.names=TRUE)
for (i in c(1,3,4)) {
  print(area(files[i]))
}

plot of chunk unnamed-chunk-16

## [1] 32394  1438

plot of chunk unnamed-chunk-16

## [1] 20930  4111

plot of chunk unnamed-chunk-16

## [1] 13334  1140

A Little Extra

If you take a look at the first image, in the above loop, we see that there is anomaly in the red border. Upon further inspection, it turns out that there are times when there will be no red points on the left border, but two red points on the right border (the algorithm ignores situations when there is only one red in a row).

One possible solution is to increase the jump:

area(files[1], jump=5)

plot of chunk unnamed-chunk-17

## [1] 32412  1848

Alternatively, we can iterate through the min.red and max.red values, and see if there are any anomalies. But this is equivalent to putting more significant figures on an already crude estimate - we’ll leave it here for today.