Title: | Model-Based Detection of Disease Clusters |
---|---|
Description: | Model-based methods for the detection of disease clusters using GLMs, GLMMs and zero-inflated models. These methods are described in 'V. Gómez-Rubio et al.' (2019) <doi:10.18637/jss.v090.i14> and 'V. Gómez-Rubio et al.' (2018) <doi:10.1007/978-3-030-01584-8_1>. |
Authors: | Virgilio Gomez-Rubio, Paula Esther Moraga Serrano, Barry Rowlingson |
Maintainer: | Virgilio Gomez-Rubio <[email protected]> |
License: | GPL-3 |
Version: | 1.0-1 |
Built: | 2025-02-24 04:19:56 UTC |
Source: | https://github.com/cran/DClusterm |
This data set contains the number of incident brain cancer cases in the 32 counties of New Mexico, USA, and each year of the period 1973-1991, and the location of Los Alamos National Laboratory. In addition, the data set also includes for each county and year information about the expected cases, the Standardized Morbidity Ratio (SMR), the FIPS, ...
File brainNM_clusters contains the results of running DetectClustersModel on a null model ('nm.m0') and another one with covariates ('nm.m1'). The results are in 'nm.cl0' and 'nm.cl1', respectively.
data(brainNM)
data(brainNM)
brainst: A STFDF object containing the following information for each county and year:
Observed | Number of observed brain cancer cases |
Expected | Number of expected brain cancer cases. Standardisation is done taking the whole time-period and not year-ly to keep any temporal trend. |
SMR | Standardized Morbidity Ratio (observed/expected) |
Year | Year |
FIPS | FIPS Code |
ID | ID (from 1 to 32) |
IDLANL | Inverse distance to Los Alamos National Laboratory |
IDLANLre | Re-scaled Inverse distance to Los Alamos National Laboratory (i.e., IDLANL/mean(IDLANL)) |
losalamos: A SpatialPoints object which contains the location (in long/lat) of Los Alamos National Laboratory obtained from the Wikipedia: -106.298333, 35.881667.
Data have been downlodad from the SatScan website. Boundaries have been obtained from the U.S. Census Bureau. Cibola and Valencia counties has been merged together.
SatScan (c). http://www.spatstat.org
Kulldorff, M., W. F. Athas, E. J. Feurer, B. A. Miller, and C. R. Key (1998). Evaluating cluster alarms: a space-time scan statistic and brain cancer in los alamos, new mexico. American Journal of Public Health 88, 1377-1380.
This function orders the regions according to the distance to a given center and selects the regions with distance to the center less than sqrt(rr). Then it calls glmAndZIP.iscluster() to obtain the cluster with the maximum log-likelihood ratio or minimum DIC of all the clusters with the same center and start and end dates, and where the maximum fraction of the total population inside the cluster is less than fractpop.
CalcStatClusterGivenCenter( point, stfdf, rr, minDateCluster, maxDateCluster, fractpop, model0, ClusterSizeContribution )
CalcStatClusterGivenCenter( point, stfdf, rr, minDateCluster, maxDateCluster, fractpop, model0, ClusterSizeContribution )
point |
vector with the coordinates of the center of the cluster. |
stfdf |
spatio-temporal class object containing the data. |
rr |
square of the maximum radius of the cluster. |
minDateCluster |
start date of the cluster. |
maxDateCluster |
end date of the cluster. |
fractpop |
maximum fraction of the total population inside the cluster. |
model0 |
Initial model (including covariates). |
ClusterSizeContribution |
Variable used to check the fraction of the
population at risk in the cluster
This can be "glm" for generalized linear models (glm stats),
"glmer" for generalized linear mixed model (glmer lme4),
"zeroinfl" for zero-inflated models (zeroinfl), or
"inla" for generalized linear, generalized linear mixed or zero-inflated models fitted with |
vector containing the coordinates of the center, the size, the start and end dates, the log-likelihood ratio or DIC, the p-value and the risk of the cluster with the maximum log-likelihood ratio or minimum DIC.
This function explores all possible clusters changing their center and start and end dates. For each center and time periods, it obtains the cluster with the maximum log-likelihood ratio or minimum DIC so that the maximum fraction of the total population inside the cluster is less than fractpop, and the maximum distance to the center is less than radius.
CalcStatsAllClusters( thegrid, CalcStatClusterGivenCenter, stfdf, rr, typeCluster, sortDates, idMinDateCluster, idMaxDateCluster, fractpop, model0, ClusterSizeContribution, numCPUS )
CalcStatsAllClusters( thegrid, CalcStatClusterGivenCenter, stfdf, rr, typeCluster, sortDates, idMinDateCluster, idMaxDateCluster, fractpop, model0, ClusterSizeContribution, numCPUS )
thegrid |
grid with the coordinates of the centers of the clusters explored. |
CalcStatClusterGivenCenter |
function to obtain the cluster with the maximum log-likelihood ratio of all the clusters with the same center and start and end dates |
stfdf |
spatio-temporal class object containing the data. |
rr |
square of the maximum radius of the cluster. |
typeCluster |
type of clusters to be detected. "ST" for spatio-temporal clusters or "S" spatial clusters. |
sortDates |
sorted vector of the times where disease cases occurred. |
idMinDateCluster |
index of the closest date to the start date of the cluster in the vector sortDates |
idMaxDateCluster |
index of the closest date to the end date of the cluster in the vector sortDates |
fractpop |
maximum fraction of the total population inside the cluster. |
model0 |
Initial model (including covariates).
This can be "glm" for generalized linear models (glm stats),
"glmer" for generalized linear mixed model (glmer lme4),
"zeroinfl" for zero-inflated models (zeroinfl), or
"inla" for generalized linear, generalized linear mixed or zero-inflated models fitted with |
ClusterSizeContribution |
Variable used to check the fraction of the population at risk in the cluster |
numCPUS |
Number of cpus used when using parallel to run the method. If parallel is not used numCPUS is NULL. |
data frame with information of the clusters with the maximum log-likelihood ratio or minimum DIC for each center and start and end dates. It contains the coordinates of the center, the size, the start and end dates, the log-likelihood ratio or DIC, the p-value and the risk of each of the clusters.
This function will be used to calculate the P(coeficient variable cluster <=0)
computeprob(func, k)
computeprob(func, k)
func |
is the inla marginals of the model parameter |
k |
is the cutoff |
probability model coefficient <=k
If the argument thegrid of DetectClustersModel() is null, this function is used to create a rectangular grid with a given step. If step is NULL the step used is equal to 0.2*radius. The grid contains the coordinates of the centers of the clusters explored.
CreateGridDClusterm(stfdf, radius, step)
CreateGridDClusterm(stfdf, radius, step)
stfdf |
spatio-temporal class object containing the data. |
radius |
maximum radius of the clusters. |
step |
step of the grid. |
two columns matrix where each row represents a point of the grid.
Searches all possible clusters with start and end dates within minDateUser
and maxDateUser, so that the maximum fraction of the total population inside
the cluster is less than fractpop, and the maximum distance to the center is
less than radius.
The search can be done for spatial or spatio-temporal clusters.
The significance of the clusters is obtained with a Monte Carlo procedure
or based on the chi-square distribution (glm, glmer or zeroinfl models)
or DIC (inla
models).
DetectClustersModel( stfdf, thegrid = NULL, radius = Inf, step = NULL, fractpop, alpha, typeCluster = "S", minDateUser = NULL, maxDateUser = NULL, R = NULL, model0, ClusterSizeContribution = "Population" )
DetectClustersModel( stfdf, thegrid = NULL, radius = Inf, step = NULL, fractpop, alpha, typeCluster = "S", minDateUser = NULL, maxDateUser = NULL, R = NULL, model0, ClusterSizeContribution = "Population" )
stfdf |
object containing the data. If data is spatial, stfdf is a SpatialPolygonsDataFrame object from sp. If data is spatio-temporal, stfdf is a STFDF object from spacetime. The data contain a SpatialPolygons object with the coordinates, and if applicable, a time object holding time information, an endTime vector of class POSIXct holding end points of time intervals. It also contain a data.frame with the Observed, Expected and potential covariates in each location and time (if applicable). Note that the function DetectClustersModel does not use the endTime vector. We can define endTime, for example, as the vector of class POSIXct which contains the same dates as the ones contained in the time object. |
thegrid |
two-columns matrix containing the points of the grid to be used. If it is null, a rectangular grid is built. |
radius |
maximum radius of the clusters. |
step |
step of the thegrid built. |
fractpop |
maximum fraction of the total population inside the cluster. |
alpha |
significance level used to determine the existence of clusters. |
typeCluster |
type of clusters to be detected. "ST" for spatio-temporal or "S" spatial clusters. |
minDateUser |
start date of the clusters. |
maxDateUser |
end date of the clusters. |
R |
If the cluster's significance is calculated based on the chi-square distribution or DIC, R is NULL. If the cluster's significance is calculated using a Monte Carlo procedure, R represents the number replicates under the null hypothesis. |
model0 |
Initial model (including covariates). |
ClusterSizeContribution |
Indicates the variable to be used as the population at risk in the cluster. This is the variable name to be used by 'fractpop' when checking the fraction of the population inside the cluster. The default column name is 'Population'.
This can be "glm" for generalized linear models (glm stats),
"glmer" for generalized linear mixed model (glmer lme4),
"zeroinfl" for zero-inflated models (zeroinfl), or
"inla" for generalized linear, generalized linear mixed or zero-inflated models fitted with |
data frame with information of the detected clusters ordered by its log-likelihood ratio value or DIC. Each row represents the information of one of the clusters. It contains the coordinates of the center, the size, the start and end dates, the log-likelihood ratio or DIC, the p-value, the risk of the cluster, and a boolean indicating if it is a cluster (TRUE in all cases). It also returns alpha_bonferroni which is the level of significance adjusted for multiple testing using Bonferroni correction. Thus, rows that should be considered clusters are the ones with p-value less than alpha_bonferroni.
Bilancia M, Demarinis G (2014) Bayesian scanning of spatial disease rates with the Integrated Nested Laplace Approximation (INLA). Statistical Methods & Applications 23(1): 71 - 94. http://dx.doi.org/10.1007/s10260-013-0241-8
Jung I (2009) A generalized linear models approach to spatial scan statistics for covariate adjustment. Statistics in Medicine 28(7): 1131 - 1143. Gómez-Rubio V, Molitor J, Moraga P (2018) Fast Bayesian Classification for Disease Mapping and the Detection of Disease Clusters. In: Cameletti M., Finazzi F. (eds) Quantitative Methods in Environmental and Climate Research. Springer, Cham
Gómez-Rubio V, Moraga P, Molitor J, Rowlingson B (2019). "DClusterm: Model-Based Detection of Disease Clusters." _Journal of Statistical Software_, *90*(14), 1-26. doi: 10.18637/jss.v090.i14 (URL: https://doi.org/10.18637/jss.v090.i14).
library("DClusterm") data("NY8") NY8$Observed <- round(NY8$Cases) NY8$Expected <- NY8$POP8 * sum(NY8$Observed) / sum(NY8$POP8) NY8$x <- coordinates(NY8)[, 1] NY8$y <- coordinates(NY8)[, 2] #Model to account for covariates ny.m1 <- glm(Observed ~ offset(log(Expected)) + PCTOWNHOME + PCTAGE65P + PEXPOSURE, family = "poisson", data = NY8) #Indices of areas that are possible cluster centres idxcl <- c(120, 12, 89, 139, 146) #Cluster detection adjusting for covariates ny.cl1 <- DetectClustersModel(NY8, thegrid = as.data.frame(NY8)[idxcl, c("x", "y")], fractpop = 0.15, alpha = 0.05, typeCluster = "S", R = NULL, model0 = ny.m1, ClusterSizeContribution = "POP8") #Display results ny.cl1
library("DClusterm") data("NY8") NY8$Observed <- round(NY8$Cases) NY8$Expected <- NY8$POP8 * sum(NY8$Observed) / sum(NY8$POP8) NY8$x <- coordinates(NY8)[, 1] NY8$y <- coordinates(NY8)[, 2] #Model to account for covariates ny.m1 <- glm(Observed ~ offset(log(Expected)) + PCTOWNHOME + PCTAGE65P + PEXPOSURE, family = "poisson", data = NY8) #Indices of areas that are possible cluster centres idxcl <- c(120, 12, 89, 139, 146) #Cluster detection adjusting for covariates ny.cl1 <- DetectClustersModel(NY8, thegrid = as.data.frame(NY8)[idxcl, c("x", "y")], fractpop = 0.15, alpha = 0.05, typeCluster = "S", R = NULL, model0 = ny.m1, ClusterSizeContribution = "POP8") #Display results ny.cl1
This function returns a categorical vector that identifies to which cluster a given areas belongs. It is the empty string for areas not in a cluster.
get.allknclusters(spdf, knresults)
get.allknclusters(spdf, knresults)
spdf |
Spatial object with data used in the detection of clusters. |
knresults |
Table with the clusters detected. |
A categorical vector with value the cluster to which area belongs. It is the empty string for regions not in a cluster.
This function is similar to get.knclusters but it also allows for spatio-temporal clusters.
get.stclusters(stfdf, results)
get.stclusters(stfdf, results)
stfdf |
A sp or spacetime object with the information about the data. |
results |
Results from a call to DetectClustersModel |
A list with as many elements as clusters in 'results'
library("DClusterm") library("RColorBrewer") data("brainNM") data("brainNM_clusters") stcl <- get.stclusters(brainst, nm.cl0) #Get first cluster brainst$CLUSTER <- "" brainst$CLUSTER[ stcl[[1]] ] <- "CLUSTER" #Plot cluster stplot(brainst[, , "CLUSTER"], at = c(0, 0.5, 1.5), col = "#4D4D4D", col.regions = c("white", "gray"))
library("DClusterm") library("RColorBrewer") data("brainNM") data("brainNM_clusters") stcl <- get.stclusters(brainst, nm.cl0) #Get first cluster brainst$CLUSTER <- "" brainst$CLUSTER[ stcl[[1]] ] <- "CLUSTER" #Plot cluster stplot(brainst[, , "CLUSTER"], at = c(0, 0.5, 1.5), col = "#4D4D4D", col.regions = c("white", "gray"))
This function constructs all the clusters with start date equal to
minDateCluster, end date equal to maxDateCluster, and with center specified
by the first element of idxorder, so that the maximum fraction of the total
population inside the cluster is less than fractpop, and the maximum
distance to the center is less than radius.
For each one of these clusters, the log-likelihood ratio test statistic
for comparing the alternative model with the cluster versus the null model
of no clusters (if model is glm, glmer or zeroinfl),
or the DIC (if model is inla
) is calculated.
The cluster with maximum value of the log-likelihood ratio or
minimum DIC is returned.
glmAndZIP.iscluster( stfdf, idxorder, minDateCluster, maxDateCluster, fractpop, model0, ClusterSizeContribution )
glmAndZIP.iscluster( stfdf, idxorder, minDateCluster, maxDateCluster, fractpop, model0, ClusterSizeContribution )
stfdf |
a spatio-temporal class object containing the data. |
idxorder |
a permutation of the regions according to their distance to the current center. |
minDateCluster |
start date of the cluster. |
maxDateCluster |
end date of the cluster. |
fractpop |
maximum fraction of the total population inside the cluster. |
model0 |
Initial model (including covariates). |
ClusterSizeContribution |
Variable used to check the fraction of the
population at risk in the cluster
This can be "glm" for generalized linear models (glm stats),
"glmer" for generalized linear mixed model (glmer lme4),
"zeroinfl" for zero-inflated models (zeroinfl), or
"inla" for generalized linear, generalized linear mixed or zero-inflated models fitted with |
vector containing the size, the start and end dates, the log-likelihood ratio or DIC, the p-value and the risk of the cluster with the maximum log-likelihood ratio or minimum DIC.
This function constructs a data frame with number of columns equal to the number of clusters. Each column is a binary representation of one of the clusters. The position i of the column is equal to 1 if the polygon i is in the cluster or 0 if it is not in the cluster.
knbinary(datamap, knresults)
knbinary(datamap, knresults)
datamap |
data of the SpatialPolygonsDataFrame with the polygons of the map. |
knresults |
data frame with information of the detected clusters. Each row represents the information of one of the clusters. It contains the coordinates of the center, the size, the start and end dates, the log-likelihood ratio, a boolean indicating if it is a cluster (TRUE in all cases), and the p-value of the cluster. |
data frame where the columns represent the clusters in binary format. The position i of the column is equal to 1 if the polygon i is in the cluster or 0 if it is not in the cluster.
library("DClusterm") library("RColorBrewer") data("NY8") data("NY8_clusters") stcl <- knbinary(NY8, ny.cl1) #Get first cluster NY8$CLUSTER <- stcl[, 1] #Plot cluster spplot(NY8, "CLUSTER", at = c(0, 0.5, 1.5), col = "#4D4D4D", col.regions = c("white", "gray"))
library("DClusterm") library("RColorBrewer") data("NY8") data("NY8_clusters") stcl <- knbinary(NY8, ny.cl1) #Get first cluster NY8$CLUSTER <- stcl[, 1] #Plot cluster spplot(NY8, "CLUSTER", at = c(0, 0.5, 1.5), col = "#4D4D4D", col.regions = c("white", "gray"))
Given a data frame with clusters that do not overlap this function merges the clusters and construct a factor. The levels of the factor are "NCL" if the polygon of the map is not in any cluster, and "CL" if the polygon i is in cluster i.
mergeknclusters(datamap, knresults, indClustersPlot)
mergeknclusters(datamap, knresults, indClustersPlot)
datamap |
data of the SpatialPolygonsDataFrame with the polygons of the map. |
knresults |
Data frame with information of the detected clusters. Each row represents the information of one of the clusters. It contains the coordinates of the center, the size, the start and end dates, the log-likelihood ratio, a boolean indicating if it is a cluster (TRUE in all cases), and the p-value of the cluster. |
indClustersPlot |
rows of knresults that denote the clusters to be plotted. |
factor with levels that represent the clusters.
library("DClusterm") library("RColorBrewer") data("NY8") data("NY8_clusters") stcl <- mergeknclusters(NY8, ny.cl1, 1:2) #Get first cluster NY8$CLUSTER <- stcl #Plot cluster spplot(NY8, "CLUSTER", col.regions = c("white", "lightgray", "gray"))
library("DClusterm") library("RColorBrewer") data("NY8") data("NY8_clusters") stcl <- mergeknclusters(NY8, ny.cl1, 1:2) #Get first cluster NY8$CLUSTER <- stcl #Plot cluster spplot(NY8, "CLUSTER", col.regions = c("white", "lightgray", "gray"))
This data set provides the number of incident leukemia cases per census tract in an eight-county region of upstate New York in the period 1978-1982. In addition, the data set also includes information about the location of the census tracts, the population in 1980, the inverse of the distance to the nearest Trichloroethene (TCE) site, the percentage of people aged 65 or more, and the percentage of people who own their home.
The dataset also provides the locations of the TCE sites.
File NY8_clusters contains the results of running DetectClustersModel on a null model ('ny.m0') and another one with covariates ('ny.m1'). The results are in 'ny.cl0' and 'ny.cl1', respectively.
data(NY8)
data(NY8)
A SpatialPolygonsDataFrame with 281 polygons representing the census tracts, and the following information about each census tract:
AREANAME | Name |
AREAKEY | Identifier |
X | x coordinate |
Y | y coordinate |
POP8 | Population in 1980 |
TRACTCAS | Number of leukemia cases rounded to 2 decimals |
PROPCAS | Ratio of the number of leukemia cases to the population in 1980 |
PCTOWNHOME | Proportion of people who own their home |
PCTAGE65P | Proportion of people aged 65 or more |
Z | |
AVGIDIST | |
PEXPOSURE | Inverse of the distance to the nearest TCE site |
Cases | Number of leukemia cases |
Xm | x coordinate (in meters) |
Ym | y coordinates(in meters) |
Xshift | Shifted Xm coordinate |
Yshift | Shifted Ym coordinate |
Waller and Gotway (2004) and Bivand et al. (2008)
Bivand, R.S., E. J. Pebesma and V. Gómez-Rubio (2008). Applied Spatial Data Analysis with R. Springer.
Waller, L., B. Turnbull, L. Clark, and P. Nasca (1992). Chronic disease surveillance and testing of clustering of disease and exposure: application to leukemia incidence in tce-contamined dumpsites in upstate New York. Environmetrics 3, 281-300
Waller, L. A. and C. A. Gotway (2004). Applied Spatial Statistics for Public Health Data. John Wiley & Sons, Hoboken, New Jersey. http://web1.sph.emory.edu/users/lwaller/WGindex.htm
Function DetectClustersModel() detects duplicated clusters. This function reduces the number of clusters by removing the overlapping clusters.
SelectStatsAllClustersNoOverlap(stfdf, statsAllClusters)
SelectStatsAllClustersNoOverlap(stfdf, statsAllClusters)
stfdf |
spatio-temporal class object containing the data. |
statsAllClusters |
data frame with information of the detected clusters obtained with DetectClustersModel(). |
data frame with the same information than statsAllClusters but only for clusters that do not overlap.
library("DClusterm") data("brainNM") data("brainNM_clusters") SelectStatsAllClustersNoOverlap(brainst, nm.cl1)
library("DClusterm") data("brainNM") data("brainNM_clusters") SelectStatsAllClustersNoOverlap(brainst, nm.cl1)
This function constructs a variable that indicates the locations and times that pertain to a cluster. Each position of the variable is equal to 1 if it corresponds to a location and time inside the cluster, and 0 otherwise. This is one of the explanatory variables used in the glmAndZIP.iscluster function to model the observed cases.
SetVbleCluster(stfdf, idTime, idSpace)
SetVbleCluster(stfdf, idTime, idSpace)
stfdf |
spatio-temporal class object containing the data. |
idTime |
vector with the indexes of the stfdf object corresponding to the time inside the cluster. |
idSpace |
vector with the indexes of the stfdf object corresponding to the locations inside the cluster. |
vector with 1's or 0's that indicates the locations and times that pertain to a cluster.
This function slims the number of clusters down. The spatial scan statistic is known to detect duplicated clusters. This function aims to reduce the number of clusters by removing duplicated and overlapping clusters.
slimknclusters(d, knresults, minsize = 1)
slimknclusters(d, knresults, minsize = 1)
d |
Data.frame with data used in the detection of clusters. |
knresults |
Object returned by function opgam() with the clusters detected. |
minsize |
Minimum size of cluster (default to 1). |
A subset of knresults with non-overlaping clusters of at least minsize size.
data("brainNM_clusters") nm.cl1.s <- slimknclusters(brainst, nm.cl1) nm.cl1.s
data("brainNM_clusters") nm.cl1.s <- slimknclusters(brainst, nm.cl1) nm.cl1.s