Maxima Function
subsample (data_matrix, logical_expression)
subsample(data_matrix,logical_expression,col_num1,col_num2,...)
This is a sort of variation of the Maxima submatrix
function. The first argument is the name of the data matrix, the second is a quoted logical expression and optional additional arguments are the numbers of the columns to be taken. Its behaviour is better understood with examples,
(%i1) load (descriptive)$ (%i2) load (numericalio)$ (%i3) s2 : read_matrix (file_search ("wind.data"))$ (%i4) subsample (s2, '(%c[1] > 18)); [ 19.38 15.37 15.12 23.09 25.25 ] [ ] [ 18.29 18.66 19.08 26.08 27.63 ] (%o4) [ ] [ 20.25 21.46 19.95 27.71 23.38 ] [ ] [ 18.79 18.96 14.46 26.38 21.84 ]
These are multivariate records in which the wind speeds in the first meteorological station were greater than 18. See that in the quoted logical expression the i-th component is refered to as %c[i]
. Symbol %c[i]
is used inside function subsample
, therefore when used as a categorical variable, Maxima gets confused. In the following example, we request only the first, second and fifth components of those records with wind speeds greater or equal than 16 in station number 1 and lesser than 25 knots in station number 4,
(%i1) load (descriptive)$ (%i2) load (numericalio)$ (%i3) s2 : read_matrix (file_search ("wind.data"))$ (%i4) subsample (s2, '(%c[1] >= 16 and %c[4] < 25), 1, 2, 5); [ 19.38 15.37 25.25 ] [ ] [ 17.33 14.67 19.58 ] (%o4) [ ] [ 16.92 13.21 21.21 ] [ ] [ 17.25 18.46 23.87 ]
Here is an example with the categorical variables of biomed.data
. We want the records corresponding to those patients in group B
who are older than 38 years,
(%i1) load (descriptive)$ (%i2) load (numericalio)$ (%i3) s3 : read_matrix (file_search ("biomed.data"))$ (%i4) subsample (s3, '(%c[1] = B and %c[2] > 38)); [ B 39 28.0 102.3 17.1 146 ] [ ] [ B 39 21.0 92.4 10.3 197 ] [ ] [ B 39 23.0 111.5 10.0 133 ] [ ] [ B 39 26.0 92.6 12.3 196 ] (%o4) [ ] [ B 39 25.0 98.7 10.0 174 ] [ ] [ B 39 21.0 93.2 5.9 181 ] [ ] [ B 39 18.0 95.0 11.3 66 ] [ ] [ B 39 39.0 88.5 7.6 168 ]
Probably, the statistical analysis will involve only the blood measures,
(%i1) load (descriptive)$ (%i2) load (numericalio)$ (%i3) s3 : read_matrix (file_search ("biomed.data"))$ (%i4) subsample (s3, '(%c[1] = B and %c[2] > 38), 3, 4, 5, 6); [ 28.0 102.3 17.1 146 ] [ ] [ 21.0 92.4 10.3 197 ] [ ] [ 23.0 111.5 10.0 133 ] [ ] [ 26.0 92.6 12.3 196 ] (%o4) [ ] [ 25.0 98.7 10.0 174 ] [ ] [ 21.0 93.2 5.9 181 ] [ ] [ 18.0 95.0 11.3 66 ] [ ] [ 39.0 88.5 7.6 168 ]
This is the multivariate mean of s3
,
(%i1) load (descriptive)$ (%i2) load (numericalio)$ (%i3) s3 : read_matrix (file_search ("biomed.data"))$ (%i4) mean (s3); 65 B + 35 A 317 6 NA + 8145.0 (%o4) [-----------, ---, 87.178, -------------, 18.123, 100 10 100 3 NA + 19587 ------------] 100
Here, the first component is meaningless, since A
and B
are categorical, the second component is the mean age of individuals in rational form, and the fourth and last values exhibit some strange behaviour. This is because symbol NA
is used here to indicate non available data, and the two means are of course nonsense. A possible solution would be to take out from the matrix those rows with NA
symbols, although this deserves some loss of information,