casacore
StatisticsAlgorithm.h
Go to the documentation of this file.
1 //# Copyright (C) 2000,2001
2 //# Associated Universities, Inc. Washington DC, USA.
3 //#
4 //# This library is free software; you can redistribute it and/or modify it
5 //# under the terms of the GNU Library General Public License as published by
6 //# the Free Software Foundation; either version 2 of the License, or (at your
7 //# option) any later version.
8 //#
9 //# This library is distributed in the hope that it will be useful, but WITHOUT
10 //# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
11 //# FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public
12 //# License for more details.
13 //#
14 //# You should have received a copy of the GNU Library General Public License
15 //# along with this library; if not, write to the Free Software Foundation,
16 //# Inc., 675 Massachusetts Ave, Cambridge, MA 02139, USA.
17 //#
18 //# Correspondence concerning AIPS++ should be addressed as follows:
19 //# Internet email: aips2-request@nrao.edu.
20 //# Postal address: AIPS++ Project Office
21 //# National Radio Astronomy Observatory
22 //# 520 Edgemont Road
23 //# Charlottesville, VA 22903-2475 USA
24 //#
25 //# $Id: Array.h 21545 2015-01-22 19:36:35Z gervandiepen $
26 
27 #ifndef SCIMATH_STATSALGORITHM_H
28 #define SCIMATH_STATSALGORITHM_H
29 
30 #include <casacore/casa/aips.h>
31 #include <casacore/casa/Exceptions/Error.h>
32 #include <casacore/casa/Utilities/CountedPtr.h>
33 #include <casacore/scimath/Mathematics/StatsDataProvider.h>
34 #include <casacore/scimath/Mathematics/StatisticsData.h>
35 #include <casacore/scimath/Mathematics/StatisticsTypes.h>
36 
37 #include <map>
38 #include <set>
39 #include <vector>
40 
41 // because the template signature has become unwieldy
42 #define CASA_STATD template <class AccumType, class DataIterator, class MaskIterator, class WeightsIterator>
43 #define CASA_STATP AccumType, DataIterator, MaskIterator, WeightsIterator
44 
45 namespace casacore {
46 
47 // Base class of statistics algorithm class hierarchy.
48 
49 // The default implementation is such that statistics are only calculated when
50 // getStatistic() or getStatistics() is called. Until then, the iterators which
51 // point to the beginning of data sets, masks, etc. are held in memory. Thus,
52 // the caller must keep all data sets available for the statistics object until
53 // these methods are called, and of course, if the actual data values are changed
54 // between adding data and calculating statistics, the updated values are used when
55 // calculating statistics. Derived classes may override this behavior.
56 //
57 // PRECISION CONSIDERATIONS
58 // Many statistics are computed via accumulators. This can lead to precision issues,
59 // especially for large datasets. For this reason, it is highly recommended that the
60 // data type one uses as the AccumType be of higher precision, if possible, than the
61 // data type pointed to by input iterator. So for example, if one has a data set of
62 // Float values (to which the InputIterator type points to), then one should use type
63 // Double for the AccumType. In this case, the Float data values will be converted to
64 // Doubles before they are accumulated.
65 //
66 // METHODS OF PROVIDING DATA
67 // Data may be provided in one of two mutually exclusive ways. The first way is
68 // simpler, and that is to use the setData()/addData() methods. Calling setData() will
69 // clear any previous data that was added via these methods or via a data provider (see
70 // below). Calling addData() subsequently to setData() will add a data set to the set
71 // of data sets on which statistics will be calculated. In order for this to work
72 // correctly, the iterators which are passed into these methods must still be valid when
73 // statistics are calculated (although note that some derived classes allow certain
74 // statistics to be updated as data sets are added via these methods. See specific
75 // classes for details).
76 //
77 // The second way to provide data is via a data provider. This takes the form of
78 // a derived class of StatsDataProvider, in which various methods are implemented for
79 // retrieving various information about the data sets. Such an interface is necessary for
80 // data which does not easily lend itself to be provided via the setData()/addData()
81 // methods. For example, in the case of iterating through a lattice, a lattice iterator
82 // will overwrite the memory location of the previous chunk of data with the current chunk
83 // of data. Therefore, if one does not wish to load data from the entire lattice into
84 // memory (which is why LatticeIterator was designed in this way), one must the
85 // LatticeStatsDataProvider class, which the statistics framework will use to iteratate
86 // through the lattice, only keeping one chunk of the data of the lattice in memory at one
87 // time.
88 //
89 // QUANTILES
90 // A quantile is a value contained in a data set, such that, it has a zero-based
91 // index of ceil(q*n)-1 in the equivalent ordered dataset, where 0 < q < 1 specifies
92 // the fractional location within the ordered dataset and n is the total number of elements.
93 // Note that, for a dataset with an odd number of elements, the median is the same as
94 // the quantile value when q = 0.5. However, there is no such correspondance between the
95 // median in a dataset with an even number of elements, since the median in that case is
96 // given by the mean of the elements of zero-based indeces n/2-1 and n/2 in the equivalent
97 // ordered dataset. Thus, in the case of a dataset with an even number of values, the
98 // median may not even exist in the dataset, while a quantile value must exist in the
99 // dataset. Note when calculating quantile values, a dataset that does not fall in
100 // specified dataset ranges, is not included via a stride specification, is masked, or
101 // has a weight of zero is not considered a member of the dataset for the pruposes of
102 // quantile calculations.
103 
104 template <class AccumType, class DataIterator, class MaskIterator=const Bool *, class WeightsIterator=DataIterator>
106 
107 public:
108 
109  virtual ~StatisticsAlgorithm();
110 
111  // Clone this instance
112  virtual StatisticsAlgorithm<CASA_STATP>* clone() const = 0;
113 
114  // <group>
115  // Add a dataset to an existing set of datasets on which statistics are
116  // to be calculated. nr is the number of points to be considered.
117  // If <src>dataStride</src> is greater than 1, when <src>nrAccountsForStride</src>=True indicates
118  // that the stride has been taken into account in the value of <src>nr</src>. Otherwise, it has
119  // not so that the actual number of points to include is nr/dataStride if nr % dataStride == 0 or
120  // (int)(nr/dataStride) + 1 otherwise.
121  // If one calls this method after a data provider has been set, an exception will be
122  // thrown. In this case, one should call setData(), rather than addData(), to indicate
123  // that the underlying data provider should be removed.
124  // <src>dataRanges</src> provide the ranges of data to include if <src>isInclude</src> is True,
125  // or ranges of data to exclude if <src>isInclude</src> is False. If a datum equals the end point
126  // of a data range, it is considered good (included) if <src>isInclude</src> is True, and it is
127  // considered bad (excluded) if <src>isInclude</src> is False.
128 
129  virtual void addData(
130  const DataIterator& first, uInt nr, uInt dataStride=1,
131  Bool nrAccountsForStride=False
132  );
133 
134  virtual void addData(
135  const DataIterator& first, uInt nr,
136  const DataRanges& dataRanges, Bool isInclude=True, uInt dataStride=1,
137  Bool nrAccountsForStride=False
138  );
139 
140  virtual void addData(
141  const DataIterator& first, const MaskIterator& maskFirst,
142  uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False, uInt maskStride=1
143  );
144 
145  virtual void addData(
146  const DataIterator& first, const MaskIterator& maskFirst,
147  uInt nr, const DataRanges& dataRanges,
148  Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False,
149  uInt maskStride=1
150  );
151 
152  virtual void addData(
153  const DataIterator& first, const WeightsIterator& weightFirst,
154  uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False
155  );
156 
157  virtual void addData(
158  const DataIterator& first, const WeightsIterator& weightFirst,
159  uInt nr, const DataRanges& dataRanges,
160  Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False
161  );
162 
163  virtual void addData(
164  const DataIterator& first, const WeightsIterator& weightFirst,
165  const MaskIterator& maskFirst, uInt nr, uInt dataStride=1,
166  Bool nrAccountsForStride=False,
167  uInt maskStride=1
168  );
169 
170  virtual void addData(
171  const DataIterator& first, const WeightsIterator& weightFirst,
172  const MaskIterator& maskFirst, uInt nr, const DataRanges& dataRanges,
173  Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False,
174  uInt maskStride=1
175  );
176  // </group>
177 
178  // get the algorithm that this object uses for computing stats
179  virtual StatisticsData::ALGORITHM algorithm() const = 0;
180 
181  // delete any (partially) sorted array
182  void deleteSortedArray();
183 
184  virtual AccumType getMedian(
185  CountedPtr<uInt64> knownNpts=NULL, CountedPtr<AccumType> knownMin=NULL,
186  CountedPtr<AccumType> knownMax=NULL, uInt binningThreshholdSizeBytes=4096*4096,
187  Bool persistSortedArray=False, uInt64 nBins=10000
188  ) = 0;
189 
190  // The return value is the median; the quantiles are returned in the <src>quantileToValue</src> map.
191  virtual AccumType getMedianAndQuantiles(
192  std::map<Double, AccumType>& quantileToValue, const std::set<Double>& quantiles,
193  CountedPtr<uInt64> knownNpts=NULL, CountedPtr<AccumType> knownMin=NULL,
194  CountedPtr<AccumType> knownMax=NULL,
195  uInt binningThreshholdSizeBytes=4096*4096, Bool persistSortedArray=False,
196  uInt64 nBins=10000
197  ) = 0;
198 
199  // get the median of the absolute deviation about the median of the data.
200  virtual AccumType getMedianAbsDevMed(
201  CountedPtr<uInt64> knownNpts=NULL,
202  CountedPtr<AccumType> knownMin=NULL, CountedPtr<AccumType> knownMax=NULL,
203  uInt binningThreshholdSizeBytes=4096*4096, Bool persistSortedArray=False,
204  uInt64 nBins=10000
205  ) = 0;
206 
207  AccumType getQuantile(
208  Double quantile, CountedPtr<uInt64> knownNpts=NULL,
209  CountedPtr<AccumType> knownMin=NULL, CountedPtr<AccumType> knownMax=NULL,
210  uInt binningThreshholdSizeBytes=4096*4096,
211  Bool persistSortedArray=False, uInt64 nBins=10000
212  );
213 
214  // get a map of quantiles to values.
215  virtual std::map<Double, AccumType> getQuantiles(
216  const std::set<Double>& quantiles, CountedPtr<uInt64> npts=NULL,
218  uInt binningThreshholdSizeBytes=4096*4096, Bool persistSortedArray=False,
219  uInt64 nBins=10000
220  ) = 0;
221 
222  // get the value of the specified statistic
223  virtual AccumType getStatistic(StatisticsData::STATS stat);
224 
225  // certain statistics such as max and min have locations in the dataset
226  // associated with them. This method gets those locations. The first value
227  // in the returned pair is the zero-based dataset number that was set or
228  // added. The second value is the zero-based index in that dataset. A data stride
229  // of greater than one is not accounted for, so the index represents the actual
230  // location in the data set, independent of the dataStride value.
231  virtual std::pair<Int64, Int64> getStatisticIndex(StatisticsData::STATS stat) = 0;
232 
234 
235  // reset this object by clearing data.
236  virtual void reset();
237 
238  // <group>
239  // setdata() clears any current datasets or data provider and then adds the specified data set as
240  // the first dataset in the (possibly new) set of data sets for which statistics are
241  // to be calculated. See addData() for parameter meanings.
242  virtual void setData(const DataIterator& first, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False);
243 
244  virtual void setData(
245  const DataIterator& first, uInt nr,
246  const DataRanges& dataRanges, Bool isInclude=True, uInt dataStride=1,
247  Bool nrAccountsForStride=False
248  );
249 
250  virtual void setData(
251  const DataIterator& first, const MaskIterator& maskFirst,
252  uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False,
253  uInt maskStride=1
254  );
255 
256  virtual void setData(
257  const DataIterator& first, const MaskIterator& maskFirst,
258  uInt nr, const DataRanges& dataRanges,
259  Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False,
260  uInt maskStride=1
261  );
262 
263  virtual void setData(
264  const DataIterator& first, const WeightsIterator& weightFirst,
265  uInt nr, uInt dataStride=1,
266  Bool nrAccountsForStride=False
267  );
268 
269  virtual void setData(
270  const DataIterator& first, const WeightsIterator& weightFirst,
271  uInt nr, const DataRanges& dataRanges,
272  Bool isInclude=True, uInt dataStride=1,
273  Bool nrAccountsForStride=False
274  );
275 
276  virtual void setData(
277  const DataIterator& first, const WeightsIterator& weightFirst,
278  const MaskIterator& maskFirst, uInt nr, uInt dataStride=1,
279  Bool nrAccountsForStride=False,
280  uInt maskStride=1
281  );
282 
283  virtual void setData(
284  const DataIterator& first, const WeightsIterator& weightFirst,
285  const MaskIterator& maskFirst, uInt nr, const DataRanges& dataRanges,
286  Bool isInclude=True, uInt dataStride=1, Bool nrAccountsForStride=False,
287  uInt maskStride=1
288  );
289  // </group>
290 
291  // instead of settng and adding data "by hand", set the data provider that will provide
292  // all the data sets. Calling this method will clear any other data sets that have
293  // previously been set or added.
294  virtual void setDataProvider(StatsDataProvider<CASA_STATP> *dataProvider);
295 
296  // Provide guidance to algorithms by specifying a priori which statistics the
297  // caller would like calculated.
298  virtual void setStatsToCalculate(std::set<StatisticsData::STATS>& stats);
299 
300 protected:
302 
303  // use copy semantics
306  );
307 
308  // Allows derived classes to do things after data is set or added.
309  // Default implementation does nothing.
310  virtual void _addData() {}
311 
312  const vector<Int64>& _getCounts() const { return _counts; }
313 
314  const vector<DataIterator>& _getData() const { return _data; }
315 
317  return _dataProvider;
318  }
319 
321  return _dataProvider;
322  }
323 
324  const vector<uInt>& _getDataStrides() const { return _dataStrides; }
325 
326  const std::map<uInt, Bool>& _getIsIncludeRanges() const { return _isIncludeRanges; }
327 
328  const std::map<uInt, MaskIterator> _getMasks() const { return _masks; }
329 
330  const std::map<uInt, uInt>& _getMaskStrides() const { return _maskStrides; }
331 
332  const std::map<uInt, DataRanges>& _getRanges() const { return _dataRanges; }
333 
334  virtual AccumType _getStatistic(StatisticsData::STATS stat) = 0;
335 
336  virtual StatsData<AccumType> _getStatistics() = 0;
337 
338  const std::set<StatisticsData::STATS> _getStatsToCalculate() const {
339  return _statsToCalculate;
340  }
341 
342  std::vector<AccumType>& _getSortedArray() { return _sortedArray; }
343 
344  virtual const std::set<StatisticsData::STATS>& _getUnsupportedStatistics() const {
345  return _unsupportedStats;
346  }
347 
348  const std::map<uInt, WeightsIterator>& _getWeights() const {
349  return _weights;
350  }
351 
352  /*
353  // get the zero-based indices of the specified quantiles in sorted dataset with npts
354  // number of good points. The returned map maps quantiles to indices.
355  static std::map<Double, uInt64> _indicesFromQuantiles(
356  uInt64 npts, const std::set<Double>& quantiles
357  );
358  */
359 
360  // The array can be changed by paritally sorting it up to the largest index. Return
361  // a map of index to value in the sorted array.
362  static std::map<uInt64, AccumType> _valuesFromArray(
363  vector<AccumType>& myArray, const std::set<uInt64>& indices
364  );
365 
366  void _setSortedArray(const vector<AccumType>& v) { _sortedArray = v; }
367 
368 private:
369  vector<DataIterator> _data;
370  // maps data to weights
371  std::map<uInt, WeightsIterator> _weights;
372  // maps data to masks
373  std::map<uInt, MaskIterator> _masks;
374  vector<Int64> _counts;
375  vector<uInt> _dataStrides;
376  std::map<uInt, uInt> _maskStrides;
377  std::map<uInt, Bool> _isIncludeRanges;
378  std::map<uInt, DataRanges> _dataRanges;
379  vector<AccumType> _sortedArray;
380  std::set<StatisticsData::STATS> _statsToCalculate, _unsupportedStats;
382 
383  void _throwIfDataProviderDefined() const;
384 };
385 
386 }
387 
388 #ifndef CASACORE_NO_AUTO_TEMPLATES
389 #include <casacore/scimath/Mathematics/StatisticsAlgorithm.tcc>
390 #endif
391 
392 #endif
virtual void setStatsToCalculate(std::set< StatisticsData::STATS > &stats)
Provide guidance to algorithms by specifying a priori which statistics the caller would like calculat...
std::vector< AccumType > & _getSortedArray()
StatisticsAlgorithm< CASA_STATP > & operator=(const StatisticsAlgorithm< CASA_STATP > &other)
use copy semantics
virtual void _addData()
Allows derived classes to do things after data is set or added.
const std::map< uInt, uInt > & _getMaskStrides() const
std::map< uInt, DataRanges > _dataRanges
virtual void addData(const DataIterator &first, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False)
Add a dataset to an existing set of datasets on which statistics are to be calculated.
std::map< uInt, WeightsIterator > _weights
maps data to weights
virtual StatsData< AccumType > _getStatistics()=0
struct Node * first
Definition: malloc.h:330
LatticeExprNode max(const LatticeExprNode &left, const LatticeExprNode &right)
std::map< uInt, MaskIterator > _masks
maps data to masks
unsigned long long uInt64
Definition: aipsxtype.h:39
virtual AccumType getStatistic(StatisticsData::STATS stat)
get the value of the specified statistic
virtual AccumType _getStatistic(StatisticsData::STATS stat)=0
virtual AccumType getMedianAndQuantiles(std::map< Double, AccumType > &quantileToValue, const std::set< Double > &quantiles, CountedPtr< uInt64 > knownNpts=NULL, CountedPtr< AccumType > knownMin=NULL, CountedPtr< AccumType > knownMax=NULL, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt64 nBins=10000)=0
The return value is the median; the quantiles are returned in the quantileToValue map...
virtual AccumType getMedianAbsDevMed(CountedPtr< uInt64 > knownNpts=NULL, CountedPtr< AccumType > knownMin=NULL, CountedPtr< AccumType > knownMax=NULL, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt64 nBins=10000)=0
get the median of the absolute deviation about the median of the data.
void _setSortedArray(const vector< AccumType > &v)
std::map< uInt, Bool > _isIncludeRanges
ALGORITHM
implemented algorithms
const std::map< uInt, MaskIterator > _getMasks() const
Referenced counted pointer for constant data.
Definition: CountedPtr.h:86
LatticeExprNode min(const LatticeExprNode &left, const LatticeExprNode &right)
virtual void setDataProvider(StatsDataProvider< CASA_STATP > *dataProvider)
instead of settng and adding data "by hand", set the data provider that will provide all the data set...
StatsDataProvider< CASA_STATP > * _getDataProvider()
double Double
Definition: aipstype.h:55
virtual StatsData< AccumType > getStatistics()
std::map< uInt, uInt > _maskStrides
StatsDataProvider< CASA_STATP > * _dataProvider
virtual StatisticsAlgorithm< CASA_STATP > * clone() const =0
Clone this instance.
const vector< DataIterator > & _getData() const
virtual StatisticsData::ALGORITHM algorithm() const =0
get the algorithm that this object uses for computing stats
#define DataRanges
Commonly used types in statistics framework.
bool Bool
Define the standard types used by Casacore.
Definition: aipstype.h:42
virtual const std::set< StatisticsData::STATS > & _getUnsupportedStatistics() const
void _throwIfDataProviderDefined() const
const std::map< uInt, DataRanges > & _getRanges() const
const std::set< StatisticsData::STATS > _getStatsToCalculate() const
const Bool False
Definition: aipstype.h:44
const std::map< uInt, WeightsIterator > & _getWeights() const
const StatsDataProvider< CASA_STATP > * _getDataProvider() const
virtual AccumType getMedian(CountedPtr< uInt64 > knownNpts=NULL, CountedPtr< AccumType > knownMin=NULL, CountedPtr< AccumType > knownMax=NULL, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt64 nBins=10000)=0
const vector< uInt > & _getDataStrides() const
virtual std::pair< Int64, Int64 > getStatisticIndex(StatisticsData::STATS stat)=0
certain statistics such as max and min have locations in the dataset associated with them...
void deleteSortedArray()
delete any (partially) sorted array
virtual void setData(const DataIterator &first, uInt nr, uInt dataStride=1, Bool nrAccountsForStride=False)
setdata() clears any current datasets or data provider and then adds the specified data set as the fi...
std::set< StatisticsData::STATS > _statsToCalculate
std::set< StatisticsData::STATS > _unsupportedStats
*static std::map< uInt64, AccumType > _valuesFromArray(vector< AccumType > &myArray, const std::set< uInt64 > &indices)
The array can be changed by paritally sorting it up to the largest index.
AccumType getQuantile(Double quantile, CountedPtr< uInt64 > knownNpts=NULL, CountedPtr< AccumType > knownMin=NULL, CountedPtr< AccumType > knownMax=NULL, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt64 nBins=10000)
const std::map< uInt, Bool > & _getIsIncludeRanges() const
const vector< Int64 > & _getCounts() const
virtual void reset()
reset this object by clearing data.
Base class of statistics algorithm class hierarchy.
const Bool True
Definition: aipstype.h:43
this file contains all the compiler specific defines
Definition: mainpage.dox:28
virtual std::map< Double, AccumType > getQuantiles(const std::set< Double > &quantiles, CountedPtr< uInt64 > npts=NULL, CountedPtr< AccumType > min=NULL, CountedPtr< AccumType > max=NULL, uInt binningThreshholdSizeBytes=4096 *4096, Bool persistSortedArray=False, uInt64 nBins=10000)=0
get a map of quantiles to values.
unsigned int uInt
Definition: aipstype.h:51