cocopp.toolsstats
module documentationcocopp
Function | fix_data_number | Obsolete and subject to removal. Use instead np.asarray(data)[randint_derandomized(0, len(data), ndata)] or [data[i] for i in randint_derandomized(0, len(data), ndata)]. |
Function | sp1 | sp1(data, maxvalue=Inf, issuccessful=None) computes a mean value over successful entries in data divided by success rate, the so-called SP1 |
Function | sp | No summary |
Function | drawSP_from_dataset | returns ``(percentiles, all_sampled_values_sorted)`` of simulated runlengths to reach ``ftarget`` based on a ``DataSet`` class instance, specifically:: |
Function | drawSP_from_dataset_new | new implementation, old interface (which should also change at some point) |
Function | drawSP | Returns the percentiles of the bootstrapped distribution of 'simulated' running lengths of successful runs. |
Function | randint_derandomized | return a numpy array of derandomized random integers. |
Function | simulated_evals | Obsolete: see DataSet.evals_with_simulated_restarts instead. |
Function | draw | Generates the empirical bootstrap distribution from a sample. |
Function | prctile | Computes percentile based on data with linear interpolation |
Function | randint | Undocumented |
Function | ranksum_statistic | Returns the U test statistic of the rank-sum (Mann-Whitney-Wilcoxon) test. |
Function | zprob | Returns the area under the normal curve 'to the left of' the given z value. |
Function | ranksumtest | Calculates the rank sum statistics for the two input data sets x and y and returns z and p. |
Function | rankdata | Ranks the data in a, dealing with ties appropriately. |
Function | significancetest | Compute the rank-sum test between two data sets. |
Function | significance_all_best_vs_other | |
Function | fastsort | Sort an array and provide the argsort. |
Function | sliding_window_data | width is an absolute number, the resulting data has the same length as the original data and the window width is between width/2 at the border and width in the middle. |
Function | equals_approximately | Undocumented |
Function | in_approximately | return True if a equals approximately any of the elements in list_, in short |
Class | Evals | Undocumented |
Function | _has_len | Undocumented |
Function | _randint_derandomized_generator | the generator for randint_derandomized |
Obsolete and subject to removal. Use instead np.asarray(data)[randint_derandomized(0, len(data), ndata)] or [data[i] for i in randint_derandomized(0, len(data), ndata)].
return copy of data vector modified to length ndata or data itself.
Assures len(data) == ndata.
>>> from cocopp.toolsstats import fix_data_number >>> data = [1,2,4] >>> assert len(fix_data_number(data, 1)) == 1 >>> assert len(fix_data_number(data, 3)) == 3 >>> assert len(fix_data_number(data, 4)) == 4 >>> assert len(fix_data_number(data, 14)) == 14 >>> assert fix_data_number(data, 14)[2] == data[2]
See also data[randint_derandomized(0, len(data), ndata)], which should do pretty much the same, a little more randomized.
Parameters | data | is a (row)-vector |
sp1(data, maxvalue=Inf, issuccessful=None) computes a mean value over successful entries in data divided by success rate, the so-called SP1
sp(data, issuccessful=None) computes the sum of the function evaluations over all runs divided by the number of success, the so-called success performance which estimates the average runtime aRT.
returns (percentiles, all_sampled_values_sorted) of simulated runlengths to reach ftarget based on a DataSet class instance, specifically:
evals = data_set.detEvals([ftarget])[0] # likely to be 15 "data points" idx_nan = np.isnan(evals) # nan == did not reach ftarget return drawSP(evals[~idx_nan], data_set.maxevals[idx_nan], percentiles, samplesize)
The expected value of all_sampled_values_sorted is the average runtime aRT, as obtained by data_set.detERT([ftarget])[0].
new implementation, old interface (which should also change at some point)
returns (None, evals), that is, no percentiles, only the data=runtimes=evals
Returns the percentiles of the bootstrapped distribution of 'simulated' running lengths of successful runs.
This implementation is depreciated and replaced by simulated_evals
.
The latter is also depreciated, see
DataSet.evals_with_simulated_restarts
instead.
See also: simulated_evals
.
return a numpy
array of derandomized random integers.
The interface is the same as for numpy.randint
, however the
default value for size
is high-low and each "random" integer
is guarantied to appear exactly once in each chunk of size
high-low. (That is, by default a permutation is returned.)
As for numpy.randint
, the value range is [low, high-1] or [0, low-1]
if high is None.
>>> import numpy as np >>> from cocopp.toolsstats import randint_derandomized >>> np.random.seed(1) >>> list(randint_derandomized(0, 4, 6)) [3, 2, 0, 1, 0, 2]
A typical usecase is indexing of data like:
[data[i] for i in randint_derandomized(0, len(data), ndata)] # or almost equivalently np.asarray(data)[randint_derandomized(0, len(data), ndata)]
randint_derandomized
Obsolete: see DataSet.evals_with_simulated_restarts
instead.
Return samplesize
"simulated" run lengths (#evaluations), sorted.
nfail
evaluations come fromExample:
>>> from cocopp import set_seed >>> from cocopp.toolsstats import simulated_evals >>> set_seed(4) >>> evals_succ = [1] # only one evaluation in the successful trial >>> evals_unsucc = [2, 4, 2, 6, 100] >>> simulated_evals(np.hstack([evals_succ, evals_unsucc]), ... len(evals_unsucc), 13) # doctest: +ELLIPSIS [1, 1, 3, ...
Generates the empirical bootstrap distribution from a sample. Input: - *data* -- a sequence of data values - *percentiles* -- a single scalar value or a sequence of percentiles to be computed from the bootstrapped distribution. - *func* -- function that computes the statistics as func(data,*args) or func(data,*args)[0], by default toolsstats.sp1 - *args* -- arguments to func, the zero-th element of args is expected to be a sequence of boolean giving the success status of the associated data value. This specialization of the draw procedure is due to the interface of the performance computation methods sp1 and sp. - *samplesize* -- number of bootstraps drawn, default is 1e3, for more reliable values choose rather 1e4. performance is linear in samplesize, 0.2s for samplesize=1000. Return: (prctiles, all_samplesize_bootstrapped_values_sorted) Example: >> import toolsstats >> data = np.random.randn(22) >> res = toolsstats.draw(data, (10,50,90), samplesize=1e4) >> print(res[0]) .. note:: NaN-values are also bootstrapped, but disregarded for the calculation of percentiles which can lead to somewhat unexpected results.
Computes percentile based on data with linear interpolation :keyword sequence data: (list, array) of data values :keyword prctiles: percentiles to be calculated. Values beyond the interval [0,100] also return the respective extreme value in data. :type prctiles: scalar or sequence :keyword issorted: indicate if data is sorted :Return: sequence of percentile values in data according to argument prctiles .. NOTEEE:: treats np.Inf and -np.Inf, np.NaN and None, the latter are simply disregarded
Returns the U test statistic of the rank-sum (Mann-Whitney-Wilcoxon) test.
http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U Small sample sizes (direct method).
Returns the area under the normal curve 'to the left of' the given z value.
http://www.nmr.mgh.harvard.edu/Neural_Systems_Group/gary/python.html
Thus:
- for z<0, zprob(z) = 1-tail probability
- for z>0, 1.0-zprob(z) = 1-tail probability
- for any z, 2.0*(1.0-zprob(abs(z))) = 2-tail probability
Adapted from z.c in Gary Perlman's |Stat. Can handle multiple dimensions.
Usage: azprob(z) where z is a z-value
Calculates the rank sum statistics for the two input data sets x and y and returns z and p.
This method returns a slight difference compared to scipy.stats.ranksumtest in the two-tailed p-value. Should be test drived...
Returns: z-value for first data set x and two-tailed p-value
Ranks the data in a, dealing with ties appropriately.
Equal values are assigned a rank that is the average of the ranks that would have been otherwise assigned to all of the values within that set. Ranks begin at 1, not 0.
Compute the rank-sum test between two data sets.
For a given target function value, the performances of two algorithms are compared. The result of a significance test is computed on the number of function evaluations for reaching the target or, if not available, the function values for the smallest budget in an unsuccessful trial.
Known bugs: this is not a fair comparison, because the successful trials could be very long.
Parameters | DataSet entry0 | -- data set 0 |
DataSet entry1 | -- data set 1 | |
list targets | -- list of target function values | |
Returns | list of (z, p) for each target function values in input argument targets. z and p are values returned by the ranksumtest method. |
Parameters | datasets | is a list of DataSet from different algorithms, otherwise on the same function and dimension (which is not necessarily checked) |
targets | is a list of target values, | |
best_alg_idx | for each target the best algorithm to be tested against the others |
Sort an array and provide the argsort.
width is an absolute number, the resulting data has the same length as the original data and the window width is between width/2 at the border and width in the middle.
Return (smoothed_data, stats), where stats is a list with elements [index_in_data, 2_10_25_50_75_90_98_percentile_of_window_at_i]