Set balancing
This article needs additional citations for verification. (December 2015) |
The set balancing problem in mathematics is the problem of dividing a set to two subsets that have roughly the same characteristics. It arises naturally in design of experiments.[1]: 71–72
There is a group of subjects. Each subject has several features, which are considered binary. For example: each subject can be either young or old; either black or white; either tall or short; etc. The goal is to divide the subjects to two sub-groups: treatment group (T) and control group (C), such that for each feature, the number of subjects that have this feature in T is roughly equal to the number of subjects that have this feature in C. E.g., both groups should have roughly the same number of young people, the same number of black people, the same number of tall people, etc.
Matrix representation
Formally, the set balancing problem can be described as follows.
is the number of subjects in the general population.
is the number of potential features.
The subjects are described by , an matrix with entries in . Each column represents a subject and each row represents a feature. if subject has feature , and if subject does not have feature .
The partition to groups is described by , an vector with entries in . if subject is in the treatment group T and is subject is in the control group C.
The balance of features is described by . This is an vector. The numeric value of is the imbalance in feature : if then there are more subjects with in T and if then there are more subjects with in C.
The imbalance of a given partition is defined as:
The set balancing problem is to find a vector which minimizes the imbalance .
Randomized algorithm
An approximate solution can be found with the following very simple randomized algorithm:
- Send each subject to the treatment group with probability 1/2.
In matrix formulation:
- Choose the elements of randomly with probability 1/2 to each value in {1,-1}.
Surprisingly, although this algorithm completely ignores the matrix , it achieves a small imbalance with high probability when there are many features. Formally, for a random vector :
PROOF:
Let be the total number of subjects that have feature (equivalently, the number of ones in the -th row of the matrix ). Consider the following two cases:
Easy case: . Then, with probability 1, the imbalance in feature (that we marked by ) is at most .
Hard case: . For every , let . Each such is a random variable that can be either 1 or -1 with probability 1/2. The imbalance in feature is: . Since the are independent random variables, by the Chernoff bound, for every :
Select: and get:
By the union bound,
- .
References
- ^ Mitzenmacher, Michael & Upfal, Eli (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. ISBN 0-521-83540-2.
Content Disclaimer
Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.
- The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
- There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
- It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
- Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
- Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.