h**t 发帖数: 1678 | 1 What formula can I use to determine the right sample size for clustering
analysis with 100-300 variables?
What sampling methodology can be used for k-means or hierarchical clustering
on categorical fields so that all values of the categorical fields are
included in the sample?
Thanks a lot! |
c***z 发帖数: 6348 | 2 A side question, how does K-mean decide the distance if some regressors are
binary? |
h**t 发帖数: 1678 | 3 k-means is baed on Euclidean distance calculations. So what ever the data is
, it still calculates the distance. |
h***x 发帖数: 586 | 4 Use Varclus (SAS) and PCA to do variable reduction first before running
clustering. When you only have 10-20 variables, you won't JiuJie to ask the
sampling strategies.
I do not like kmeans. Everytime when I reset the seeds, or even reorder the
dataset, and I will have different results, but the pros is I can get the
results I desire after trying and trying... Not sure if it is kind of
cheating...
Non-parameter clustering (modeclus) is a better choice most of the time. It
can handle the situation the kmeans cannot handle well because of the data
structure problems.
Another good way is to combine KMeans with hierarchical method to make two
stage clustering.
clustering
【在 h**t 的大作中提到】 : k-means is baed on Euclidean distance calculations. So what ever the data is : , it still calculates the distance.
|
c***z 发帖数: 6348 | 5 also don't forget to normalize the variables |
s*********h 发帖数: 6288 | 6 twp step clustering 在R里有吗?
the
the
It
【在 h***x 的大作中提到】 : Use Varclus (SAS) and PCA to do variable reduction first before running : clustering. When you only have 10-20 variables, you won't JiuJie to ask the : sampling strategies. : I do not like kmeans. Everytime when I reset the seeds, or even reorder the : dataset, and I will have different results, but the pros is I can get the : results I desire after trying and trying... Not sure if it is kind of : cheating... : Non-parameter clustering (modeclus) is a better choice most of the time. It : can handle the situation the kmeans cannot handle well because of the data : structure problems.
|
g******2 发帖数: 234 | 7 you can use sparse K-means |
h**t 发帖数: 1678 | 8 Thank you!
I know model based clustering and two step clustering are more appropriate
for my data. For some reason, I can only use k-means or hierarchical
clustering to do finsih some demos...
PCA or FA is not preferred; actually we want to keep these many variables...
the
the
It
【在 h***x 的大作中提到】 : Use Varclus (SAS) and PCA to do variable reduction first before running : clustering. When you only have 10-20 variables, you won't JiuJie to ask the : sampling strategies. : I do not like kmeans. Everytime when I reset the seeds, or even reorder the : dataset, and I will have different results, but the pros is I can get the : results I desire after trying and trying... Not sure if it is kind of : cheating... : Non-parameter clustering (modeclus) is a better choice most of the time. It : can handle the situation the kmeans cannot handle well because of the data : structure problems.
|
h**t 发帖数: 1678 | 9 this is done already..
【在 c***z 的大作中提到】 : also don't forget to normalize the variables
|
c***z 发帖数: 6348 | 10 too many variables will cause the dimensionality curse..
..
【在 h**t 的大作中提到】 : Thank you! : I know model based clustering and two step clustering are more appropriate : for my data. For some reason, I can only use k-means or hierarchical : clustering to do finsih some demos... : PCA or FA is not preferred; actually we want to keep these many variables... : : : the : the : It
|
b********r 发帖数: 764 | 11 请问这个的算法是怎样的?
【在 g******2 的大作中提到】 : you can use sparse K-means
|