KDD Cup 2006 FAQ: UNM Computer Science
KDD Cup 2006 FAQ
Note: References to code or actual data look like this.
Q1: The spec says that there are 46 positive cases and 20 negative cases,
but I only count 38 positive and 8 negative in the training data. What
gives?
A1: Don't panic! As the README included with the data indicates, this release
is a preliminary data set — it may not match the spec in all ways. The slightly
longer story is that Siemens is preparing a more extensive version of the
training data. But it turned out to take longer to produce that data than they
had expected, and we judged that it was more important to go forward with the
competition than to wait on the final data. So we released the data that is
currently available. The spec is based on Siemens's projections of what
the final data will look like and is out of synch with the preliminary data that
we released. The "final" data may or may not become available in time to be of
use during the competition; if it does become available I will release it as
soon as I can. At that point, I'll correct the spec to reflect the final status
of the training data.
Q2: How do you define candidates which are associated with PE (see
pdf-specifications file)?
A2: The 'label' field actually gives the index of the PE with which the
candidate is associated. That is, label=0 means 'this candidate not
associated with a PE', while label=X means 'this candidate
associated with PE number X' (for X!=0).
Q3: Why does the 'KDDPEfeatureNames.txt' file have fewer feature names (116,
including Patient ID) than the 'KDDPETrain.txt' file fields (117)?
A3: The last feature name should be Tissue Feature and is
missing from the original feature names file.
Q4: What's with normalization? The contest documentation says "all features
are normalized from 0 to 1," but there are negative numbers all over the
place.
A4: Yes, that's a mistake in the documentation. The normalization is into a
unit range and roughly a zero mean. But feel free to re-normalize the data in
any way that you like.
Q5: How many submissions (i.e. sets of predictions) are allowed for each
subtask? What should be the format for the submissions?
A5: I'm working on sorting out the submission process now. I'll let you all
know more when I do.
Q6: Once the evaluation is done, will the labels of the test data be
released to the participants?
A6: Yes. Once the competition is complete, the intention is to release the
entire data set.
Q7: Can you give us more info on the meaning of each of the features?
Background info, interpretation, and so on?
A7: Siemens is unwilling to provide this information. They are hoping for
"abstract" approaches that aren't engineered to specific feature sets. (E.g., so
that they're easily extensible to different medical problems.)
Q7': How far apart should candidates be before I can reasonably assume they
are independent (there is no way I can tell this from the processed data)? I.e.
how far apart should they be to represent a "different" part of the lung and not
be looking at essentially the same structure?
A7': That's a good question. It's part of the challenge for you to determine
that. No prior domain knowledge is available on this point.
Q8: There appear to be strong correlations within a patient. Can you explain
that? Will we be given patient ID in the test data?
A8: There are indeed strong correlations within patient and even within
patient groups from a single hospital. Part of the challenge is to find useful
ways to exploit that information. You will be given patient ID in the
final test data, but you will not be given a hospital ID and you may
not assume that all of the patients are drawn from the same hospital.
The test data patient set will be disjoint from the training data patient set.
Q9: Can I return "unknown" labels for some of the data points?
A9: I'll say more about the answer format in the future but the short answer
is: no. You must commit to an answer for each candidate. In practice,
if your algorithm was embedded in a piece of medical imaging equipment, you
would be required to return an absolute yes/no answer to the physician — there
is no room for "I don't know".
Q10: Is there label noise in the training or test data?
A10: Answer from Siemens: "We have high confidence that the labels given in
the training and test data are correct. They have been rigorously examined
individually by domain experts. But there is always an element of human error —
there may be small errors in either direction, but we believe that such errors
are minimal, at best. There exists the possibility that there are PEs in the
original data to which no candidates were assigned at all, but this is beyond
the scope of this competition. (Such PEs will have no candidates in the test
data, so you won't be scored on them one way or the other.)"
|