KDD Cup  

Home Page
KDD Cup 2008
KDD Cup 2007
KDD Cup 2006
KDD Cup 2005
KDD Cup 2004
KDD Cup 2003
KDD Cup 2002
KDD Cup 2001
KDD Cup 2000
KDD Cup 1999
KDD Cup 1998
KDD Cup 1997
SIGKDD

KDD Cup 2003: Datasets

Sept 4, 2003: The datasets available for public download have been finalized.

I. Citation Prediction Task

Available for contestants:
  1. The LaTeX sources of all papers in the hep-th portion of the arXiv until May 1, 2003 are available for download. Each paper is identified by a unique arXiv id.

    There are approximately 29,000 hep-th papers with 1.7 gigs of data. The papers have been compressed to about 500M and divided into separate years for downloading.
    hep-th 1992 (22M)
    hep-th 1993 (31M)
    hep-th 1994 (39M)
    hep-th 1995 (36M)
    hep-th 1996 (41M)
    hep-th 1997 (44M)
    hep-th 1998 (45M)
    hep-th 1999 (48M)
    hep-th 2000 (53M)
    hep-th 2001 (56M)
    hep-th 2002 (59M)
    hep-th 2003 (17M)

  2. The abstracts for all the hep-th papers as a hep-th abstracts tarball.

  3. The SLAC dates for each hep-th paper as a hep-th slacdates tarball.
    • The format for the slac dates is a sorted 2 column vector where the left column is the paper's arxiv id and the right column is the SLAC date:
      [arxiv id] [date in YYYY-MM-DD format]

  4. The citation graph of the hep-th portion of the arXiv as a hep-th citations tarball.
    • The format for citations is a sorted 2 column vector where the left column is the cited from paper arxiv id and the right column is the cited to paper arxiv id:
      [paper cited from] [paper cited to]

II. Data Cleaning Task

For this task the LaTeX sources of the hep-ph papers on March 1, 2003 are available for download. A random paper id between 1 and 100,000 has been assigned to each paper. Also, a small subset of papers were converted from pdf/ps and only appear as plain text.

There are over 35,000 hep-ph papers with 1.8 gigs of data, so the download has been broken into 10 separate tar gzips of 50MB each, plus 1 extra tarball with the plain text papers.
hep-ph part 0
hep-ph part 1
hep-ph part 2
hep-ph part 3
hep-ph part 4
hep-ph part 5
hep-ph part 6
hep-ph part 7
hep-ph part 8
hep-ph part 9
hep-ph part 10 (plain text papers)

Sept 4, 2003: The corresponding citation graph for hep-ph used as the evaluation criteria is now available here.

III. Download Estimation Task

Available for this task are the same datasets for task 1 plus:
  1. For each paper that was published in one of the listed six months (2/2000, 3/2000, 2/2001, 4/2001, 3/2002, 4/2002), the download logs from its first 60 days in the arXiv are provided.
Update Sept 4, 2003: Download data is no longer publicly available for download.

IV. Open Task

Contestants can use any of the hep-th data from Tasks 1 or 3.