Datathon

In the last decade, researchers from network management field start leveraging the power of Machine Learning (ML) techniques. The objective to organize this challenge is to motivate the application of ML in automatic network analysis, since machine learning techniques have shown great success in other related domains. The ONTIC project aims to promote participants’ enthusiasm to build ML-based system for the prediction of network traffic using real data via this challenge. The principal task proposed by this challenge is to predict network traffic in short term at different temporal levels, provided real network traffic collected at a Spanish ISP. This huge dataset will be public available to encourage future advanced research works. A more detailed description will be provided later.

The 5G, the new generation of communications networks, is expected to cater to the needs of billions of interconnected devices of a heterogeneous nature. Therefore, there is an increasing interest in leveraging recent advances in machine learning and data mining to the analysis and characterization of network traffic.

Among the different challenges that network service and infrastructure providers are interested in tackling, demand prediction is one of the most important. The ability to detect patterns in the behaviour of traffic that allow providers to make reliable predictions can be instrumental to efficiently managing virtualized infrastructure, deploying resources where they are most needed juresearchers from network management field start leveraging the power of Machine Learning (ML) techniques. The objective to organize this challenge is to motivate the application of ML in automatic network analysis, since machine learning techniques have shown great success in other related domains. The ONTIC project aims to promote participants’ enthusiasm to build ML-based system for the prediction of network traffic using real data via this challenge. The principal task proposed by this challenge is to predict network traffic in short term at different temporal levels, provided real network traffic collected at a Spanish ISP. This huge dataset will be public available to encourage future advanced research works. A more detailed description will be provided later.st in time to ensure the best performance possible. 

This datathon proposes the challenge of predicting the number of flows crossing a large network link. The employed dataset was collected at a medium-sized Spanish ISP. It is a time series of one-second intervals, where each datum represents the number of TCP flows crossing the core network of the ISP. The dataset spans a period of 6 days.

 

Organizers

Chairs

  • Miguel Ángel López Peña, Innovation and Development Manager at SATEC, Spain. 

  • Alberto Mozo Velasco, Associate Professor at Universidad Politécnica de Madrid, and Project Coordinator of the ONTIC project, Spain. 

Challenge Organizers

  • Sandra Gómez Canaval, Universidad Politécnica de Madrid, Spain. 
  • Bruno Ordozgoiti Rubio,  Universidad Politécnica de Madrid, Spain. 
  • Bo Zhu,  Universidad Politécnica de Madrid, Spain. 

 

Sponsored by:

This datathon has been totally supported by the European Union's in the Seventh Framework Programme (FP7/2007-2011) within the ONTIC project  under grant agreement n. 619633.

 

Registration

To participate in the challenge, people interested must register following the timeline below. Registration process consists to send the next documentation:

  1. A participation form should be properly filled and sent to This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. An agreement should be signed by the participants and sent.
  3. An accreditation document specifying that participants have European Union citizens or valid identification card number for European residents or students: identity national document or identity national document for foreign residents in the European Union.  Please, note that this challenge only includes European Union citizens, European Union Organizations (with a European Union Fiscal number) and foreign residents in the European Union with identification card number for European Union.

 In order to have a complete registration whole documents above must be sent in a correct way. Challenge organizers only will accept complete and valid registrations.

Challenge organizers will confirm by email the correct registration to the accepted participants.

 

Task & Dataset

As said above, the goal is to predict the number of flows crossing a network link based on a time series of previous values. Participants will be given a time series corresponding to the five weekdays of a week of 2016. Each data point in the time series represents the number of TCP sessions that were active during the previous one-second interval. Therefore, the data set totals 432,000 data points. A data set from a different period will be provided for validation purposes.

For the testing phase, a set of randomly sampled time series segments from a different period will be employed. Participants will be allowed to use up to 1000 data points to predict the next one. Participants are required to make predictions 1 step, 2 steps, 4 steps and 8 steps into the future. That is, given the time series t_1, t_2, ..., t_1000, participants will be required to produce predictions for the values at t_1001, t_1002, t_1004 and t_1008.

The test set will consist of 20,000 such segments, provided in a CSV file of 20,000 rows and 1000 columns. Participants are required to produce four files, each of which containing 1000 values corresponding to the prediction at each of the time steps for each input row. For instance, the first file will contain the 1000 1-second-ahead predictions for each row, the second file will contain the predictions 2 seconds ahead, and so on. 

To measure the quality of the predictions, the mean squared error (MSE) of the predictions with respect to the actual values will be measured. The final score of a submission will be the sum of the MSE of the four different tasks.

In order to make sure that participants produce a submission in the right form, the validation data will be in the same format as the test data. In addition, the script used for measuring the score of the predictions will also be provided.

  

Submission & Important Dates

To participate in the challenge, people interested must register following the timeline below. Registration process consists to send the next documentation:

  • A participation form should be properly filled and sent to the email to be announced.
  • An agreement should be signed by the participants and sent.
  • An accreditation document specifying that participants have European Union citizens or valid identification card number for European residents or students: identity national document or identity national document for foreign residents in the European Union.
  • Please, note that this challenge only includes European Union citizens, European Union Organizations (with a European Union Fiscal number) and foreign residents in the European Union with identification card number for European Union.
  • In order to have a complete registration whole documents above must be sent in a correct way. Challenge organizers only will accept complete and valid registrations.
  • Challenge organizers will confirm by email the correct registration to the accepted participants.

Registered participants are required to submit the following materials:

  1. Forecasting results obtained on the test data, following the same format as the provided example result, together with the configuration and parameter setting of the conducted experiments.
  2. Source code of proposed algorithm, which should generate prediction results that has the same format as the provided example. Programming languages are restricted within the scope of Python, Java, Scala, R and Matlab. If a compilable language is used, the submitted source code should be able to generate an executable. Using such executable the organizers should be able to reproduce the experiments and obtain the same forecasting results given the provided configuration of experiments. Submitted algorithms will be evaluated using the experiment results obtained by the challenge organizers. 
  3. If any non-standard library is used in the proposed algorithm, the source code of such library should be provided as well.
  4. In case that two solutions have sufficiently similar forecasting precisions, the organizers will evaluate both and reward the one with more novelty and interestingness.

 

Challenge Timeline:

  • March 3,  Deadline for participant's registration.
  • March 6, Communication for accepted registration to participant.
  • March 9,   The challenge starts. FTP access opened to download training and validation datasets. 
  • March 16,  Release of the test data and submission page opens.
  • March 23, Submission due.
  • March 30, Publication of results.

 

Evaluation Criteria

The quality of the predictions will be measured with the mean squared error (MSE) of the predictions with respect to the actual values. The final score of a submission will be the sum of the MSE of the four different experiments.

 

Baselines

  •  Naive (last point)
  • 2-point extrapolation
  • Linear regression
  • Linear regression with polynomial basis expansion
  •  Neural network

 

Download

All files for this challenge will be download by ftp accounts given by Challenge Organizers via email.

  • Train data
  • Validation data
  • Test data
  • Scoring measure
  • Example submission for the validation data, will be sent to the registered participants, following the timeline above. 

 

Prizes

 1st prize: Apple Ipad Air2 128GB Wifi (or equivalent model)

 2nd prize: Lenovo TAB 2 A10-70 16GB (or equivalent model)

 

Contact

Challenge Organizers by email: This email address is being protected from spambots. You need JavaScript enabled to view it.