Crowdsourcing identification of license violations

Sanghoon Lee, Daniel M. German, Seung won Hwang, Sunghun Kim

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Free and open source software (FOSS) has created a large pool of source codes that can be easily copied to create new applications. However, a copy should preserve copyright notice and license of the original file unless the license explicitly permits such a change. Through software evolution, it is challenging to keep original licenses or choose proper licenses. As a result, there are many potential license violations. Despite the fact that violations can have high impact on protecting copyright, identification of violations is highly complex. It relies on manual inspections by experts. However, such inspection cannot be scaled up with open source software released daily worldwide. To make this process scalable, we propose the following two methods: use machine-based algorithms to narrow down the potential violations; and guide non-experts to manually inspect violations. Using the first method, we found 219 projects (76.6%) with potential violations. Using the second method, we show that the accuracy of crowds is comparable to that of experts. Our techniques might help developers identify potential violations, understand the causes, and resolve these violations.

Original languageEnglish
Pages (from-to)190-203
Number of pages14
JournalJournal of Computing Science and Engineering
Volume9
Issue number4
DOIs
Publication statusPublished - 2015 Jan 1

Fingerprint

Inspection
Open source software

All Science Journal Classification (ASJC) codes

  • Engineering(all)
  • Computer Science Applications

Cite this

Lee, Sanghoon ; German, Daniel M. ; Hwang, Seung won ; Kim, Sunghun. / Crowdsourcing identification of license violations. In: Journal of Computing Science and Engineering. 2015 ; Vol. 9, No. 4. pp. 190-203.
@article{cef6a8706e614fc499884d720e0f8aa3,
title = "Crowdsourcing identification of license violations",
abstract = "Free and open source software (FOSS) has created a large pool of source codes that can be easily copied to create new applications. However, a copy should preserve copyright notice and license of the original file unless the license explicitly permits such a change. Through software evolution, it is challenging to keep original licenses or choose proper licenses. As a result, there are many potential license violations. Despite the fact that violations can have high impact on protecting copyright, identification of violations is highly complex. It relies on manual inspections by experts. However, such inspection cannot be scaled up with open source software released daily worldwide. To make this process scalable, we propose the following two methods: use machine-based algorithms to narrow down the potential violations; and guide non-experts to manually inspect violations. Using the first method, we found 219 projects (76.6{\%}) with potential violations. Using the second method, we show that the accuracy of crowds is comparable to that of experts. Our techniques might help developers identify potential violations, understand the causes, and resolve these violations.",
author = "Sanghoon Lee and German, {Daniel M.} and Hwang, {Seung won} and Sunghun Kim",
year = "2015",
month = "1",
day = "1",
doi = "10.5626/JCSE.2015.9.4.190",
language = "English",
volume = "9",
pages = "190--203",
journal = "Journal of Computing Science and Engineering",
issn = "1976-4677",
publisher = "Korean Institute of Information Scientists and Engineers",
number = "4",

}

Crowdsourcing identification of license violations. / Lee, Sanghoon; German, Daniel M.; Hwang, Seung won; Kim, Sunghun.

In: Journal of Computing Science and Engineering, Vol. 9, No. 4, 01.01.2015, p. 190-203.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Crowdsourcing identification of license violations

AU - Lee, Sanghoon

AU - German, Daniel M.

AU - Hwang, Seung won

AU - Kim, Sunghun

PY - 2015/1/1

Y1 - 2015/1/1

N2 - Free and open source software (FOSS) has created a large pool of source codes that can be easily copied to create new applications. However, a copy should preserve copyright notice and license of the original file unless the license explicitly permits such a change. Through software evolution, it is challenging to keep original licenses or choose proper licenses. As a result, there are many potential license violations. Despite the fact that violations can have high impact on protecting copyright, identification of violations is highly complex. It relies on manual inspections by experts. However, such inspection cannot be scaled up with open source software released daily worldwide. To make this process scalable, we propose the following two methods: use machine-based algorithms to narrow down the potential violations; and guide non-experts to manually inspect violations. Using the first method, we found 219 projects (76.6%) with potential violations. Using the second method, we show that the accuracy of crowds is comparable to that of experts. Our techniques might help developers identify potential violations, understand the causes, and resolve these violations.

AB - Free and open source software (FOSS) has created a large pool of source codes that can be easily copied to create new applications. However, a copy should preserve copyright notice and license of the original file unless the license explicitly permits such a change. Through software evolution, it is challenging to keep original licenses or choose proper licenses. As a result, there are many potential license violations. Despite the fact that violations can have high impact on protecting copyright, identification of violations is highly complex. It relies on manual inspections by experts. However, such inspection cannot be scaled up with open source software released daily worldwide. To make this process scalable, we propose the following two methods: use machine-based algorithms to narrow down the potential violations; and guide non-experts to manually inspect violations. Using the first method, we found 219 projects (76.6%) with potential violations. Using the second method, we show that the accuracy of crowds is comparable to that of experts. Our techniques might help developers identify potential violations, understand the causes, and resolve these violations.

UR - http://www.scopus.com/inward/record.url?scp=85008255741&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85008255741&partnerID=8YFLogxK

U2 - 10.5626/JCSE.2015.9.4.190

DO - 10.5626/JCSE.2015.9.4.190

M3 - Article

AN - SCOPUS:85008255741

VL - 9

SP - 190

EP - 203

JO - Journal of Computing Science and Engineering

JF - Journal of Computing Science and Engineering

SN - 1976-4677

IS - 4

ER -