×

Serwis używa ciasteczek ("cookies") i podobnych technologii m.in. do utrzymania sesji i w celach statystycznych. • Ustawienia przeglądarki dotyczące obsługi ciasteczek można swobodnie zmieniać. • Całkowite zablokowanie zapisu ciasteczek na dysku komputera uniemożliwi logowanie się do serwisu. • Więcej informacji: Polityka cookies OPI PIB

×

Regulamin korzystania z serwisu PBN znajduję się pod adresem: Regulamin serwisu

Szukaj wśród:
Dane publikacji

Towards Finding Scholarly Articles in Internet Using Hadoop MapReduce with Oozie Workflow

Artykuł
Czasopismo : Challenges of modern technology   Tom: 4, Zeszyt: 4
Jakub Jurkiewicz [1] , Aleksander Nowiński [1]
2014-03-11 angielski
Link do publicznie dostępnego pełnego tekstu
Cechy publikacji
-
  • Oryginalny artykuł naukowy
  • Zrecenzowana naukowo
Dyscypliny naukowe
-
Informatyka – dziedzina nauk matematycznych
Słowa kluczowe
-
Abstrakty ( angielski )
-
An article focuses on the new methods for automatic processing and analysis of the scientific papers. It covers the very first part of this task – discovery and harvesting of scientific publications from the internet. Article is focused on discovery and analysis of the html documents to identify publication resources. Usage of data from Common Crawl project allows operating on large subset of the web pages without a need to perform an expensive crawl of the WWW. We present methods for automatic identification of pages describing scholarly documents in WWW network using html meta headers. Presented set of rules applied to the data achieves reasonable quality. A system based on these tools is also presented. It allows easy operating and transferring output to the COntent ANalysis SYStem(CoAnSys) - a processing and analysis system developed in ICM. For achieving this goal set of MapReduce tasks running with Hadoop And Ozzie has been used. The quality and efficiency of described rules are discussed. Finally future challenges for our system are presented.
Pełny tekst
-
  1. Rodzaj tekstu: Ostateczna wersja wydawcy
  2. Licencja: Creative Commons BY 3.0 PL
  3. Pliki
    1. comt44_01.pdf, 214 kB Pobierz plik
Bibliografia
-
  1. Dendek, P. J., Czeczko, A., Fedoryszak, M., Kawa, A., Wendykier, P. & Bolikowski, L. (2013). Taming the zoo - about algorithms implementation in the ecosystem of Apache Hadoop, 12. Information Retrieval; Digital Libraries. Retrieved from http://arxiv.org/abs/1303.5367
  2. Zamlynska, K., Bolikowski, L. & Rosiek, T. (2008). Migration of the Mathematical Collection of Polish Virtual Library of Science to the YADDA platform. Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008, 127-130.
  3. Kosala, R. & Blockeel, H. (2000). Web mining research. ACM SIGKDD Explorations Newsletter, 2(1), 1–15. doi:10.1145/360402.360406
  4. Turner, T. P. & Lise, B. (1998). Rising to the Top: Evaluating the Use of the HTML META Tag to Improve Retrieval of World Wide Web Documents hrough Internet Search Engines - Library Resources & Technical Services - Volume 42, Number 4 / 1998 - American Library Association. Library Resources & Technical Services, v42 n4 Oct 1998. Retrieved July 5, 2013, from http://alcts.metapress.com/content/gq8151m1l 8515845/
  5. Gyongyi, Z. & Garcia-Molina, H. (2005, April 1). Web Spam Taxonomy. First International Workshop on Adversarial Information Retrieval on the Web AIRWeb 2005). Retrieved from http://ilpubs.stanford.edu:8090/771/1/2005-9.pdf
  6. Beel, J. & Gipp, B. (2010). On the Robustness of Google Scholar Against Spam. Proceedings of the 21st ACM conference on Hypertext and hypermedia - HT ’10 (p. 297). New York, New York, USA: ACM Press. doi:10.1145/1810617.1810683
  7. Ardö, A. (2010). Can We Trust Web Page Metadata? Journal of Library Metadata, 10(1), 58–74. doi:10.1080/19386380903547008
  8. Dean, J. & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 1–13. doi:10.1145/1327452.1327492
Zacytuj dokument
-