'Development' 카테고리의 글 목록

« » 2025.4
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

[cascading] TemplateTap deprecated !!

Development/Cascading / 2014. 10. 8. 10:53

Cascading TemplateTap deprecated.

대신 PartitionTap을 이용해 출력시에 sub-directory를 만들 수 있다.

사용 방법은 TemplateTap과 비슷하다.

많은 Partition 데이터를 작업할때 M/R 올렸다 내렸다하는 시간이 긴데, 이렇게 하면 시간을 엄청 단축 시킬 수 있다.

특히 Partition이 잘개 쪼개져 있는 경우에 엄청나게 큰 이득이다.
houly partitioning 데이터 작업을 할 때, partition 별로 작업 할 때 3주 걸리던 작업이 단 1시간 반 만에 끝났다.

다음 예제를 보면 new Fields("partition")필드에 sub-directory로 만들 date가 들어있다.

방법은 간단하다!!

Tap multiSinkTap = new PartitionTap(new Hfs(new TextDelimited(fields, false, delimiter), outputPath), new DelimitedPartition(new Fields("partition")), SinkMode.REPLACE);

'Development > Cascading' 카테고리의 다른 글

About Cascading? (0)	2014.02.03
Aggregator (2)	2014.01.22

Posted by 꽃현주

, |

[Hadoop,Cascading] classpath, distributed cache path 설정

Development / 2014. 6. 11. 14:34

hadoop 에서 hdfs에 distributed cache와 classpath 를 runtime에 동적으로 binding하여 사용하고 있다.

(cascading을 이용하고있뜸!!)

distributed cache는 "mapred.cache.files" (yarn: mapreduce.cache.files)에 path를 다음과 같이 넣었고,

"hdfs:///user/joo/x.sqlite"

classpath도 "mapred.job.classpath.files" (yarn: mapreduce.job.classpath.files) path를 다음와 같이 넣었었다.

"hdfs:///user/lib/simple-json.jar"

하지만 뚜둥!! class not found의 에러가 발생하여 job이 죽었다. ㅠㅠ

분명 jobtracker의 job file을 보아도 path가 잘 붙어 있는데 벙쪄 있었다. ㅠ

해결 방법은 아주 간단했다.

"hdfs://" 를 제거하고 "/user/lib/simple-json.jar" root부터만 path를 적어 줬더니 아주 잘 돌아간다;;;

classpath는 "hdfs//"가 prefix 로 되어있나보다.

Posted by 꽃현주

, |

HIVE beeline이 background job(nohup, &)에서 실행 되지 않을 때

Development/Linux / 2014. 4. 21. 14:16

hive beeline을 사용하는데, background job이 실행되지 않을 때 해결 방법

(방법이라 쓰고 꼼수라고 읽는다.)

shell script에서 beeline을 호출해서 background 작업을 돌리려 했다.

늘 그렇듯 그냥 돌리면 잘 돌아감 ㅠㅠ

하지만 background job으로 실행하니 hive 쿼리 날리는 부분에서 한참동안 어떠한 메시지도 뜨지 않고 실행도 되지 않았다.

한참 후에 오류 메시지를 보니 JLine 부분에서 오류가 난 것이었다.

알고보니 JLine에서는 콘솔에 값을 리턴해주려하는데,

background job을 돌리니 콘솔에 출력하기 위한 정보를 읽어 올 수 없어 오류가 났던 것!!

그래서 screen을 사용해서 session이 끊어 지지 않도록 하고, 그 곳에서 script를 실행 시켰다.

해결방법: screen을 사용해라!!

스크린이 무엇인지 모르겠따면??

요 링크 참조 (알기 쉽게 설명해둠) >> http://forum.falinux.com/zbxe/index.php?document_srl=530766&mid=lecture_tip

'Development > Linux' 카테고리의 다른 글

간단한 리눅스 원격 명령 (0)	2014.03.25
파일을 지워 디렉토리 용량을 유지 시켜주는 스크립트 (0)	2014.03.16
터미널 & vi가 이상 할 때 (0)	2014.02.20
실행은 되는데, Cron에서 돌지 않을 때 (0)	2014.02.20
간단히 awk를 사용하여 Pattern Detecting !! (3)	2014.02.02

Posted by 꽃현주

, |

HIVE beeline이 background job(nohup, &)에서 실행 되지 않을 때 (0)	2014.04.21
파일을 지워 디렉토리 용량을 유지 시켜주는 스크립트 (0)	2014.03.16
터미널 & vi가 이상 할 때 (0)	2014.02.20
실행은 되는데, Cron에서 돌지 않을 때 (0)	2014.02.20
간단히 awk를 사용하여 Pattern Detecting !! (3)	2014.02.02

HIVE beeline이 background job(nohup, &)에서 실행 되지 않을 때 (0)	2014.04.21
간단한 리눅스 원격 명령 (0)	2014.03.25
터미널 & vi가 이상 할 때 (0)	2014.02.20
실행은 되는데, Cron에서 돌지 않을 때 (0)	2014.02.20
간단히 awk를 사용하여 Pattern Detecting !! (3)	2014.02.02

sqlite excute query

Development/Python / 2014. 2. 13. 23:15

Python에서 간단히

(1) sqlite db연결하고,

(2) table 만들고,

(3) table에 data 삽입하기

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import os, sys
import sqlite3

def makeSqliteTable(rows):
    conn= sqlite3.connect("dbName")
    conn.text_factory= str
    c= conn.cursor()
    c.execute("DROP TABLE IF EXISTS student_info")
    c.execute("CREATE TABLE student_info (kor_name text, address text, job text, age text)")
    c.executemany("INSERT INTO student_info VALUES(?,?,?,?)",rows)
    conn.commit()
    conn.close()

def main():
    rows= []
    rows.append("홍길동", "경기도","대학생","24살")
    rows.append("이상화", "경기도","빙상여제","26살")
    makeSqliteTable(rows)

if __name__== "__main__":
main()

Posted by 꽃현주

, |

About Cascading?

Development/Cascading / 2014. 2. 3. 23:19

발번역 주의 요망, 내용 중 오류 발견시 연락 요망 = =

# What is Cascading?

Cascading은 분산컴퓨팅 클러스터 혹은 단일컴퓨터 노드 환경에서 질의를 정의&공유 하는 등의 작업을 하고

data processing workflow를 실행하는 data processing API이다.

단일컴퓨터 노드 환경(local mode)는 test코드와 workflow를 cluster에 배포 전에 효율적으로 테스트 할 수 있다.

분산컴퓨팅 환경에서는 Apache Hadoop plaform에서 이용 할 수 있다.

이를 이용해 쉽게 Hadoop Application을 개발 하고 Job을 생성 하고 스케쥴링 할 수 있다.

쉽게 설명 하자면

HDFS 등에 분산되어있는 데이터를 Cascading을 이용하여 쉽게 추출/정제를 하고 work flow를 생성 할 수 있다.

User는 Map/Reduce를 직접 구현하지 않아도 Casading의 Each, GroupBy, Aggregator, Filter 등을 이용하여 Hadoop Job에서 Mapper와 Reducer를 작동 시킬 수 있다.

또한 기 구현 된 Operation 중에 원하는 Operation이 없다면 Interface를 상속받아 구현 할 수 있다.

# Example - word count

source = Input 데이터의 위치와 원하는 형태의 스키마로 Tap 생성

sink = 저장하려는 스키마와, OuputPath, 싱크모드를 지정한 Tap 생성

assembly= 파이프를 만들어서 wordcount로 명명하고,

RegularExpression Generator로 각각의 라인을 단어로 쪼개서 word라고 명명된 필드에 넣음

word로 groupBy하여 count함

Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
// the 'head' of the pipe assembly
Pipe assembly = new Pipe( "wordcount" );
// For each input Tuple
// parse out each word into a new Tuple with the field name "word"
// regular expressions are optional in Cascading
String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
// group the Tuple stream by the "word" value
assembly = new GroupBy( assembly, new Fields( "word" ) );
// For every Tuple group
// count the number of occurrences of "word" and store result in
// a field named "count"
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );
// initialize app properties, tell Hadoop which jar file to use
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
// plan a new Flow from the assembly using the source and sink Taps
// with the above properties
FlowConnector flowConnector = new HadoopFlowConnector( properties );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
// execute the flow, block until complete
flow.complete();

# Comparison of pipe types

Pipe type	Purpose	Input	Output
Pipe	instantiate a pipe; create or name a branch	name	a (named) pipe
SubAssembly	create nested subassemblies
Each	apply a filter or function, or branch a stream	tuple stream (grouped or not)	a tuple stream, optionally filtered or transformed
Merge	merge two or more streams with identical fields	two or more tuple streams	a tuple stream, unsorted
GroupBy	sort/group on field values; optionally merge two or more streams with identical fields	one or more tuple streams with identical fields	a single tuple stream, grouped on key field(s) with optional secondary sort
Every	apply aggregator or buffer operation	grouped tuple stream	a tuple stream plus new fields with operation results
CoGroup	join 1 or more streams on matching field values	one or more tuple streams	a single tuple stream, joined on key field(s)
HashJoin	join 1 or more streams on matching field values	one or more tuple streams	a tuple stream in arbitrary order

# 결론

M/R 코드는 딱 한번 짜봤는데 간단한 M/R조차 개발하는게 쉽지 않았다.

Cascading은 쉽게 work flow를 생성하여 ETL 업무를 하는데 적격이다.

또한 Cascading API는 Java뿐아니라 Scala, Clojure, Groove, JRuby, Jython 등에서도 사용 할 수 있다.

그러나 work flow가 필요 없는 간단한 질의의 경우 Impala나 Shark, Hive, Pig 등이 더 유리 할 수 있다.

개발 전엔 항상 데이터 크기, 모양, 해상도, 작업 복잡도, 실시간이냐 아니냐 등

여러가지를 잘 따져보고 자신의 상황에 맞는 것을 사용해야됨!!

'Development > Cascading' 카테고리의 다른 글

[cascading] TemplateTap deprecated !! (0)	2014.10.08
Aggregator (2)	2014.01.22

Posted by 꽃현주

, |

간단히 awk를 사용하여 Pattern Detecting !!

Development/Linux / 2014. 2. 2. 19:06

간단히 리눅스에서 데이터 다루기!!

이름 학과 번호 국어 영어 수학 평균

홍길동 컴공 1 100 90 95 95

홍진이 경영 2 100 100 100 100

철수 컴공 2 90 90 90 90

홍민수 산디 1 100 60 80 80

구분자: tab File: test.txt

* 성이 '홍' 씨이고 평균이 85점 이상인 학생의 이름과 번호과 평균을 출력하고, 평균으로 정렬하여라.

(단, 평균은 내림차순이고 출력 구분자는 tab 이다.)

$ cat text.txt | grep '^홍' | awk -F'\t' '{if($7>=85) print $1"\t"$3"\t"$7}' | sort -r -k 3

간단히 위의 명령어로 데이터를 정렬하고 뽑고, 원하는 형태로 데이터를 정제 할 수 있다.

결과

홍진이 2 100
홍길동 1 95

File: result.txt

유용한사이트(ggang9님이 공유해주신ㅎㅎ): http://explainshell.com /explain?cmd=cat+text.txt+%7C+grep+%27%5E홍%27+%7C+awk+-F+%27%5Ct%27+%27 %7Bif%28%247%3E%3D85%29+print+%241%22%5Ct%22%243%22%5Ct%22%247%7D%27+%7C+plan9-sort.1+-r+-k+3

'Development > Linux' 카테고리의 다른 글

간단한 리눅스 원격 명령 (0)	2014.03.25
파일을 지워 디렉토리 용량을 유지 시켜주는 스크립트 (0)	2014.03.16
터미널 & vi가 이상 할 때 (0)	2014.02.20
실행은 되는데, Cron에서 돌지 않을 때 (0)	2014.02.20
리눅스 서버간 자동 로그인 (0)	2014.01.25

Posted by 꽃현주

, |

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

쥬 잡동사니

카테고리

달력

공지사항

태그목록

최근에 올라온 글

'Development'에 해당되는 글 12건

[cascading] TemplateTap deprecated !!

'Development > Cascading' 카테고리의 다른 글

[Hadoop,Cascading] classpath, distributed cache path 설정

HIVE beeline이 background job(nohup, &)에서 실행 되지 않을 때

'Development > Linux' 카테고리의 다른 글

간단한 리눅스 원격 명령

'Development > Linux' 카테고리의 다른 글

파일을 지워 디렉토리 용량을 유지 시켜주는 스크립트

'Development > Linux' 카테고리의 다른 글

터미널 & vi가 이상 할 때

'Development > Linux' 카테고리의 다른 글

실행은 되는데, Cron에서 돌지 않을 때

'Development > Linux' 카테고리의 다른 글

sqlite excute query

About Cascading?

'Development > Cascading' 카테고리의 다른 글

간단히 awk를 사용하여 Pattern Detecting !!

'Development > Linux' 카테고리의 다른 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

링크

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역