hadoop学习心得_学习hadoop心得体会

2020-02-28 学习培训心得体会 下载本文

hadoop学习心得由刀豆文库小编整理,希望给你工作、学习、生活带来方便,猜你可能喜欢“学习hadoop心得体会”。

1.FileInputFormat splits only large files.Here “large” means larger than an HDFS block.The split size is normally the size of an HDFS block, which is appropriate for most applications;however,it is poible to control this value by setting various Hadoop properties.2.So the split size is blockSize.3.Making the minimum split size greater than the block size increases the split size, but at the cost of locality.4.One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file.If the file is very small(“small” means significantly smaller than an HDFS block)and there are a lot of them, then each map task will proce very little input, and there will be a lot of them(one per file), each of which imposes extra bookkeeping overhead.hadoop处理大量小数据文件效果不好:

hadoop对数据的处理是分块处理的,默认是64M分为一个数据块,如果存在大量小数据文件(例如:2-3M一个的文件)这样的小数据文件远远不到一个数据块的大小就要按一个数据块来进行处理。

这样处理带来的后果由两个:1.存储大量小文件占据存储空间,致使存储效率不高检索速度也比大文件慢。

2.在进行MapReduce运算的时候这样的小文件消费计算能力,默认是按块来分配Map任务的(这个应该是使用小文件的主要缺点)

那么如何解决这个问题呢?

1.使用Hadoop提供的Har文件,Hadoop命令手册中有可以对小文件进行归档。2.自己对数据进行处理,把若干小文件存储成超过64M的大文件。

FileInputFormat is the base cla for all implementations of InputFormat that use files as their data source(see Figure 7-2).It provides two things: a place to define which files are included as the input to a job, and an implementation for generating splits for the input files.The job of dividing splits into records is performed by subclaes.An InputSplit has a length in bytes, and a set of storage locations, which are just hostname strings.Notice that a split doesn’t contain the input data;it is just a reference to the data.As a MapReduce application writer, you don’t need to deal with InputSplits directly, as they are created by an InputFormat.An InputFormat is responsible for creating the input splits, and dividing them into records.Before we see some concrete examples of InputFormat, let’s briefly examine how it is used in MapReduce.Here’s the interface:

public interface InputFormat { InputSplit[] getSplits(JobConf job, int numSplits)throws IOException;RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter)throws IOException;}

The JobClient calls the getSplits()method.On a tasktracker, the map task paes the split to the getRecordReader()method on InputFormat to obtain a RecordReader for that split.A related requirement that sometimes crops up is for mappers to have acce to the full contents of a file.Not splitting the file gets you part of the way there, but you also need to have a RecordReader that delivers the file contents as the value of the record.One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file.If the file is very small(“small” means significantly smaller than an HDFS block)and there are a lot of them, then each map task will proce very little input, and there will be a lot of them(one per file), each of which imposes extra bookkeeping overhead.Example 7-2.An InputFormat for reading a whole file as a record public cla WholeFileInputFormat extends FileInputFormat { @Override protected boolean isSplitable(FileSystem fs, Path filename){ return false;} @Override public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter)throws IOException { return new WholeFileRecordReader((FileSplit)split, job);} } We implement getRecordReader()to return a custom implementation of RecordReader.Example 7-3.The RecordReader used by WholeFileInputFormat for reading a whole file as a record cla WholeFileRecordReader implements RecordReader { private FileSplit fileSplit;private Configuration conf;private boolean proceed = false;public WholeFileRecordReader(FileSplit fileSplit, Configuration conf)throws IOException { this.fileSplit = fileSplit;this.conf = conf;} @Override public NullWritable createKey(){ return NullWritable.get();} @Override public BytesWritable createValue(){ return new BytesWritable();} @Override public long getPos()throws IOException { return proceed ? fileSplit.getLength(): 0;} @Override public float getProgre()throws IOException { return proceed ? 1.0f : 0.0f;} @Override public boolean next(NullWritable key, BytesWritable value)throws IOException { if(!proceed){ byte[] contents = new byte[(int)fileSplit.getLength()];Path file = fileSplit.getPath();FileSystem fs = file.getFileSystem(conf);FSDataInputStream in = null;try { in = fs.open(file);IOUtils.readFully(in, contents, 0, contents.length);value.set(contents, 0, contents.length);} finally { IOUtils.closeStream(in);} proceed = true;return true;} return false;} @Override public void close()throws IOException { // do nothing } }

Input splits are represented by the Java interface, InputSplit(which, like all of the claes mentioned in this section, is in the org.apache.hadoop.mapred package†): public interface InputSplit extends Writable { long getLength()throws IOException;String[] getLocations()throws IOException;}

An InputSplit has a length in bytes, and a set of storage locations, which are just hostname strings.Notice that a split doesn’t contain the input data;it is just a reference to the data.The storage locations are used by the MapReduce system to place map tasks as close to the split’s data as poible, and the size is used to order the splits so that the largest get proceed first, in an attempt to minimize the job runtime(this is an instance of a greedy approximation algorithm).As a MapReduce application writer, you don’t need to deal with InputSplits directly, as they are created by an InputFormat.An InputFormat is responsible for creating the input splits, and dividing them into records.Before we see some concrete examples of InputFormat, let’s briefly examine how it is used in MapReduce.Here’s the interface:

public interface InputFormat { InputSplit[] getSplits(JobConf job, int numSplits)throws IOException;RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter)throws IOException;}

Having calculated the splits, the client sends them to the jobtracker, which uses their storage locations to schedule map tasks to proce them on the tasktrackers.A path may represent a file, a directory, or, by using a glob, a collection of files and directories.A path representing a directory includes all the files in the directory as input to the job.See “File patterns” on page 60 for more on using globs.It is a common requirement to proce sets of files in a single operation.For example, a MapReduce job for log proceing might analyze a month worth of files, contained in a number of directories.Rather than having to enumerate each file and directory to specify the input, it is convenient to use wildcard characters to match multiple files with a single expreion, an operation that is known as globbing.Hadoop provides two FileSystem methods for proceing globs: public FileStatus[] globStatus(Path pathPattern)throws IOException public FileStatus[] globStatus(Path pathPattern, PathFilter filter)throws IOException

《hadoop学习心得.docx》
将本文的Word文档下载,方便收藏和打印
推荐度:
hadoop学习心得
点击下载文档
相关专题 学习hadoop心得体会 学习心得 Hadoop 学习hadoop心得体会 学习心得 Hadoop
[学习培训心得体会]相关推荐
    [学习培训心得体会]热门文章
      下载全文